Ian McKenna, Ph.D. Senior Financial Engineer

2y ago
50 Views
2 Downloads
2.27 MB
29 Pages
Last View : 2d ago
Last Download : 3m ago
Upload by : Maxton Kershaw
Transcription

Integrating Advanced Analytics with Big DataIan McKenna, Ph.D.Senior Financial Engineer 2017 The MathWorks, Inc.1

The GoalSCALE!2

The Solutiontall3

Agenda Introduction to tall data Case Study: Predicting Analytics Scaling with PCT/MDCS Scaling with Spark/Hadoop– Interactive Mode using MDCS– Deployment using MATLAB Compiler Summary4

Datastore - Accessing Big Data Sources Easily access large sets of dataWorks with various data formats–––– DatabasesCSV filesExcel filesImagesSelect & preview columns/formats easilyUse with parallel computing toolsEasily use local and remote data sources––––HDFS (hdfs:///)Amazon S3 (s3://)Azure Blob Storage (wasbs://)Databases5

Big Data Frameworks in MATLAB TallMapReduce– Deploy to Hadoop or run with MDCS Ease of Use Greater Control– Local, PCT, MDCS– MDCS Spark– Compiler SparkMATLAB API for Spark– Access Spark functions (flatMap, aggregate, etc.)– Access Spark RDD API and create standalone apps6

Tall Data New data type for data that doesn’t fit into memoryMachineMemory Designed for mathematical/statistical operations Looks like a normal MATLAB array– Supports numeric types, tables, datetimes, strings, etc – 300 tall enabled functions supported in MATLAB Process big data on your desktop, compute clusters,and Hadoop/Spark systemsTall Data7

Big Data Without Big Changes1 File1000 Files8

Analytics With Tall Include Machine Learning––––––––––––fitlm (linear regression)fitglm (logistic & generalized linear)fitckernel (Gaussian kernel classification)fitrlinear (SVM regression)fitclinear (SVM classification)fitctree (classification tree)fitcnb (naïve bayes)fitcdiscr (discriminant analysis)TreeBagger (random forest)lasso (lasso regression)pca (principal component analysis)kmeans (clustering) Cleaning ronizeretimesplitapplydatasamplecvpartition Visualizing amhistogram2pie9

Example Analytics Use Case Objective: Predict Apple Stock Price Inputs:– Price series for all constituents of S&P100– Scale to billions of rows (20 years of minutely data) Approach:–––––Preprocess and explore dataWork with subset of data for prototypingFit regression modelsPredict price and validate modelScale to full data set on HDFS10

Scaling Analytics With Tall Non-scaled Desktop Application Tall (Local) Tall Parallel Computing (Local) Tall MDCS (MATLAB Distributed Computing Server) Tall MDCS Spark Tall MATLAB Compiler SparkPrototypeProduction11

What Is Spark/Hadoop? Hadoop:– HDFS (File System)– YARN (Scheduler)– MapReduce (Programming Model) MATLABCluster management and computingsoftware for big dataSpark: Computational engineBatchIn-memoryMapReduceSparkYARNHDFSMATLAB is certified for HDP andCloudera12

Tall With Spark HadoopEdge NodeMATLAB workers must be installed oraccessible to all worker nodesMasterName NodeMATLAB MDCS workers (working from MATLAB) MATLAB Runtime (deployed)YARN(Resource Manager)Client LibrariesSpark-submit scriptWorker NodeWorker NodeWorker NodeWorker B workersData NodeCacheTaskMATLAB workersData NodeCacheTaskMATLAB workersData NodeCacheTaskMATLAB workersData NodeHDFS13

Running On Spark Hadoop (MDCS) Desktop% Define the Execution Environment.mapreducer(gcp);% Access the data.d datastore('/home/data/SP100/*.csv');t tall(d); Spark%% Define the Execution Environmentsetenv('HADOOP HOME', '/usr/hdp/2.6.2.0/hadoop');setenv('SPARK HOME', '/usr/hdp/2.6.2.0/spark');Tall with PCTSpark Environmentcluster park.executor.instances') '16';mapreducer(cluster);Spark Connection% Access the datad csv');t tall(d);HDFS Access14

Running On Spark Hadoop (MDCS)15

Deploying Applications to SparkToolboxesEdgeNode1WorkerNodesMATLAB CompilerMATLABRuntime23.sh16

Big Data for New UsersDesktop Datastore & tallRun in parallelPrototype code locallyCompute Clusters Scale parallel applications togrid, cluster, & cloudSpark Hadoop Run in parallel on Spark clusterDeploy as standaloneapplications17

Multiple Choices, Many Benefits Benefits of Spark/Hadoop– Scalability and robustness– Fault-tolerant distributed data storage– Move compute to the data Benefits of MDCS– Interactive connection– Easy prototyping and debugging Benefits of Compiler– Easily invoke from outside MATLAB– Royalty free deployment– No licensing necessary on cluster18

Easy Scaling with Tall Designed for visualization, data cleansing, statistics, and machine learning Deferred evaluation optimizes big data analytics Perform visualizations directly on big data Easily convert between in-memory and out-of-memory No need to rewrite code, just call ‘tall’ Support production and prototype using ‘isdeployed’tallSCALE!19

Summary Get started scaling right away on your local machinewith tall Don’t need Spark/HDFS cluster to scale, can useMDCS MATLAB scales from desktop to productionMATLABDesktop (Client) . . – Transition from desktop to cluster with minimal changes– Using Spark/HDFS is simple with MATLABCluster. Scheduler20

MATLAB Central CommunityEvery month, over 2 million MATLAB & Simulink users visit MATLAB Central to get questions answered,download code and improve programming hingSpeakMATLAB Answers: Q&A forum; most questions getanswered in only 60 minutesFile Exchange: Download code from a huge repository offree code including tens of thousands of open sourcecommunity filesCody: Sharpen programming skills while having funLearn Contribute ConnectCodyandmore Blogs: Get the inside view from Engineers who buildand support MATLAB & SimulinkThingSpeak: Explore IoT DataAnd more for you to explore 21

Get TrainingCPE ApprovedProviderAccelerate your learning curve:- Customized curriculum- Learn best practices- Practice on real-world examplesOptions to fit your needs:- Self-paced (online)- Instructor led (online and in-person)- Customized curriculum (on-site)22

Consulting Engineering expertise and deep product knowledge, specializing in:––––– Application development using MATLABModel-Based Design using Simulink and StateflowEmbedded systems developmentEnterprise-wide integration of MathWorks products into engineering process and systemsJumpstart services for a fast, smooth transition to MathWorks productsProject-based services for a growing number of industries, including aerospace anddefense, automotive, communications, power and marine, and financial serviceswww.mathworks.com/consulting23

2017 The MathWorks, Inc.24

Contact us to learn more! Senior Financial Engineers– Ian McKenna (Ian.McKenna@mathworks.com) [Chicago]– Marshall Alphonso (Marshall.Alphonso@mathworks.com) [NYC] Senior Account Managers– Chuck Castricone (Chuck.Castricone@mathworks.com)– Mike DeLucia (Mike.DeLucia@mathworks.com)– Jim Coughlin (Jim.Coughlin@mathworks.com)– Mark DeMaio (Mark.DeMaio@mathworks.com)– David Habeeb (David.Habeeb@mathworks.com)25

Appendix26

Requirements MDCS– Windows, Linux, Mac Spark– Linux & Mac (on Cluster) 17b: MDCS method – can use tall arrays on Spark cluster supporting all architectures for the client, whilesupporting Linux & Mac architectures for the cluster (includes cross-platform support)– MDCS: Spark 1.x or 2.x (Spark enabled Hadoop system only)– Compiler: Spark 1.x or 2.x (Spark enabled Hadoop system only)– Hadoop 2.x or higher27

Tips Use head/tail to pull portion of data into memory (also faster!)Work with unevaluated array as much as possible– Gives MATLAB ability to further optimize execution ‘Gathering’ more is faster!– [a,b,c] gather(a,b,c) Use ‘dot’ notation or array2table, cell2mat, etc. to index data typesMake sure indices are in sorted order (e.g. T([2 5 7],:))– Use ‘sort’ on the indices28

MATLAB Integrates With Many Systems Built-in support for interoperability with various analytics platforms:––––– HDFS, Hadoop/MapReduce, YARN, Spark 1.XCloudera, HortonworksMongoDBCloud and local databases using ODBC/JDBCAWS S3 and Azure BlobApplications our service teams can assist with include:––––––Running of MathWorks products onto cloud platforms (AWS, Azure, Google, etc.)Read/write from: AWS S3, Azure Blob, Azure Data LakeStreaming data: Kafka, Azure IoT Hub, Azure Event Hub, and AWS servicesTableau, Qlikview, SpotfireHive, Cassandra, Impala, Parquet, and AVRONetezza, Teradata29

Tall With Spark Hadoop Worker Node Executor Cache Worker Node Executor Cache Worker Node Executor Cache Master Name Node YARN (Resource Manager) Data Node Data Node Data Node Worker Node Executor Cache Data Node HDFS Task Task Task Task Edge Node Client Libraries MATLAB Spark-submit script

Related Documents:

1. a. Ian loves looking at the country from the ocean. b. Ian loves looking at the country from the sky. 2. a. Ian loves the feeling of security. b. Ian loves the feeling of freedom. 3. a. Ian is a lucky young man. b. Ian is a happy young man. 4. a. When Ian flies, he is sad. b. When Ian flies, he is excited. 5. a. Ian sees lions. He is .

Robert J McKenna Jr MD 1 ROBERT J MCKENNA JR. MD PERSONAL INFORMATION Business Address: 2121 Santa Monica Blvd, Suite CTOC Santa Monica, CA 90404 Business phone: (310) 829-8618 Business FAX: (310) 829-8607 E-mail Address: Robert.McKenna@providence.org Date of birth: May 28, 1951 Place of birth: New York, New York

13u01 james monroe neighborhood senior center 13v01 arturo schomburg neighborhood senior ctr 13x01 betances neighborhood senior center 13y01 van cortlandt neighborhood senior center 13z01 coop city neighborhood senior center 13z02 jasa einstein senior center 14a01 concourse plaza wellness nc .

NT Senior Loan. Simultaneously herewith, Borrower is executing a senior loan agreement (the "NT Senior Loan Agreement") with The Northern Trust Company ("NT Senior Lender") pursuant to which NT Senior Lender agrees to make a multiple draw co-senior term loan ("NT Senior Loan") to Borrower in an aggregate amount of up to 5,500,000 .

Also by Rupert Sheldrake A New Science of Life The Presence of the Past The Rebirth of Nature Seven Experiments that Could Change the World: A Do-It-Yourself Guide to Revolutionary Science Natural Grace (with Matthew Fox) The Physics of Angels (with Matthew Fox) RUPERT SHELDRAKE TERENCE McKENNA RALPH ABRAHAM THE EVOLUTIONARY MIND

2019 RESUME Gord McKenna is a geotechnical engineer specializing in tailings and mine waste . Syncrude South Bison Hills Dump 2002 (100 ha) Syncrude SWSS Cell 32 2000 (39 ha) . And a half-dozen organizing committees for national and international mining and reclamation conferences.

AMP Corporation c/o Tyco Electronics Juergen Gromer, President P.O. Box 3608 Harrisburg, PA 17105 . c/o Fergusson McKenna Supply Company Todd McKenna, CEO 320 North Mill Road Kennert Square, PA 19348 . National Metal Crafters c/o Xynatech Manufacturing Company 203 Progress Drive Montgomeryville, PA 18936

Arkansas Tech University, taught participants about wildlife identification, foods and concepts, interpreting . Daniel’s individual project was a sleepover John Deere pillowcase. Third overall winner was McKenna S. who took home a 20 cash prize. McKenna’s . the agents