Big Data: Scale Down, Scale Up, Scale Out

1y ago
2 Views
1 Downloads
1.31 MB
43 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Duke Fulford
Transcription

Big Data:Scale Down, Scale Up, Scale OutPhillip B. GibbonsIntel Science & Technology Centerfor Cloud ComputingKeynote Talk at IPDPS’15May 28, 2015

ISTC for Cloud Computing 11.5M over 5 years 4 Intel researchers. Launched Sept 201125 faculty87 students(CMU Berkeley, GA Tech,Princeton, WashingtonUnderlying Infrastructureenabling the futureof cloud computingwww.istc-cc.cmu.eduBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons2

Big Data Performance Challengewhenever the volume or velocity of dataoverwhelms current processing systems/techniques,resulting in performance that falls far short of desiredThis talk: Focus on performance as key challengeMany other challenges, including: variety of data, veracity of dataanalytics algorithms that scaleprogrammingsecurity, privacyinsights from the data, visualizationBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons4

How to Tackle theBig Data Performance ChallengeThree approaches to improving performance byorders of magnitude are: Scale down the amount of data processed orthe resources needed to perform the processing Scale up the computing resources on a node,via parallel processing & faster memory/storage Scale out the computing to distributed nodesin a cluster/cloud or at the edgeBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons5

Scale downthe amount of data processed orthe resources needed to perform the processingGoal: Answer queries much faster/cheaper thanbrute force Specific query?memoized answer Family of queries? Retrieval?good indexWith underlying common subquery (table)?materialized view Aggregation?data cubeImportant Scale Down tool: approximation(w/error guarantees)Big Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons6

Big Data Queries circa 1995DecisionSupportSystems(DSS)SQL QueryExact AnswerLong Response Times! Scale Down Insight:Often EXACT answers not required– DSS applications usually exploratory: early feedbackto help identify “interesting” regions– Preview answers while waiting. Trial queries– Aggregate queries: precision to “last decimal” notneededBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons7

Fast Approximate AnswersOften, only interested in leading digits of answerE.g., Average salary for 59,152.25 (exact)in 10 minutes 59,000 /- 500 (with 95% confidence)in 10 Synopsis(GB/MB)Orders of magnitude speed-up because synopsesare orders of magnitude smaller than original dataBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons8

The Aqua ArchitectureSQLQuery Q[Sigmod’98, ture without AquaBig Data: Scale Down, Scale Up, Scale OutWarehouseData Updates Phillip B. Gibbons9

The Aqua ArchitectureSQLQuery Q[Sigmod’98, ]Q’NetworkHTMLXMLRewriterResult(w/ error bounds)BrowserExcelPicture with Aqua:DataWarehouseAQUASynopsesWarehouseData UpdatesAQUATracker– Aqua is middleware, between client & warehouse(Client: error bound reporting. Warehouse SW: unmodified)– Aqua Synopses are stored in the warehouse– Aqua intercepts the user query and rewrites it to be a query Q’on the synopses. Data warehouse returns approximate answerBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons10

Precomputed, Streaming SynopsesOur Insights (circa 1996) Precomputed is often faster than on-the-fly– Better access pattern than sampling– Small synopses can reside in memory Compute synopses via one pass streaming– Seeing entire data is very helpful: provably & inpractice (Biased sampling for group-bys, Distinct valuesampling, Join sampling, Sketches & other statistical functions)– Incrementally update synopses as new data arrivesBottom Line:Orders of magnitude faster on DSS queriesBig Data: Scale Down, Scale Up, Scale OutPhillip B. Gibbons11

Example: Distinct-Values Queriesselect count(distinct target-attr)from relwhere Pselect count(distinct o custkey)from orderswhere o orderdate ‘2014-05-28’TemplateExample usingTPC-D/H/Rschema How many distinct customers placed ordersin past year?– Orders table has many rows for each customer,but must only count each customer once& only if has an order in past yearBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons12

Distinct-Values Query Approaches10% sample Estimate from Random Sample73– Statistics, Databases, etc379176– Lousy in practice5 distinct?50 distinct?– [Charikar’00] Need linear sample size Flajolet-Martin‘85u universe size– One-pass algorithm, stores O(log u) bits– Only produces count, can’t apply a predicate Our Approach: Distinct Sampling[VLDB’01]– One-pass, stores O(t * log u) tuples– Yields sample of distinct values, with up to t-sizeuniform sample of rows for each value– First to provide provably good error guaranteesBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons13

Accuracy vs. Data SkewRatio Error7Data set size 1MSample sizes 1%6DistinctSamplingGEE543AE2100.511.522.533.54Zipf Parameter[VLDB’01]Over the entire range of skew : Distinct Sampling has 1.00-1.02 ratio error At least 25 times smaller relative error than GEE and AEBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons14

Scale Down Today Hundreds and hundreds of clever algorithms– Synopsis-based approximations tailored to query families– Reduce data size, data dimensionality, memory needed, etc Synopses routinely used in Big Data analyticsapplications at Google, Twitter, Facebook, etc– E.g., Twitter’s open source Summingbird toolkit Hyperloglog – number of unique users whoperform a certain action; followers-of-followers CountMin Sketch – number of times each query issuedto Twitter search in a span of time; building histograms Bloom Filters – keep track of users who have beenexposed to an event to avoid duplicate impressions(10 8 events/day for 10 8 users)[Boykin et al, VLDB’14]Big Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons15

How to Tackle theBig Data Performance Challenge Scale Down Scale Up the computing resources on a node,via parallel processing & faster memory/storage Scale OutBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons16

Why Scale Up when you can Scale Out? Much of Big Data focus has been on Scale Out– Hadoop, etc But if data fits in memory of multicore thenoften order of magnitude better performance– GraphLab1 (multicore) is 1000x faster thanHadoop (cluster)– Multicores now have 1-12 TB memory: mostgraph analytics problems fit! Even when data doesn’t fit, will still want totake advantage of Scale Up whenever you canBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons17

Multicore: 144-core Xeon Haswell E7-v3socketsocket2 HWthreads32KB256KB18 2 HWthreads2 HWthreads32KB256KB45MB Shared L3 Cache8 32KB18 256KB2 HWthreads32KB256KB45MB Shared L3 Cacheup to 12 TB Main MemoryAttach: Hard Drives & Flash DevicesBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons18

Hierarchy Trends Good performance [energy] requireseffective use of hierarchy Hierarchy getting richer– More cores– More levels of cache– New memory/storage technologies Flash/SSDs, emerging PCM Bridge gaps in hierarchies – can’t justlook at last level of hierarchyBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons19

Hi-Spade:Hierarchy-Savvy Sweet SpotPlatform m 2Ignoringprogramming effortGoals: Modest effort, good performance inpractice, robust, strong theoretical foundationBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons20

What Yields Good HierarchyPerformance? Spatial locality: use what’s brought in Temporal locality: reuse it Constructive sharing: don’t step on others’ toesCPU1L1CPU2L1CPU3L1Stepping on toese.g., all CPUs write Bat the same timeBSharedL2 CacheL2 CacheTwo design options Cache-aware: Focus on the bottleneck levelCache-oblivious:Design for any cache PhillipsizeB. GibbonsBig Data: ScaleDown, Scale Up, Scale Out21

Multicore Hierarchies’Key New Dimension: SchedulingScheduling of parallel threads has LARGEimpact on hierarchy performanceRecallourproblem scenario:Key reason: Caches notfullysharedCPU3CPU2CPU1all CPUs want to write Bat the same timeL1BL1L1SharedL2 CacheL2 CacheBig Data: Scale Down, Scale Up, Scale OutCan mitigate (but not solve)if can schedule the writesto be far apart in time Phillip B. Gibbons22

Program-centric Analysis Start with a portable program description:dynamic Directed Acyclic Graph (DAG) Analyze DAG without reference tocores, caches, connections Program-centric metrics Number of operations (Work, W) Length of Critical Path (Depth, D) Data reuse patterns (Locality)Our Goal: Program-centric metrics Smart thread scheduler deliveringprovably good performance on many platformsBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons23

Parallel Cache Complexity ModelDecompose task into maximalsubtasks that fit in space M& glue operationsMMMCache Complexity Q*(M,B) Σ Space for M-fitting subtasks Σ Cache miss for everyaccess in glueM,B parameters either usedin algorithm (cache-aware)or not (cache-oblivious)[Simhadri, 2013]Big Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons24

Space-Bounded Scheduler[Chowdhury, Silvestri, Blakeley, Ramachandran IPDPS‘10]Key Ideas: Assumes space use (working set sizes) of tasksare known (can be suitably estimated)C Assigns a task to a cache C that fitsthe task’s working set. Reservesthe space in C. Recurses on thesubtasks, using the CPUs andcaches that share C (below C in the diagram) Cache costs: optimal levels Q*(Mi) x Ciwhere Ci is the miss cost for level i caches[SPAA’11][SPAA’14]Experiments on 32-core Nehalem:reduces cache misses up to 65% vs. work-stealingBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons25

Sharing vs. ContentionSharing: operations thatshare the same memorylocation (or possiblyother resource)Contention: serialized accessto a resource (potentialperformance penalty ofsharing)Replace concurrent update with Priority Update:updates only if higher priority than currentBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons26

Priority Update has Low [SPAA’13]Contention under High SharingPerform poorly underhigh sharing*Perform well under high sharing*Randompriorities5 runs of 108 operations on 40-core Intel Nehalem

Further Research Directions Determinism at function call abstraction,Commutative Building Blocks,Deterministic Reservations for loops,Use of priority update [PPoPP’12, SPAA’13, SODA’15] Scaling Up by redesigning algorithms& data structures to take advantage ofnew storage/memory technologies[VLDB’08, SIGMOD’10, CIDR’11, SIGMOD’11, SPAA’15]Big Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons28

How to Tackle theBig Data Performance Challenge Scale Down Scale Up Scale Out the computing to distributed nodesin a cluster/cloud or at the edgeBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons29

Big Learning Frameworks & Systems Goal: Easy-to-use programming frameworkfor Big Data Analytics that delivers goodperformance on large (and small) clusters Idea: Discover & take advantage of distinctiveproperties of Big Learning algorithms- Use training data to learn parameters of a model- Iterate until Convergence approach is common- E.g., Stochastic Gradient Descent for Matrix Factorizationor Multiclass Logistic Regression; LDA via Gibbs Sampling;Page Rank; Deep learning; Big Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons30

Parameter Servers for Distributed ML Provides all machines with convenient access toglobal model parameters Enables easy conversion of single-machine parallelML algorithms “Distributed shared memory” programming style Replace local memory access with PS accessWorker 1Worker 2ParameterTable(one or moremachines)Worker 3Worker 4† Ahmed et al. (WSDM’12), Power and Li (OSDI’10)SingleMachineParallelUpdateVar(i) {old y[i]delta f(old)y[i] delta}UpdateVar(i) {old PS.read(y,i)Distributeddelta f(old)with PSPS.inc(y,i,delta)}31

The Cost of Bulk SynchronyWasted computing time!Thread 11Thread 21Thread 31Thread 4123232233TimeThreads must wait for each otherEnd-of-iteration sync gets longer with larger clustersPrecious computing time wastedBut: Fully asynchronous No algorithm convergence guarantees32

Stale Synchronous Parallel (SSP)Staleness Threshold 3Thread 1 waits untilThread 2 has reached iter 4Thread 1Thread 1 will always seethese updatesThread 2Thread 3Thread 1 may not seethese updates (possible error)Thread 40123456789Iteration[NIPS’13]Allow threads to usually run at own paceFastest/slowest threads not allowed to drift S iterations apartProtocol: check cache first; if too old, get latest version from networkConsequence: fast threads must check network every iterationSlow threads check only every S iterations – fewer network accesses, so catch up!

Staleness Sweet Spot

Enhancements to SSP Early transmission of larger parameter changes,up to bandwidth limit [SoCC’15] Find sets of parameters with weak dependencyto compute on in parallel– Reduces errors from parallelization Low-overhead work migration to eliminatetransient straggler effects Exploit repeated access patterns of iterativealgorithms (IterStore) [SoCC’14]– Optimizations: prefetching, parameter data placement,static cache policies, static data structures, NUMAmemory management Big Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons35

IterStore: Exploiting Iterativeness[SoCC’14]Collaborative Filtering (CF) on NetFlix data set, 8 machines x 64 cores

Big Learning Systems Big PictureFramework approaches:– BSP-style approaches: Hadoop, Spark– Think-like-a-vertex: Pregel, GraphLab– Parameter server: Yahoo!, SSPTend to revisit the same problemsAd hoc chniquesWhat is the entire big picture? Phillip B. GibbonsBig Data: Scale Down, Scale Up, Scale Out37

Unified Scale Down, Scale Up,Scale Out Big Data System?No system combines all threeResearch questions:– How best to combine: Programming & Performancechallenges– Scale down techniques for Machine Learning?E.g., Early iterations on data synopses– Scale up techniques more broadly applied?Lessons from decades of parallel computing research– Scale out beyond the data center?Lessons from IrisNet project?Big Data: Scale Down, Scale Up, Scale Out[Sigmod’03, PC 2003] Phillip B. Gibbons38

How to Tackle theBig Data Performance ChallengeThree approaches to improving performance byorders of magnitude are: Scale down the amount of data processed orthe resources needed to perform the processing Scale up the computing resources on a node,via parallel processing & faster memory/storage Scale out the computing to distributed nodesin a cluster/cloud or at the edgeAcknowledgment: Thanks to MANY collaboratorsBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons39

AppendixBig Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons40

Slides 9-11:References (1/3)[Sigmod’98] P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximatequery answers. ACM SIGMOD, 1998.S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. ACMSIGMOD, 1999.S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. ACMSIGMOD, 1999. Demo paper.S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-byqueries. ACM SIGMOD, 2000.N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. J. Comput.Syst. Sci., 2002. Special issue on Best of PODS’99.M. Garofalakis and P. B. Gibbons. Probabilistic wavelet synopses. ACM TODS, 2004.Slides 13-14:[Charikar’00] M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards Estimation Error Guaranteesfor Distinct Values. ACM PODS, 2000.[Flajolet-Martin’85] P. Flajolet and G. N. Martin. Probabilistic Counting Algorithms for Data Base Applications. J.Comput. Syst. Sci., 1985.[VLDB’01] P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and eventreports. VLDB, 2001.Slide 15:[Boykin et al. VLDB’14] P. O. Boykin, S. Ritchie, I. O'Connell, and J. Lin. Summingbird: A Framework forIntegrating Batch and Online MapReduce Computations. PVLDB 2014.Slide 24:[Simhadri, 2013] H. V. Simhadri. Program-Centric Cost Models for Parallelism and Locality. Ph.D. Thesis, 2013.Big Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons41

Slide 25:References (2/3)[Chowdhury, Silvestri, Blakeley, Ramachandran IPDPS‘10] R. A. Chowdhury, F. Silvestri, B. Blakeley, and V.Ramachandran. Oblivious algorithms for multicores and network of processors. IEEE IPDPS, 2010.[SPAA’11] G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and H. V. Simhadri. Scheduling Irregular ParallelComputations on Hierarchical Caches. ACM SPAA, 2011.[SPAA’14] H. V. Simhadri, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and A. Kyrola. Experimental analysis ofspace-bounded schedulers. ACM SPAA, 2014.Slide 27:[SPAA’13] J. Shun, G. E. Blelloch, J. T. Fineman, and P. B. Gibbons. Reducing contention through priorityupdates. ACM SPAA, 2013.Slide 28:[PPoPP’12] G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and J. Shun. Internally deterministic algorithms can befast. ACM PPoPP, 2012.[SPAA’13] see above[SODA’15] J. Shun, Y. Gu, G. E. Blelloch, J. T. Fineman, and P. B. Gibbons. Sequential Random Permutation, ListContraction and Tree Contraction are Highly Parallel. ACM-SIAM SODA, 2015.[VLDB’08] S. Nath and P. B. Gibbons. Online maintenance of very large random samples on flash storage. VLDB,2008.[SIGMOD’10] S. Chen, P. B. Gibbons, and S. Nath. PR-join: A non-blocking join achieving higher result rate withstatistical guarantee. ACM SIGMOD, 2010.[CIDR’11] S. Chen, P. B. Gibbons, S. Nath. Rethinking database algorithms for phase change memory. CIDR,2011[SIGMOD’11] M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. MASM: Efficient onlineupdates in data warehouses. ACM SIGMOD, 2011.Big Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons42

References (3/3)[SPAA’15] G. E. Blelloch, J. T. Fineman, P. B. Gibbons, Y. Gu, and J. Shun. Sorting with Asymmetric Read andWrite Costs. ACM SPAA, 2015.Slide 31:[Ahmed et al. (WSDM’12)] A. Ahmed, M. Aly, J. Gonzalez, S. M. Narayanamurthy, and A. J. Smola. Scalableinference in latent variable models. ACM WSDM, 2012.[Power and Li (OSDI’10)] R. Power and J. Li. Piccolo: Building Fast, Distributed Programs with Partitioned Tables.Usenix OSDI, 2010.Slide 33:[NIPS’13] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. Gibson, G. Ganger, and E. Xing. Moreeffective distributed ML via a state synchronous parallel parameter server. NIPS, 2013.Slide 34:[ATC’14] H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, G. A.Gibson, and E. P. Xing. Exploiting Bounded Staleness to Speed Up Big Data Analytics. Usenix ATC, 2014.Slides 35-36:[SoCC’14] H. Cui, A. Tumanov, J. Wei, L. Xu, W. Dai, J. Haber-Kucharsky, Q. Ho, G. R. Ganger, P. B. Gibbons, G.A. Gibson, and E. P. Xing. Exploiting iterative-ness for parallel ML computations. ACM SoCC, 2014.[SoCC’15] J. Wei, W. Dai, A. Qiao, Q. Ho, H. Cui, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing.Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics. ACM SoCC, 2015.Slide 38:[Sigmod’03] A. Deshpande, S. Nath, P. B. Gibbons, and S. Seshan. Cache-and-query for wide area sensordatabases. ACM SIGMOD, 2003.[PC 2003] P. B. Gibbons, B. Karp, Y. Ke, S. Nath, and S. Seshan. Irisnet: An architecture for a worldwide sensorweb. IEEE Pervasive Computing, 2003.Big Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons43

AcknowledgmentsThe work presented in this talk resulted fromvarious collaborations with a large number of ,students, and colleagues. I thank all of my coauthors, whose names appear in the list ofReferences.A number of these slides were adapted fromslides created by my co-authors, and I thankthem for those slides.Big Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons44

Big Data: Scale Down, Scale Up, Scale Out Phillip B. Gibbons Intel Science & Technology Center for Cloud Computing Keynote Talk at IPDPS'15 May 28, 2015

Related Documents:

The Rise of Big Data Options 25 Beyond Hadoop 27 With Choice Come Decisions 28 ftoc 23 October 2012; 12:36:54 v. . Gauging Success 35 Chapter 5 Big Data Sources.37 Hunting for Data 38 Setting the Goal 39 Big Data Sources Growing 40 Diving Deeper into Big Data Sources 42 A Wealth of Public Information 43 Getting Started with Big Data .

big data systems raise great challenges in big data bench-marking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target e-

of big data and we discuss various aspect of big data. We define big data and discuss the parameters along which big data is defined. This includes the three v’s of big data which are velocity, volume and variety. Keywords— Big data, pet byte, Exabyte

Retail. Big data use cases 4-8. Healthcare . Big data use cases 9-12. Oil and gas. Big data use cases 13-15. Telecommunications . Big data use cases 16-18. Financial services. Big data use cases 19-22. 3 Top Big Data Analytics use cases. Manufacturing Manufacturing. The digital revolution has transformed the manufacturing industry. Manufacturers

Big Data in Retail 80% of retailers are aware of Big Data concept 47% understand impact of Big Data to their business 30% have executed a Big Data project 5% have or are creating a Big Data strategy Source: "State of the Industry Research Series: Big Data in Retail" from Edgell Knowledge Network (E KN) 6

Hadoop, Big Data, HDFS, MapReduce, Hbase, Data Processing . CONTENTS LIST OF ABBREVIATIONS (OR) SYMBOLS 5 1 INTRODUCTION TO BIG DATA 6 1.1 Current situation of the big data 6 1.2 The definition of Big Data 7 1.3 The characteristics of Big Data 7 2 BASIC DATA PROCESSING PLATFORM 9

6 Big Data 2014 National Consumer Law Center www.nclc.org Conclusion and Recommendations Unfortunately, our analysis concludes that big data does not live up to its big promises. A review of the big data underwriting systems and the small consumer loans that use them leads us to believe that big data is a big disappointment.

BIG DATA BIG PICTURE BIG OPPORTUNITIES We see big to continuously boil down the essential improvements until you achieve sustainable growth! 617.237.6111 info@databoiler.com databoiler.com # SEs preliminarily believe Our rationale for the rebukes 5 Multiple NBBOs would not vary from today’s self-aggregating practices or is