Framework Ingestion & Dispersal - O'Reilly Media

2y ago

3 Views

2 Downloads

3.95 MB

41 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Cade Thielen

Report this link

Download PDF

Transcription

Strata NY 2018September 12, 2018Apache HadoopIngestion & DispersalFrameworkDanny ChenOmkar JoshiEric Uber Hadoop Platform Team

Agenda MissionOverviewNeed for Hadoop ingestion &dispersal frameworkDeep Dive High Level Architecture Abstractions and Building BlocksConfiguration & Monitoring of JobsCompleteness & Data DeletionLearnings

Uber Apache Hadoop Platform Team MissionBuild products to support reliable, scalable, easy-to-use, compliant, andefficient data transfer (both ingestion & dispersal) as well as data storageleveraging the Hadoop ecosystem.

Overview Any Source to Any SinkEase of onboardingBusiness impact & importance ofdata & data store locationSuite of Hadoop ecosystem tools

Introducing

Open Sourced in September 2018https://github.com/uber/marmarayBlog Post: n-source/

Marmaray (Ingestion): Why? Raw data needed in Hadoop data lakeIngested raw data - Derived DatasetsReliable and correct schematized dataMaintenance of multiple data pipelines

Marmaray (Dispersal): Why? Derived datasets in HiveNeed arose to serve livetrafficDuplicate and ad hocdispersal pipelinesFuture dispersal needs

Marmaray: Main Features Release to production end of 2017Automated schema managementIntegration w/ monitoring & alertingsystemsFully integrated with workfloworchestration toolExtensible architectureOpen sourced

Marmary: Uber Eats Use Case

Hadoop Data Ecosystem at Uber

Hadoop Data Ecosystem at sHadoopDataLakeMarmarayDispersal

High-Level Architecture& Technical Deep Dive

High-Level ArchitectureSchema ServiceError TablesMetadata Manager(Checkpoint store)WorkUnitCalculatorInputStorageSystemDatafeed Config StoreSourceConnectorConverter1Converter 2SinkConnectorChain of convertersM3Monitoring & Alerting SystemOutputStorageSystem

High-Level ArchitectureSchema ServiceError TablesMetadata Manager(Checkpoint store)WorkUnitCalculatorInputStorageSystemDatafeed Config er 2Chain of convertersM3Monitoring SystemOutputStorageSystem

Schema ServiceGet Schema by Name & versionSchemaServiceGet SchemaService ReaderBinary DataReader /DecoderGeneric RecordGet SchemaService WriterGeneric DataWriter /EncoderBinary Data

High-Level ArchitectureTopic Config StoreSchema ServiceWorkUnitCalculatorInputStorageSystemError TablesMetadata Manager(Checkpoint ter 2Chain of convertersM3Monitoring SystemOutputStorageSystem

Metadata Managerinit()Called onJob startPersistentStorage(ex.HDFS)persist()Called afterJob finishMetadata ManagerIn-MemoryCopySet (key, value)called 0 or moretimesGet(key) - valuecalled 0 or moretimesDifferent JobDAGComponents

Fork Operator Why is it needed? Avoid reprocessing inputrecordsAvoid re-reading inputrecords (or in Spark,re-executing inputtransformations)SchemaConformingrecordsInput RecordsErrorRecords

Fork Operator & Fork Functionr1, S/FInput RecordsForkFunctionr2, S/Frx, S/FSuccess FilterfunctionSchema ConformingrecordsTaggedRecordsFailure FilterfunctionPersisted using Spark’s disk/memory persistence levelErrorRecords

Easy to Add Support for new Source & SinkHiveKafkaData lake with GenericRecordS3NewSourceCassandra

Support for Writing into Multiple SystemsHiveTable 1KafkaData lake with GenericRecordHiveTable 2

JobDag & JobDagActionsJob Dag ActionsReport metrics for monitoringJobDAGRegister table in Hive

Need for running multiple JobDags together Frequency of data arrivalNumber of messagesAvg record size & complexity of schemaSpark job has Driver executors (1 or more)Not efficient model to handle spikesToo many topics to ingest. 2000

JobManager Single Spark job for runningingestion for 300 topicsExecutes multiple JobDAGsManages execution ordering formultiple JobDAGsManages shared Spark contextEnables job and tier-levellockingJobMgr1SparkJobIngesting kafka-topic 1 (JobDAG 1)Ingesting kafka-topic N (JobDAG N)Waiting Q for JobDAGs

Completeness10 min bucketsSource(Kafka)LatestBucket10 min bucketsSink(Hive)LatestBucket

Completeness contd. Why not run queries on source and sink dataset periodically? Possible for very small datasetsWon’t work for billions of records; very expensive!!Bucketizing records How about creating time based buckets say for every 2min or 10min.Count records at source and sink during every runs Does it give 100% guarantee? No but w.h.p. it is close to it.

Completeness - High level rInputErrorRecord(IER)IRError TableIEROEROutputErrorRecord(OER)ORHoodie(Hive)

Hadoop old way of storing kafka data20142015010102Stitched parquet files( 4GB) ( 400 files perpartition)02Kafka topic120180806Latest DatePartitionNon-stitched parquetfiles ( 40MB) ( 20-40Kfiles per partition)

Data Deletion (Kafka) Old architecture is designed to be append/read onlyNo indexes Only way to update is to rewrite entire partition Need to scan entire partition to find out if record is present or notRe-writing entire partition forGDPR requires all data to be cleaned up once user requests deletionThis is a big architectural change and many companies are struggling tosolve this

Marmaray HUDI (hoodie)to rescue

Hoodie Data layout0101f1 ts1.parquetf2 ts1.parquetf3 ts1.parquetf4 ts1.parquetf5 ts2.parquetf6 ts2.parquet02201402f7 ts2.parquet2015f1 ts3.parquet08Kafka Topic06Updates2018ts1.commit.hoodieHoodie metadataf8 ts3.parquetts2.committs3.commit

Configurationcommon:hadoop:fs.defaultFS: "hdfs://namenode/"hoodie:table name: "mydb.table1"base path: "/path/to/my.db/table1"metrics prefix: "marmaray"enable metrics: trueparallelism: 64kafka:conn:bootstrap.servers: : 1000socket.receive.buffer.bytes: 5242880fetch.message.max.bytes: 20971520auto.commit.enable: falsefetch.min.bytes: 5242880source:topic name: "topic1"max messages: 1024read parallelism: 64error table:enabled: truedest path: "/path/to/my.db/table1/.error"date partitioned: true

Monitoring & Alerting

Learnings-Spark--Parquet--Better record compression with column alignmentsKafka--Off heap memory usage of spark and YARN killing ourcontainersExternal shuffle server overloadingBe gentle while reading from kafka brokersCassandra-Cassandra SSTable streaming (no throttling) , no monitoringNo backfill for dispersal

External atest/

Other Relevant TalksYour 5 billion rides are arriving now: Scaling Apache Spark for data pipelinesand intelligent systems at Uber - Wed 11:20amHudi: Unifying storage and serving for batch and near-real-time analytics - Wed5:25 pm

We are hiring!Positions available: Seattle, Palo Alto & SanFranciscoemail : hadoop-platform-jobs@uber.com

Useful links lo/https://eng.uber.com/m3/

Q & A?

Follow our Facebook page:www.facebook.com/uberopensource

Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi omkar@uber.com Eric Sayle esayle@uber.com Uber Hadoop Platform Team Strata NY 2018 September 12, 2018. Agenda Mission Overview Need for

Related Documents:

Activity 3.4: Seed Dispersal and Plant Migration

Part 2: Seed Dispersal Experiment Procedure 1. Introduce the activity by reviewing students’ seed-dispersal predictions and their answers to the questions on the handout. 2. Divide students into groups. 3. Hand out the Seed Dispersal Experiment handout and give each group: Three seeds of different dispersal type One bucket of water

31 Views

3y ago

Vermont Perc-Rite Drip Dispersal Design Guide July 2020

The Perc-Rite Drip Dispersal System is a non-invasive, flexible and environmentally sensitive means of wastewater dispersal. Perc-Rite is the only drip dispersal brand approved for both septic tank and treated effluent in Vermont and has garnered acclaim for

12 Views

2y ago

Mechanisms of Seed Dispersal.ppt

through digestive system to be deposited later – Collect and bury seeds (ants and squirrels) Nature . through the digestive tract. Dispersal by Animals cont’d Seed dispersal by animal ingestion, . Microsoft PowerPoint

9 Views

2y ago

PLANT REPRODUCTION - XTEC

importance of pollen for plant reproduction and diversity. A.CONTENT - Use terminology and processes related to plant pollination, fertilisation, germination and seed dispersal - Self.pollination and cross-pollination. - Seed dispersal: animals. Wind, water, self dispersal. - Explain reproduction in flowering plants. - recognise the huge variety of

45 Views

3y ago

LAB 14 The Plant Kingdom - Los Angeles Mission College

Seed Dispersal For a plant to successfully reproduce and spread, its seeds must be dispersed in some way. This is the function of fruits – to facilitate seed dispersal. Fruits develop from the ovary wall and surround the seed(s), aiding in its dispersal in a variety of ways: Wind – fruits carried on the wind as seen with dandelions

24 Views

3y ago

Dispersal enhances beta diversity in nectar microbes

dispersal to increase, rather than decrease, beta diversity. First, increased dispersal may facilitate arrival of species that would not otherwise reach communities, effectively increasing gamma diversity (Vanschoenwinkel et al. 2013; Zobel 2016). If alpha diversity was not as

8 Views

2y ago

Hemp Extract for Ingestion and Inhalation

Hemp Extract intended for Human Consumption that is not clearly labeled as intended for Inhalation or Ingestion must meet all of the requirements for both Inhalation and Ingestion. In the event that there are different requirements, the stricter standard shall apply. 20. What if my Hemp Extract product

4 Views

2y ago

CSEC Chemistry MCQ Answer Key - Collins

1 HarperCollins Publishers 2017 Section A: Principles of Chemistry A1 States of Matter No. Answers Further explanations 1 C 2 DNH 3 (g) HCl(g) NH 4 Cl(s)

68 Views

3y ago

Recent Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

Decision Tree Tutorial by Kardi Teknomo - TAN THIAM HUAT 陳添發

Male 1 Cheap Medium Bus Female 1 Cheap Medium Train Female 0 Cheap Low Bus Male 1 Cheap Medium Bus Male 0 Standard Medium Train Female 1 Standard Medium Train Female 1 Expensive High Car Male 2 Expensive Medium Car Female 2 Expensive High Car Based on above training data, we can induce a decision tree as the following:

10m ago

84 Views

-xglfldo:Dwfk Xjxvw Wkurxjk)2,

Affordable Care Act - insurance comparison, cheapest insurance, cheap health insurance NJ, cheapest insurance company Priority One High Volume - Washington state health insurance plans, affordable health insurance The best performing ad copy included those that made specific reference to finding "health insurance" for

1y ago

259 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

Industry Observations Insurance Industry

Jun 30, 2019 · 6/17/2019 Commercial Insurance Branch of Extraco Banks, N.A. Higginbotham Insurance Group, Inc. Insurance Brokers NA 6/13/2019 Links Insurance Services, LLC World Insurance Associates LLC Property and Casualty Insurance NA 6/13/2019 Abram Interstate Insurance Services, Inc. Risk Placement Services,

2y ago

619 Views

Life Insurance Buyer's Guide Life Insurance - National Association of .

Life Insurance uers uide Naional ssociaion of Insurance Commissioners Compare the Different Types of Insurance Policies There are many types of life insurance pol-icies. You should choose a policy with fea-tures that fit your individual needs. Some things to consider are: Term Insurance vs. Cash Value In-surance. Term insurance is intended to

1y ago

520 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

18.01.41 - REPLACEMENT OF LIFE INSURANCE AND ANNUITIES - Idaho

Department of Insurance Replacement of Life Insurance and Annuities. Page 3. 04. Existing Life Insurance or Annuity. "Existing Life Insurance or Annuity" means any life insurance or annuity in force, including life insurance under a binding or conditional receipt or a lif e insurance policy or annuity that is within an unconditional refund period.

1y ago

407 Views

EXAMINATION REPORT OF THE ADMIRAL INSURANCE COMPANY AS OF . - Delaware

Berkley Regional Specialty Insurance Comp 31295 DE Carolina Casualty Insurance Company 10510 IA Clermont Insurance Company 33480 IA Continental Western Insurance Company 10804 IA Firemen's Insurance Com pany of Wash, D.C. 21784 DE Gemini Insurance Company 10833 DE Great Divide Insurance Company 25224 ND

1y ago

258 Views

American International Group, Inc. - Federal Reserve

American General Life Insurance Company AGL U.S. Life Insurance Company AGC Life Insurance Company AGC Life U.S. Life Insurance Company The United States Life Insurance Company in the City of New York U.S. Life U.S. Life Insurance Company The Variable Annuity Life Insurance Company VALIC U.S. Life Insurance Company

1y ago

269 Views

Japan's Insurance Market - Toa Re

with 61.6% of net premiums written, of which automobile insurance totaled 48.8% and compulsory automobile liability insurance totaled 12.8%. Fire insurance accounted for 13.7%, miscellaneous casualty insurance including liability insurance accounted for 11.6%, accident insurance accounted for 9.8%, and marine insurance accounted for 3.2%.

1y ago

179 Views

List of Insurance Companies by Insurance Manager - Cayman Islands dollar

2447 Batan Insurance Company SPC, Ltd. 29-Sep-03 1307714 BBG Insurance Services, Ltd. 09-Aug-16 1254 BCHS Insurance, Ltd. 07-Oct-98 1168 Bearacuda Re 01-Aug-97 2639 Bedrock Insurance Limited 24-Nov-05 2150 Bom Ambiente Insurance Company 14-Jun-00 2565 Boundless Insurance Company, Ltd. 01-Dec-04 769 Bucap Limited 03-Mar-89

1y ago

293 Views

Insurance Certificate 713705-3 and Assistance Program

Name of insurance product: Purchase Protection and Travel Insurance for National Bank of Canada Mastercard credit cards, group insurance policy no. 713705 (Schedule A Certificate number 3)/713705-3 Type of insurance product: Purchase insurance and extended warranty and travel insurance (group insurance) Assistance provider contact information

3m ago

54 Views

The End of Cheap Oil

78 Scientific American March 1998 The End of Cheap Oil The End of Cheap Oil . serves “proved” only if the oil lies near a producing well and there is “reason- . many P90 reserve estimates always un - derstates the amount of proved oil in a region. The only correct way to total

2y ago

153 Views

Framework Ingestion & Dispersal - O'Reilly Media

It looks like you're using an ad-blocker