Framework Ingestion & Dispersal - O'Reilly Media

2y ago
3 Views
2 Downloads
3.95 MB
41 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Cade Thielen
Transcription

Strata NY 2018September 12, 2018Apache HadoopIngestion & DispersalFrameworkDanny ChenOmkar JoshiEric Uber Hadoop Platform Team

Agenda MissionOverviewNeed for Hadoop ingestion &dispersal frameworkDeep Dive High Level Architecture Abstractions and Building BlocksConfiguration & Monitoring of JobsCompleteness & Data DeletionLearnings

Uber Apache Hadoop Platform Team MissionBuild products to support reliable, scalable, easy-to-use, compliant, andefficient data transfer (both ingestion & dispersal) as well as data storageleveraging the Hadoop ecosystem.

Overview Any Source to Any SinkEase of onboardingBusiness impact & importance ofdata & data store locationSuite of Hadoop ecosystem tools

Introducing

Open Sourced in September 2018https://github.com/uber/marmarayBlog Post: n-source/

Marmaray (Ingestion): Why? Raw data needed in Hadoop data lakeIngested raw data - Derived DatasetsReliable and correct schematized dataMaintenance of multiple data pipelines

Marmaray (Dispersal): Why? Derived datasets in HiveNeed arose to serve livetrafficDuplicate and ad hocdispersal pipelinesFuture dispersal needs

Marmaray: Main Features Release to production end of 2017Automated schema managementIntegration w/ monitoring & alertingsystemsFully integrated with workfloworchestration toolExtensible architectureOpen sourced

Marmary: Uber Eats Use Case

Hadoop Data Ecosystem at Uber

Hadoop Data Ecosystem at sHadoopDataLakeMarmarayDispersal

High-Level Architecture& Technical Deep Dive

High-Level ArchitectureSchema ServiceError TablesMetadata Manager(Checkpoint store)WorkUnitCalculatorInputStorageSystemDatafeed Config StoreSourceConnectorConverter1Converter 2SinkConnectorChain of convertersM3Monitoring & Alerting SystemOutputStorageSystem

High-Level ArchitectureSchema ServiceError TablesMetadata Manager(Checkpoint store)WorkUnitCalculatorInputStorageSystemDatafeed Config er 2Chain of convertersM3Monitoring SystemOutputStorageSystem

Schema ServiceGet Schema by Name & versionSchemaServiceGet SchemaService ReaderBinary DataReader /DecoderGeneric RecordGet SchemaService WriterGeneric DataWriter /EncoderBinary Data

High-Level ArchitectureTopic Config StoreSchema ServiceWorkUnitCalculatorInputStorageSystemError TablesMetadata Manager(Checkpoint ter 2Chain of convertersM3Monitoring SystemOutputStorageSystem

Metadata Managerinit()Called onJob startPersistentStorage(ex.HDFS)persist()Called afterJob finishMetadata ManagerIn-MemoryCopySet (key, value)called 0 or moretimesGet(key) - valuecalled 0 or moretimesDifferent JobDAGComponents

Fork Operator Why is it needed? Avoid reprocessing inputrecordsAvoid re-reading inputrecords (or in Spark,re-executing inputtransformations)SchemaConformingrecordsInput RecordsErrorRecords

Fork Operator & Fork Functionr1, S/FInput RecordsForkFunctionr2, S/Frx, S/FSuccess FilterfunctionSchema ConformingrecordsTaggedRecordsFailure FilterfunctionPersisted using Spark’s disk/memory persistence levelErrorRecords

Easy to Add Support for new Source & SinkHiveKafkaData lake with GenericRecordS3NewSourceCassandra

Support for Writing into Multiple SystemsHiveTable 1KafkaData lake with GenericRecordHiveTable 2

JobDag & JobDagActionsJob Dag ActionsReport metrics for monitoringJobDAGRegister table in Hive

Need for running multiple JobDags together Frequency of data arrivalNumber of messagesAvg record size & complexity of schemaSpark job has Driver executors (1 or more)Not efficient model to handle spikesToo many topics to ingest. 2000

JobManager Single Spark job for runningingestion for 300 topicsExecutes multiple JobDAGsManages execution ordering formultiple JobDAGsManages shared Spark contextEnables job and tier-levellockingJobMgr1SparkJobIngesting kafka-topic 1 (JobDAG 1)Ingesting kafka-topic N (JobDAG N)Waiting Q for JobDAGs

Completeness10 min bucketsSource(Kafka)LatestBucket10 min bucketsSink(Hive)LatestBucket

Completeness contd. Why not run queries on source and sink dataset periodically? Possible for very small datasetsWon’t work for billions of records; very expensive!!Bucketizing records How about creating time based buckets say for every 2min or 10min.Count records at source and sink during every runs Does it give 100% guarantee? No but w.h.p. it is close to it.

Completeness - High level rInputErrorRecord(IER)IRError TableIEROEROutputErrorRecord(OER)ORHoodie(Hive)

Hadoop old way of storing kafka data20142015010102Stitched parquet files( 4GB) ( 400 files perpartition)02Kafka topic120180806Latest DatePartitionNon-stitched parquetfiles ( 40MB) ( 20-40Kfiles per partition)

Data Deletion (Kafka) Old architecture is designed to be append/read onlyNo indexes Only way to update is to rewrite entire partition Need to scan entire partition to find out if record is present or notRe-writing entire partition forGDPR requires all data to be cleaned up once user requests deletionThis is a big architectural change and many companies are struggling tosolve this

Marmaray HUDI (hoodie)to rescue

Hoodie Data layout0101f1 ts1.parquetf2 ts1.parquetf3 ts1.parquetf4 ts1.parquetf5 ts2.parquetf6 ts2.parquet02201402f7 ts2.parquet2015f1 ts3.parquet08Kafka Topic06Updates2018ts1.commit.hoodieHoodie metadataf8 ts3.parquetts2.committs3.commit

Configurationcommon:hadoop:fs.defaultFS: "hdfs://namenode/"hoodie:table name: "mydb.table1"base path: "/path/to/my.db/table1"metrics prefix: "marmaray"enable metrics: trueparallelism: 64kafka:conn:bootstrap.servers: : 1000socket.receive.buffer.bytes: 5242880fetch.message.max.bytes: 20971520auto.commit.enable: falsefetch.min.bytes: 5242880source:topic name: "topic1"max messages: 1024read parallelism: 64error table:enabled: truedest path: "/path/to/my.db/table1/.error"date partitioned: true

Monitoring & Alerting

Learnings-Spark--Parquet--Better record compression with column alignmentsKafka--Off heap memory usage of spark and YARN killing ourcontainersExternal shuffle server overloadingBe gentle while reading from kafka brokersCassandra-Cassandra SSTable streaming (no throttling) , no monitoringNo backfill for dispersal

External atest/

Other Relevant TalksYour 5 billion rides are arriving now: Scaling Apache Spark for data pipelinesand intelligent systems at Uber - Wed 11:20amHudi: Unifying storage and serving for batch and near-real-time analytics - Wed5:25 pm

We are hiring!Positions available: Seattle, Palo Alto & SanFranciscoemail : hadoop-platform-jobs@uber.com

Useful links lo/https://eng.uber.com/m3/

Q & A?

Follow our Facebook page:www.facebook.com/uberopensource

Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi omkar@uber.com Eric Sayle esayle@uber.com Uber Hadoop Platform Team Strata NY 2018 September 12, 2018. Agenda Mission Overview Need for

Related Documents:

Part 2: Seed Dispersal Experiment Procedure 1. Introduce the activity by reviewing students’ seed-dispersal predictions and their answers to the questions on the handout. 2. Divide students into groups. 3. Hand out the Seed Dispersal Experiment handout and give each group: Three seeds of different dispersal type One bucket of water

The Perc-Rite Drip Dispersal System is a non-invasive, flexible and environmentally sensitive means of wastewater dispersal. Perc-Rite is the only drip dispersal brand approved for both septic tank and treated effluent in Vermont and has garnered acclaim for

through digestive system to be deposited later – Collect and bury seeds (ants and squirrels) Nature . through the digestive tract. Dispersal by Animals cont’d Seed dispersal by animal ingestion, . Microsoft PowerPoint

importance of pollen for plant reproduction and diversity. A.CONTENT - Use terminology and processes related to plant pollination, fertilisation, germination and seed dispersal - Self.pollination and cross-pollination. - Seed dispersal: animals. Wind, water, self dispersal. - Explain reproduction in flowering plants. - recognise the huge variety of

Seed Dispersal For a plant to successfully reproduce and spread, its seeds must be dispersed in some way. This is the function of fruits – to facilitate seed dispersal. Fruits develop from the ovary wall and surround the seed(s), aiding in its dispersal in a variety of ways: Wind – fruits carried on the wind as seen with dandelions

dispersal to increase, rather than decrease, beta diversity. First, increased dispersal may facilitate arrival of species that would not otherwise reach communities, effectively increasing gamma diversity (Vanschoenwinkel et al. 2013; Zobel 2016). If alpha diversity was not as

Hemp Extract intended for Human Consumption that is not clearly labeled as intended for Inhalation or Ingestion must meet all of the requirements for both Inhalation and Ingestion. In the event that there are different requirements, the stricter standard shall apply. 20. What if my Hemp Extract product

1 HarperCollins Publishers 2017 Section A: Principles of Chemistry A1 States of Matter No. Answers Further explanations 1 C 2 DNH 3 (g) HCl(g) NH 4 Cl(s)