Apache Ignite - Using A Memory Grid For Heterogeneous Computation .

1y ago
7 Views
1 Downloads
3.22 MB
36 Pages
Last View : 8d ago
Last Download : 3m ago
Upload by : Rosemary Rios
Transcription

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation Chris Herrera Hashmap

Topics Who - Key Hashmap Team Members The Use Case - Our Need for a Memory Grid Requirements Approach V1 Approach V1.5 Approach V2 Lessons Learned What’s Next Questions 2

Who - Hashmap WHO Big Data, IIoT/IoT, AI/ML Services since 2012 HQ Atlanta area with offices in Houston, Toronto, and Pune Consulting Services and Managed Services REACH 125 Customers across 25 Industries PARTNERS Cloud and technology platform providers 3

Who - Hashmap Team Members Jay Kapadnis Akshay Mhetre Chris Herrera Lead Architect Team Lead Chief Architect/Innovation Officer Hashmap Hashmap Hashmap Pune, India Pune, India Houston, TX 4

The Use Case Oilfield Drilling Data Processing

Why - Oilfield Drilling Data Processing The Process Plan Execute Optimize WITSML Server Plan Store 6

Why - Oilfield Drilling Data Processing The Plan How to match the data Deduplication Missing information Various formats Various ingest paths Data Analyst TDM EDM WellView Homegrown Vendors Financial Homegrown 7

Why - Oilfield Drilling Data Processing Rig Site Data Flow WITSML Server MWD Operational Data Missing classification Unknown quality Various formats Various ingest paths Unknown completeness WITSML Server Mud Logger CSV CSV CSV Cement Magic CSV CSV DLIS Wireline Data Analyst 8

Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Impossible to generate insights without huge data cleansing operations Extracting value is a very expensive operation that has to be done with a combination of experts Generating reports requires a huge number of man-hours Data Analyst TDM Vendors EDM WellView Financial Homegrown Homegrown 9

Why - Oilfield Drilling Data Processing BUT WAIT 10

Why - Oilfield Drilling Data Processing We still have all the compute to deal with, some of which is very legacy code Parse Parse the data from CSV, WITSML, DLIS, etc. Identify & Enrich Understand where the data came from and what its global key should be Load Load the data into a staging area to start understanding what to do with it Clean Deduplicate, interpolate, pivot, split, aggregate Feature Engineering Persist & Report Generate additional features that are required to get useful insights into the data Land the data into a store that allows for BI reports and interactive queries 11

Requirements What do we have to do?

Functional Requirements Cleaning and Feature Engineering (the legacy code I referred to) Parse WITSML / DLIS Attribute Mapping Unit Conversions Null Value Handling Rig Operation Enrichment Rig State Detection Invisible Lost Time Analysis Anomaly Detection 13

Non-Functional Requirements Description Requirement 1 Heterogeneous Data Ingest Very flexible ingest Flexible simple transformations 2 Robust Data Pipeline Easy to debug Trusted 3 Extensible Feature Engineering Be able to support existing computational frameworks / runtimes 4 Scalable Scales up Scales Down If a data processing workflow fails at a step, it does not continue with erroneous data 5 Reliable 14

Approach V1 How Then?

Solution V1 HDFS HDFS Hive Spark Zeppelin Staging TDM EDM BI Reporting Well View Heterogeneous ingest implemented through a combination of NiFi processors/flows and Spark Jobs Marts WITSML Avro files loaded as external tables CS CS V VFiles WITS ML Server TDM EDM WellView BI connected via ODBC (Tableau) Zeppelin Hive interpreter was used to access the data in Hive Homegrown 16

Issues with the Solution Very Slow BI Tough to debug cleansing Tough to debug feature extractions A lot of overhead for limited benefit Painful data loading process Incremental refresh was challenging Chaining the jobs together in a workflow was very hard Mostly achieved via Jupyter Notebooks In order to achieve the functional requirements, all of the computations were implemented in Spark, even if there was little benefit 17

V1 Achieved Requirements Achieved Requirement Description 1 Heterogeneous Data Ingest Very flexible ingest Flexible simple transformations 2 Robust Data Pipeline Hard to Debug Hard to modify 3 Extensible Feature Engineering Hard to support other frameworks Hard to modify current computations 4 Scalable Scales up but not down 5 Robust Hard to debug 18

Approach V1.5 An Architectural Midstep

A Quick Architectural Midstep (V1.5) Hive Spark Jupyter HDF S HDFS/IGFS Ignite Staging BI Reporting Marts Complicated an already complex system In-Memory MapReduce TDM EDM Well View Did not solve all of the problems WITSML Needed a simpler way to solve all of the issues CS CS V VFiles WITS ML Server TDM EDM WellView Homegrown Ignite persistence was released while we were investigating this 20

Approach V2 How Now?

Approach V2 Kubernetes Workflow API Ignite HDFS Spark Scheduler API Docker Functions API Flink Zeppelin Service Grid Memory Grid Functions Caches Function Function Workflow Cache Workflow Cache Persistent Storage (Configurable) Allows for very interactive workflows Workflows can be scheduled Each workflow is made up of functions (microservices) Each instance of a workflow workflow contains its own cache Zeppelin via the Ignite interpreter Workflows loaded data and also processed data 22

Approach V2 - The Workflow Source is the location the data is coming from The workflow is the data that goes from function to function Data stored as data frames can be queried by an API or another function Source Function 1 Function 2 SQL / DF Service Key Val Function 3 SQL / DF Service Key Val Service Apache Ignite Apache Ignite 23

Approach – The Workflow Each function runs as a service using Service Grid The function receives input from any source Kafka* JDBC Ignite Cache Once the function is applied, store the result into the Ignite cache store 24

Workflow Capabilities Start / Stop / Restart Execute single functions within a workflow Pause execution to validate intermediate steps 25

Approach - Spark Based Functions - Persistence After each function has completed its computation the Spark DataFrame is stored via distributed storage Table name is stored as SQL PUBLIC tableName df.write .format(FORMAT IGNITE) .option(OPTION TABLE, tableName) // table name to store data .option(OPTION CREATE TABLE PRIMARY KEY FIELDS, “id”) .save() Spark Function DF Service Key Apache Ignite Val 26

Approach – Intermediate Querying Once the data is in the cache, the data can be optionally persisted using the Ignite persistence module The data can be queried using the Ignite SQL grid module as well Allows for intermediate validation of the data as it proceeds through the workflow val cache ignite.getOrCreateCache(cacheConfig) val cursor cache.query(new SqlFieldsQuery(s”SELECT * FROM tableName limit 20")) val data cursor.getAll API Spark Function DF Service Key Val Apache Ignite 27

Approach - Applied to the Use Case Workflow API WITS ML Server Scheduler API Channel Mapping / Unit Conversion (Docker) Java WITSML Client (Docker) SQL Service Key Val Rig State Detection / Enrichment / Pivot (Spark) SQL Service Key Val Service Apache Ignite Apache Ignite 28

V2 Achieved Requirements Achieved Requirement Description 1 Heterogeneous Data Ingest Very flexible ingest Flexible transformations 2 Robust Data Pipeline Easy to debug Easy to modify 3 Extensible Feature Engineering Easy to add Easy to experiment 4 Scalable Scales up Scales down 5 Robust Easy to debug Reliable 29

Solution Benchmark Setup Dimension Tables already loaded 8 functions (6 wells of data – 5.7 billion points) Ingest / Parse WITSML Null Value Handling Interpolation Depth Adjustments Drill State Detection Rig State Detection Anomaly Detection Pivot Dataset For V1 everything was implemented as a Spark application For V2 the computations remained close to their original format 30

Solution Comparison V1 - Execute Time 9 Hours Without WITSML Download 7 Hours V2 - Execute Time 2 Hours Without WITSML Download 22 minutes 19x Improvement V1 to V2 31

Lessons Learned How Now?

Lessons Learned Apache Ignite is a great tool to speed up data processing without a wholesale replacement of technology Apache Ignite does have a learning curve, it is definitely worth doing an analysis beforehand to understand what it means to operationalize it Accelerating Hive via Ignite was not straightforward and, at times made it very difficult to debug the actual issues that we were facing Spatial querying, while great, is LGPL, so be aware of that before your specific implementation Understanding data locality in Ignite is crucial in larger data sets Ignite works very well inside of Kubernetes due to its peer-to-peer clustering mechanism The thin client JDBC driver does not have affinity awareness, so in multinode configurations, the thick client is preferred 33

What’s Next How Now?

What’s Next Implementation of a UI on top of the computational framework Implementation of a standard set of “functions” that can be leveraged on top of the memory grid Implementation of streaming sources via Kafka Ignite Sink 35

Questions Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation Chris Herrera Hashmap

Apache Ignite is a great tool to speed up data processing without a wholesale replacement of technology Apache Ignite does have a learning curve, it is definitely worth doing an analysis beforehand to understand what it means to operationalize it Accelerating Hive via Ignite was not straightforward and, at times made it

Related Documents:

Getting Started with the Cloud . Apache Bigtop Apache Kudu Apache Spark Apache Crunch Apache Lucene Apache Sqoop Apache Druid Apache Mahout Apache Storm Apache Flink Apache NiFi Apache Tez Apache Flume Apache Oozie Apache Tika Apache Hadoop Apache ORC Apache Zeppelin

Ignite has flexible deployment options: it can be deployed on-premise or on-cloud, on physical servers or virtual environments. Ignite can be deployed from Docker, Kubernetes or Mesos containers. Additionally, images are available in both AWS (ignite-ami) and Google Compute (ignite-image) for quickly deploying Ignite clusters on the cloud.

CDH: Cloudera’s Distribution Including Apache Hadoop Coordination Data Integration Fast Read/Write Access Languages / Compilers Workflow Scheduling Metadata APACHE ZOOKEEPER APACHE FLUME, APACHE SQOOP APACHE HBASE APACHE PIG, APACHE HIVE APACHE OOZIE APACHE OOZIE APACHE HIVE File System Mount UI

Apache ,)Apache)Ignite,)Ignite oundation)in)the)United)States .

Apache , Apache Ignite, Ignite , and the Apache Ignite logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States .

Ignite SQL 10 Ignite can be used as distributed SQL database - ANSI-99 compliant - Horizontally scalable - Fault-tolerant Ignite SQL architecture - Tightly coupled with H2 database parsing, optimizing, local query execution - Distributed logic based on map-reduce - Data is stored in Ignite Durable Memory

Ignite/GridGain has a 3rdParty Persistence feature (Cache Store) that allows: Propagating cache changes to external storage like RDBMS Automatically copying data from external storage to Ignite upon accessing data missed in Ignite What if you want to propagate external storage change to Ignite at the

Artificial intelligence (AI) technologies are developing apace, with many potential ben-efits for economies, societies, communities, and individuals. Realising their potential requires achieving these benefits as widely as possible, as swiftly as possible, and with as smooth a transition as possible. Across sectors, AI technologies offer the promise of boosting productivity and creating new .