Using Apache Spark, Apache Kafka And Apache Cassandra

2y ago
28 Views
3 Downloads
3.40 MB
14 Pages
Last View : 6d ago
Last Download : 3m ago
Upload by : Esmeralda Toy
Transcription

Using Apache Spark,Apache Kafkaand Apache Cassandrato Power Intelligent Applications

Apache Cassandra is well known as the database of choicefor powering the most scalable, reliable architecturesavailable. Apache Spark is the state-of-the-art advanced andscalable analytics engine. Apache Kafka is the leading streamprocessing engine for scale and reliability.Deployed together, these technologies give developers thebuilding blocks needed to build reliable, scalable andintelligent applications that adapt based on the data theycollect.This paper discusses the use cases, architectural pattern andoperations considerations for deploying these technologiesto deliver intelligent applications.USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS 02

Use CasesInternet of ThingsAt the core of an IoT application there is a stream of regularobservations from (potentially) a large number of devices oritems with embedded electronics (e.g. switches, sensors, tags).A stream of IoT data is just “big data”, but analysing that bigdata in a way that drives actions, recommendations, orprovides information is where the application delivers value.Apache Cassandra is extremely well suited to receiving andstoring streams of data. It’s always-on availability matches theconstant stream of data sent by devices to ensure yourapplication is always able to store data. In addition, its nativestorage formats are well suited to efficient storing and usingtime series data such as that produced by IoT devices. Thescalability of Apache Cassandra means you can be assured thatyour datastore will smoothly scale as the number of devicesand stream of data grows.The powerful analytics capabilities and distributed architectureof Apache Spark is the perfect engine to help you make senseand make decisions based on the data you’re receiving fromyour IoT devices. Spark’s stream processing can quicklydetermine answers from short-term views of your data as it’sreceived. For analysis running over longer time periods, theSpark Cassandra connector enables Spark to efficiently accessdata stored in Cassandra to perform analysis.USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS 03

In this context, Apache Kafka is often used as a reliablemessage buffer. In many IOT scenarios, the flow of data fromdevices is constant and the devices have very limitedcapacity to buffer data in the event the central processingservice is unavailable. Events from the devices can be writtento Kafka when first received and then picked up andprocessed by the downstream applications. This ensuresevents are not lost even if the processing elements for thecentral system become backed up or suffer downtime. Inaddition, use of Kafka in this manner easily allows additionalconsumers of the event stream to be added to the system.For example, your initial implementation may have a simpleapplication that just saves data to Cassandra for later use butyou then you add a second application that performs realtime processing on the event stream. Kafka Streams may alsobe used as an alternative to Spark Streaming for real timestream processing.USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS 04

Financial ServicesThe pressures for financial services companies to gaina technological edge in data processing are coming not onlyfrom the competition but also from consumers. Gaining acompetitive edge requires systems that can collect andquickly analyse vast streams of data. Consumers expect thatthe systems they interact with will be instantly up to date,always available and, increasingly, be aware of the context ofall their previous interactions and related information.Addressing these joint pressures, while containing technologycosts, requires the adoption of new generation architecturalpatterns and technologies. Apache Cassandra, Apache Kafkaand Apache Spark are technologies that are ideally placed toform the core of such an architecture. The applicability ofthese technologies in financial services has been proven manytimes by leading organisations such as ING and UBS.One common application we see for Cassandra in financialservices is as a persistent cache to support high volume clientrequests. In particular, we see this requirement with banksimplementing the Payment Services Directive (PSD2) in theEU. This leverages Cassandra’s extreme reliability andbuilt-for-the-cloud architecture to enable financial servicesorganisations to deliver always-on service and avoid the highcost of scaling their legacy (often mainframe) architectures tomeet increased client interactions needs. Spark is oftenincluded in this architecture to enrich the cached data withsophisticated analysis of trends and patterns in the dataenabling user-facing applications to make this analysis withinteractive response times. Kafka often sits in this picture as amessage bus to connect the core processing system tomultiple downstream consumers.USING APACHE SPARK AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS 05

OthersThe two use-cases above are great examples where we see regularadoption of Spark, Kafka and Cassandra. However, there are manyother business problems where the three technologies can combine toprovide an ideal solution. Some examples that we have seen include:Ad-TechRelying on the low-latency (low double digit ms)responsiveness and always-on availability of Cassandra tomake online advertising placement decisions backed by deepanalysis calculated with Spark. Massive flows of inboundevents and information can be managed with Kafka.Application MonitoringWe use a combination of Spark and Cassandra in our ownmonitoring system that monitors close to 1500 servers.Cassandra seamlessly handles a steady stream of writes withmetrics data while Spark is used to calculate regular roll-upsto allow viewing summarised data over long time periods.Kafka acts as a centralisation point for the messages and alsoa message buffer.Inventory Management, particularly in travelUse Cassandra to track inventory records and Spark to analyseavailable inventory to determine dynamic pricing, capacitytrends, etc.USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS 06

Architectural PatternsBatch UpdatingAt a more rudimentary level, many Cassandra applications have aneed for periodic batch processing for data maintenance. Whilethis can include summarisation it can also include requirementslike implementing complex data expiry rules. Running thesebatches through a single threaded (or single machine) batchengine will not scale to the same extent your Cassandra clusterwill. Implementing these batch jobs in Spark not only provides apre-built set of libraries to assist with development of the dataprocessing functionality but also the frameworks to automaticallyscale the jobs and scale and execute processing logic on the sameservers where the data is stored.Stream EnrichmentFor most applications, a strong design will store in a singleCassandra table all of the information required to service aparticular read request (i.e. the data will be highly denormalised). Insome cases this denormalisation process will require calculating orlooking up additional data to add to a stream before the stream ofdata is saved. Using Spark Streaming to process data before savingto Cassandra provides a scalable and reliable technology base toimplement this pattern. Kafka Streams is an alternative engine forimplementing this form of stream enrichment.USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS 07

Lambda ArchitectureThe Lambda Architecture is an increasingly popular architectural pattern forhandling massive quantities of data through both a combination of stream andbatch processing. With the Lambda Architecture, you maintain a short-term,volatile view of your data (the speed layer) and a longer term, morepermanent view (the batch layer) with a service to join views across the two(the serving layer).With Spark and Cassandra, you have the key architectural building blocks youneed to implement the Lambda Architecture. Spark Streaming is an idealengine for implementing the speed layer of the architecture (potentially withresults stored in TTL’d tables in Cassandra) while Spark can also be used toperform the longer-term batch calculations and store results in Cassandra.Kappa ArchitectureThe Kappa Architecture takes the next step from the Lambda Architecture,removing the batch layer and treating the stream of events as the immutablerecord of system state. Stream processing maintain summary views as thestream is processed. If the logic of summary views needs to change then thestream processing logic is updated and the saved streams reprocessed. TheKappa Architecture removes the need to maintain separate stream and batchlogic that is required for the Lambda Architecture.Once again, the combination of Spark and Cassandra gives you thearchitectural components you need to implement Kappa Architecture. SparkStreaming is an ideal processing engine to undertake the calculations neededon the stream of data. Apache Cassandra can be used both as the long term,immutable store of the data stream and as a store for the results of the streamcalculations that are used by the serving layer. An alternative is to use ApacheKafka as your immutable event store and Apache Cassandra as the store forthe materialized views calculated based on these events.USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS 08

OperationsOperating as part of a mission-critical application isthe normal mode of operation for Cassandra andthere is a well established body of knowledge abouthow to operate Cassandra to achieve the highestlevels of availability. Although Kafka is a little newerit is also widely operated at the highest levels of scaleand reliability.Spark, on the other hand, is often run to provide ananalytics environment for use by a small number ofdata scientists. In this situation, reliability andpredictable performance are not as critical as whenSpark is deployed as a component of a productionapplication. This section of the paper describes someof the considerations to be applied when deployingSpark for production usage.USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS 09

Management EnvironmentThe key to reliable operations of any technology is to have a solidoverall management environment including aspects such as:Automated (or at least well controlled)deployment and configuration management.High quality testing of new configurations priorto deployment.Backup and disaster recovery procedures.Appropriate monitoring, and systems and peoplethat are paying attention to what is beingreported by that monitoring.Rigorous incident response procedures andwell-trained staff.None these items is specific to Kafka, Spark or Cassandra. However,introducing production usage of these technologies will requireexamination of each of these areas to ensure they are fit forpurpose with introduction of new architectural components andapplications.USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS 10

High AvailabilityOne specific area to be considered is high availabilityarchitecture (ensuring your overall service continues to runeven when components fail). Cassandra is effectivelyhigh-availability by default — if you use multiple machines anda basic, competent setup you will have a high-availabilitycluster. Of course, there is more you can do for the absolutehigh level of availability. Kafka follows a somewhat similararchitecture and has similar considerations in terms ofdistributing data across multiple replicas and placing replicasin multiple availability zones.For Spark, more detailed consideration is required. Spark bydefault is resilient to the failure of worker processes with workbeing automatically redistributed to running workers should aworker fail. However, the Spark Master and Driver requirefurther consideration. Apache Spark has built-in capability tomake the Spark Master highly available by using an ApacheZookeeper cluster to control the election of which machinewill be the active Master at any point in time.For the Driver component (that submits jobs to the cluster), itis possible to configure Spark to automatically retry jobs thatfail. To enable this, the job must be submitted in cluster mode(--deploy-mode:cluster) and with the --superviseflag set. As this will restart failed jobs from scratch, it isnecessary to ensure your jobs are idempotent when using thisfunctionality.USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS 11

MonitoringIt is important for any production system to have quality monitoring inplace to help detect and diagnose problems. This starts with basicoperating-system level monitoring of metrics such as CPU load and freedisk space. It should then extend to monitoring that the expectedsystem processes are running.For Cassandra and Kafka, a broad range of metrics are available out ofthe box and are sufficient to monitor usage of Cassandra and Kafka forthe vast majority of use cases. It will likely be necessary to tune alertingthresholds for your application, but the important metrics to monitor arefairly standard and well known.Spark also provides built-in monitoring capabilities including a UI toallow you to review the progress of your jobs. However, given theextremely diverse nature of workloads that Spark can handle, it will alsolikely be necessary to implement error handling and reporting as part ofyour production jobs as well as relying on the native Spark metrics.Workload IsolationOne of the unique advantages of Cassandra is its ability to provideworkload isolation through its native multi-data center architecturesupport. By setting up two Cassandra “data centres” in the same physicaldata centre (or cloud provider region) you can isolate the loads of yourSpark analytic reads to a single data centre, ensuring processing capacityand response times of your online process are minimally impacted whenbatch processing runs in Spark.USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS 12

ConclusionCassandra, Kafka and Spark form a powerful combinationfor many use cases. However, architecting and runningdistributed technologies at scale and with the highestlevels of reliability and security requires a specialistenvironment including tools such as monitoring,management processes and skilled and experienced staff.Instaclustr’s focus is the provision of the world’s bestmanaged environment for running open-source,distributed technologies reliably, at scale. We bring toyour application a proven management platform andover 13 million node hours of experience running thesetechnologies in production.Discover MoreApache SparkApache CassandraApache KafkaUSING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS 13

Brought to you byinstaclustr.com

scalability of Apache Cassandra means you can be assured that your datastore will smoothly scale as the number of devices and stream of data grows. The powerful analytics capabilities and distributed architecture of Apache Spark is the perfect engine to help you make sense a

Related Documents:

Getting Started with the Cloud . Apache Bigtop Apache Kudu Apache Spark Apache Crunch Apache Lucene Apache Sqoop Apache Druid Apache Mahout Apache Storm Apache Flink Apache NiFi Apache Tez Apache Flume Apache Oozie Apache Tika Apache Hadoop Apache ORC Apache Zeppelin

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Kafka Design Motivation Goals Kafka built to support real-time analytics Designed to feed analytics system that did real-time processing of streams Unified platform for real-time handling of streaming data feeds Goals: high-throughput streaming data platform supports high-volume event streams like log aggregation, user

for Apache Kafka (aligns to Confluent Developer Skills for Building Apache Kafka course) Confluent Certified Administrator for Apache Kafka (aligns to Confluent Operations Skills for Apache Kafka) What you Need to Know Qualifications: 6-to-9 months hands-on experience Duration: 90 mins Availability: Live, online 24/7 Cost: 150

Apache \Storm and Spark for real-time streaming data analysis. For more information about Apache Kafka, refer to the Kafka documentaion. Understanding Kafka Architecture. Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that handles a high volume of data and enables you to pass messages from one end-point to .

Apache Kafka Overview Apache Kafka is a hot technology amongst application developers and architects looking to build the latest generation of real-time and web-scale applications. According the official Apache Kafka website "Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable,

only focus on Apache Kafka [26], but the RDMA design could be borrowed by other systems (§6). scalledaproducer that pushes records to containers called Kafka topics. A Kafka's subscriber, called a consumer, subscribes to Kafka topics to fetch

from: apache-kafka It is an unofficial and free apache-kafka ebook created for educational purposes. All the content is extracted from Stack Overflow Documentation, which is written by many hardworking individuals at Stack Overflow. It is neither affiliated with Stack Overflow nor official apache-kafka.

from: apache-kafka It is an unofficial and free apache-kafka ebook created for educational purposes. All the content is extracted from Stack Overflow Documentation, which is written by many hardworking individuals at Stack Overflow. It is neither affiliated with Stack Overflow nor official apache-kafka.