Apache Cassandra On AWS

3y ago
40 Views
2 Downloads
1.30 MB
52 Pages
Last View : 1m ago
Last Download : 2m ago
Upload by : Grady Mosby
Transcription

Apache Cassandra on AWSGuidelines and Best PracticesJanuary 2016

Amazon Web Services – Apache Cassandra on AWSJanuary 2016 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved.NoticesThis document is provided for informational purposes only. It represents AWS’scurrent product offerings and practices as of the date of issue of this document,which are subject to change without notice. Customers are responsible for makingtheir own independent assessment of the information in this document and anyuse of AWS’s products or services, each of which is provided “as is” withoutwarranty of any kind, whether express or implied. This document does not createany warranties, representations, contractual commitments, conditions orassurances from AWS, its affiliates, suppliers or licensors. The responsibilities andliabilities of AWS to its customers are controlled by AWS agreements, and thisdocument is not part of, nor does it modify, any agreement between AWS and itscustomers.Page 2 of 52

Amazon Web Services – Apache Cassandra on AWSNotices2Abstract4Introduction4NoSQL on AWSCassandra: A Brief Introduction56Cassandra: Key Terms and Concepts6Write Request Flow8Compaction11Read Request Flow11Cassandra: Resource Requirements14Storage and IO Requirements14Network Requirements15Memory Requirements15CPU Requirements15Planning Cassandra Clusters on AWSPage 3 of 52January 201616Planning Regions and Availability Zones16Planning an Amazon Virtual Private Cloud18Planning Elastic Network Interfaces19Planning High-Performance Storage Options20Planning Instance Types Based on Storage Needs24Deploying Cassandra on AWS30Setting Up High Availability31Automating This Setup32Setting Up for Security36Monitoring by Using Amazon CloudWatch37Using Multi-Region Clusters39

Amazon Web Services – Apache Cassandra on AWSJanuary 2016Performing Backups41Building Custom AMIs42Migration into AWS42Analytics on Cassandra with Amazon EMR44Optimizing Data Transfer Costs45Benchmarking Cassandra46Using the Cassandra Quick Start Deployment47Conclusion48Contributors48Further Reading48Notes49AbstractAmazon Web Services (AWS) is a flexible, cost-effective, easy-to-use cloudcomputing platform. Apache Cassandra is a popular NoSQL database that iswidely deployed in the AWS cloud. Running your own Cassandra deployment onAmazon Elastic Cloud Compute (Amazon EC2) is a great solution for users whoseapplications have high throughput requirements.This whitepaper provides an overview of Cassandra and its implementation onthe AWS cloud platform. It also talks about best practices and implementationcharacteristics such as performance, durability, and security, and focuses on AWSfeatures relevant to Cassandra that help ensure scalability, high availability, anddisaster recovery in a cost-effective manner.IntroductionNoSQL databases are a type of database optimized for high-performanceoperations on large datasets. Each type of NoSQL database provides its ownPage 4 of 52

Amazon Web Services – Apache Cassandra on AWSJanuary 2016interface for accessing the system and its features. One way to choose a NoSQLdatabase types is by looking at the underlying data model, as shown following: Key-value stores: Data is organized as key-value relationships andaccessed by primary key. These products are typically distributed rowstores. Examples are Cassandra and Amazon DynamoDB. Graph databases: Data is organized as graph data structures and accessedthrough semantic queries. Examples are Titan and Neo4J. Document databases: Data is organized as documents (for example, JSONfiles) and accessed by fields within the documents. Examples areMongoDB and DynamoDB. Columnar databases: Data is organized as sections of columns of data,rather than rows of data. Example: HBase.DynamoDB shows up in both document and key-value stores in this list becauseit supports storing and querying both key-value pairs and objects in a documentformat like JSON, XML, or HTML.NoSQL on AWSAmazon Web Services provides several NoSQL database software options forcustomers looking for a fully managed solution, or for customers who want fullcontrol over their NoSQL databases but who don’t want to manage hardwareinfrastructure. All our solutions offer flexible, pay-as-you-go pricing, so you canquickly and easily scale at a low cost.Consider the following options as possible alternatives to building your ownsystem with open source software (OSS) or a commercial NoSQL product. Page 5 of 52Amazon DynamoDB is a fully managed NoSQL database service thatprovides fast and predictable performance with seamless scalability.1 Alldata items in DynamoDB are stored on solid-state drives (SSDs) and areautomatically replicated across three facilities in an AWS region to providebuilt-in high availability and data durability. With Amazon DynamoDB,you can offload the administrative burden of operating and scaling ahighly available distributed database cluster while paying a low variableprice for only the resources you consume.

Amazon Web Services – Apache Cassandra on AWS January 2016Amazon Simple Storage Service (Amazon S3) provides a simple webservices interface that can store and retrieve any amount of data anytimefrom anywhere on the web.2 Amazon S3 gives developers access to thesame highly scalable, reliable, secure, fast, and inexpensive infrastructurethat Amazon uses to run its own global network of websites. Amazon S3maximizes benefits of scale, and passes those benefits on to you.Cassandra: A Brief IntroductionNote: This is a brief overview on how Cassandra works; to learn more visitDataStax documentation.3Apache Cassandra is a massively scalable open source NoSQL database, which isideal for managing large amounts of structured, semi-structured, andunstructured data across multiple distributed locations. Cassandra is based onlog-structured merge-tree, a data structure that is highly efficient with highvolume write operations.4 The most popular use case for Cassandra is storingtime series data.Cassandra delivers continuous availability, linear scalability, and operationalsimplicity across many commodity servers with no single point of failure, alongwith a powerful dynamic data model designed for maximum flexibility and fastresponse times. Cassandra is a master less peer-to-peer distributed system wheredata is distributed among all nodes in the cluster. Each node has knowledgeabout the topology of the cluster and exchanges information across the clusterevery second.Cassandra: Key Terms and ConceptsBefore we discuss best practices and considerations for using Cassandra on AWS,let us review some key concepts.A cluster is the largest unit of deployment in Cassandra. Each cluster consists ofnodes from one or more distributed locations (Availability Zones or AZ in AWSterms).A distributed location contains a collection of nodes that are part of a cluster. Ingeneral, while designing a Cassandra cluster on AWS, we recommend that youPage 6 of 52

Amazon Web Services – Apache Cassandra on AWSJanuary 2016use multiple Availability Zones to store your data in the cluster. You canconfigure Cassandra to replicate data across multiple Availability Zones, whichwill allow your database cluster to be highly available even during the event of anAvailability Zone failure. To ensure even distribution of data, the number ofAvailability Zones should be a multiple of the replication factor. The AvailabilityZones are also connected through low-latency links, which further helps avoidlatency for replication.A node is a part of a single distributed location in a Cassandra cluster that storespartitions of data according to the partitioning algorithm.A commit log is a write-ahead log on every node in the cluster. Every writeoperation made to Cassandra is first written sequentially to this append-onlystructure, which is then flushed from the write-back cache on the operatingsystem (OS) to disk either periodically or in batches. In the event of a noderecovery, the commit logs are replayed to perform recovery of data.A memtable is basically a write-back cache of data rows that can be looked up bykey. It is an in-memory structure. A single memtable only stores data for a singletable and is flushed to disk either when node global memory thresholds havebeen reached, the commit log is full, or after a table level interval is reached.An SStable (sorted string table) is a logical structure made up of multiple physicalfiles on disk. An SStable is created when a memtable is flushed to disk. AnSStable is an immutable data structure. Memtables are sorted by key and thenwritten out sequentially to create an SStable. Thus, write operations in Cassandraare extremely fast, costing only a commit log append and an amortized sequentialwrite operation for the flush.A bloom filter is a probabilistic data structure for testing set membership thatnever produces a false negative, but can be tuned for false positives. Bloom filtersare off-heap structures. Thus, if a bloom filter responds that a key is not presentin an SStable, then the key is not present, but if it responds that the key is presentin the SStable, it might or might not be present. Bloom filters can help scale readrequests in Cassandra. Bloom filters can also save additional disk read operationsreading the SStable, by indicating if a key is not present in the SStable.Page 7 of 52

Amazon Web Services – Apache Cassandra on AWSJanuary 2016An index file maintains the offset of keys into the main data file (SStable).Cassandra by default holds a sample of the index file in memory, which stores theoffset for every 128th key in the main data file (this value is configurable). Indexfiles can also help scale read operations better because they can provide you therandom position in the SStable from which you can sequentially scan to get thedata. Without the index files, you need to scan the whole SStable to retrieve data.A keyspace is a logical container in a cluster that contains one or more tables.Replication strategy is typically defined at the keyspace level.A table, also known as a column family, is a logical entity within a keyspaceconsisting of a collection of ordered columns fetched by row. Primary keydefinition is required while defining a table.Write Request FlowThe following diagram shows a Cassandra cluster with seven nodes with areplication factor of 3. The clients are writing to the cluster using quorumconsistency level.5 While using quorum consistency level, write operationssucceed if two out of three nodes acknowledge success to the coordinator (thenode that the client connects to).Page 8 of 52

Amazon Web Services – Apache Cassandra on AWSJanuary 2016Figure 1: Write Request FlowThe preceding diagram illustrates a typical write request to Cassandra with threeway replication, as described following:1. A client sends a request to a node in the cluster to store a given key. At thispoint, the node might or might not be the right partition to store the key. Ifit is not the right partition, the node acts as a coordinator (the case in thisexample). Note that a node can either act as a replica or a coordinator orboth (if the node maps to the data and is talking to the client).2. The coordinator determines the replica nodes that should store the keyand forwards the request to those nodes.Page 9 of 52

Amazon Web Services – Apache Cassandra on AWSJanuary 20163. Each node that gets the key performs a sequential write operation of thedata, along with the metadata required to recreate the data in the commitlog locally.4. The key along with its data is written to the in-memory memtable locally.5. Replica nodes respond back to the coordinator with a success or failure.6. Depending on the consistency level specified as part of the request, thecoordinator will respond with success or failure to the client. For example,with a consistency level of quorum and a replication factor of 3, thecoordinator will respond with success as soon as two out of three nodesrespond with success.Now, during step 5 preceding, if some nodes do not respond back and fail (forexample, one out of three nodes), then the coordinator stores a hint locally tosend the write operation to the failed node or nodes when the node or nodes areavailable again. These hints are stored with a time to live equal to thegc grace seconds parameter value, so that they do not get replayed later. Hintswill only be recorded for a period equal to the max hint window in msparameter (defined in cassandra.yaml), which defaults to three hours.As the clients keep writing to the cluster, a background thread keeps checking thesize of all current memtables. If the thread determines that either the node globalmemory thresholds have been reached, the commit log is full, or a table levelinterval has been reached, it creates a new memtable to replace the current oneand marks the replaced memtable for flushing. The memtables marked for flushare flushed to disk by another thread (typically, by multiple threads).Once a memtable is flushed to disk, all entries for the keys corresponding to thatmemtable that reside in a commit log are no longer required, and those commitlog segments are marked for recycling.When a memtable is flushed to disk, a couple of other data structures are created:a bloom filter and an index file.Page 10 of 52

Amazon Web Services – Apache Cassandra on AWSJanuary 2016CompactionThe number of SStables can increase over a period of time. To keep the SStablesmanageable, Cassandra automatically performs minor compactions by default.Compaction merges multiple SStables based on an algorithm that you specifyusing a compaction strategy.Compaction allows you to optimize your read operations by allowing you to reada smaller number of SStables to satisfy the read request. Compaction basicallymerges multiple SStables based on the configurable threshold to create one ormore new, immutable SStables. For example, the default compaction strategy,Size Tiered Compaction, groups multiple similar-sized SStables together andcreates a single large SStable. It keeps iterating this process on similar-sizedSStables.Compaction does not modify existing SStables (remember, SStables areimmutable) and only creates a new SStable from the existing ones. When a newSStable is created, the older ones are marked for deletion. Thus, the used space istemporarily higher during compaction. The amount of space overhead due tocompaction depends on the compaction strategy used. This space overhead needsto be accounted for during the planning process. SStables that are marked fordeletion are deleted using a reference counting mechanism or during a restart.Read Request FlowBefore we dive into the read request flow, we will summarize what we know abouta Cassandra cluster.In a cluster, each row is replicated across multiple nodes (depending on yourreplication factor). There is no concept of a master node. This approach meansthat any node in the cluster that contains the row can answer queries about thatrow. Cassandra uses the Gossip protocol to exchange information about networktopology among nodes. By virtue of Gossip, every node learns about the topologyof the cluster and can determine where a request for a given row should be sent toin the cluster.In the diagram following, we have a Cassandra cluster with seven nodes and areplication factor of 3. The clients read from the cluster using quorumPage 11 of 52

Amazon Web Services – Apache Cassandra on AWSJanuary 2016consistency level. While using quorum consistency level, read operations succeedif two out of three nodes acknowledge success.With this brief context, let us look at how the read requests are served. The figureand list following illustrate.Figure 2: Read Request Flow1. A client sends a request to a node in the cluster to get data for a given key,K. At this point, if the key is not mapped to this node, then the node acts asPage 12 of 52

Amazon Web Services – Apache Cassandra on AWSJanuary 2016a coordinator. Note that a node can either act as a replica or a coordinatoror both (if the node maps to the data and is talking to the client).2. The coordinator determines the replica nodes that might contain the keyand forwards the request to those nodes. While sending the request to thereplica nodes, the coordinator determines which node is closer to itself(through a snitch) and sends a request for full data to the closest node anda request for the digest generated with the hash of the data from the othernodes. (A snitch determines which host is closest to the current location.)3. The request is forwarded to the internal services of the node for furtherprocessing.4. A request for data from both the memtable and SStables is made. Therequest iterates over the bloom filters for the SStables asking whether thekey is present.5. Because the memtable is in memory, data might be returned faster fromthe memtable, but one or more SStables still need to be consulted for thedata.6. If a bloom filter responds that the key is not present, the next bloom filteris checked. If a bloom filter responds that the key might be present (whichis the case here), then it checks the sample index in memory.7. A binary search is performed on the sample index to determine a startingoffset into the actual index file. This offset is used to offset into the indexfile and do a sequential read operation to obtain the offset into the SStablefor the actual key.8. With the offset obtained from step 7, the actual data from the SStable isreturned by offsetting into the SStable file.9. The data for the key is returned from the SStable lookup. The filtercommand consolidates all versions of the key data obtained from SStablelookups and the memtable.10. The latest consolidated version of the key data is returned to the internalservices.11. The same process is repeated by the internal services on other nodes andresults are returned back to the coordinator node.Page 13 of 52

Amazon Web Services – Apache Cassandra on AWSJanuary 201612. The coordinator compares the digest obtained from all nodes anddetermines if there is a conflict within the data. If there is a conflict, thecoordinator reconciles the data and returns the reconciled version back tothe client. A read repair is also initiated to make the data consistent.Note that we did not talk about the cache preceding. To learn more aboutcaching, refer to the DataStax documentation.6Read repairs can resolve data inconsistencies when the written data is read. Butwhen the written data is not read, you can only use either the hinted handoff 7 oranti-entropy 8 mechanism.Cassandra: Resource RequirementsLet us now take a look at the resources required to run Cassandra. We will look atstorage and I/O, CPU, memory, and networking requirements.Storage and IO RequirementsMost of the I/O happening in Cassandra is sequential. But there are cases whereyou require random I/O. An example is when reading SStables during readoperations.SSD is the recommended storage mechanism for Cassandra, because it providesextremely low-latency response times for random read operations whilesupplying ample sequential write performance for compaction operations.Replication and storage overhead due to compaction has to be taken into accountwhile determining storage requirements.The recommended file system for all volumes is XFS. Ext4 might be used bypreference. Ext3 is considerably slower, and we recommend that you avoid it.AWS provides two types of storage options, namely local storage and AmazonElastic Block Store (Amazon EBS). Local storage is available locally to theinstance, and EBS is network-attached storage. We will talk more about choosinga storage option on AWS for Cassandra later in this whitepaper.Page 14 of 52

Amazon Web Services – Apache Cassandra on AWSJanuary 2016Network RequirementsCassandra uses the Gossip protocol to exchange information with other nodesabout network topology. The use of Gossip coupled with distrib

Cassandra is a master less peer -to -peer distributed system where data is distributed among all nodes in th e cluster. Each n ode has knowledge about the topology of the cluster and exchanges information across the cluster every second. Cassandra: Key Terms and Concepts Before we discuss best practices and considerations for using Cassandra on AWS, let us review some key concepts. A cluster .

Related Documents:

Getting Started with the Cloud . Apache Bigtop Apache Kudu Apache Spark Apache Crunch Apache Lucene Apache Sqoop Apache Druid Apache Mahout Apache Storm Apache Flink Apache NiFi Apache Tez Apache Flume Apache Oozie Apache Tika Apache Hadoop Apache ORC Apache Zeppelin

CASSANDRA_SSL_STORAGE_PORT FALSE 7011; Cassandra cassandra.ssl_storage_port; Cassandra parameter contrail-node-init, contrail-external-cassandra; CASSANDRA_SSL_TRUSTSTORE_PASSWORD FALSE ornatum; Cassandra Cassandra parameter; contrail-external-cassandra CASSANDRA_STORAGE_PORT FALSE; 7010 Cassandra; cassandra.storage_port Cassandra

Apache Cassandra 1.0 Documentation Introduction to Apache Cassandra Apache Cassandra is a free, open-source, distributed database system for managing large amounts of structured, semi-structured, and unstructured data. Cassandra is designed to scale to a very large size across many commodity Apache Cassandra 1.0 Documentation 1

TP2: data modeling with Apache Cassandra 12. Cassandra versions Latest version 3.11.9 Cassandra 3.0 is supported until 6 months after 4.0 release (date TBD) Cassandra 2.2 is supported until 4.0 release Cassandra 2.1 is supported until 4.0 release 13. Cassandra 3.X physical model

Cassandra database: Build Cassandra code. Installation and configuration of Cassandra on Windows. Installation and configuration of Cassandra on Linux. Running a single Cassandra node. Examples of access control list usage. Extend Cassandra to multiple nodes. Build Cassandra code:

valid credentials to work with Cassandra database. Now the point is how to prepare for Apache Cassandra certification. One of the most popular certification for Apache Cassandra is "Professional Certification with Apache Cassandra: Massively NoSQL database " and this certification is for both Developer as well as architects.

4 AWS Training & Services AWS Essentials Training AWS Cloud Practitioner Essentials (CP-ESS) AWS Technical Essentials (AWSE) AWS Business Essentials (AWSBE) AWS Security Essentials (SEC-ESS) AWS System Architecture Training Architecting on AWS (AWSA) Advanced Architecting on AWS (AWSAA) Architecting on AWS - Accelerator (ARCH-AX) AWS Development Training

Amazon Keyspaces (for Apache Cassandra) Guide du développeur Fonctionnement Qu'est-ce qu'Amazon Keyspaces (pour Apache Cassandra) ? Amazon Keyspaces (pour Apache Cassandra) est un service de base de données compatible avec Apache