Understanding Write Behaviors Of Storage Backends In Ceph .

2y ago
37 Views
2 Downloads
1.90 MB
10 Pages
Last View : 2m ago
Last Download : 2m ago
Upload by : Allyson Cromer
Transcription

Understanding Write Behaviors of StorageBackends in Ceph Object StoreDong-Yun Lee* , Kisik Jeong* , Sang-Hoon Han* , Jin-Soo Kim* , Joo-Young Hwang , and Sangyeun Cho * Computer Systems LaboratorySungkyunkwan University, South KoreaMemory BusinessSamsung Electronics Co., Ltd., South Korea{dongyun.lee, kisik, shhan}@csl.skku.edu, jinsookim@skku.edu, {jooyoung.hwang, sangyeun.cho}@samsung.comAbstract—Ceph is a scalable, reliable and high-performancestorage solution that is widely used in the cloud computingenvironment. Internally, Ceph provides three different storagebackends: FileStore, KStore and BlueStore. However, little efforthas been devoted to identifying the differences in those storagebackends and their implications on performance. In this paper,we carry out extensive analysis with a microbenchmark anda long-term workload to compare Ceph storage backends andunderstand their write behaviors by focusing on WAF (WriteAmplification Factor). To accurately analyze WAF, we carefullyclassify write traffic into several categories for each storagebackend.We find that writes are amplified by more than 13x, no matterwhich Ceph storage backend is used. In FileStore, the overheadof Ceph write-ahead journaling triples write traffic compared tothe original data size. Also, FileStore has the journaling of journalproblem, generating a relatively large amount of file systemmetadata and journal traffic. KStore suffers severe fluctuationsin IOPS (I/O Operations Per Second) and WAF due to largecompaction overheads. BlueStore shows the stable performanceon both HDDs and SSDs in terms of IOPS, WAF and latency.Overall, FileStore performs the best among all storage backendson SSDs, while BlueStore is also highly promising with goodaverage and tail latency even on HDDs.I. I NTRODUCTIONIn the cloud computing era, a stable, consistent and highperformance block storage service is essential to run a largenumber of virtual machines. Ceph is a storage solution thatmeets all these demanding requirements and has attracted aspotlight in the last decade. Ceph is a scalable, highly reliable software-defined storage solution that provides multipleinterfaces for object, block and file level storage [1]. Cephaims at completely distributed storage without a single pointof failure and high fault tolerance with no specific hardwaresupport. Since Ceph provides strong consistency to clients,users can access objects, block devices and files withoutworrying about consistency. Moreover, because it has a scaleout structure, Ceph can improve its performance gradually byadding additional cluster nodes [2].Internally, all storage services in Ceph are built upon theCeph RADOS (Reliable Autonomic Distributed Object Store)layer [3], which manages fixed-size objects in a scalable,distributed and reliable manner. Ceph provides three differentstorage backends in the RADOS layer: FileStore, KStore andBlueStore. FileStore and KStore manage objects on top oftraditional file systems and key-value stores (e.g., LevelDBand RocksDB), respectively. On the other hand, BlueStore isa new object store architecture that has been developed activelyfor the Ceph RADOS layer in recent years. BlueStore savesobject data into the raw block device directly, while it managestheir metadata on a small key-value store such as RocksDB.Currently, Ceph can be configured to use one of these storagebackends freely.Due to Ceph’s popularity in the cloud computing environment, several research efforts have been made to find optimalCeph configurations under a given Ceph cluster setting [4], [5]or to tune its performance for fast storage like SSD (SolidState Drive) [6]. However, little attention has been paid tothe differences in the storage backends available in Ceph andtheir implications on the overall performance. In this paper,we compare the write behaviors and performance of Cephbackends with a focus on WAF (Write Amplification Factor).The study on the WAF of various storage backends can bevery enlightening to understand the storage access behaviors ofCeph for the following reasons. First, WAF has a major impactnot only on the overall performance, but also on device lifetimewhen Ceph runs on SSDs. Second, the larger WAF, the morelimited effective bandwidth given to the underlying storagedevice. In particular, HDD (Hard Disk Drive) exhibits verylow IOPS (I/O Operations Per Second) compared to SSD andit is very important to use raw hardware bandwidth effectively.Finally, as in the previous research with SQLite, there mightbe issues such as journaling of journal [7] problem whenimplementing distributed storage services on top of a localfile system.We have used a microbenchmark and a long-term workloadof 4KB random writes to measure write traffic of variousCeph storage backends on both HDDs and SSDs. Our resultswith the long-term workload indicate that Ceph amplifiesthe amount of write traffic by more than 13x under thereplication factor of 3, regardless of the storage backend used.In FileStore, we find that write-ahead journaling with separateCeph journal does not double, but rather triples write traffic

compared to the original data size. Also, the journaling ofjournal problem is severe, making file system metadata andjournaling traffic as much as the original data size. In the caseof KStore, the compaction process takes up almost all writetraffic, resulting in poor tail latency and severe fluctuationsin IOPS. Finally, BlueStore is free from the journaling ofjournal problem as it stores data directly in the storagedevice. However, RocksDB traffic to store metadata and objectattributes still overwhelms data traffic over a factor of three inBlueStore.The rest of the paper is organized as follows. Section IIpresents more detailed background on Ceph. In Section III,we introduce our measurement methodology and experimentalconfigurations. In Section IV, we perform several microbenchmarks and discuss the basic write behaviors of Ceph backends.Section V evaluates various Ceph backends with the long-termworkload on HDDs and SSDs. We discuss the related work inSection VI. Finally, Section VII concludes the paper.II. BACKGROUNDThis section gives a brief overview of the Ceph architectureand its storage backends.A. Ceph Architecturea primary block storage for virtual machines in the cloudcomputing platforms such as OpenStack. krbd is implementedas a kernel module which exports device files directly to thekernel so that clients can mount them just like conventionaldisks. In this paper, we use the krbd module to investigate theperformance of the Ceph RADOS block device without anyinterference from hypervisor or other virtual machines.C. Ceph Storage BackendsThe Ceph OSD daemon consists of many functional modules in order to support software-defined storage services.In the heart of the Ceph OSD daemon, there is a modulecalled ObjectStore which is responsible for how objects arestored and managed. In particular, Ceph is designed to supportmultiple storage engines by registering them as differentbackends for ObjectStore. The stable Ceph Jewel LTS versioncurrently supports three kinds of storage backends: FileStore,KStore and BlueStore. In the following subsections, we brieflypresent overall architecture and characteristics of each storagebackend.1) FileStoreCeph provides multiple storage services at an object level(Ceph object storage), a block level (Ceph block storage) anda file level (Ceph file system). Internally, they are all basedon one unified layer called RADOS (Reliable AutonomicDistributed Object Store). Ceph consists of several daemonsrunning in the RADOS layer, each of which performs aspecific task. The Ceph monitor (MON) daemon manages thecluster-wide node information called cluster map. The Cephmetadata (MDS) daemon is needed only for the file-levelservice to maintain the metadata of each file as is done intraditional distributed file systems. Finally, the Ceph objectstorage device (OSD) daemon is responsible for retrieving andstoring objects by interacting with its local disks.One important feature in the Ceph object and block storageis that clients can directly contact the OSD daemon that hasa primary copy of the required data. In traditional distributedstorage systems, clients have to make a request to a centralizedserver first to get metadata (e.g., data locations), which can bea performance bottleneck as well as a critical single point offailure. Ceph eliminates the need for the centralized serverby placing data using a pseudo-random distribution algorithmcalled CRUSH [8].In FileStore, each object is stored as a separate file in the underlying local file systems such as XFS, BTRFS and ZFS. Using FileStore, Ceph mandates to use an external Ceph journalfor ensuring consistency. Since Ceph guarantees strong consistency among data copies, all write operations are treated asatomic transactions. Unfortunately, there is no POSIX API thatprovides the atomicity of compound write operations. Instead,FileStore first writes incoming transactions to its own journaldisk in an append-only manner. After writing to the journal,worker threads in FileStore perform actual write operations tothe file system with the writev() system call. In every afew seconds up to filestore max sync interval (5seconds by default), FileStore calls syncfs() to the disk andthen drops the journal entries. In this way, FileStore providesstrong consistency.Having the external journal also brings a performancebenefit as the write speed is improved due to the appendonly logging mechanism. To enhance the Ceph performancefurther, many systems employ SSDs as the journal devices.Theoretically, any local file system can be used for FileStore,but due to some issues related to extended attributes (xattr),XFS is the only file system officially recommended by Cephdevelopers.B. Ceph RADOS Block Device (RBD)2) KStoreCeph RADOS block device, also known as RBD, providesa thin-provisioned block device to the clients. A block devicerepresents a consecutive sequence of bytes and Ceph dividesit into a set of objects of equal size. When a client modifies aregion of the RBD, the corresponding objects are automaticallystripped and replicated over the entire Ceph cluster system.The size of objects is set to 4MiB by default.There are two types of RBD: librbd and krbd. librbd isa user-level library implementation which is widely used asKStore is an experimental storage backend in the CephJewel version. The basic idea behind KStore is to encapsulateeverything from object data to their metadata as key-valuepairs and put them into key-value stores since key-value storesare already highly optimized for storing and managing keyvalue pairs. Any key-value store can be used for KStore byinterposing a simple translation layer between ObjectStore andthe key-value store. The Ceph Jewel version currently supportsLevelDB, RocksDB and KineticStorage for KStore. In this

Admin Server / Client (x1)ModelDELL R730XDProcessorIntel Xeon CPU E5-2640 v3Memory128 GB1ST WRITEOVERWRITE 2ND WRITE3RD WRITEOSD Servers (x4)Storage NetworkPublic NetworkModelDELL R730ProcessorIntel Xeon CPU E5-2640 v3Memory32 GBStorageHGST UCTSSC600 600 GB x4Samsung PM1633 960 GB x4Intel 750 series 400 GB x2Switch (x2)ModelDELL N4032, Mellanox SX6012PublicNetworkDELL 10Gbps EthernetStorageNetworkMLNX 40Gbps InfiniBandFig. 1: Experimental Ceph Testbedpaper, we have performed our experiments on LevelDB andRocksDB.3) BlueStoreBlueStore is another experimental storage backend that isbeing actively developed by Ceph community. It is expectedto be stable in the next upcoming release. The key ideaof BlueStore is effective management of objects avoidinglimitations in FileStore. One problem in FileStore is a doublewrite penalty caused by the external journaling. Instead, BlueStore saves object data in a raw block device directly, whilemanaging their metadata with RocksDB. Since BlueStorebypasses the local file system layer, file system overheads suchas journaling of journal can be avoided. Note that because afile system is still required to run the RocksDB, BlueStoreinternally has a tiny user-level file system named BlueFS.Moreover, Ceph usually deals with a bunch of objects andit often has to enumerate all objects in an ordered fashionfor checking consistency and recovering. However, POSIXdoes not provide any efficient way to retrieve the objects frommultiple directories. Another benefit of using RocksDB is thatCeph can simply retrieve and enumerate all objects stored inthe system.III. M EASUREMENT M ETHODOLOGYA. Evaluation EnvironmentFigure 1 illustrates an organization of our experimentalCeph testbed. In our experiments, we use one administrationserver to run the Ceph monitor daemon (MON). The sameserver is also used as a client which generates I/O requests tothe Ceph RBD. We use four storage servers for running theCeph OSD daemons. There are two private networks in thesystem; one is a public network that connects all the serverswith a 10Gbps Ethernet and the other is a storage network thatconnects four storage servers with a 40Gbps InfiniBand.Each storage server is equipped with four 600GB HGSTSAS HDDs, four 960GB Samsung PM1633 SAS SSDs andtwo 400GB Intel 750 NVMe SSDs. We conduct the mea-2MiB04MiBFig. 2: Workload in Microbenchmarksurement for the Ceph storage backends using SAS HDDsand SAS SSDs. In our configuration, a storage server runsfour OSD daemons either on four HDDs or on four SSDsso that each OSD daemon works on a single storage device.Therefore, there are total 16 OSDs in our Ceph testbed. NVMeSSDs are used for external journaling in FileStore and WAL(Write-Ahead Logging) in BlueStore. All experiments areconducted on the Linux 4.4.43 kernel with the latest CephJewel LTS version (v10.2.5).B. MicrobenchmarkFirst, we analyze how WAF changes when we make a singlewrite request to the RBD under different circumstances. Recallthat the entire RBD space is divided into a set of 4MiB objectsby default. When the replication factor is set to default valueof three, total three copies of each object are distributed overthe available OSDs.Our microbenchmark is designed to measure the amount ofwrite traffic in the following cases (cf. Figure 2): (1) whenthere is a write to an empty object (denoted as 1 ST WRITE),(2) when there is a write next to the 1 ST WRITE (denotedas 2 ND WRITE), (3) when there is a write to the middle ofthe object leaving a hole between the 2 ND WRITE and thecurrent write (denoted as 3 RD WRITE) and (4) when thereis an overwrite to the location written by the 1 ST WRITE(denoted as OVERWRITE). The microbenchmark repeats thesame experiment by doubling the request size from 4KiB to4MiB (i.e., 4KiB, 8KiB, 16KiB, . . . , 4MiB). We expect that theamount of metadata traffic generated by Ceph varies in eachcase. To avoid any interference from the previous experiment,we reinstall Ceph every time for each request size. We use theftrace tool in Linux to collect and classify trace results.C. Long-term workloadTo perform experiments under a long-term workload, weuse the fio tool in Linux to generate 4KiB random writesto the Ceph RBD, periodically measuring IOPS and WAFfrom FileStore, KStore (with LevelDB and RocksDB) andBlueStore. For each storage backend, we classify all writerequests into several categories and calculate WAF for eachcategory. As mentioned before, the Ceph RBD is widely usedto provide large, reliable and high-performance storage forvirtual machines. In this VDI (Virtual Desktop Infrastructure)environment, it is well known that the patterns of writerequests are mostly random, with their sizes ranging from4KiB to 8KiB [9][10]. This is why we focus on 4KiB randomwrites in this paper.

Write-AheadJournalingLevelDBCephJournalCeph journalNVMe SSDDBWALXFS file systemCeph dataCeph metadataLevelDB or RocksDBCeph dataCeph metadataCompactionMetadataAttributesObjectsDB WALRaw DeviceXFS file systemFile system metadataFile system journalHDDs or SSDs(a) esObjectsFile system metadataFile system journalCeph dataZero-filled dataHDDs or SSDsHDDs or SSDs(b) KStoreRocksDBDB WALBlueFSBlueFSRocksDB DBNVMe SSDRocksDB WALNVMe SSD(c) BlueStoreFig. 3: Internal Architectures of Ceph Storage BackendsEach long-term experiment is performed in the followingorder:1) Install Ceph and create an empty 64GiB krbd partition.2) Drop page cache, call sync and wait for 600 seconds toflush all the dirty data to the disk.3) Perform 4KiB random writes with the queue depth of128 (QD 128) using fio to the krbd partition until thetotal write amount reaches 90% of the capacity (i.e.,57.6GiB).All experiments are conducted with 16 HDDs first and theyare repeated on 16 SSDs later. According to our experiments,a single write thread with QD 128 is enough to saturate OSDservers on HDDs, but not on SSDs. Therefore, we performthe experiments on SSDs with two write threads (each withQD 128).As the purpose of this paper is to analyze WAF, we keeptrack of the amount of written sectors for each disk. We alsomeasure IOPS and latency distribution since they are traditionally important metrics in storage performance analysis. Duringthe long-term experiment, we observe that the ftrace overhead affects our experimental results, degrading the overallIOPS by up to 5%. In order to eliminate this overhead, wedid not use ftrace in the long-term experiment. Instead, wehave modified the Linux kernel so that it collects the amountof written sectors for each category and then exports them viathe /proc interface. At runtime, we periodically read thosevalues from the /proc file system. The sampling period isset to 15 seconds for HDDs, but it is shortened to 3 secondsfor SSDs as SSDs are much faster than HDDs. The detailedsetting and write traffic classification scheme for each storagebackend are described below. Unless otherwise specified, weuse default configurations of Ceph. In all experiments, thereplication factor is set to three.1) FileStoreWe create a single partition in each disk and mount it withthe XFS file system. This partition is dedicated to a Ceph OSDdaemon as the main data storage. Since FileStore needs anadditional Ceph journal partition for its write-ahead journaling,we use two NVMe SSDs in each storage server. An NVMeSSD is divided into two partitions, each of which is assignedto a Ceph OSD daemon as the Ceph journal partition.As shown in Figure 3(a), we classify write traffic into thefollowing categories in FileStore: Ceph data: Replicated client data written by Ceph OSDdaemons.Ceph metadata: Data written by Ceph OSD daemonsother than Ceph data.Ceph journal: Data written to the Ceph journal partitionby Ceph OSD daemons.File system metadata: File system metadata written byXFS (e.g. inodes, bitmaps, etc.).File system journal: File system journal written by XFS.Inside the kernel, it is very difficult to separate Ceph datafrom Ceph metadata unless there is an explicit hint from theCeph layer. Instead, we first calculate the amount of Cephdata by multiplying the replication factor to the amount ofdata written by the client. And then we obtain the amountof Ceph metadata by subtracting the amount of Ceph datafrom the total amount of data written by XFS for regular filesand directories. Thus, Ceph metadata also includes any datawritten to LevelDB by Ceph OSD daemons.2) KStoreIn KStore, we use only one partition per Ceph OSD daemonwhich is mounted with the XFS file system. Since the writerequests from the client are 4KiB in size, we set the stripe sizeof key-value pairs (kstore default stripe size) to4096 instead of the default value of 65536.In KStore, write traffic is classfied into the followingcategories as shown in Figure 3(b): Ceph data: Replicated client data written by Ceph OSDdaemons.Ceph metadata: Data written by Ceph OSD daemonsother than Ceph data.Compaction: Data written by LevelDB or RocksDBduring compaction.

File system metadata: File system metadata written b

Currently, Ceph can be configured to use one of these storage backends freely. Due to Ceph’s popularity in the cloud computing environ-ment, several research efforts have been made to find optimal Ceph configurations under a given Ceph cluster setting [4], [5] or to tune its performance for fast storage like SSD (Solid-State Drive) [6].

Related Documents:

Cost Transparency Storage Storage Average Cost The cost per storage Cost Transparency Storage Storage Average Cost per GB The cost per GB of storage Cost Transparency Storage Storage Devices Count The quantity of storage devices Cost Transparency Storage Storage Tier Designates the level of the storage, such as for a level of service. Apptio .

los angeles cold storage co. lyons cold storage llc marianne's ice cream mar-jac poultry mattingly cold storage mccook cold storage merchants cold storage, llc mesa cold storage midwest refrigerated services minnesota freezer warehouse co mtc logistics nestle usa new orleans cold storage newcold nor-am cold storage nor-am ice and cold storage

los angeles cold storage los angeles cold storage co. lyons cold storage llc marianne's ice cream mar-jac poultry mattingly cold storage mccook cold storage merchants cold storage, llc mesa cold storage midwest refrigerated services minnesota freezer warehouse co mtc logistics nestle usa new orleans cold storage newcold nor-am cold storage .

Oracle ZFS Storage Appliance systems use the concept of a storage pool, which is a collection of storage and cache devices exclusively assigned to the pool. Storage is provisioned in shared file systems or block storage, and it is backed by intent-log and write-cache devices. Figure 2 illustrates an Oracle ZFS Storage Appliance system's

Lift Services Services Basement -2 9 Parking Spaces 1 Motorcycle Storage 19 MBP V: 5.68 m3 Storage 18 V: 5.68 m3 Storage 17 V: 5.68 m3 Storage 20 V: 7.14 m3 Storage 21 V: 7.14 m3 Storage 22 V: 6.61 m3 Storage 23 V: 6.61 m3 Storage 24 V: 6.61 m3 B DA-301 B DA-301 A DA 300 A DA 300 N DA-20

Configure the object or cloud storage destination . Configure a storage policy to use the new storage destination . Configuring the Storage Destination 1 . A web browser is used to open the StorNext GUI, log in, and proceed to the Configuration Storage Destinations screen . The Object Storage tab is selected to configure object .

About vSphere Storage vSphere Storage describes storage options available to VMware ESXi and explains how to configure your ESXi system so that it can use and manage different types of storage. In addition, vSphere Storage explicitly concentrates on Fibre Channel and iSCSI storage area networks (SANs) as storage options and discusses

Swift - Storage Policies Storage Policies provide a way for object storage providers to differentiate service levels, features and behaviors of a Swift deployment. Each Storage Policy configured in Swift is exposed to the client via an abstract name. Each device in the system is assigned to one or more Storage Policies. This is accomplished