Scaled RDMA Performance And Storage Design With Windows Server 2012 R2

1y ago
5 Views
2 Downloads
836.34 KB
27 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Grady Mosby
Transcription

Scaled RDMA Performance and Storage Design with Windows Server 2012 R2 Dan Lovinger Principal Software Engineer Windows File Server Microsoft 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Outline SMB3 Application Workloads – Real Hardware Methodology 2012 Results and Discussion* Comparison to 2012 R2 RTM Scaling to Racks and Full Deployments *There’s a paper you can download! 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Goals Demonstrate SMB3 is valid Best Choice for application workloads Evaluate potential of new server hardware with SMB3 Evaluate performance of RDMA-capable fabric(s) Demonstrate that it is reasonable to consider remotely deployed storage for highly scaled server environments. Chart a future performance course, and metrics to use 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Key SMB3 Application Workloads Hyper-V (virtualization) SQL 8K Random VHDs and database tables Pure read, plus read/write mix 512K Sequential Backup, disk migration, decision support/data mining Pure read Can be 512K, but performance and requirements largely the same Also 64K 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

EchoStreams FlacheSAN2 Appliance combining SAS HBAs, enterprise SSDs and high speed networking, Windows Server 2012 and Storage Spaces. Networking: 3x Mellanox ConnectX-3 FDR InfiniBand HCAs Storage CPU: 2x Intel Xeon E5-2650 (8c16t 2.00Ghz) Latest version with E5-2665 2.40GHz CPUs Mezz DRAM: 32GB Client generic white box Networking: 3x Mellanox ConnectX-3 FDR InfiniBand HCAs CPU: 2x Intel Xeon E5-2680 (8c16t 2.70Ghz) DRAM: 128GB EchoStreams FlacheSAN2 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. LSI HBA Mellanox FDR 5x mirrored 4-column 2-copy Space, exposed as SMB3 shares Mellanox FDR Mellanox FDR Mellanox FDR 5x LSI 2308-based PCIe Gen 3.0 SAS HBA (6 possible) 8x Intel 520 SSDs per controller Total: five groups of eight for 40 total SSDs (48 possible) Mellanox FDR Mellanox FDR Client x5 HBA SSD Groups Intel 520 SSD Intel 520 SSD Intel 520 SSD Intel 520 SSD Intel 520 SSD Intel 520 SSD Intel 520 SSD Intel 520 SSD

Methodology Client workload generator: Microsoft SQLIO Affinitized to run on specific CPU cores Two instances, one per socket Server virtual drives Each share exposes two 100GB files Client instances split load per-socket Goal to emulate typical NUMA-aware modern application E.g. Windows Hyper-V, guests running with affinity to specific socket(s) and core(s), accessing per-VM VHDs Units: KB MB GB decimal: 103 106 109 KiB MiB GiB IEC60027-2: 210 220 230 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Metric: Overhead Cycles/Byte Standard measure of CPU bandwidth efficiency 𝑐 𝐵 %𝑃𝑟𝑖𝑣𝑖𝑙𝑒𝑔𝑒𝑑 𝐶𝑃𝑈 𝑈𝑡𝑖𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑟𝑒 𝐶𝑙𝑜𝑐𝑘 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 #𝐶𝑜𝑟𝑒𝑠 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ 𝑖𝑛 𝐵𝑦𝑡𝑒𝑠 Privileged CPU utilization from Windows Performance counters Discounts any unrelated activity, and from load generator itself Core clock is not constant - must configure system under test to minimize processor frequency variation: Hyperthreading disabled TurboBoost and SpeedStep disabled Virtualization disabled BIOS deep C-states disabled Windows power plan to Max Performance Re-enabling can improve performance, i.e. results are conservative. 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Metric: Latency Two client-visible components of latency: Wire Visible in Windows perfmon “stalls” Filesystem (NTFS) processing time Storage processing time Measured as 90th percentile Captured with Windows Performance Analyzer Individual I/O latencies 1M samples or 1 minute, with warm-up Unexpected latency increase can indicate bottleneck being reached E.g. CPU saturation or other overhead 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. Wire Client - Server EchoStreams FlacheSAN2 LSI HBA Client Latency Server Client Bit transmission time Includes request queuing on/off adapter Wire Latency Server Latency x5 HBA SSD Groups Intel 520 SSD . . . . . . .

Latency Methodology Windows Performance Toolkit xperf -on fileio xperf -d trace.etl xperf –i trace.etl -o trace.txt -a dumper Correlate relavent fileio events Trace both sides of the wire simultaneously, post warmup Difference the client and server side histograms 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. 9

Result 1: Single I/O Latency Single random I/O to single share Used to establish base latency expected of systems Consistent, good performance, exposing wire and SSD latencies Latency (us) 90th Percentile Read 90th Percentile Write Size (KiB) 1 8 64 512 Client Server Wire 204 176 197 159 419 366 1297 1112 Client 29 38 52 185 Server 153 113 366 1355 Wire 119 65 303 1143 Cumulative Latencies Server Relative Client - Server Latency Client 1K 100% 150 80% 100 60% 40% 50 20% 0 0% 10 100 Latency (us) 1000 10 100 Latency (us) 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. 1000 Wire Latency (us) # Samples (Thousand) Server Latency 34 49 63 212 8K 64K 512K 200 150 100 50 0 0% 20% 40% 60% Percentile 80% 100%

Result 2: Small I/O Scaling - Read Client CPU comfortable Server CPU saturates at high thread count Note relatively low server CPU clock (2.00 GHz) 1KiB 8KiB 43.3 43.3 41.4 %CPU IOPs 90th (us) c/B Srv 7.9 10.8 64500 310 14.8 21.7 123600 365 24.0 48.4 211500 445 7.9 7.0 6.5 9.7 16.4 26.3 9.8 20.3 46.6 560 41.7 35.7 84.5 327050 530 6.4 40.0 82.5 1040 44.9 46.6 99.9 425900 955 7.2 58.2 100.0 Threads IOPs 90th (us) c/B 1 (20 I/O) 2 (40 I/O) 4 (80 I/O) 8 (160 IO) 16 (320 IO) 76650 144050 244250 265 320 390 360950 438400 %CPU 1000 300 750 200 500 Saturation 250 100 Scaling 0 1 2 4 8 0 16 1T IO/s Latency Wire Latency (us) 400 90th %ile Latency (us) IO/s (Thousands) 1250 %CPU Srv Scaled 8KiB Read Wire Latency at 20 IO/Thread Scaled 8KiB Read at 20 IO/Thread 500 %CPU 2T 4T 8T 200 150 100 50 0 10% 20% Threads 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. 30% 40% 50% 60% Percentile 70% 80% 90% 100%

Result 3: Small I/O Scaling – 60/40 Similar to read As expected, since load is not bandwidth-limited Scaling may increase on bi-directional links, if available 1KiB Threads IOPs 8KiB 90th R(us) 90th W(us) c/B %CPU IOPs 90th R(us) 90th W(us) c/B %CPU 1 (20 I/O) 69800 300 350 44.3 7.3 70700 310 270 7.1 9.5 2 (40 I/O) 125900 355 410 45.3 13.5 124950 370 340 7.0 16.6 4 (80 I/O) 206450 435 495 43.4 21.3 210850 450 410 6.9 27.4 8 (160 I/O) 319150 545 635 39.6 30.0 328150 575 510 6.8 42.5 16 (320 I/O) 424850 960 1140 47.1 47.5 375900 1235 1330 7.4 52.7 Mixed 8KiB Wire Latency at 8T 20 IO/Thread Wire Latency (us) read write 300 200 100 0 10% 20% 30% 40% 50% 60% Percentile 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. 70% 80% 90% 100%

Result 4: Large I/O (Read) 13.99 2520 %CPU 1.22 1.06 1.05 1.06 6.9 12.2 20.8 29.0 1.09 35.3 64KiB 400 512KiB 3.0 2.0 300 64KiB 1.0 0.0 200 100 0 10% 30% 50% Percentile 70% 90% GByte/s 4.0 16.40 0.31 11.6 Scaling of 512KiB IO 90th Latency 90th Latency 50th Latency 50th Latency 10 0 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. 4.7 7.6 9.8 10.9 GBytes/s 4 3 2 1 0 2 0.31 0.29 0.29 0.30 GBytes/s 20 1 %CPU 19900 Scaling of 64KiB IO 64KiB Wire Latency (us) 512KiB Wire Latency (ms) Large Read Wire Latency at 8T 20 IO/Thread 512KiB 512KiB GBytes/s 90th (us) c/B 6.64 1630 11.34 2570 14.41 4970 15.68 9930 4 8 16 Threads 20 20 10 10 0 0 1 2 4 8 Threads 16 Latency (ms) Threads 1 (20 I/O) 2 (40 I/O) 4 (80 I/O) 8 (160 I/O) 16 (320 I/O) 64KiB GBytes/s 90th (us) c/B 2.45 550 4.95 630 8.58 780 11.84 1300 GByte/s Full bandwidth (16 GBps!) achievable, at very low CPU 512KB reaches limit of network at just under 16 threads Multichannel round-robin leads to some latency variation near limit CPU limit much better behaved, by comparison Latency (ms)

Conclusions (Windows Server 2012) Maximum bandwidth High IOPS to real storage 16.4GB/s ( 5.5GB/s/adapter) 0.31 c/B overhead For 512KiB I/Os 376,000 to FlacheSAN2 6.4 c/B overhead For 8KiB I/Os Near-constant latency profile 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Approaching RTM – Small I/O 8KiB Random Read WS2012 IOPS* WS “WIP” IOPS* Δ IOPS Δ c/B 1x54Gbps NIC 330,000 460,000 36% -17% 2x54Gbps NIC 660,000 * fictitious storage (/dev/zero) 860,000 30% -15% As of Windows 2012 R2 ‘MP’ Preview Intermediate results from local-only internal optimizations Enhanced NUMA awareness Improved request batching, locking, cacheline false sharing, etc Future improvements expected from Further optimizations Use of iWARP/InfiniBand remote invalidation Refer to earlier Greg Kramer / Tom Talpey Presentation for final! 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

2012 to 2012 R2 Same Client, Server increases CPU by 20% SSDs age about 9 months Mezzanine LSI Adapter option installed, sixth SSD group now available E5-2650 E5-2665 Normal 2.0 Ghz 2.4Ghz 20% Turbo 2.8 Ghz 3.1Ghz 11% 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. 16

2012 to 2012 R2 at 5 Groups Small Read @ 20QD/T Up 30% at limit, above nominal 20% from clock alone 583K 1KiB 549K 8KiB 500,000 30% 400,000 IO/s 600,000 1KiB - 2012 300,000 1KiB - 2012 R2 8KiB - 2012 200,000 8KiB - 2012 R2 100,000 0 1 2 4 8 16 Threads 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. 17

2012 to 2012 R2 – 5 Group Latency 1200 End to end latency improves very significantly at saturation 1000 Latency (us) 800 600 8KiB - 2012 8KiB - 2012 R2 400 200 0 1 2 4 8 16 Threads 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. 18

2012 R2 5 6 Groups Small Read, now 24QD/T 20%, as expected, until CPU saturation and max TDP 700,000 600,000 500,000 400,000 IO/s 1KiB - 5 1KiB - 6 300,000 8KiB - 5 8KiB - 6 200,000 100,000 0 1 2 4 8 16 Improvement - 6 Threads 30% 20% 10% 1KiB 0% 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. 8KiB 1 2 4 8 16 Threads 19

2012 R2 Balanced v. High Perf 600,000 1000 900 700 400,000 600 IO/s 100% Read 800 300,000 500 400 200,000 Latency (us) Impact of power management varies over load Same final destination near saturation 500,000 300 8 - High - IO/s 8 - Balanced - Rd90 8 - High - Rd90 200 100,000 8 - Balanced - IO/s 100 0 0 1 2 4 8 16 Threads 600,000 4000 60:40 R/W 3500 500,000 IO/s 2500 300,000 2000 1500 200,000 1000 100,000 Latency (us) 3000 400,000 8 - Balanced - IO/s 8 - High - IO/s 8 - Balanced - Rd90 8 - High - Rd90 500 0 0 1 2 4 8 16 Threads 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. 20

Scaling 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. 21

Classic Cluster-in-a-box Storage Connectivity Server A Server B JBOD A JBOD B Great 2-point resiliency and easy shared storage Limited in scale and resiliency 24-120 shared storage devices possible 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Scale-out File Server Storage Connectivity Great scale and resiliency No single point of failure Dual path to storage devices from each server 48-280 shared storage devices possible Scale-out fileserver allows for resource/load balancing JBOD A JBOD B Server A Server B Server C Server D JBOD C JBOD D ** connectivity shown for single server 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

And Now For Something Different ! 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved. 24

Performance: 100% Reads – 4KiB Block 1Million IOPs 100% Reads – 8KiB Block 500K IOPs 100% Writes – 4KiB 600K IOPs 100% Writes – 8KiB 300K IOPs Configuration as tested: V6616 - SLC 4 x Dual Port Mellanox ConnectX-3 2 x Internal Gateways 8c Sandy Bridge at 1.8 GHz 48GB DRAM Windows 2012 R2 Failover Cluster 8 1TB Shares exported – 2 Per Client Planned configuration for GA: MLC: 64TB, 32TB, 12TB SLC: 16TB *Samples and POC gear available immediately Interconnect 40 GbE – RoCE RDMA SMB 3.0 SMB Direct 4 External Clients 2 x Dual Port Mellanox ConnectX-3 4c Xeon at 2.53 GHz 24 GB DRAM Windows 2012 R2, SQLIO

References Windows Server 2012 EchoStreams FlacheSAN2 (paper) px?id 38432 EchoStreams FlacheSAN2 http://www.echostreams.com/flachesan2.html SMB 3.0 Specification (MS-SMB2) spx SMB Direct Specification (MS-SMBD) spx Windows Performance Analyzer http://go.microsoft.com/fwlink/?LinkId 214551 Contact danlo -at- microsoft.com 2013 Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

2012 to 2012 R2 at 5 Groups Small Read @ 20QD/T Up 30% at limit, above nominal 20% from clock alone 17 0 100,000 200,000 300,000 400,000 500,000 600,000 1 2 4 8 16 IO/s Threads 1KiB - 2012 1KiB - 2012 R2 8KiB - 2012 8KiB - 2012 R2 30% 583K 1KiB 549K 8KiB

Related Documents:

RDMA FOR APACHE SPARK DISTRIBUTION High-Performance Design of Spark over RDMA-enabled Interconnects High performance RDMA -enhanced design with native InfiniBand and RoCE support at the verbs- level for Spark RDMA-based data shuffle and SEDA -based shuffle architecture Non-blocking and chunk- based data transfer Off-JVM-heap .

memory. The SMC-R protocol defines a means to exploit the shared memory for communications - transparent to the applications! Shared Memory Communications via RDMA SMC SMC RDMA enabled (RoCE) RNIC Clustered Systems This solution is referred to as SMC-R (Shared Memory Communications over RDMA). SMC-R is an open sockets

For details on hardware components for all configurations and models, see Oracle Exadata Storage Server Hardware Components. RDMA Network Fabric interconnects all of the database and storage servers using a pair of RDMA Network Fabric switches. Each storage server runs Oracle Exadata System Software to process data at the storage level and

Cost Transparency Storage Storage Average Cost The cost per storage Cost Transparency Storage Storage Average Cost per GB The cost per GB of storage Cost Transparency Storage Storage Devices Count The quantity of storage devices Cost Transparency Storage Storage Tier Designates the level of the storage, such as for a level of service. Apptio .

benchmark workloads. 2. RELATED WORK RDMA has been extensively used for fast data transfers in high performance computing systems using MPI implementations over Infiniband [24]. More closely related to our work are the applica-tions of RDMA in storage and file systems, key-value stores, and relational database systems.

Scaled Agile, Inc. Marketing and Editorial Style Guide January 2020 7 Elevator pitches 80-word company description About Scaled Agile, Inc. (SAI) Scaled Agile, Inc. is the provider of SAFe , the world's leading framework for business agility. Through learning and certification, a global partner network, and a growing community of over 500,000

MPEG-4 connections to a fixed rate with low jitter. Ultra‐Low Latency iWARP and UDP Representing helsio's fourth‐generation iWARP RDMA design, T6 builds on the RDMA capabilities of T3 and T5, which have been field proven in numerous large, 100 node clusters, including a 1300‐node cluster at Purdue University.

Microsoft {chguo, hwu, zdeng, gasoni, jiye, padhye, malipsht}@microsoft.com ABSTRACT Over the past one and half years, we have been using RDMA over commodity Ethernet (RoCEv2) to support some of Microsoft's highly-reliable, latency-sensitive ser-vices. This paper describes the challenges we encoun-