Asynchronous Peer-to-Peer Device Communication - NVIDIA

1y ago
4 Views
1 Downloads
4.30 MB
28 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Harley Spears
Transcription

13th ANNUAL WORKSHOP 2017Asynchronous Peer-to-Peer Device CommunicationFeras Daoud, Leon Romanovsky[ 28 March, 2017 ]

AgendaPeer-to-Peer communicationPeerDirect technologyPeerDirect and PeerDirect AsyncPerformanceUpstream workOpenFabrics Alliance Workshop 2017

Peer-to-Peer Communication3OpenFabrics Alliance Workshop 2017

Peer-to-Peer Communication“Direct data transfer between PCI-E deviceswithout the need to use main memory as atemporary storage or use of the CPU formoving data.” Main advantages: Allow direct data transfer between devices Control the peers directly from other peer devices Accelerate transfers between different PCI-Edevices Improve latency, system throughput, CPUutilization, energy usage Cut out the middlemanOpenFabrics Alliance Workshop 2017

PeerDirect Technology5OpenFabrics Alliance Workshop 2017

TimelineOpenFabrics Alliance Workshop 2017

Prior To GPUDirect GPUs use driver-allocated pinnedmemory buffers for transfers RDMA driver use pinned buffers forzero-copy kernel-bypasscommunication It was impossible for RDMA driversto pin memory allocated by the GPU Userspace needed to copy databetween the GPU driver’s systemmemory region and the RDMAmemory regionCPU MemoryCPU12ChipsetGPU penFabrics Alliance Workshop 2017

GPUDirect/GPUDirect P2P GPU and RDMA device share theCPU Memorysame “pinned” buffers GPU copies the data to systemmemory RDMA device sends it from thereCPU1ChipsetGPU Memory Advantages Eliminate the need to make a redundant copyGPUin CUDA host memory Eliminate CPU bandwidth and tlenecks8OpenFabrics Alliance Workshop 2017

GPUDirect RDMA/PeerDirect CPU synchronizes between GPUCPU Memorytasks and data transfer HCA directly accesses GPUmemoryCPUChipset AdvantagesGPU Memory Direct path for data exchange Eliminate the need to make aGPUredundant copy in host nFabrics Alliance Workshop 2017

GPUDirect RDMA/PeerDirectCPU UtilizationGPU CPU HCAwhile(fin) {gpu kernel , stream (buf);cudaStreamSynchronize(stream);ibv post send(buf);ibv poll cq(cqe);}10OpenFabrics Alliance Workshop 2017

GPUDirect Async/PeerDirect Async Control the HCA from the GPU PerformanceCPU Memory Enable batching of multiple GPU andcommunication tasks Reduce latencyCPU Reduce CPU utilizationChipset Light weight CPU Less power CPU prepares and queues compute andGPU Memorycommunication tasks on GPU GPU triggers communication on HCA HCA directly accesses GPU 1OpenFabrics Alliance Workshop 2017

GPUDirect Async/PeerDirect AsyncGPU CPU HCAwhile(fin) {gpu kernel , stream (buf);gds stream queue send(stream, qp,buf);gds stream wait cq(stream, cqe);}CPU is free12OpenFabrics Alliance Workshop 2017

Peer-to-Peer EvolutionGPUDirect Eliminate the need to make a redundant copy inCUDA host memory Eliminate CPU bandwidth and latency bottlenecksPeerDirect Eliminate the need to make a redundant copy in hostmemory Direct path for data exchangePeerDirect Sync Control RDMA device from the GPU Reduce CPU utilizationOpenFabrics Alliance Workshop 2017

PeerDirect14OpenFabrics Alliance Workshop 2017

PeerDirectHow Does It Work? Allow ibv reg mr() to register peer memory Peer devices implement new kernel module – io peer mem Register with RDMA subsystem - ib register peer memory client() io peer mem implements the following callbacks : acquire() – detects whether a virtual memory range belongs to the peer get pages() – asks the peer for the physical memory addresses matching the memory region dma map() – requests the bus addresses for the memory region Matching callbacks for release: dma unmap(), put pages() and release()15OpenFabrics Alliance Workshop 2017

PeerDirectMemory Region RegistrationUser-space Verbs AppRDMA Subsystemibv reg mr()Peer ClientPeer DeviceHCA(a) acquire()mine!(b) get pages()Pin Peer PagesPhysical Pagesdma map()Register MRDMA addressesibv reg mr() SuccessUse MR forPeerDirectOpenFabrics Alliance Workshop 2017

PeerDirect Async17OpenFabrics Alliance Workshop 2017

PeerDirect AsyncHow Does It Work? Allow peer devices to control the network card latency reduction, batching of management operations Two new supported operations Queue a set of send operations to be triggered by the GPU - ibv exp peer commit qp() Test for a “successful completion” - ibv exp peer peek cq() Dedicated QPs and CQs for PeerDirect Sync Avoid to interlock PeerDirect Sync and normal post send/poll cq Device agnostic Currently, built to support NVIDIA’s GPUs Support other HW as well – FPGAs; storage controllers18OpenFabrics Alliance Workshop 2017

Transmit OperationCreate a QP - Mark it for PeerDirect Sync - Associate it with the peer(1)QueueWorkRequest1. Post work requests using ibv post send() Doorbell record is not updatedCPU(2)PassBytecodeDoorbell is not ringed2. Use ibv exp peer commit qp() to getbytecode for committing all WQEscurrently posted to the send work queue3. Queue the translated bytecode operationson the peer after the operations thatgenerate the data that will be sentHCAGPU(3)Trigger sendusingBytecode19OpenFabrics Alliance Workshop 2017

Completion HandlingCreate a CQ - Mark it for PeerDirect Sync - Associate it with the peer(4)ReclaimCompletions1. Use ibv exp peer peek cq() to getbytecode for peeking a CQ in a specificoffset from the currently expected CQ entry2. Queue the translated operations on thepeer before the operations that use thereceived data3. Synchronize the CPU with the peer toinsure that all the operations has ended4. Use ibv poll cq() to consume thecompletion entriesCPU(3)Report forfinishHCA(1)Pass PollBytecodeGPU(2)Peek forCompletion20OpenFabrics Alliance Workshop 2017

Performance21OpenFabrics Alliance Workshop 2017

Performance mode[*] modified ud pingpong test: recv GPU kernel send on each side.2 nodes: Ivy Bridge Xeon K40 Connect-IB MLNX switch, 10000 iterations, message size: 128B, batch size: 2022OpenFabrics Alliance Workshop 2017

Economy Mode25% faster45% less CPU load[*] modified ud pingpong test, HW same as in previous slide23OpenFabrics Alliance Workshop 2017

Upstream Work24OpenFabrics Alliance Workshop 2017

Peer-to-Peer – Upstream Proposals Peer-to-Peer DMA Mapping DMA addresses of PCI device to IOVA of other device ZONE DEVICE Extend ZONE DEVICE functionality to memory not cached by CPU RDMA extension to DMA-BUF Allow memory region create from DMA-BUF file handle IOPMEM A block device for PCI-E memory Heterogeneous Memory Management (HMM) Common address space will allow migration of memory betweendevicesOpenFabrics Alliance Workshop 2017

13th ANNUAL WORKSHOP 2017THANK YOUFeras Daoud, Leon Romanovsky

BACKUPOpenFabrics Alliance Workshop 2017

BytecodeOpenFabrics Alliance Workshop 2017

Associate it with the peer. 1. Use ibv_exp_peer_peek_cq() to get bytecode for peeking a CQ in a specific offset from the currently expected CQ entry. 2. Queue the translated operations on the peer before the operations that use the received data . 3. Synchronize the CPU with the peer to insure that all the operations has ended. 4.

Related Documents:

DNR Peer A Peer B Peer C Peer D Peer E Peer F Peer G Peer H Peer I Peer J Peer K 14 Highest Operating Margin in the Peer Group (1) (1) Data derived from SEC filings, three months ended 6/30/13 and includes DNR, CLR, CXO, FST, NBL, NFX, PXD, RRC, SD SM, RRC, XEC. Calculated as

2005 SystemVerilog standard[3], making SVA a little tricky to use for describing asynchronous behaviors. Asynchronous behaviors usually fall into two categories: (1) asynchronous control, and (2) asynchronous communication. SystemVerilog assertions can be used for either, but each presents its own set of challenges.

The popularity of peer-to-peer multimedia file sharing applications such as Gnutella and Napster has created a flurry of recent research activity into peer-to-peer architec-tures. We believe that the proper evaluation of a peer-to-peer system must take into account the characteristics

In a peer-peer file-sharing application, for example, a peer both requests files from its peers, and stores and serves files to its peers. A peer thus generates workload for the peer-peer application, while also providing the ca

1.1 Basic block diagram of an Asynchronous Circuit 5 1.2 (a) A synchronous circuit, (b) a synchronous circuit with clock drivers and clock gating, (c) an equivalent asynchronous circuit, and (d) an abstract data-flow view of the asynchronous circuit. 9 2.1 CMOS in

Asynchronous File Transfer Protocols Older microcomputer file transfer protocols used asynchronous point-to-point circuits, typically across telephone lines via a modem. y XMODEM x XMODEM-CRC (CRC-8) x XMODEM-1K (CRC 1K blocks) y YMODEM(CRC-16) y ZMODEM (CRC-32) y KERMIT (CRC -24) Asynchronous

Multiphase buck converter: Asynchronous phase control 14 / 21 - analog-asynchronous interfaces - synthesised hazard-free components l asymmetric delays elements. Multiphase buck converter: Design of asynchronous

counter. Synchronous counter is faster than the asynchronous counter. Because Asynchronous counter has more delay of the pulse from one Flip flop to another Flip flop. Fig. 8. Simulated output of Conventional Two-bit Asynchronous Counter. D. Two-bit Synchronous Counter . In synchronous counter