IOctopus: Outsmarting Nonuniform DMA

1y ago
41 Views
2 Downloads
1.38 MB
15 Pages
Last View : 3d ago
Last Download : 3m ago
Upload by : Victor Nelms
Transcription

IOctopus: Outsmarting Nonuniform DMAIgor SmolyarAlex MarkuzeTechnionTechnion & MellanoxHaggai EranGerd ZellwegerAustin BolenTechnion & VMware ResearchTechnion & MellanoxLiran LissMellanoxVMware ResearchAdam MorrisonTel Aviv UniversityAbstractIn a multi-CPU server, memory modules are local to the CPUto which they are connected, forming a nonuniform memoryaccess (NUMA) architecture. Because non-local accesses areslower than local accesses, the NUMA architecture mightdegrade application performance. Similar slowdowns occurwhen an I/O device issues nonuniform DMA (NUDMA) operations, as the device is connected to memory via a single CPU.NUDMA effects therefore degrade application performancesimilarly to NUMA effects.We observe that the similarity is not inherent but rathera product of disregarding the intrinsic differences betweenI/O and CPU memory accesses. Whereas NUMA effects areinevitable, we show that NUDMA effects can and should beeliminated. We present IOctopus, a device architecture thatmakes NUDMA impossible by unifying multiple physicalPCIe functions—one per CPU—in manner that makes themappear as one, both to the system software and externallyto the server. IOctopus requires only a modest change tothe device driver and firmware. We implement it on existinghardware and demonstrate that it improves throughput andlatency by as much as 2.7 and 1.28 , respectively, whileridding developers from the need to combat (what appearedto be) an unavoidable type of overhead.CCS Concepts. Hardware Communication hardware, interfaces and storage; Software and its engineering Operating systems; Input / output.Keywords. NUDMA; NUMA; OS I/O; DDIO; PCIe; bifurcationPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrightsfor components of this work owned by others than the author(s) mustbe honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from permissions@acm.org.ASPLOS’20, March 16–20, 2020, Lausanne, Switzerland 2020 Copyright held by the owner/author(s). Publication rights licensedto ACM.ACM ISBN 978-1-4503-7102-5/20/03. . . 15.00https://doi.org/10.1145/3373376.3378509Boris PismennyDellDan TsafrirTechnion & VMware ResearchACM Reference Format:Igor Smolyar, Alex Markuze, Boris Pismenny, Haggai Eran, GerdZellweger, Austin Bolen, Liran Liss, Adam Morrison, and DanTsafrir. 2020. IOctopus: Outsmarting Nonuniform DMA. In Proceedings of the Twenty-Fifth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOS’20), March 16–20, 2020, Lausanne, Switzerland. ACM, New York,NY, USA, 15 pages. onIn modern multi-CPU servers, each CPU is physically connected to its own memory module(s), forming a node, and canaccess remote memory of other nodes via a CPU interconnect [2, 32, 82, 94]. The resulting nonuniform memory access(NUMA) architecture can severely degrade application performance, due to the latency of remote memory accesses andthe limited bandwidth of the interconnect [49].NUMA effects are inevitable, because there are legitimate,canonical application behaviors that mandate CPU accessto the memory of remote nodes—e.g., an application thatrequires more memory than is available on the local node.Therefore, despite extensive NUMA support in productionsystem and many research efforts [11, 15, 18, 21, 29, 43, 48,49, 54, 85, 86], a “silver bullet” solution to the problem seemsunrealistic.The NUMA topology is usually perceived as consisting ofCPUs and memory modules only, but it actually includes I/Odevices as well. CPUs are equipped with I/O controllers thatmediate direct memory access (DMA) by the device to systemmemory. Consequently, device DMA to the memory of itsnode is faster and enjoys higher throughput than accesses toremote node memory. We refer to such DMA as nonuniformDMA (NUDMA).Similarly to NUMA, NUDMA can degrade performanceof I/O-intensive applications, and the many techniques proposed for addressing the problem [11, 13, 28, 31, 35, 74, 81,91, 92] only alleviate its symptoms instead of solving it.This paper presents IOctopus, a device architecture thatmakes NUDMA impossible once and for all. The observationunderlying IOctopus is that the similarity between NUMAand NUDMA is not inherent. It is a product of disregardingthe intrinsic differences between device and CPU memoryaccesses. I/O devices are external to the NUMA topology,

2Background and MotivationModern servers are often multisocket systems housing several multicore CPUs. Each CPU is physically connected toits own “local” memory modules, forming a node. CPU coresaccess “remote” memory of other nodes in a cache coherentmanner via the CPU interconnect. (For x86, this interconnect is HyperTransport (HT) [2, 32] for AMD processors, orQuickPath Interconnect (QPI) [82, 94] and, more recently,UltraPath Interconnect (UPI) [5, 40] for Intel processors.)Remote accesses into a module M are satisfied by the memory controller of M’s CPU. Node topology is such that somenodes might be connected to others indirectly via intermediate nodes, in which case remote accesses traverse throughmultiple memory controllers.DRAM0LLCI/O ctrlrDDIOI/O onnectCPU0gaining access to it through the PCIe fabric. It is thereforepossible to eliminate NUDMA by connecting the device toevery CPU, which allows it to steer each DMA request tothe PCIe endpoint connected to the target node.Crucially, the IOctopus architecture is not simply aboutdevice wiring. In fact, there exist commercially availableNICs whose form-factor consists of two PCIe cards that canbe connected to different CPUs [63]. There also exist “multihost” NICs [16, 38, 62]—aimed at serving multiple servers ina rack [76]—that could be engineered to connect to multipleCPUs within one server.However, these commercial NIC architectures still sufferfrom NUDMA effects, because they tacitly assume that aPCIe endpoint must correspond to a physical MAC address.MAC addresses are externally visible, which prompts the OSto associate the PCIe endpoints with separate logical entities such as network interfaces. The IOctopus insight is thatdecomposing one physical entity—the NIC—into multiplelogical entities is the root cause of NUDMA. This decomposition forces a permanent association between a socket andthe PCIe endpoint corresponding to the socket’s interface,which, in turns, leads to NUDMA if the process using thesocket migrates to a CPU remote from that PCIe endpoint.Accordingly, IOctopus introduces a conceptually new device architecture, in which all of a device’s PCIe endpoints areabstracted into a single entity, both physically and logically.The IOctopus model crystallizes that the PCIe endpointsare not independent entities. They are extensions of oneentity—the limbs of an octopus.We describe the design and implementation of octoNIC,an IOctopus-based 100 Gb/s NIC device prototype, and ofits device driver. We show that the IOctopus design enablesleveraging standard Linux networking APIs to completelyeliminate NUDMA. We also report on initial work to applyIOctopus principles to NVMe storage media.Our evaluation on standard networking benchmarks showsthat, compared to a Mellanox 100 Gb/s NIC which suffersfrom NUDMA, the octoNIC prototype improves throughputby up to 2.7 and lowers network latencies by 1.28 .I/O ctrlrDMAMMIO and PIOFigure 1. I/O interactions might suffer from nonuniformity. Thereare four types of such interactions: DMAs and interrupts (initiatedby I/O devices), and MMIO and PIO operations (initiated by CPUs).2.1NUMAThe ability to access both local and remote modules createsa non-uniform memory access (NUMA) architecture thatposes a serious challenge to operating system kernel designers. The challenge stems from the slower remote read/writeoperations as well as the limited bandwidth and asymmetricnature of the interconnect [49]. Together, these factors canseverely degrade the performance of applications.Addressing the NUMA challenge is nontrivial. It involvesaccounting for often conflicting considerations and goals,such as: (1) bringing applications closer to their memoryand (2) co-locating them at the same node if they communicate via shared memory, while (3) avoiding overcrowdingindividual CPUs and preventing harmful competition overtheir resources (notably their cache and memory controllercapacities); (4) deciding whether it is preferable to migrateapplications closer to their memory pages or the other wayaround; (5) weighing the potential benefit of migrating application between nodes against the overhead of continuouslymonitoring their memory access patterns at runtime, whichallows for (6) approximating an optimal node-to-applicationassignment at any given time in the face of changing workload conditions.Due to the challenging nature and potential negative impact of NUMA, this issue serves as the focus of many researchand development efforts [11, 15, 18, 21, 29, 43, 48, 49, 54, 85,86]. Production operating system kernels and hypervisors—such as Linux/KVM, FreeBSD, and VMware ESXi—providebasic NUMA support: by satisfying application memory allocations from within the memory modules of the node thatruns them [27, 31, 68, 88, 93]; by exposing the NUMA topology to applications [17]; by allowing applications to decidetheir node affinity [44]; and by automatically migrating virtual memory pages residing on remote nodes to the localnode of the corresponding applications [20, 52, 80, 89].2.2The Problem of NUDMA – Nonuniform DMAWe usually perceive the NUMA topology as consisting ofCPUs and memory modules only. However, the topology contains a third type of hardware—I/O devices—as illustrated inFigure 1. In addition to memory controllers, CPUs have I/O

controllers, which mediate all memory interactions involving I/O devices. As each device is connected to a single I/Ocontroller, I/O interactions are nonuniform as well. Namely,local interactions between the device and its node (CPU0and DRAM0 in Figure 1) are speedier and enjoy a higherthroughput as compared to remote interactions of the device(with CPU1 and DRAM1), because the latter must traversethrough the CPU interconnect and therefore suffer from thesame NUMA limitations.Most of the traffic that flows through I/O controllers istypically associated with direct memory accesses (DMA)activity, which takes place when devices read from or write tomemory while fulfilling I/O requests; we denote this activityas nonuniform DMA (NUDMA). There are other forms ofnonuniform I/O: CPU cores communicate with I/O devicesvia memory-mapped I/O (MMIO) and port I/O (PIO), anddevices communicate with cores via interrupts. These typesof interactions are also depicted in Figure 1. However, forbrevity, and since interrupts, MMIO, and PIO operations tendto be fewer as compared to DMA operations, we overload theterm NUDMA to collectively refer to all types of nonuniformI/O activity.In Intel systems, whenever possible, Data Direct I/O (DDIO)technology satisfies local DMAs using the last level cache(LLC), keeping the DRAM uninvolved [37] (bottom/left arrow in Figure 1). But DDIO technology only works locally; itdoes not work for remote DMA, thereby further exacerbating the problem of nonuniformity. The negative implicationsof the inability to leverage DDIO technology are more thanjust longer latency. With the ever increasing bandwidth ofI/O devices, studies show that DRAM bandwidth is alreadybecoming a bottleneck resource [3, 55]. This problem furtherincreases the motivation to utilize DDIO, as serving DMAoperations using the caches may substantially reduce theload that the DRAM modules experience [45].We note that NUDMA activity frequently translates to“traditional” NUMA overheads. For example, if a device DMAwrites to some memory location that is currently cached bya CPU remotely to the device, then the corresponding cacheline L is invalidated as a consequence, and the CPU has tofetch L from DRAM when subsequently accessing it.No good solutions to the NUDMA problem exist, and sothe relevant state-of-the-art is limited, consisting of recommending to users to manually pin I/O-intensive applicationsto the node that is connected to the corresponding device[13, 28, 31, 35, 81, 92], automatically doing such pinning[14, 30, 74, 77, 78, 87], and migrating some of the threadsaway from the said local node if it becomes overloaded [11].Significant effort was invested in making OS schedulersNUDMA-aware [11, 74, 81, 91], which makes an alreadyvery sophisticated and sensitive sub-system even more complex and harder to maintain. All of these techniques clearlydo not solve the NUDMA problem and only try to alleviatesome of its symptoms if/when possible. It seems there is littleelse that can be done.2.3Multiple Device Queues Do Not Solve NUDMAModern high-throughput I/O devices—NICs in our context—support multiple per-device queues. Using these queues, theoperating system and the device work in tandem to increaseparallelism and improve memory locality. IOctopus uses device queues, but they alone are ineffective against NUDMA.A queue is a cyclic array (known as a “ring buffer” orsimply a “ring”) in DRAM, which the OS accesses throughload/store operations, and the device accesses using DMA.The queue consists of descriptors that encapsulate I/O requests, which are issued by the OS and are processed by thedevice. NICs offer two types of queues: transmit (Tx) queuesfor sending packets from DRAM to the outside world, andreceive (Rx) queues for traffic in the opposite direction. Eachsuch queue instance may be further subdivided to two rings,such that one is associated with the requests (that the CPUasks the device to process) and the other is associated withthe responses (that the device issues after processing thecorresponding requests).When the device is local to the node, the OS carefully usesTx queues to increase memory locality. Here, we outlinehow the Linux kernel accomplishes this goal with TransmitPacket Steering (XPS) [53]; other kernels use similar mechanisms [26, 67]. The Linux network stack maps each core Cto a different Tx queue Q, such that Q’s memory is allocatedfrom C’s node. Additionally, memory allocations of packets transmitted via Q are likewise fulfilled using the samenode. Cores can then transmit simultaneously through theirindividual queues in an uncoordinated, NU(D)MA-friendlymanner while avoiding synchronization overheads. When athread T that executes on C issues a system call to open asocket file descriptor S, the network stack associates Q withS, saving Q’s identifier in the socket data structure. Afterthat, whenever T transmits through S, the network stackchecks that T still runs on C. If it does not, the network stackupdates S to point to the queue of T ’s new core. (The actualmodification happens after Q is drained from any outstanding packets that originated from S, to avoid out-of-ordertransmissions.)Assuming the device is local to the node, receiving packets with good memory locality is also possible, although itis somewhat more challenging than transmission and requires additional device support. Linux associates separateRx queues with cores similarly to Tx queues, such that theassociated ring buffers and packet buffers are allocated locally. The difference is that, when receiving, it is not the OSthat steers the incoming packets to queues, but rather theNIC. Therefore, modern NICs support Accelerated ReceiveFlow Steering [53] (ARFS) by (1) providing the OS with an

API that allows it to associate networking flows1 with Rxqueues, and by (2) steering incoming packets accordingly.When the OS migrates T away from C, the OS updates theNIC regarding T ’s new queue using the ARFS API. Onceagain, the actual update is delayed until the original queueis drained from packets of S, to avoid out-of-order receives.Together, XPS and ARFS improve memory locality, andthey eliminate all NU(D)MA effects if the device is local toN —the node that executes T . However, both techniques areineffective against remote devices. For example, assume thatthe NIC is remote to N , and that L is a line that is cachedby the CPU of N . If L holds content of an Rx completiondescriptor or packet buffer that will soon be DMA-writtenby the NIC on packet arrival, then L will have to be invalidated before the NIC is able to DMA-write it, as DDIO is notoperational when the device is remote. When L is next readby T , its new content will have to be fetched from DRAM.2.4Remote DDIO Will Not Solve NUDMAEven if hardware evolves and extends DDIO support to apply to remote devices, NU(D)MA effects nevertheless persist.Even if the NIC could write to a remote LLC, its accesseswould suffer from increased latency on the critical data path,while contending over the bandwidth of the CPU interconnect (Figure 1). A less drastic remote DDIO design wouldallocate the line written by the NIC in the local LLC evenif the target address belongs to another node. However, theremote CPU would still have to read the data from the NIC’snode, resulting in cache lines ping-pongs between nodes andagain increasing the critical path latency.We empirically validate that the latter remote DDIO design does not alleviate NU(D)MA overheads in a significantway as follows. Remote DDIO already partially works forDMA-writes in cases where a response ring (containing I/Orequest completion notifications) is allocated locally to thedevice and remotely to the CPU. Let us denote the latterring as R. After receiving a packet, the NIC DMA-writes to Rthe corresponding completion notification. In this case, thephysical destination of the DMA is the LLC of the CPU thatis local to the NIC, because device-to-memory write activityallocates cache lines in the LLC when the target memory islocal to the device [37]—as is the case for R. In the pktgenbenchmark experiment (described in detail in §5), which isdominated by memory accesses to rings, we find that allocating R remotely to pktgen and locally to the NIC yields onlya marginal performance improvement of up to 2%.2.5Multiple Devices Do Not Solve NUDMANUDMA effects can be potentially alleviated by installingmultiple identical I/O devices, one for each CPU, thus allowing all cores to enjoy their own local device [83, 87]. Let us1 AnIP flow is uniquely identified by its 5-tuple: source IP, source port,destination IP, destination port, and protocol ID.assume that the system’s owner is willing to tolerate thepotentially wasted bandwidth and added hardware price associated with purchasing a different NIC for each CPU nodein each server along with enough network switches withenough ports to connect all these NICs. This costlier setupcan help to curb NU(D)MA effects, but only if the workloadis inherently static enough to ensure that all threads remainin their original nodes throughout their lifetime. (And ofcourse only if these threads are limited to exclusively usinglocal devices.)In contrast, dynamic workloads that require load balancing between CPUs will experience NU(D)MA overheads,because, technically, once a socket S is established, there isno generally applicable way to make the bytes that it streamsflow through a different physical device. Therefore, usingthe above notation, if a thread T migrates from one CPUto another, its socket file descriptor S will still be served bythe device at the original node, thereby incurring NU(D)MAoverheads.With Ethernet, for example, the inability to change theassociation between S and its original NIC stems from thefact that an IP address is associated with exactly one MAC.While it is possible to transition this IP address from oneNIC (and MAC) to another, doing so would mean that allthe other threads that use this IP address would either loseconnectivity or have to change their association as well,potentially causing new NUDMA effects.When a server is connected to a switch through multiple NICs, it may instruct the switch to treat these NICs asone channel (called “bonding” [51] or “teaming” [50]), if theswitch supports EtherChannel [19] or 802.3ad IEEE link aggregation [33]. This approach does not eliminate NUDMAactivity as well, because there is no way for the server toask the switch to steer flows of some thread T to a specificNIC, and then to another NIC, based on the CPU where Tis currently executing. Switches do not support, for example, a mechanism similar to the aforementioned ARFS (§2.3).(While SDN switches have similar capabilities [75], they typically do not provide individual hosts with the ability to steerbetween aggregated links.) It is possible to design switchesthat support ARFS-like functionality, but we will have toreplace all the existing infrastructure to enjoy it.2.6Technology Trends: One Device May Be EnoughIn addition to the fact that multiple I/O devices do not solvethe NUDMA problem (§2.5), in the case of networking, wecontend that technology trends suggest that the I/O capacityof a single device should typically be enough to satisfy theneeds of all the CPUs in the server. Figure 2 depicts thesetrends by showing the past and predicted progression ofthe network bandwidth that a single NIC supports vs. thenetwork bandwidth that a single CPU may consume. The twoNIC lines shown correspond to the full-duplex throughputof a single- and a dual-port NIC, respectively.

1600NICCPU400GbEb. need moreCPU Ud. DMA toremote mem.NIC-port800re40GbE20010GbE04C8C 10C12 18C C/cops 32x100GbE400b0G1513Mbps/core24 28 32CC C48C2008 2010 2012 2014 2016 2018 2020yearFigure 2. The bandwidth of the NIC exceeds what a single CPUcould use. Top labels show Ethernet generations. Bottom labels showthe number of cores per CPU. (Data taken from various sources corresponding to Intel/AMD CPUs [8, 39, 70] and Mellanox and Intel NICs[6, 34, 39, 58, 59, 65].)The bottom-most CPU line assumes that every core inthe CPU consumes 513 Mbps. This figure reflects an upperbound on the per-core TCP throughput that was reported forAmazon EC2 high-spec instances (4xlarge and up: 8xlarge,12xlarge, etc.) with 8 and more cores when all cores concurrently engage in networking [7, 90]. An earlier report from2014 shows that 8-core instances of four cloud providers(Amazon, Google, Rackspace, and Softlayer) consume at most380 Mbps per core [71].The upper CPU line assumes an unusually high per-corerate of 10 Gb/s TCP, which is about 50% of a core’s cyclesin a bare-metal setup when running the canonical netperfbenchmark [42]; let us assume the other 50% is needed forcomputation, as netperf does not do anything useful. Thenumber of cores shown reflects the highest per-CPU corecount available from Intel and AMD for the correspondingyear. We multiply the assumed maximal per-core bandwidthwith the highest core count and display the product as themaximal throughput that one CPU may produce/consume(optimistically assuming that OSes can provide linear scaling when all CPU cores simultaneously do I/O). The figureindicates that one NIC is capable of satisfying the needs ofmultiple CPUs, even in such a demanding scenario. Othershave reached a similar conclusion [46].3NICNUMA – inevitable in common use casesgle600NICDesignIn this section, we describe the design of IOctopus, whichconsists of hardware and software components that togethereliminate all NUDMA effects. We begin by observing theinherent differences between NUMA and NUDMA that makePCIePCIe200GbEPCIe 3.3xdua1000sinthroughput [Gbps]1200PCIel-port1400c. communicatevia shared mem.a. need morememoryNICNUDMAFigure 3. NUMA effects are inevitable for some canonical algorithm classes, which dictate that CPU cores in one NUMA node mustaccess the memory of another (a–c). NUDMA effects are likewisepresently unavoidable (d), but not due to true node sharing.IOctopus possible (§3.1). We next describe the hardware/firmwaresupport that IOctopus necessitates in wiring (§3.2) and networking (§3.3). We then describe the software, operatingsystem aspect of IOctopus, which introduces a new typeof I/O device that is local to all CPUs (§3.4). In the subsequent section, we describe how we implemented all of thesecomponents (§4).3.1True and False Node SharingNUMA effects cannot be eliminated. This is true despite theextensive NUMA support provided by production systemand all of the associated research efforts (§2.1). NUMA effects are inevitable because there are legitimate, canonicalalgorithm classes that mandate CPU cores to access the memory of remote NUMA nodes. Let us use the term “true nodesharing” to denote such situations, where, algorithmically,it is impossible to avoid NUMA effects, as CPU cores aremeant to access memory of remote nodes, by design.True node sharing occurs, for example: when a singlethread running on a single core solves a problem that requiresmore memory than is available on the local node (Figure 3a);or when the problem being solved requires relatively littlememory and is housed by a single node, but additional cores—more than are available on the local CPU—can accelerate thesolution considerably (Figure 3b); or when the problem issolved with a classically-structured parallel job, where eachthread is assigned a core to execute a series of compute steps,separated by communication steps whereby all threads readfrom the memory of their peers in order to carry out thesubsequent compute step (Figure 3c) [24].The initial insight underlying the design of IOctopus isthat NUDMA activity is not the result of true node sharing.This is so because, by definition (§2.2), NUDMA activity doesnot involve cores accessing any memory modules, neitherlocal nor remote (Figure 3d). Instead, it is the device thataccesses the memory.More specifically, as its name suggests, the NUMA architecture intentionally makes memory accesses of CPU coresnonuniform. It employs a distributed memory controller that

unifies the memory modules spread across all nodes into asingle coherent address space. Memory access latencies experienced by cores are then determined by the internal topologyof the distributed memory space. In contrast, I/O devices areentirely external to this topology, gaining access to it viaa PCIe fabric. Thus, the specific connection point betweenthe PCIe fabric and the NUMA memory space determinememory access latencies that devices experience. Namely,assuming it is possible to connect the NIC in Figure 3d viaPCIe to both CPUs, then, in principle, it may be possible toeliminate NUDMA effects.In light of the above, we denote NUDMA effects as happening due to “false node sharing.” When restating our aforementioned insight using this terminology, we can say thatthe inherent difference between NUMA and NUDMA effects is that the former are the result of true node sharing,whereas the latter are the result of false node sharing. Thisarticulation is helpful, because it highlights why, in principle,NUDMA effects may be avoidable.3.2Wiring Hardware SupportConnecting I/O devices via PCIe to only a single CPU is anold, standard practice, which is so pervasive that it appears ascarved in stone. Consequently, one might easily mistakenlybelieve that there are sound technical reasons that prevent usfrom connecting a device to multiple CPUs. However, this isnot the case. Such connectivity already exists in production,and we contend that its availability will become more andmore prevalent in the future, as discussed below.Before we conduct the discussion, however, it is essentialto note that, by itself, connecting an I/O device to multipleCPUs does not eliminate NUDMA. Rather, such connectivityis equivalent to using multiple devices, as discussed in §2.5.Namely, for technical reasons explained later on, connectinga device to multiple CPUs translates to adding more PCIeendpoints to the PCIe fabric, such that each endpoint is localto its own CPU but remote to all the rest.PCIe Bifurcation and Extenders Currently, probably themost straightforward approach that can be used to connectone I/O device to multiple CPUs is through PCIe bifurcation[41], which enables splitting a single PCIe link into several.2The vendor of the I/O device can implement different typesof bifurcation, e.g., a 32-lanes PCIe link width could be splitinto 2 or 4 PCIe endpoints with a link width of 16- and 8lanes, respectively. The additional endpoints that bifurcationcreates could be connected to other CPUs.In some bifurcation cases—e.g., splitting 16 lanes into two8-lane endpoints connected to different CPUs—the resultingavailable bandwidth between the device and a single CPUmay not be sufficient for certain workloads. To alleviate thisproblem, vendors can support extending, say, a 16-lane PCIe2 Thecitation [41] refers to a bifurcating one CPU PCIe link into multiplelinks to the same CPU; we are presenting bifurcating to multiple CPUs here.device with an additional 16-lane PCIe endpoint (providedthat internally the device has 32-lanes; additional resourcesare required [66].Attesting the architectural viability of PCIe bifurcationto connect a single I/O device to multiple targets is the factthat Broadcom, Intel, and Mellanox already produce “multihost” NICs [16, 38, 62]. The goal of a multi-host NIC is tosimultaneously serve 2–4 physical servers in a consolidatedmanner [76]. Given that such connectivity works for multipleservers in a rack, it stands to reason that it should also workfor multiple CPUs within one server.IOctopus is a joint project developed by several organizations, including Mellanox, which is a networking vendor.Mellanox

mediatedirectmemoryaccess(DMA)bythedevicetosystem memory. Consequently, device DMA to the memory of its node is faster and enjoys higher throughput than accesses to remote node memory. We refer to such DMA as nonuniform DMA (NUDMA). Similarly to NUMA, NUDMA can degrade performance of I/O-intensive applications, and the many techniques pro-

Related Documents:

PG 3 DMA-011 DMA-043 DMA-096 DMA-053 DMA-056 DMA-064 DMA-063 DMA-066 DMA-066B DMA-067 DMA-068 DMA-079 DMA-084 DMA-087 DMA-088

Different DMA for each surface type. Slide courtesy of Santa Barbara County and Dan Cloak. 1225 SF Existing Impervious Area. DMA-1. 3200 DMA-2. 3200 DMA-3: 3700 DMA-4. 12400 DMA-5: 500 DMA-6. 8500 DMA-7: 4200 Total 35700 1225 SF Existing Impervious Area. Slide courtesy of Santa Barbara County and Dan Cloak. Sizing - Treatment Only. DMA Name .

This DMA General Certification Overview course is the first of five mandatory courses required for DMA certification: 1. DMA General Certification Overview 2. DMA Military Sexual Trauma (MST) and the Disability Examination Process 3. DMA Medical Opinions 4. DMA Aggravation Opinions 5. DMA Gulf War General Medical Examination

DMA interrupt handler are implemented in emlib, but callbacks can be registered by application emlib DMA config includes DMA interrupt handler Callback functions registered during DMA config 17. Hands-on task 1 - Basic Mode 1. Open an\fae_training\iar\dma.eww and got to adc_basic project 2. Run code and check that DMA- CHREQSTATUS[0] is set to 1

Linux - DMA buf Application Coprocessor libmetal Allocator (ION ) Remoteproc ioctl to import DMA buf Linux Kernel metal_shm_open() metal_shm_attach() metal_shm_sync DMA buf DMA buf fd DMA buf fd va, dev_addr DMA buf fd dev addr, size Sync_r/Sync_w, Ack RPMsg dev_addr, size Sync_r/Sync_w, Shm size Ack

CLASSIFIED ADVERTISING 400 East 11th Street Chattanooga, Tennessee 37403 (423) 757-6252 FAX (423) 757-6337 Market & Retail Trading Zones Adults in Percent TFP TFP Reach Weekly Times Free Press Readers Chatt. DMA of DMA Readers % in DMA DMA Chattanooga DMA 744,860 100.0% 312

The DMA-1240 (DMA-1275) is a 12-channel, multi-use, multi-zone power amplifier that is flexible and powerful enough to amplify every speaker in your whole-house audio system and/or home theater system. This amplifier delivers exceptionally

ISO 14001:2004 February 24, 2005 This document provides a summary of the requirement of ISO 14001:2004, which is an international standard describing the specification and requirements for an environmental management system (EMS). ELEMENT-BY-ELEMENT GUIDANCE ISO 14001 Requirement: 4.1 General requirements An organization must establish, document, implement, and continually improve their .