NoC-enabled Software/Hardware Co-Design Framework For Accelerating K .

1y ago
9 Views
2 Downloads
744.57 KB
8 Pages
Last View : 12d ago
Last Download : 3m ago
Upload by : Abram Andresen
Transcription

NoC-enabled Software/Hardware Co-Design Framework forAccelerating k-mer CountingBiresh Kumar Joardar*, Priyanka Ghosh*, Partha Pratim Pande*, Ananth Kalyanaraman*, Sriram Krishnamoorthy†*School of EECS, Washington State UniversityPullman, WA 99164, U.S.A.{biresh.joardar, priyanka.ghosh, ananth, pande}@wsu.eduABSTRACTCounting k-mers (substrings of fixed length k) in DNA and proteinsequences generate non-uniform and irregular memory accesspatterns. Processing-in-Memory (PIM) architectures have thepotential to significantly reduce the overheads associated withsuch frequent and irregular memory accesses. However, existingk-mer counting algorithms are not designed to exploit theadvantages of PIM architectures. Furthermore, owing to thermalconstraints, the allowable power budget is limited in conventionalPIM designs. Moreover, k-mer counting generates unbalanced andlong-range traffic patterns that need to be handled by an efficientNetwork-on-Chip (NoC). In this paper, we present an NoCenabled software/hardware co-design framework to implementhigh-performance k-mer counting. The proposed architectureenables more computational power, efficient communicationbetween cores/memory – all without creating a thermalbottleneck; while the software component exposes more inmemory opportunities to exploit the PIM and aids in the NoCdesign. Experimental results show that the proposed architectureoutperforms a state-of-the-art software implementation of k-mercounting utilizing Hybrid Memory Cube (HMC), by up to 7.14X,while allowing significantly higher power budgets.KEYWORDSManycore, M3D, Thermal, PIM, k-mer counting, Co-design1 INTRODUCTIONAnalysis of biomolecular data such as DNA and proteins has beenone of the primary drivers of scientific discovery in biologicalsciences. From a computational perspective, these biomoleculescan be represented as strings (equivalently, sequences). Hence,sequence analysis occupies a significant portion of manybioinformatics workflows. One such operation is k-mer counting,where the goal is to determine the counts of all distinct fixedlength substrings of length k in a large collection of inputsequences. Computing k-mer abundance profiles is oftennecessary for several bioinformatics applications e.g., de novogenome assembly, repeat identification, etc.Challenges: From a software perspective, implementing k-mercounting in a resource- and time-efficient manner is a challenging 2019 Association for Computing Machinery. ACM acknowledges that thiscontribution was authored or co-authored by an employee, contractor or affiliate ofthe United States government. As such, the United States Government retains anonexclusive, royalty-free right to publish or reproduce this article, or to allow othersto do so, for Government purposes only.NOCS '19, October 17–18, 2019, New York, NY, USA 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6700-4/19/10 15.00†HPCGroup, Pacific Northwest National LaboratoryRichland, WA 99352, U.S.A.sriram@pnnl.govtask. Existing software-only solutions for efficient k-mer countinge.g. KMC2 [1], Gerbil [2], do not consider hardware limitations,resulting in sub-optimal performance. The counting processgenerates irregular memory accesses and sparse computationsthat results in more frequent memory read/writes. Conventionalmemory architectures provide limited bandwidth with higherread/write latencies. This presents a performance bottleneck fork-mer counting, which relies on repeated memory access. As aresult, computing units often remain idle, as a large portion of theexecution time is spent moving data to/from memory. Moreover,k-mer counting generates significant communication between theProcessing Elements (PEs). For example, in Gerbil [2], k-mers arerepeatedly distributed among the threads responsible forcounting, introducing significant amount of on-chip traffic.Without an efficient communication backbone, this can lead tolonger execution times as PEs would remain idle for a greaternumber of cycles waiting for data. To overcome theseinefficiencies in conventional architectures, we posit that acarefully designed software/hardware co-design framework canbe better equipped to derive the best out of both worlds. Morespecifically, in this work, we argue that the emerging paradigm ofProcessing-in-Memory (PIM), enabled by a Network-on-Chip(NoC), presents a promising solution to these applications.PIM takes advantage of emerging 3D-stacked memory logicdevices (such as Micron’s Hybrid Memory Cube or HMC) toenable high-bandwidth, low latency and low energy memoryaccess [3]. However, conventional PIM architectures are restrictedby thermal constraints as temperature impacts both memoryretention and overall performance [4, 5]. Conventional 2.5D PIMand 3D PIM architectures often have limited power budget andcomputational capability before reaching the allowabletemperature threshold [5, 6]. The role of NoC in PIM architecturesis also understudied. Due to a single layer of logic in existing PIMarchitectures, planar NoC is used for efficient communicationamong the PEs [7]. However, 2D NoCs such as mesh, are notsuited for long range communication that is inherent in k-mercounting. Moreover, k-mer counting generates unbalanced trafficthat imposes an additional layer of design complexity.To overcome the above-mentioned challenges, in this paper, wepropose an NoC enabled manycore architecture that exploits thebenefits of emerging Monolithic 3D (M3D) integration tointegrate multiple logic layers in PIM architectures with 3Dstacked memory, for high-performance k-mer counting.Experimentally, we show that even with multiple logic layers andhigher power budget, the proposed architecture does not violatethermal constraints. The main contributions of this work are:

NOCS 19 2 RELATED WORKS2.1 k-mer countingThe task of k-mer counting is memory-intensive and involvescreating a histogram of all k-length substrings in a DNA sequence.KMC2 [1], DSK [8] and Gerbil [2] are some of the popular toolsfor this purpose, with Gerbil representing the state-of-the-art insoftware as it outperforms most of the other tools. However, itrequires the repetitive use of off-chip secondary memory andtherefore will fail to fully exploit the high-bandwidth, low latencymemory access facilitated by PIM. Manycore CPU- and GPUbased platforms are the preferred choices for implementing k-mercounting [1, 2]. However, these works do not address the memorybottleneck issues. In [9], the authors designed a custom FPGAbased architecture connected to an HMC for approximate k-mercounting. However, the memory (HMC) is connected to the PEsusing serial links in a 2.5D architecture, which is not as efficientas completely on-chip solutions i.e. 3D PIM [3, 10].2.2 Processing-in-memoryProcessing-in-Memory (PIM) involves moving the computationalunits closer to memory. This allows efficient data transfer frommemory enabled by 3D integration [3]. Prior works has mostlyfocused on Through Silicon Via (TSV)-based PIM architectures,which are prone to high temperatures [5, 11]. Heat fromprocessing elements can significantly affect the retention time ofDRAMs [4]. Beyond 85ᵒC, the overheads to counter lower DRAMretention can significantly offset the benefits of PIM [5].To reduce the effect of temperature, 2.5D architecture is a popularchoice to implement PIM where PEs are placed near the memoryand connected via interposers. However, lateral heat flow fromPEs can significantly affect memory temperature [6]. In 3D-PIMarchitectures, memory is stacked directly on top of the PEs, whichenables better throughput and latency [3]. However, they aremore prone to high temperature as PEs are placed in the samevertical stack. In [12], the authors propose to use a mix of 2.5D 3D PIMs. They map the application on PEs based on its memoryand compute requirements for best performance. However, theirmethodology assumes prior knowledge of an application, whichis often not feasible. Memory-centric NoC, that connects multipleHMCs to facilitate efficient data transfer has been studied [7]. Therole of NoC connecting different vaults of an HMC is discussed in[13]. However, these implementations are limited to 2D NoC2which is inefficient in addressing the challenges posed by k-mercounting i.e. unbalanced traffic and long-range communication.To overcome these limitations, we propose an NoC-enabled PIMbased architecture that amalgamates: (a) multiple logic layers inconventional PIM, (b) M3D-based vertical integration, and (c)efficient 3D-NoC design for high-performance k-mer counting,while remaining within 85ᵒC temperature. To take advantage ofPIM’s more efficient memory access and aid NoC design, we alsopropose an alternative software approach to count k-mers thatoutperforms the Gerbil framework.3 High-performance k-mer countingProblem statement: Given a DNA sequence s of length l, a “kmer” is defined as a substring of length k in s (k being an integer,k l). All k-mers in s can be generated by simply sliding a windowof length k over s. Given a set S of n such input sequences (aka.“reads”), the problem of k-mer counting is one of determining thetotal number of occurrences for each distinct k-mer that is presentin the reads of S. In this work, we mainly focus on Gerbil [2] asour software baseline for k-mer counting. Gerbil is a recentlyproposed methodology that outperforms several well-knowncounters e.g. KMC2 [1] and DSK [8] for higher values of k, e.g.k 32, making it an appropriate software baseline to consider.3.1 Gerbil: Overview and AnalysisGerbil is a two-phase algorithmic implementation for k-mercounting: (a) 'Distribution' phase where the input reads arepartitioned into multiple intermediate files on the disk, and (b)'Counting' phase, which reads each of these intermediate files (oneby one) for counting. The entire procedure is designed to makeoptimal use of conventional manycore architectures. Each step inGerbil operates in a pipelined fashion for high throughputcounting. Popular techniques like load balancing and use of failurebuffers to handle hash conflicts, etc., have also been used for betterperformance. Fig. 1 illustrates the overall workflow of Gerbil.Next, we thoroughly analyze Gerbil using detailed full-systemsimulations on Gem5 [14] to study relevant features that arecrucial in designing an efficient manycore architecture. Fig. 2(a)shows the different categories of instructions (operations)involved in Gerbil with real-world inputs. We observe that nearlytwo-thirds of Gerbil consists of integer operations. Memoryoperations (including I/O) constitute the second largest category(32.5%). The remaining is made up of NoOps while floating-pointinstructions are negligible.Traditional off-chip main-memory and secondary memoryaccesses are slower [10], which can cause significant CPU stalls.In Gerbil, memory and I/O operations contribute significantly toInput SequencesF1. We profile k-mer counting to extract features relevantto the design of a high-performance NoC and theoverall PIM architecture for k-mer counting.We present a software/hardware co-designframework: the hardware consists of a PIM withmultiple logic layers enabled by M3D integration whilethe software component enables high-performance kmer counting by utilizing the benefits of PIM and aidsin NoC design.We perform a thorough experimental evaluation of theproposed co-design framework and show significantimprovements over an appropriate baseline.FNSecondary MemoryReadFiRepeat NtimesCCCCCCCCCCCCProcessorCCCCWriteBjMinimizer BucketsB1 B2 B3. B.K.Joardar et al.Aggregate Secondary MemoryCountsFig. 1: Illustration of workflow in Gerbil (F: Input files, B: Buckets,C: CPU; ‘Buckets’ is synonymous to ‘intermediate files’ in [2])

NoC-enabled Software/Hardware Co-Design Framework for Accelerating k-mer Countingthe runtime, the effect of which is captured in Fig. 2(b). Fullsystem simulations on Gem5 with 64 Intel x86 cores executingGerbil shows that CPUs are utilized less than 15% of the time (Fig.2(b)) for any number of intermediate files. The intermediate filesare generated after Distribution phase of Gerbil which are storedand then eventually read back from the slow off-chip secondarymemory for further processing (as shown in Fig. 1). Moreover, thecounting process involves irregular memory accesses, whichmakes caching ineffective. As a result, even though integerinstructions constitute the majority of Gerbil operations, Fig. 2(b)clearly highlights that most of the execution time is spent infetching/storing data rather than actual computation. The CPUutilization gets worse if more intermediate files are generated(4.5% CPU utilization in the case of 512 files) as it involves moreoff-chip memory access. This translates to a significant increasein runtime as observed in Fig. 2(b). Overall, it is clear that Gerbildoes not efficiently utilize the computing resources resulting insub-optimal performance. Fig. 2 also proves that slow memoryaccess presents a more serious bottleneck to performance thancomputation, for k-mer counting, making it an ideal case for PIM.However, Gerbil’s dependence on secondary memory (Fig. 1)makes it inappropriate for PIM architectures as it’ll fail to fullyexploit the high-bandwidth, low latency memory accessfacilitated by PIM. Therefore, A PIM-friendly k-mer countingsoftware solution that complements the hardware is necessary.Fig. 3 shows the communication between every CPU (Ci) pair forthree different input datasets in the form of a heat map. Here, wedefine amount of communication (traffic) as the number of flitsexchanged between a pair of cores during k-mer counting asobtained using full-system Gem5 simulations considering amanycore system with 64 cores. As shown in Fig. 3, k-mercounting in Gerbil exhibits significant amount of data exchangesbetween cores. Darker patches (a few have been highlighted in redin Fig. 3(a) as examples) indicate heavier communication betweena pair of cores. Planar logic in conventional PIM offers only aC1SenderSenderC64(a)Input LoadingC1ooooC64(b)CCCCCCCCCCCCCCCCProcessorInsert intoBucketsLocal/RemotePIM cubeBucketingLocalPIM cubeAggregateCountsCountingOn-chip MemoryFig. 4: Illustration of workflow in proposed k-mer countingmethodology: PIM-Counter (F: Input files, B: Buckets, C: CPU)limited number of design and floor-planning choices. Hence,frequently communicating cores may get potentially placed farfrom each other, leading to long range communication. Also, weobserve several lighter patches indicating lower communicationin Fig. 3. This shows that the communication in Gerbil is highlyunbalanced. Few of the cores have heavy data traffic while the resthave relatively negligible traffic. These heavily communicatingcores e.g. C1 in Fig. 3(a), can become traffic hotspot duringexecution, which affects performance. Without a suitable NoCbackbone, this can result in higher latency that in turn willincrease execution time. It is well known that 2D NoCs (due tosingle layer of logic in conventional PIM) are not scalable and notsuited to handle long range communication. Therefore, anefficient NoC is crucial for high-performance k-mer counting.3.2 PIM-Counter: PIM Friendly k-mer CounterIn this section we present PIM-Counter, a PIM-friendly multithreaded algorithm designed to overcome the I/O bottleneck ofGerbil, exploit the PIM-based architecture and aid in the NoCdesign. Fig. 4 shows the workflow of PIM-Counter. As discussedearlier, Gerbil relies on secondary memory usage, which results ininefficient CPU utilization (Fig. 2) and is not suited for PIM-basedarchitectures. In contrast, the proposed PIM-counter (Fig. 4), usesan on-chip memory-friendly approach to utilize the benefits ofPIM. As illustrated in Fig. 4, PIM-Counter has three main steps:Step-1: Input loading: Instead of reading the input files inbatches using multiple I/O passes as in Gerbil (Fig. 1), PIMCounter performs a single I/O pass. The inputs are then loadeduniformly across the PIM cubes. Here, a cube (Fig. 5, which showsthe overall PIM architecture) is analogous to an HMC vault [13]that consists of both logic and memory. However unlikeconventional HMC vaults, we also consider PEs e.g. CPUs, as partof the logic layer (i.e. 3D PIM). We discuss the hardwarearchitecture in more details in next section.Step-2: Bucketing: Once the strings are loaded onto the memory,uniformly across the partitions (‘cubes’ in Fig. 5), the localthread(s) in the corresponding cubes generate all k-mers fromeach string by sliding a window of length k. To overcome theReceiverC64C1C1ooooooooSecondary dFC64HighNormalizedruntimeNo. of intermediate files(a)(b)Fig. 2: Gerbil: (a) Instruction types, and (b) CPU utilization andruntime (normalized) for varying number of intermediate filesInput SequencesooooC64LowIntMem2416128512NoOpFloatCPU utilization32.5%66.5%Runtime10.80.60.40.20.CPU Utilization16%12%8%4%0%NOCS 19(c)Fig 3: CPU-to-CPU communication profile for Gerbil in the form of heat map for input datasets (a) E. Coli, (b) Prochlococcus sp., and (c)Vibrio cholerae (Ci: CPUi; Red boxes highlight few of the patches of heavy communication)3

NOCS 19B.K.Joardar et al.challenge imposed by the use of a large value of k, we use theconcept of minimizers, which was originally introduced in thecontext of building de Bruijn graphs [15]. The idea is to hash eachk-mer using its least (or equivalently, most) frequent m-mer,where m k (e.g., m 7; k 32) and migrate that k-mer to theminimizing m-mer’s bucket. Here, the term bucket refers to thecollection of all k-mers that share the same minimizer and isanalogous to the ‘intermediate files’ used in Gerbil. However,unlike Gerbil, these buckets are present in the on-chip memory.Each cube is responsible for a different, non-overlapping set ofbuckets. The mapping of bucket to cube id is achieved using a hashfunction in linear congruential form (e.g. ((Ax B) mod P), A, B andP are constants), which distributes all possible buckets across thedifferent cubes. As a result, the responsible bucket for a k-mercould either reside on the local cube (same cube as the computingPE) or on a remote cube (any other cube except the local cube). Forexample, in Fig. 5, Cube-16 is a local cube to CPU-16, while Cube1 is remote cube to CPU-16. Memory in local cube can be accessedby the cores using vertical interconnects only. A remote cube,however, must be reached via the use of one or more planar links.Accessing remote cubes is costlier as data must traverse longerphysical distance that can result in higher execution time. TheNoC should support this data movement.In PIM-Counter, the data movement (traffic pattern) between PEsdepends on the hash function which defines the mapping of k-merbuckets to cube-ids. Therefore, it is important to choose suitablevalues for A, B and P (and hence the hash function) such that theresulting traffic is balanced. Overall, our aim here is to choose asuitable mapping that distributes the traffic among the PEs evenlyto avoid hotspots in the NoC during execution. We use full-systemGem5 simulations to determine the hash function that yield bettertraffic distribution (shown in experimental results).Step 3: Counting: In the final step, the thread(s) local to eachcube aggregate the counts for each distinct k-mer represented inits local buckets; this is achieved using a parallel reduction. Here,PIM-Counter fully exploits the locality benefits of PIM as data isalready available on each thread’s corresponding local cube (dueto the previous bucketing phase) and can be accessed using justthe vertical links. Due to the physical proximity of memory inPIM, CPU stalls are greatly reduced as data can be fetchedrelatively faster than in conventional architectures (where data isfetched from physically distant/off-chip memory).Overall, PIM-Counter presents a PIM-friendly k-mer countingalternative that can outperform other counting tools as it benefitsCPU-5CPU-1RCPU-2Logic layersNo. of layers 1DRAMCube-16oooMemory 12RMonolithic Inter-tier ViasRCPU-16R Cube-16RR RouterFig. 5: Proposed PIM-based architecture with multiple logic andmemory layers enabled via M3D integration4from high-bandwidth, low-latency and low-energy memoryaccess facilitated by PIM. It also enables efficient communicationbetween PEs by reducing traffic hotspots.4 NoC-ENABLED 3D-PIM DESIGNIn this section, we introduce the features of the proposed 3D-PIMenabled by M3D integration followed by the NoC design thatsupports the communication generated by k-mer counting.4.1 PIM-architecture for k-mer CountingPIM allows high bandwidth, low-latency and low-energy memoryaccess by moving computation closer to memory [10]. The fastermemory access enabled by PIM is crucial for k-mer counting as alarge fraction of time ( 85%) is spent in fetching/storing datato/from memory in Gerbil (Fig. 2). However, temperature presentsan important limitation in conventional PIM architectures. DRAMretention capability is lowered beyond 85ᵒC. After temperatureexceeds this threshold, refresh rate must be doubled for every 10ᵒC increase in memory temperature. Higher refresh ratesconsume more power and, results in lower memory performance[5]. Also, traditional power management techniques are often nottailored for memory. Therefore, placing memory directly on topof (or nearby) the PEs in PIM, without addressing thermal issues,can be detrimental to performance.In [6], the authors found that 2.5D PIM architectures are prone tolateral heat flow from PEs even when placed 10mm farther fromthe HMC. Placing memory farther away to reduce temperature,also defeats the main purpose of PIM, which is to bringcomputation closer to memory. 3D PIM architectures where PEsare in same vertical stack as memory, are even more sensitive.Therefore, conventional PIM architectures (both 2.5D and 3D)typically use either (a) PEs with simpler architectures (as complexcores e.g. Out-of-Order (OoO) CPUs tend to consume more power[5]), (b) fewer number of cores e.g. [12], or (c) minimal computingpower [6], (or all of the above) to remain within the temperaturethreshold. Due to these restrictions, conventional PIMarchitectures have lower computation capability that affectsperformance and are not scalable with increasing system size.Moreover, PIM architectures are restricted to single logic layer andmultiple memory layers, as logic (PEs) dissipates more heat thanmemory [5]. It is well known that 2D logic provides limited floorplanning choices and require more die area than an equivalent 3Dcounterpart. However, multiple logic layers stacked vertically in3D ICs are prone to higher temperatures as PEs farther away fromthe sink cannot dissipate heat easily, resulting in worsetemperature [16]. As PEs consume more power than memory, useof multiple layers of logic in PIMs is typically avoided. As a result,only a few cores can be integrated given a fixed area constraint.Overall, our objective for a “suitable PIM architecture” is one thatshould: (a) allow larger volume of computation (logic) to beintegrated without incurring extra area and thermal overheads;and (b) enable efficient data exchange between cores and memory.Taking advantage of the benefits of 3D ICs in this work, wepropose a PIM architecture that incorporates multiple logic layersin conventional PIM for high-performance. Fig. 5 shows theproposed architecture with multiple logic layers (similar to 3D ICs

NoC-enabled Software/Hardware Co-Design Framework for Accelerating k-mer Counting[16]) and multiple memory layers. Each logic layer consists ofmultiple PEs, while the memory layers consists of conventionalDRAM. The cores are connected using a Network-on-Chip (NoC)to support efficient on-chip communication between cores. Wediscuss NoC design in the next sub-section. The use of multiplelogic layers enables a greater number of cores to be integratedcompared to traditional PIM (single logic layer) under an “isoarea” setting. All the layers are virtually (not physically) dividedinto several equal cubes. Each cube consists of equal amount ofresources i.e. one core per logic layer (placed vertically on top ofeach other) and the portion of memory directly above it. Forexample, in Fig. 5 (assuming 2-logic layers and following similarnumbering convention of CPUs), Cube-16 consists of CPU-16,CPU-32 and part of memory directly above it.Conventional TSV-based 3D architectures are susceptible tohigher temperatures and hence cannot be used to design theproposed architecture (Fig. 5) [11]. Consecutive layers in TSVbased designs are physically attached using a bonding materiale.g. Benzocyclobutene (BCB), that exhibits poor thermalconductivity. This impedes the seamless flow of heat across thelayers resulting in considerable increase in temperature in thelayers away from the heat sink. Moreover, the relatively thickersilicon substrate (several micrometers) in TSV-based designscauses the heat to spread laterally within the substrate instead ofvertically towards the sink. This results in higher on-chiptemperatures, which is undesirable in PIM architectures.On the other hand, emerging M3D integration allows fasterdissipation of heat than its TSV-based counterparts [16]. Absenceof a bonding material and relatively smaller dimensions(nanometers as opposed to micrometers) leads to superior thermalcharacteristics than TSV-based designs. Therefore, we argue thatwe should design high-performance yet thermally viable PIMarchitectures with multiple logic (and memory) layers as shownin Fig. 5 using M3D integration. Experimentally, we show thatM3D-based PIM designs are superior in terms of bothperformance and temperature, enabling higher power budgetscompared to their TSV-based solutions. Moreover, M3D enablesdesign of area- and power-efficient multi-tier logic blocks [17].The possibility of multi-tier logic blocks e.g. NoC routers, enabledesign of high-performance and energy-efficient NoCs, which isessential to support efficient k-mer counting.4.2 NoC design for k-mer CountingFor achieving high performance, the choice of overall NoCconnectivity should be governed by the traffic pattern generatedby the application under consideration. As shown in Fig. 3, k-mercounting introduces significant long-range and unbalanced trafficpattern that should be handled by the NoC. The unbalanced trafficin k-mer counting is addressed by choosing a suitable mapping tocubes (hash-function) in PIM-Counter as discussed in Section 3.PIM-Counter makes the traffic more uniform compared to Gerbil,reducing chances of traffic hotspots (shown later in experimentalresults section). To efficiently handle the long-range trafficpattern, 3D small-world (SW) NoC architecture is a suitablechoice. The vertical links in 3D NoCs bring cores physically closerNOCS 19RouterCPUSingle-tier routersMultitierroutersLinksFig. 6: Illustration of proposed M3D-enabled SW-NoC with multitier routers [17] and small-world properties [19] (The colorcontrast between Layer1 and Layer2 is for differentiation only)and enable long-range communication shortcuts necessary fordesigning high-performance SW NoC [19]. Moreover, the verticalconnectivity in M3D is facilitated by monolithic inter-tier vias(MIVs), which are 100x smaller and more energy efficient thanconventional TSVs [18]. Overall, we utilize the benefits of M3D todesign high-performance, yet energy efficient SW NoCs.To design a suitable 3D SW NoC, the placement of links androuters need to be optimized based on the application (k-mercounting in this case). By optimizing the placement of therouters/links, it is possible to address the communicationchallenges inherent in k-mer counting (Fig. 3) effectively. Wedemonstrate that the designed NoC (executing PIM-Counter)outperforms Gerbil running on an equivalent platform, in latersection. Next, we discuss the details of the NoC optimization.Optimization Objective: For the NoC performance evaluation,we consider two objectives: latency and energy. We estimatenetwork latency and energy using analytical models proposed in[16] for optimization purpose. For an N core system, the averagenetwork latency is modeled as:𝑁𝑁1𝐿𝑎𝑡 (𝑟 ℎ𝑖𝑗 𝑑𝑖𝑗 ) 𝑓𝑖𝑗 𝑓𝑖𝑗(1)𝑖 1 𝑗 1Here, 𝑓𝑖𝑗 represents the number of flits exchanged between core iand core j (Fig. 3) obtained from full-system k-mer countingsimulations on Gem5. The parameter r represents the number ofrouter stages, ℎ𝑖𝑗 denotes the number of hops between the twocores while 𝑑𝑖𝑗 incorporates the effect of physical distance thatmessages must traverse based on the routing protocol.The network energy is modeled using the following equations:𝑁𝑁𝑅𝐸𝑟𝑜𝑢𝑡𝑒𝑟 𝑓𝑖𝑗 𝑟𝑖𝑗𝑘 (𝐸𝑟 𝑃𝑘 )𝑖 1 𝑗 1𝑁𝑁(2)𝑘 1𝐿𝑉𝐸𝑙𝑖𝑛𝑘 𝑓𝑖𝑗 ( 𝑝𝑖𝑗𝑘 𝑑𝑘 E𝑝𝑙𝑎𝑛𝑎𝑟 𝑞𝑖𝑗𝑘 𝐸𝑣𝑒𝑟𝑡𝑖𝑐𝑎𝑙 ) (3)𝑖 1 𝑗 1𝑘 1𝑘 1(4)Here 𝐸𝑟 denotes the average router logic energy per port and 𝑃𝑘𝐸 𝐸𝑟𝑜𝑢𝑡𝑒𝑟 𝐸𝑙𝑖𝑛𝑘denotes the number of ports available at router k. The total linkenergy can be divided into two parts due to the different physicalcharacteristics of planar and vertical links. 𝑑𝑘 represents thephysical link length of link k. Here, 𝑞𝑖𝑗𝑘 and 𝑟𝑖𝑗𝑘 indicate if avertical link or router k is utilized to communicate between core iand core j respectively. E𝑝𝑙𝑎𝑛𝑎𝑟 and E𝑣𝑒𝑟𝑡𝑖𝑐𝑎𝑙 denote the energyconsumed per flit by planar metal wires and vertical links (TSV orMIV) respectively. All the required power numbers were obtained5

B.K.Joardar et al.87.7%NoOpFloatIntMem3.02.0GerbilPim-Counter2.5X reduction1.00.0(a)GerbilPim-CounterApi Bac Dcn Dro Eco Pla Pro Rat Str VibFig 8: Average CPU utilization in Gerbil and PIM-CounterApiDcn(b)EcoFig 7: PIM-Counter: (a) Instruction types, and (b) Number ofmemory operations compared to Gerbil (normalized)using Synopsys Prime Power for 28nm nodes. The total networkenergy E is the sum of router logic and link energy.We optimize the two objectives, latency and energy, using amachine lear

To overcome these limitations, we propose an NoC-enabled PIM-based architecture that amalgamates: (a) multiple logic layers in conventional PIM, (b) M3D-based vertical integration, and (c) efficient 3D-NoC design for high-performance k-mer counting, while remaining within 85ᵒC temperature. To take advantage of and aid NoC design, we also

Related Documents:

May 2020 NOC April 2020 NOC March 2020 Modification of Section IV Table 11C.1 (New provision symbols: 30B#6.21N_1, 30B#6.21N_2) February 2020 NOC January 2020 Modification of Section II Chapter 1 and Chapter 2 with new Special Sections: AP30/P and AP/30A/P. December 2019 NOC November 2019 NOC October 2019 NOC September 2019

Once the design of the basic NOC architecture became established, new techniques evolved to address advanced issues such as dynamic load balancing on to a node of the NOC architecture, the shortest/fastest path for the data flow through NOC, and energy efficient NOC architecture design. Most researchers have focused on the

complex system chip design, especially for NoC-based SoC, which has been increasingly applied to complex system design. This paper proposes a collaborative verification and testing platform for NoC-based SoC. Based on VMM verification methodology, a hierarchical NoC validation platform is constructed and assigned to the function verification of NoC

without breaking your business model, Kaseya NOC Services can help. Designed to let you scale quickly, Kaseya NOC Services deliver the monitoring and management services you need to extend your existing in-house staff and meet your customers' demands. You can deploy Kaseya NOC Services 24x7 as a permanent 'virtual' member of your IT staff.

Our NOC obtains both area and en-ergy benefits without compromising either performance or QOS guarantees. In a notional 256mm2 high-end chip, the proposed NOC consumes under 7% of the overall area and 23.5W of power at a sustained network load of 10%, a mod-est fraction of the overall power budget. Table 1: Scalability of NOC topologies. k .

NOC The nursing outcomes classification (NOC) is a classification of nurse sensitive outcomes NOC outcomes and indicators allow for measurement of the patient, family, or community outcome at any point on a continuum from most negative to most positive and at different points in time. ( Iowa Outcome Project, 2008)

C. NoC Arbitration as a Reinforcement Learning Problem Reinforcement learning can be applied to the problem of NoC arbitration to learn an efficient arbitration policy for a given NoC topology. We can train the agent (a recommen-dation system for the arbiter) such that for a given state (a collection of input buffers at a router all competing for

E-learning memungkinkan pembelajar untuk belajar melalui komputer ditempat mereka masing-masing tanpa harus secara fisik pergi mengikuti pelajaran/ perkuliahan di kelas. 4. E-learning sering pula dipahami sebagai suatu bentuk pembelajaran berbasis web yang bias diakses dari internet di jaringan lokal atau internet. 5.