Pinatubo: A Processing-in-Memory Architecture For Bulk Bitwise .

1y ago

6 Views

1 Downloads

1.03 MB

6 Pages

Last View : 23d ago

Last Download : 3m ago

Upload by : Rafael Ruffin

Report this link

Download PDF

Transcription

Pinatubo: A Processing-in-Memory Architecture for BulkBitwise Operations in Emerging Non-volatile MemoriesShuangchen Li1 , Cong Xu2 , Qiaosha Zou1,5 , Jishen Zhao3 , Yu Lu4 , and Yuan Xie1University of California, Santa Barbara1 , Hewlett Packard Labs2University of California, Santa Cruz3 , Qualcomm Inc.4 , Huawei Technologies Inc.5{shuangchenli, yuanxie}ece.ucsb.edu1ABSTRACTProcessing-in-memory (PIM) provides high bandwidth, massive parallelism, and high energy efficiency by implementing computations in main memory, therefore eliminatingthe overhead of data movement between CPU and memory. While most of the recent work focused on PIM inDRAM memory with 3D die-stacking technology, we propose to leverage the unique features of emerging non-volatilememory (NVM), such as resistance-based storage and current sensing, to enable efficient PIM design in NVM. Wepropose Pinatubo1 , a Processing In Non-volatile memory ArchiTecture for bUlk Bitwise Operations. Instead ofintegrating complex logic inside the cost-sensitive memory,Pinatubo redesigns the read circuitry so that it can compute the bitwise logic of two or more memory rows veryefficiently, and support one-step multi-row operations. Theexperimental results on data intensive graph processing anddatabase applications show that Pinatubo achieves a 500 speedup, 28000 energy saving on bitwise operations, and1.12 overall speedup, 1.11 overall energy saving over theconventional processor.1.INTRODUCTIONIn the big data era, the “memory wall” is becoming thetoughest challenge as we are moving towards exascale computing. Moving data is much more expensive than computing itself: a DRAM access consumes 200 times more energy than a floating-point operation [14]. Memory-centricprocessing-in-memory (PIM) architecture and Near-data-computing (NDC) appear as promising approaches to address This work was supported by the U.S. Department of Energy,Office of Science, Office of Advanced Scientific Computing Research under Award number DE-SC0013553 with disclaimer athttp://seal.ece.ucsb.edu/doe/. It was also supported in part byNSF 1533933, 1461698 and 1500848, and a grant from Qualcomm. Zhao is supported by UCSC start-up funding.1Mount Pinatubo is an active volcano that erupted in 1991. Weenvision our design to invigorate the future PIM research, similarto the rejuvenation of life after an volcanic eruption.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from permissions@acm.org.DAC ’16, June 05-09, 2016, Austin, TX, USAc 2016 ACM. ISBN 978-1-4503-4236-0/16/06. . . 15.00DOI: http://dx.doi.org/10.1145/2897937.2898064such challenges. By designing processing units inside/nearmain memory, PIM/NDC dramatically reduces the overheadof data movements. It also takes advantage of the largememory internal bandwidth, and hence the massive parallelism. For example, the internal bandwidth in the hybridmemory cube (HMC) [20] is 128 times larger than its SerDesinterface [4].Early PIM efforts [19] are unsuccessful due to practicalconcerns. They integrate processing units and memory onthe same die. Unfortunately, designing and manufacturingthe performance-optimized logic and the density-optimizedmemory together is not cost-effective. For instance, complexlogic designs require extra metal layers, which is not desirable for the cost-sensitive memory vendors. Recent achievements in 3D-stacking memory revive the PIM research [18],by decoupling logic and memory circuits in different dies.For example, in HMC stacked memory structure, an extralogic die is stacked with multi-layer of DRAM dies usingmassive number of through-silicon-vias [18].Meanwhile, the emerging non-volatile memories (NVMs), i.e., phase changing memory (PCM) [10], spin-transfertorque magnetic random access memory (STT-MRAM) [24],and resistive random access memory (ReRAM) [8] providepromising features such as high density, ultra-low standby power, promising scalability, and non-volatility. Theyhave showed great potential as candidates of next-generationmain memory [15, 28, 27].The goal of this paper is to show NVM’s potential on enabling PIM architecture, while almost all existing efforts focus on DRAM systems and heavily depend on 3D integration. NVM’s unique features, such as resistance-based storage (in contrast to charge-based in DRAM) and currentsensing scheme (in contrast to the voltage-sense scheme usedin DRAM), are able to provide inherent computing capabilities [13, 16]. Therefore, NVM can enable PIM without therequirement of 3D integration. In addition, it only requiresinsignificant modifications to the peripheral circuitry, resulting in a cost-efficient solution. Furthermore, NVM-enabledPIM computation is based on in-memory analog signals,which is much more energy efficient than other work thatuses digital circuits.In this paper, we propose Pinatubo, a Processing In Nonvolatile memory ArchiTecture for bUlk Bitwise Operations,including OR, AND, XOR, and INV operations. WhenPinatubo works, two or more rows are activated simultaneously, the memory will output the bitwise operations result of the open rows. Pinatubo works by activating two (ormore) rows simultaneously, and then output of the memo-

ry is the bitwise operation result upon the open rows. Theresults can be sent to the I/O bus or written back to another memory row directly. The major modifications on theNVM-based memory are in the sense amplifier (SA) design.Different from a normal memory read operation, where theSA just differentiates the resistance on the bitline betweenRhigh and Rlow , Pinatubo adds more reference circuit to theSA, so that it is capable of distinguishing the resistance of{Rhigh /2 (logic “0,0”), Rhigh Rlow (logic “0,1”), Rlow /2 (logic “1,1”)} for 2-row AND/OR operations. It also potentiallysupports multi-row OR operations when high ON/OFF ratio memory cells are provided. Although we use 1T1R PCMas an example in this paper, Pinatubo does not rely on acertain NVM technology or cell structure, as long as thetechnology is based on resistive-cell.Our contributions in this paper are listed as follows, We propose a low-cost processing-in-NVM architecturewith insignificant circuit modification and no requirement on 3D integration. We design a software/hardware interface which is bothvisible to the programmer and the hardware. We evaluate our proposed architecture on data intensive graph processing and data-base applications, andcompare our work with SIMD processor, acceleratorin-memory PIM, and the state-of-the-art in-DRAMcomputing approach.2.NVM BACKGROUNDAlthough the working mechanism and the features vary,PCM, STT-MRAM, and ReRAM share common basics: allof them are based on resistive-cell. To represent logic “0”and “1”, they rely on the difference of cell resistance (Rhighor Rlow ). To switch between logic “0” and “1”, certain polarity, magnitude, and duration voltage/current are required.The memory cells typically adopt 1T1R structure [10], wherethere are a wordline (WL) controlling the access transistor,a bitline (BL) for data sensing, and a source line (SL) toprovide different polarized write currents.Architecting NVM as main memory has been well studied [28, 15]. The SA design is the major difference between NVM and DRAM design. Different from the conventional charge-based DRAM, the resistance-based NVM requires a larger SA to convert resistance difference into voltage/current signal. Therefore, multiple adjacent columnsshare one SA by a multiplexer (MUX), and it results in asmaller row buffer size.S1IcIrCsIcVgcVcS1VgrCsS1IrVrLatch (en 0)Latch (en 0)BLREF(a) Phase 1:current-samplingBLREF(b) Phase 2: currentratio amplificationVcCsALUCachesAll data via theNarrow DDR busCPUDDR3BusALU (idle)Caches (idle)Only CMD &Row-ADROperand Row 1Modified SAOperand Row 2Operand Row 1Operand Row 2Operand Row nResult RowOperand Row nResult RowNVM-based Main Memory(a) Conventional Approach(b) PinatuboFigure 2: Overview: (a) Computing-centric approach, moving tons of data to CPU and write back.(b) The proposed Pinatubo architecture, performsn-row bitwise operations inside NVM in one step.We propose Pinatubo to accelerate the bitwise operationsinside the NVM-based main memory. Fig. 2 shows theoverview of our design. Conventional computing-centric architecture in Fig. 2 (a) fetches every bit-vector from thememory sequentially. The data walks through the narrowDDR bus and all the memory hierarchies, and finally is executed by the limited ALUs in the cores. Even worse, it thenneeds to write the result back to the memory, suffering fromthe data movements overhead again. Pinatubo in Fig. 2 (b)performs the bit-vector operations inside the memory. Only commands and addresses are required on the DDR bus,while all the data remains inside the memory. To performbitwise operations, Pinatubo activates two (or more) memory rows that store bit-vector simultaneously. The modifiedSA outputs the desired result. Thanks to in-memory calculation, the result does not need the memory bus anymore. Itis then written to the destination address thought the WDdirectly, bypassing all the I/O and bus.Pinatubo embraces two major benefits from PIM architecture. First, the reduction of data movements. Second, thelarge internal bandwidth and massive parallelism. Pinatuboperforms a memory-row-length (typical 4Kb for NVM) bitvector operations. Furthermore, it supports multi-row operations, which calculate multi-operand operations in one step,bringing the equivalent bandwidth 1000 larger than theDDR3 bus.VrLatch (en 1)BLREF(c) Phase 3: 2nd-stageamplificationFigure 1: A current-based SA (CSA) [8].Fig. 1 shows the mechanism of a state-of-the-art CSA [8].There are three phases during sensing, i.e., current-sampling,current-ratio amplification, and 2nd -stage amplification.3.and image processing [6]. They are applied to replace expensive arithmetic operations. Actually, modern processorshave already been aware of this strong demand, and developed accelerating solutions, such as Intel’s SIMD solutionSSE/AVX.MOTIVATION AND OVERVIEWBitwise operations are very important and widely usedby database [26], graph processing [5], bio-informatics [21],4.ARCHITECTURE AND CIRCUIT DESIGNIn this section, we first show the architecture design thatenables the NVM main memory for PIM. Then we showthe circuit modifications for the SA, LWL driver, WD, andglobal buffers.4.1From Main Memory to PinatuboMain memory has several physical/logic hierarchies. Channels runs in parallel, and each channel contains several ranksthat share the address/data bus. Each rank has typical 8physical chips, and each chip has typical 8 banks as shown inFig. 3 (a). Banks in the same chip share the I/O, and banksin different chips work in a lock-step manner. Each bank

atRow BufferInter-sub operationsmulti-rowActivationBankMat(c) Mat (intra-subarray op.)LWL dec.Bank(b) Bank (inter-subarray op.)GWL dec.BankCtrl. IOOutput BufferInter-bankoperations(a) Chip (inter-bank op.)RowLWLSLBLMUXCSLMUXSA (w/ Intra-sub operations)WD (w/ in-place update)Ctrl.Figure 3: The Pinatubo Architecture.Glossary: Global WordLine (GWL), Global DataLine (GDL), LocalWordLine (LWL), SelectLine (SL), BitLine (BL), Column SelectLine (CSL), Sense Amplifier (SA), Write Driver (WD).4.2Peripheral Circuitry ModificationSA Modification: The key idea of Pinatubo is to use SAfor intra-subarray bitwise operations. The working mechanism of SA is shown in Fig. 5. Different from the chargebased DRAM/SRAM, the SA for NVM senses the resistanceon the BL. Fig. 5 shows the BL resistance distribution during read and OR operations, as well as the reference value assignment. Fig. 5 (a) shows the sensing mechanism fornormal reading (Though the SA actually senses currents, thefigure presents distribution of resistance for simplicity). Theresistance of a single cell (either Rlow or Rhigh ) is comparedwith the reference value (Rref-read ), determining the resultbetween “0” and “1”. For bitwise operations, an example fora 2-row OR operation is shown in Fig. 5 (b). Since two rowsare activated simultaneously, the resistance on the BL is theparallel connection of two cells. There could be three situations: Rlow Rlow (logic “1”,“1”), Rlow Rhigh (“1”,“0”), andRhigh Rhigh (“0”,“0”)2 . In order to perform OR operations,the SA should output “1” for the first two situations andoutput “0” for the last situation. To achieve this, we simplyshift the reference value to the middle of Rlow Rhigh andRhigh Rhigh , denoted as Rref-or . Note that we assume thevariation is well controlled so that no overlap exists between“1” and “0” region. In summary, to compute AND and OR,we only need to change the reference value of the SA.1 regionRBLCELLvalue0 regionpdf1 region 0 regionpdfSAoutputRlow Rref-read Rhigh1Rlow Rlow Rlow Rhigh Rref-or Rhigh Rhigh(1, 1) (1, 0)(0, 0)0(a) SA reads with R ref-read.(b) SA processes OR with R ref-or.Figure 5: Modifying Reference Values in SA to Enable Pinatubo.SxorCsCh101 0 1 1 0Contrl. (V) OROUTXOR/INVENRow Data1100011010.8REFANDCsOR/ANDoverheadORBL0.8 0 0 1 0 0 1 0S1ReadS1XORoverheadhas several subarrays. As Fig. 3 (b) shows, Subarrays sharethe GDLs and the global row buffer. One subarray containsseveral MATs as shown in Fig. 3 (c), which also work in thelock-step manner. Each Mat has its private SAs and WDs.Since NVM’s SA is much larger than DRAM, several (32 inour experiment) adjacent columns share one SA by a MUX.According to the physical address of the operand rows,Pinatubo performs three types of bitwise operations: intrasubarray, inter-subarray, and inter-bank operations.Intra-subarray operations. If the operand rows are allwithin one subarray, Pinatubo performs intra-subarray operations in each MAT of this subarray. As shown in Fig. 3 (c),the computation is done by the modified SA. Multiple rowsare activated simultaneously, and the output of the modifiedSA is the operation result. The operation commands (e.g.,AND or OR) are sent by the controller, which change thereference circuit of the SA. We also modify the LWL driveris also implemented to support multi-row activation. If theoperation result is required to write back to the same subarray, it is directly fed into the WDs locally as an in-placeupdate.Inter-subarray operations. If the operand rows are indifferent subarrays but in the same bank, Pinatubo performsinter-subarray operations as shown in Fig. 3 (b). It is basedon the digital circuits added on the global row buffer. Thefirst operand row is read to the global row buffer, while thesecond operand row is read onto the GDL. Then the twooperands are calculated by the add-on logic. The final resultis latched in the global row buffer.Inter-bank operations. If the operand rows are even indifferent banks but still in the same chip, Pinatubo performsinter-bank operations as shown in Fig. 3 (a). They are doneby the add-on logic in the I/O buffer, and have a similarmechanism as inter-subarray operations.Note that Pinatubo does not deal with operations betweenbit-vectors that are either in the same row or in different chips. Those operations could be avoided by optimizedmemory mapping, as shown in Section 5.ANDXOR00.8V(Cs) (V)00.8OUT (V) 1 0 1 0 0 1010TIME(sec) (lin)printed Thu Oct 29 2015 16:43:18 by shuangchenli on linux34.engr.ucsb.eduSynopsys, Inc. (c) 2000-2009Figure 6: Current Sense Amplifier (CSA) Modification (left) and HSPICE Validation (right).Fig. 6 (a) shows the corresponding circuit modificationbased on the CSA [8] introduced in Section 2. As explainedabove, we add two more reference circuits to support AND/ORoperations. For XOR, we need two micro-steps. First, oneoperand is read to the capacitor Ch . Second, the otheroperand is read to the latch. The output of the two add-ontransistors is the XOR result. For INV, we simply outputthe differential value from the latch. The output is selected among READ, AND, OR, XOR, and INV results by aMUX. Fig. 6 (b) shows the HSPICE validation of the proposed circuit. The circuit is tested with a large range ofcell resistances from the recent PCM, STT-MRAM, and ReRAM prototypes [23].Multi-row Operations:Pinatubo supports multi-rowoperations that further accelerate the bitwise operations. Amulti-row operation is defined as calculating the result ofmultiple operands at one operation. For PCM and ReRAM2“ ” denotes production over sum operation.

pim malloc(.Software StackOSDriver Libschedulepim-aware memoryoptmanagementC Run-timeLibraryProgramming Model);pim op(dst,src1,src2,data t,op t, len);pim-awaremallocexpose PA by syscallextendISAHardware ControlCMD Mode RegisterADR4 (MR4)Ctrl.Memory with PIMDATMain MemoryFigure 4: Pinatubo System Support.RESETWL Driver.LWL-n.LWL dec.Adr GWLwhich encode Rhigh as logic “0”, Pinatubo can calculate nrow OR operations3 . After activating n rows simultaneously,Pinatubo needs to differentiate the bit combination of onlyone “1” that results in “1”, and the bit combination with all“0” that results in “0”. This leads to a reference value between Rlow Rhigh /(n 1) and Rhigh /n. This sensing margin is similar with the TCAM design [17]. State-of-the-artPCM-based TCAM supports 64-bit WL with two cells perbit. Therefore we assume maximal 128-row operations forPCM. For STT-MRAM, since the ON/OFF ratio is alreadylow, we conservatively assume maximal 2-row operation.LWL Driver Modification: Conventional memory activates one row each time. However, Pinatubo requires multirow activation, and each activation is a random-access. Themodifications of the LWL driver circuit and the HPSICEvalidation are shown in Fig. 7. Normally, the LWL driveramplifies the decoded address signal with a group of inverters. We modify each LWL drive by adding two more transistors. The first transistor is used to feed the signal betweeninverters back and serves as a latch. The second transistor is used to force the driver’s input as ground. Duringthe multi-row activation, it requires to send out the RESETsignal first, making sure that no WL has latched anything.Then every time an address is decoded, the selected WLsignal is latched and stuck at VDD until the next RESETsignal arrives. Therefore, after issuing all the addresses, allthe corresponding selected WL are driven to the high voltagevalue.RESET (V)1.5DEC n (V)1.5WL n (V)1.50001n2n3n4n5nTIME(sec) (lin)Synopsys, Inc. (c) 2000-2009WD Modification: Fig. 8 (a) shows the modification to aWD of STT-MRAM/ReRAM. We do not show PCM’s WDsince it is simpler with unidirectional write current. Thewrite current/voltage is set on BL or SL according to thewrite input data. Normally, the WD’s input comes from thedata bus. We modify the WD circuit so that the SA resultis able to be fed directly to the WD. This circuit bypassesthe bus overhead when writing results back to the memory.BLSAwdataContrl.dataContrl.SLRowBufferorIO BufferOverheadFigure 8: (a) Modifications to Write Driver (WD).(b) Modifications for Inter-Sub/Bank Operations35.SYSTEM SUPPORTFig. 4 shows an overview of Pinatubo’s system design.The software support contains the programming model andrun-time supports. The programming model provides twofunctions for programmers, including the bit-vector allocation and the bitwise operations. The run-time supports include modifications of the C/C run-time library and theOS, as well as the development of the dynamic linked driverlibrary. The C/C run-time library is modified to providea PIM-aware data allocation function. It ensures that different bit-vectors are allocated to different memory rows, sincePinatubo is only able to process inter-row operations. TheOS provides the PIM-aware memory management that maximizes the opportunity for calling intra-subarray operations.The OS also provides the bit-vector mapping informationand physical addresses (PAs) to the PIM’s run-time driverlibrary. Based on the PAs, the dynamic linked driver libraryfirst optimizes and reschedules the operation requests, andthen issues extended instruction for PIM [3]. The hardwarecontrol part utilizes the DDR mode register (MR) and command. The extended instructions are translated to DDRcommands and issued through the DDR bus to the mainmemory. The MR in the main memory is set to configurethe PIM operations.0Figure 7: Local Wordline (LWL) Driver Modification (left) and HSPICE Validation (right).printed Tue Oct 27 2015 14:15:19 by shuangchenli on linux34.engr.ucsb.eduGlobal Buffers Modification: To support inter-subarrayand inter-bank operations, we have to add the digital circuitsto the row buffers or IO buffers. The logic circuit’s input isthe data from the data bus and the buffer. The output is selected by the control signals and then latched in the buffer,as shown in Fig. 8 (b).Multi-row AND in PCM/ReRAM is not supported, since it is unlikely to differentiate Rlow /(n 1) Rhigh and Rlow /n, when n 2.6.EXPERIMENTIn this section, we compare Pinatubo with state-of-the-artsolutions and present the performance and energy results.6.1 Experiment SetupThe three counterparts we compare are described below:SIMD is a 4-core 4-issue out-of-order x86 Haswell processor running at 3.3GHz. It also contains a 128-bit SIMDunit with SSE/AVX for bitwise operation acceleration. Thecache hierarchy consists of 32KB L1, 256KB L2, and 6MBL3 caches.S-DRAM is the in-DRAM computation solution to accelerate bitwise operations [22]. The operations are executedby charges sharing in DRAM. Due to the read-destructivefeature of DRAM, this solution requires copying data beforecalculation. Only 2-row AND and OR are supported.AC-PIM is an accelerator-in-memory solution, where eventhe intra-subarray operations are implemented with digitallogic gates as shown in Fig. 8 (b).The S-DRAM works with a 65nm 4-channel DDR3-1600DRAM. AC-PIM and Pinatubo work on 1T1R-PCM basedmain memory whose tRCD-tCL-tWR is 18.3-8.9-151.1ns [9].SIMD works with DRAM when compared with S-DRAM,

Vector:dataset:pure vector OR operations.e.g. 19-16-1(s/r) means 219 -lentgh vector, 216 vectors, 21 -row OR ops (sequntial/random access)Graph:bitmap-based BFS for graph processing [5].dataset:dblp-2010, eswiki-2013,amazon-2008 [1]Database: bitmap-based database (Fastbit [26]) application.dataset:240/480/720 number of quraying on STAR [2]S-DRAMAC-PIMPinatubo-2Pinatubo-1281.E 041.E 031.E 021.E 01Table 1: Benchmarks and Data SetS-DRAMFig. 9 shows Pinatubo’s OR operation throughput. Wehave four observations. First, the throughput increases withlonger bit-vectors, because they make better use of the memory internal bandwidth and parallelism. Second, we observetwo turning points, A and B, after which the speedup improvement is showed down. Turning point A is caused bythe sharing SA in NVM: bit-vectors longer than 214 have tobe mapped to columns the SA sharing, and each part has tobe processed in serial. Turning point B is caused by the limitation of the row length: bit-vectors longer than 219 haveto be mapped to multiple ranks that work in serial. Third,Pinatubo has the capability of multi-row operations (as thelegends show). For n-row OR operations, larger n provideslarger bandwidth. Fourth, the y-axis is divided into threeregions: the below DDR bus bandwidth region which onlyincludes short bit-vectors’ result; the memory internal bandwidth region which includes the majority of the results; andthe beyond internal bandwidth region, thanks to the multirow operations. DRAM systems can never achieve beyondmemory internal bandwidth region.5.E 04GBps8-row16-row1.E 02Point BPoint A1.E 011.E 00Below DDR Bus BandwidthRegion32-row64-row128-row10 11 12 13 14 15 16 17 18 19 20Bit-vector Length (2 n)Figure 9: Pinatubo’s Throughput (GBps) for ubo-2Pinatubo-1285.E 025.E r14-16-7s14-12-7s19-16-7s19-16-1s5.E 00Fastbit.Figure 11: Energy Saving Normalized to eandblp1.00amazon4-row OR1.E 03.5.E 03eswiki2-row ORBeyond Internal BandwidthRegionFastbitWe compare both Pinatubo of 2-row and 128-row operation with two aggressive baselines in Fig. 10, which shows thespeedup on bitwise operations. We have three observations: First, S-DRAM has better performance than Pinatubo2 in some cases with very long bit-vectors. This is because DRAM-based solutions benefit from larger row buffers, compared with the NVM-based solution. However, theadvantage of NVM’s multi-row operations still dominates.Pinatubo-128 is 22 faster than S-DRAM. Second, the ACPIM solution is much slower than Pinatubo in every singlecase. Third, multi-row operations show their superiority,especially when intra-subarray operations are dominating.An opposite example is 14-16-7r, where all operations arerandom accesses and it is dominated by inter-subarray/bankoperations, so that Pinatubo-128 is as slow as Pinatubo-2.5.E 051.E 04dblpGraphFigure 10: Speedup Normalized to SIMD Baseline.Performance and Energy r720The parameters for S-DRAM are scaled from existingwork [22]. The parameters for AC-PIM are collected fromsynthesis tool with 65nm technology. As to parameters forPinatubo, the analog/mixsignal part, including SA, WD,and LWL, is extracted from HSPICE simulation; the digitalpart, including controllers and logics for inter-subarray/bankoperations, is extracted from the synthesis tool. Based onthose low-level parameters, we heavily modify NVsim [11] forthe NVM circuit modeling, and CACTI-3DD [9] for the mainmemory modeling, in order to achieve high-level parameters.We also modify the PIN-based simulator Sniper [7] for SIMDprocessor and the NVM-based memory system. We developan in-house simulator to evaluate the AC-PIM, S-DRAM,and Pinatubo. We show the evaluation benchmarks and data sets in Table 1, in which Vector only has OR operationwhile Graph and Database contain all AND, OR, XOR, andINV operations.1.E 0019-16-1sand with PCM when compared with AC-PIM and Pinatubo.Note that the experiment takes 1T1R PCM for a case study,but Pinatubo is also capable to work with other technologiesand cell structures.EnergyFigure 12: Overall Speedup and Energy Saving Normalized to SIMD Baseline.Fig. 11 shows the energy saving result. The observationsare similar with those from speedup: S-DRAM is betterthan Pinatubo-2 in some cases but worse than Pinatubo128 on average. AC-PIM never has a change to save moreenergy then any of the other three solutions, since both SDRAM and Pinatubo rely on high energy efficient analogy

computing. On average, Pinatubo saves 2800 energy forbitwise operations, compared with SIMD processor.Fig. 12 shows the overall speedup and energy saving ofPinatubo in the two real-world applications. The ideal legend represents the result with zero latency and energy spenton bitwise operations. We have three observations. First,Pinatubo almost achieves the ideal acceleration. Second,limited by the bitwise operations’ proportion, Pinatubo canimprove graph processing applications by 1.15 with 1.14 energy saving. However, it is data dependent. For the eswiki and amazon data set, since the connection is “loose”, ithas to spend most of the time searching for an unvisitedbit-vector. For dblp, it has 1.37 speedup. Third, for thedatabase applications, it achieves 1.29 overall speedup andenergy saving.6.3Overhead EvaluationFig. 13 shows the area overhead results. As shown inFig. 13 (a), Pinatubo incurs insignificant area overhead only 0.9%. However, AC-PIM has 6.4% area overhead, whichis critical to the cost-sensitive memory industry. S-DRAMreports 0.5% capacity loss, but it is for DRAM-only result and orthogonal with Pinatubo’s overhead evaluation.Fig. 13 (b) shows the area overhead breakdown. We conclude that the majority area overhead are taken by intersubarray/bank operations. For intra-subarray operations,XOR operations takes most of the area.4%2%6.4%6%0.9%Area %wl act 0.05%Intra-sub0.13%and/or0.02%xor 0.06%0%PCMFigure 13: Area Overhead Comparison (left) andBreakdown (right).7.RELATED WORKPinatubo distinguishes itself from PIM work, in-DRAMcomputing work, and logic-in-memory work. First, different from other PIM work, Pinatubo does not need 3D integration [18], and does not suffer from logic/memory coupling problem [19] either. Pinatubo benefits from NVM’sresistive-cell feature, and provides cost and energy efficient PIM. Although ProPRAM [25] leverages NVM for PIM,it uses NVM’s lifetime enhancement peripheral circuits forcomputing, instead of the NVM’s character itself. Moreover, bitwise AND/OR is not supported in ProRPAM, andit is computing with digital circuit while Pinatubo takes advantage of high energy-efficient analog computing. Second,based on charge sharing, in-DRAM bulk bitwise operationsis proposed [22]. However, it suffers from read destructiveproblem so that operand copy is required before computing, incurring unnecessary overhead. Also, only maximal2-row operations are supported. Third, there is other workusing NVM for logic-in-memory functionality such as associate memory [17, 12]. Recent studies also take use of ReRAM crossbar array to implement IMPLY-based logic operations [16, 13]. However, none of them are necessarily fitted to the PIM concept: they use the memory technique toimplement processing unit, but the processing unit

Figure 2: Overview: (a) Computing-centric ap-proach, moving tons of data to CPU and write back. (b) The proposed Pinatubo architecture, performs n-row bitwise operations inside NVM in one step. We propose Pinatubo to accelerate the bitwise operations inside the NVM-based main memory. Fig. 2 shows the overview of our design.

Related Documents:

Impact of the eruption of Mt. Pinatubo on the chemical composition of ...

Pinatubo eruption. The motivation of this study is to differentiate the effect of the eruption on the chemical composition into: 1.the effect of stratospheric heating and subsequent changes in transport, 2.the pure chemical effect due to heterogeneous reactions on the volcanic aerosol and the modication of the relevant chemical cycles.

4 Views

1y ago

Remember • Honor • Celebrate

In memory of Paul Laliberte In memory of Raymond Proulx In memory of Robert G. Jones In memory of Jim Walsh In memory of Jay Kronan In memory of Beth Ann Findlen In memory of Richard L. Small, Jr. In memory of Amalia Phillips In honor of Volunteers (9) In honor of Andrew Dowgiert In memory of

41 Views

2y ago

CSE 3320 Operating Systems Memory Management

Memory Management Ideally programmers want memory that is o large o fast o non volatile o and cheap Memory hierarchy o small amount of fast, expensive memory -cache o some medium-speed, medium price main memory o gigabytes of slow, cheap disk storage Memory management tasks o Allocate and de-allocate memory for processes o Keep track of used memory and by whom

31 Views

1y ago

An Introduction to Memory Retrieval and Forgetting Flow ...

An Introduction to Memory LO 1 Define memory. LO 2 Describe the processes of encoding, storage, and retrieval. Flow With It: Stages of Memory LO 3 Explain the stages of memory described by the information-processing model. LO 4 Describe sensory memory. LO 5 Summarize short-term memory. LO 6 Give examples of how we can use chunking to improve our memory span.

153 Views

2y ago

Memory Hierarchy Design - ICL UTK

Chapter 2 Memory Hierarchy Design 2 Introduction Goal: unlimited amount of memory with low latency Fast memory technology is more expensive per bit than slower memory –Use principle of locality (spatial and temporal) Solution: organize memory system into a hierarchy –Entire addressable memory space available in largest, slowest memory –Incrementally smaller and faster memories, each .

46 Views

3y ago

Chapter 6 Memory - University of Houston–Downtown

Memory -- Chapter 6 2 virtual memory, memory segmentation, paging and address translation. Introduction Memory lies at the heart of the stored-program computer (Von Neumann model) . In previous chapters, we studied the ways in which memory is accessed by various ISAs. In this chapter, we focus on memory organization or memory hierarchy systems.

45 Views

3y ago

CHAPTER 6 Memory

CMPS375 Class Notes (Chap06) Page 2 / 17 by Kuo-pao Yang 6.1 Memory 281 In this chapter we examine the various types of memory and how each is part of memory hierarchy system We then look at cache memory (a special high-speed memory) and a method that utilizes memory to its fullest by means of virtual memory implemented via paging.

44 Views

3y ago

Standard Speciﬁcation for Flat Glass1

2 For referenced ASTM standards, visit the ASTM website, www.astm.org, or contact ASTM Customer Service at service@astm.org. For Annual Book of ASTM Standards volume information, refer to the standard’s Document Summary page on the ASTM website. 3 National Fenestration Rating Council, 84884 Georgia Ave., Suite 320, Silver Spring, MD 20910. 1

154 Views

3y ago

Recent Views

BETTER NUTRITION BRIGHTER FUTURE - Maryland.gov Enterprise Agency Template

TOFU BUY: 12- to 16-ounce container Brands and types shown here ONLY Not WIC Approved: With added fats, sugar, oil, or salt With added ﬂavorings, sauces, or seasonings Azumaya Extra Firm Franklin Farms Firm, Medium Firm, Extra Firm, Soft House Foods Organic: Soft, Firm, Medium Firm, Extra Firm Premium: Soft, Firm, Medium Firm, Extra Firm

1y ago

192 Views

Leaving a Law Firm: A Guide to the Ethical Obligations in Law Firm .

associates or otherwise employed in the firm "not to (1) actively exploit their positions within the [law firm] for their own personal benefits, or (2) hinder the ability of the [law firm] to conduct the business for which it was developed." Burke v. Lakin Law Firm, 2008 WL 64521 (S.D.Ill. Jan. 3, 2008), quoting FoodComm Intern. V.

1y ago

113 Views

Uses of Special Purpose Vehicles (SPVs) in structuring financing .

TFR "Best Law Firm in Trade Finance" Trade & Forfaiting Review (TFR) named Sullivan & Worcester "Best Law Firm in Trade Finance" in its 2014, 2015 and 2016 TFR Excellence Awards . GTR "Best Law Firm" Sullivan & Worcester UK LLP was top ranked firm in the . Global Trade Review (GTR) Best Law Firm 2015 and 2016 polls . The Legal 500 UK . 2016

1y ago

139 Views

Global Elite Law Firm Brand Index 2022 - thomsonreuters

such areas as law firm brand, firm usage, and legal market trends. The responses are distilled . down into four different and non-related measures gathered from the Sharplegal research and . then used to generate the individual Global Elite Law Firm Brand Index score for each law firm. How we generate our insights. In-depth interviews with

1y ago

166 Views

Notice and Order - Law Firm Names - Amendments to RPC 7.5 and Related Rules

LAW FIRM NAMES - AMENDMENTS TO RPC 7.5 AND COURT RULES 1:21-1A, 1:21-1B, AND 1:21-1C The Supreme Court has adopted amendments to Rule of Professional Conduct 7.5 ("Law Firm Names and Letterheads") so as to remove the requirement that the law firm name include the name of a lawyer and describe the nature of the firm's legal practice.

1y ago

134 Views

Law Student's Guide to the Washington, DC-Area Law Firm Market

Years 6-8: Return to law firm as senior associate or counsel . Benefit: In addition to your government experience, law firm employers will value your prior firm experience with billing time, working with private sector clients, etc. In other words, you already "know how law firms work" and this provides a smoother transition back. *Disclaimer:

1y ago

151 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Overcoming Ethical Challenges for Multi-Firm Lawyers and Their Firms .

- Florida Bar Op. 94-7: o Law firm refers personal injury cases to a lawyers who is "of counsel" to the firm and who sometimes works in the law firm's offices, but who also . Formal Ops. 1995-9: o A law firm named "A B & C" is a NY partnership consisting of partners A, B, and C. Motivated by tax concerns, C retires and becomes .

1y ago

116 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

Law Firm Performance Metrics - Thomson Reuters

ProLaw XII reporting offers a firm the capability to turn data into knowledge for law firm performance management. The new reporting features within ProLaw XII provide key financial and operational metrics necessary to monitor firm performance - many of which can be self‐defined by the firm.

1y ago

104 Views

CHAPTER 11 35 per hour to firm A but differ in their .

flock to the piece rate firm. After the price of output falls, firm A values all workers at 17.50 per hour, while worker 1’s value at firm B falls to 50 cents, worker 2’s value falls to 1 at firm B, etc. The question is what happens to the wage. Presumably wage also falls, to 17.50 per hour in firm A.

2y ago

165 Views

Faculty of Juridical, Social and Political Sciences Year .

Law L Law IV 8 Drept procesual civil II / Civil Procedure Law II 5 Law L Law IV 8 Dreptul comerțului internațional / International ommercial Law 4 Law L Law IV 8 riminalistică / Forensics 4 Law L Law IV 8 Practică de cercetare pentru elaborarea lucrării de lincență(3 săptămân

2y ago

384 Views

Pinatubo: A Processing-in-Memory Architecture For Bulk Bitwise .

It looks like you're using an ad-blocker