Bounding Worst-Case DRAM Performance On Multicore Processors

1y ago
7 Views
3 Downloads
5.69 MB
14 Pages
Last View : 5d ago
Last Download : 2m ago
Upload by : Raelyn Goode
Transcription

Regular PaperJournal of Computing Science and Engineering,Vol. 7, No. 1, March 2013, pp. 53-66Bounding Worst-Case DRAM Performance on Multicore ProcessorsYiqiang Ding, Lan Wu, and Wei Zhang*Department of Electrical and Computer Engineering, Virginia Commonwealth University, Richmond, VA, USAdingy4@vcu.edu, wul3@vcu.edu, wzhang4@vcu.eduAbstractBounding the worst-case DRAM performance for a real-time application is a challenging problem that is critical forcomputing worst-case execution time (WCET), especially for multicore processors, where the DRAM memory is usuallyshared by all of the cores. Typically, DRAM commands from consecutive DRAM accesses can be pipelined on DRAMdevices according to the spatial locality of the data fetched by them. By considering the effect of DRAM command pipelining, we propose a basic approach to bounding the worst-case DRAM performance. An enhanced approach is proposedto reduce the overestimation from the invalid DRAM access sequences by checking the timing order of the co-runningapplications on a dual-core processor. Compared with the conservative approach, which assumes that no DRAM command pipelining exists, our experimental results show that the basic approach can bound the WCET more tightly, by15.73% on average. The experimental results also indicate that the enhanced approach can further improve the tightnessof WCET by 4.23% on average as compared to the basic approach.Categories: Embedded computingKeywords: Performance; Reliability; Real-time scheduling; WCET; Multicore processorI. INTRODUCTIONWith the rapid development of computing technologyand the diminishing return of complex uniprocessors,multicore processors are being used more widely in thecomputer industry. Future high-performance real-timesystems are likely to benefit from multicore processorsdue to the significant boost in processing capability, lowpower consumption, and high density.In real-time systems, especially hard real-time systems, it is crucial to accurately obtain the worst-case execution time (WCET) for real-time tasks to ensure thecorrectness of schedulability analysis. Although the WCETof a real-time application can be obtained by measurement-based approaches, the results are generally unreliable due to the impossibility of exhausting all the possibleOpen Accessprogram paths. Alternatively, static WCET analysis [1]can be used to compute the WCET, which should be safeand as accurate as possible. The WCET of a real-timeapplication is not only determined by its own attributes,but also affected by the timing of architectural components, such as pipelines, caches, and branch predictors.Most prior research works have focused on WCET analysis for single-threaded applications running on uniprocessors [2-6], but these methods cannot be easily applied toestimate the WCET on multicore processors with sharedresources, such as a shared L2 cache and DRAM memory. This is because the possible interferences in the sharedresources between different threads can significantlyincrease the complexity of WCET analysis.Due to its structural simplicity, high density, and volatility, DRAM is usually utilized in current popular ttp://jcse.kiise.orgThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.Received 12 February 2013, Accepted 1 March 2013*Corresponding AuthorCopyright2013. The Korean Institute of Information Scientists and EngineerspISSN: 1976-4677 eISSN: 2093-8020

Journal of Computing Science and Engineering, Vol. 7, No. 1, March 2013, pp. 53-66sors, including multicore processors. A DRAM systemconsists of multiple components, such as a memory accesscontroller, command/address bus, data bus, and DRAMdevices. The latency of an access to the DRAM variesdue to the status of each component when accessed. Onerecent work studied the vulnerability of current multicoreprocessors due to a new class of denial of service (DoS)attacks [7]. Under the current DRAM architecture, athread with a particular memory access pattern can overwhelming the shared resources in the DRAM, preventingother threads from using these resources efficiently.Therefore, the latencies of the DRAM accesses fromother threads could be prolonged.There have been several studies to model and predictDRAM memory performance. Ahn et al. [8] performed aperformance analysis of scientific and multimedia applications on DRAM memory with various parameters, andfound that the most critical performance factors are highread-write turnaround penalties and internal DRAM bankconflicts. They then developed an accurate analyticalmodel for the effective random-access bandwidth givenDRAM technology parameters and the burst-length.Yuan and Aamodt [9] proposed a hybrid analytical modelto predict DRAM access efficiency based on memorytrace profiling. Bucher and Calahan [10] modeled theperformance of an interleaved common memory of amultiprocessor using queuing and simulation methods.Choi et al. [11] presented an analytical model to predictthe DRAM performance based on the DRAM timing andmemory access pattern parameters. However, these priorstudies have focused on predicting the average-caseDRAM performance, rather than the worst-case. Forexample, the DRAM access patterns assumed in thesestudies were based on typical access patterns or derivedfrom simulated traces, which cannot be safely used torepresent the worst-case DRAM access patterns to derivethe WCET.Research was performed recently to bound the worstcase DRAM performance on a uniprocessor by considering the impact of the row-buffer management policy [12].However it is more challenging to conduct WCET analysis on a multicore processor by bounding the worst-caseDRAM performance for the following reasons. First, theDRAM access pattern of a thread depends on its accessesto higher-level cache memories, such as the L2 cache. Ifthe DRAM memory is shared with different cores, theaccesses of a thread can be greatly impacted by inter-coreDRAM access interference. Second, the worse-case latencyof a DRAM access of a thread is determined by not onlythe number of the simultaneous DRAM accesses fromother threads, but also the timing order of all these DRAMaccesses and the spatial locality of the data fetched bythem. However, the timing order of simultaneous DRAMaccesses from co-running threads is hard to determinethrough static analysis, because all the threads are running independently on different cores.http://dx.doi.org/10.5626/JCSE.2013.7.1.53To overcome these difficulties, this paper first investigates the timing characteristics of DRAM accesses with afocus on DRAM devices. Our study shows that the DRAMcommands from multiple consecutive DRAM accessescan be pipelined on DRAM devices, and the degree of theDRAM command pipelining varies according to the spatial locality of the data accessed, which may impact theworst-case latency of each access. A basic approach isthen proposed to estimate the worst-case situation ofDRAM command pipelining, which leads to the worstcase latency for a DRAM access among a sequence ofconsecutive DRAM accesses. An enhanced approach isproposed to reduce the overestimation from the invalidDRAM access sequences by checking the timing orderconstraints of concurrent applications. In addition, weutilize the extended integer linear programming (ILP)approach [4] to model the constraints between the accessesto the higher-level cache memory and the DRAM accesses.The worst-case DRAM performance is integrated into theobjective function of the extended ILP approach to boundthe WCET of a real-time task running on a multicore processor.The rest of the paper is organized as the follows. First,the multicore architecture studied in this work is describedin Section II. Next, the background of the DRAM systemis introduced in Section III. Section IV presents the timing characteristics of DRAM accesses, with a focus onDRAM devices. Then, we introduce two approaches tobound the worst-case DRAM performance in Section V.Section VI introduces the evaluation methodology, andSection VII gives the experimental results. Finally, conclusions are presented in Section VIII.II. SYSTEM ARCHITECTUREFig. 1 shows the system architecture of a multicoreprocessor with N cores studied in this paper (N 1). EachFig. 1. Target system architecture.54Yiqiang Ding et al.

Bounding Worst-Case DRAM Performance on Multicore Processorscore is symmetrical, with its own processing unit, pipeline, L1 instruction and data caches, and private L2 cache,which are not uncommon in commercial multicoredesigns. The DRAM is shared by all cores through ashared bus. In order to focus on bounding the worst-caseDRAM performance, the interactions between the DRAMand the hard disk are ignored in our study. It is assumedthat all the code and data of a thread are loaded into theDRAM beforehand, such that no page fault would occurduring subsequent execution.considerations and implementations for the memory controller.There are two types of row-buffer management policies: the open-page policy and the close-page policy. Theopen-page policy is designed to favor memory accessesto the same row of memory by keeping the row bufferopen and holding a row of data for ready access. In contrast, the close-page policy is designed to favor accessesto random locations in the DRAM, and optimally supports the DRAM access patterns with low degrees of spatial locality. In a multicore processor, the intermixing ofDRAM access sequences from multiple threads reducesthe spatial locality of the access sequence. The closepage policy can achieve better performance [13] withoutany optimization on the memory controller [14]. TheDRAM access transactions and DRAM commands arequeued in the memory controller. The queuing delay alsoaffects the performance of DRAM. DRAM commandscan be scheduled by various scheduling algorithms [15,16] based on different factors, such as the availability ofresources in DRAM devices. In our study, the memorycontroller is assumed to have no optimization, and theclose-page policy and the first come first serve (FCFS)scheduling algorithm are used.III. DRAM MEMORY SYSTEMGenerally, a DRAM memory system comprises threemajor components, as shown in Fig. 2. The DRAM devicesstore the actual data; the memory controller is responsiblefor the communication between the DRAM devices andthe processor; and the buses connect the DRAM devicesand the memory controller to transfer addresses, commands, and data.DRAM device: Multiple levels of store entities areorganized hierarchically in a DRAM device, such thatDRAM accesses can be served in parallel on a certainlevel according to the spatial locality of the data beingaccessed. The memory array is the fundamental storageentity in a DRAM device. A bank is a set of independentmemory arrays, and has a two-dimensional structure withmultiple rows and columns. A bank also has a row buffer,and data can only be read from this buffer. A rank consists of a set of banks sharing the same I/O gating, andoperates in lockstep to a given DRAM command. Achannel is defined as a set of ranks that share the databus. For example, multiple DRAM accesses to differentranks in the same channel can be executed in parallel,except when the data are transferred on the shared databus.Memory controller: The memory controller managesthe flow of data in and out of DRAM devices connectedto it. The row-buffer management policy, the addressingmapping scheme, and the memory transaction and DRAMcommand ordering scheme are three important designIV. TIMING OF ACCESSING DRAM MEMORYSYSTEMSIn this section, we study the timing characteristics of theDRAM access, both in the case of an individual DRAMaccess and multiple consecutive DRAM accesses. Also,the worst-case latency for a DRAM access among asequence of consecutive DRAM accesses is derived.Generally, the timing of a DRAM access consists ofthree parts: the latency through the bus between the processor and the memory controller, the queuing delay inthe memory controller, and the latency of accessing theDRAM device. In this paper, as we focus on the estimation of the latency of accessing DRAM devices, theworst-case bus latency and the worst-case queuing delayin the memory controller are estimated safely as con-Fig. 2. DRAM architecture.Yiqiang Ding et al.55http://jcse.kiise.org

Journal of Computing Science and Engineering, Vol. 7, No. 1, March 2013, pp. 53-66Table 1. Timing parameters defined in the generic DRAM accessstants by a conservative approach, assuming that a givenDRAM access from one core should wait for the bustransferring and the memory controller queuing of theother N-1 DRAM accesses issued from other cores simultaneously on a multicore processor with N cores.protocolParameterA. Generic DRAM Access ProtocolTypically, a DRAM access can be translated into several DRAM commands to move data between the memory controller and the DRAM devices. A generic DRAMaccess protocol can be modeled by only consideringsome necessary basic DRAM commands and related timing constraints. It is assumed that two different commands can be fully pipelined on a DRAM device only ifthey do not have any conflict on a shared resource at agiven time, which can be called DRAM command pipelining. The whole procedure for DRAM commands of agiven DRAM access to fulfill the data movement areillustrated in Fig. 3. The figure also shows the resourcesrequired by these commands that cannot be shared by thecommands from other DRAM accesses concurrently. Inthe first phase, the command is transported via the command and address buses and decoded by the DRAMdevice. In the second phase, the data are moved into abank. The data are transported on the shared I/O gatingcircuit in the third phase. Finally, the data are transferredto the memory controller by the data bus.In the generic DRAM access protocol, three genericDRAM commands are defined: row access commands,column access commands, and precharge commands. Thetiming parameters related to these commands are shown inTable 1. tRCD, tCAS, and tBURST are all a part of tRAS, as shownDescriptiontBURSTData burst durationtCMDCommand transport durationtCASColumn access strobe latencytRASRow access strobe latencytRCDRow to column command delaytRPRow precharge durationin Fig. 4. The DRAM refresh command is not covered inthe generic DRAM protocol, because it is not issued fromany DRAM access, and could interrupt the commandpipeline periodically.B. Timing of an Individual DRAM Device AccessA typical cycle of an individual DRAM device accessto read data consists of three major phases: row access,column access, and precharge. The details of the cycleare illustrated in Fig. 4. Including the time of commandtransferring, the latency for the whole cycle can usuallybe computed by using Equation (1), the timing parameters of which are illustrated in Fig. 4. As the data movement for a DRAM access is finished at the end of thecolumn access, the latency of a read cycle without considering the precharge phase can be described by Equation (2), the timing parameters of which are also shown inFig. 4.Fig. 3. Command and data movement for an individual DRAM device access on a generic DRAM device iqiang Ding et al.

Bounding Worst-Case DRAM Performance on Multicore ProcessorsFig. 4. A cycle of a DRAM device access to read data [13].tREAD tCMD tRAS tRP(1)tREAD tCMD tRCD tCAS tBURST(2)flict, the first command of the second access is executedafter the start of the first access with the time interval ofat least tRP tRCD to avoid conflicts. So, the latency of thesecond access is defined as the sum of this minimal timeinterval and the latency of a read cycle without the precharging phase, as described in Equation (4). In Fig. 5c,the data fetched by both accesses are on different ranks,so both accesses only conflict on the data bus. Similar toFig. 5b, the minimal timing interval between the start ofboth accesses to avoid the conflict turns out to be onlytBURST. T2 in this case can be computed by Equation (5).C. Timing of Consecutive DRAM Device AccessesMultiple consecutive DRAM accesses happen morefrequently on a multicore processor than on a uniprocessor for the following reasons. First, the number of DRAMaccesses issued concurrently increases as the number ofcores increases. Second, there is no data dependency orcontrol flow constraint between the DRAM accessesfrom the threads running on different cores. However, theDRAM commands of consecutive DRAM accesses canrarely be fully pipelined, because these commands needto share resources in the DRAM devices concurrently.The degree of the DRAM command pipelining dependson the spatial locality of the data fetched by the consecutive DRAM accesses, as well as the state of the DRAMdevices, which can possibly impact the latency of a DRAMaccess among a sequence of consecutive DRAM accesses.Fig. 5 demonstrates the latencies of two consecutiveDRAM device accesses in three cases with different dataspatial locality between them. Both DRAM accesses areready to be executed at the same time. The latency of thefirst access T1 is the same in all cases according to Equation (2), which is not affected by the second access at all.However, the latency of the second access T2 varies,because the degrees of DRAM command pipelining aredifferent in three cases. As the data fetched by bothaccesses are in the same bank in Fig. 5a, the first command of the second access is not released until the datafetched by the first access are restored and the row hasbeen precharged. Because the second access has to waitfor the full cycle of the first access, and only the transportation of its row access command is pipelined with theprecharge phase of the first access, its latency T2 can bedescribed in Equation (3). In Fig. 5b, where both accessesfetch the data in different banks of the same rank, bothaccesses will only conflict on the I/O gating circuit andthe data bus. Also, as the row required by the secondaccess should be precharged in the case of a bank con-Yiqiang Ding et al.T2 case a tRAS tRP tCMD tRCD tCAS tBURST(3)T2 case b tRCD tRP tCMD tRCD tCAS tBURST(4)T2 case c tBURST tCMD tRCD tCAS tBURST(5)It can easily be concluded that the later of two consecutive DRAM accesses will have the worst-case latency ifboth accesses fetch data on the same bank. Furthermore,it can be extended to the case of N consecutive DRAMaccesses (N 2), since they can be divided into multipleinstances of two consecutive DRAM accesses. Therefore,the worst-case latency of a given access is Tn, as shown inEquation (6), if it is the last one in the sequence of consecutive DRAM accesses, and all the accesses fetch thedata in the same bank as well.Tn ( N – 1 ) * ( tRAS tRP ) tCMD tRCD tCAS tBURST (6)V. ANALYZING WORST-CASE DRAM PERFORMANCEOur assumptions: In this work, we develop a WCETanalysis method to derive the WCET for real-time applications running on multicore processors by modeling andbounding the worst-case DRAM performance. We focuson studying the instruction accesses through the memoryhierarchy, and assume the data cache is perfect. Also, inour WCET analysis, we have not considered the timing57http://jcse.kiise.org

Journal of Computing Science and Engineering, Vol. 7, No. 1, March 2013, pp. 53-66Fig. 5. The latencies of two consecutive DRAM device accesses with different spatial locality, which are calculated using Equations (3)(5), respectively. (a) Two consecutive DRAM memory accesses to the same bank, (b) two consecutive DRAM memory accesses to differentbanks on the same rank, and (c) two consecutive DRAM memory accesses to different ranks.caused by bus conflicts and DRAM memory refreshing.We assume in-order pipelines without using branch prediction. In our software model, we assume a set of independent tasks execute concurrently on different cores,and there is no data sharing or synchronization amongthose tasks.We extend the implicit path enumeration technique(IPET) technique [4] to obtain the WCET of a real-timeapplication on a multicore processor with its worst-caseDRAM performance. In IPET, the objective function ofthe integer ILP problem to calculate the WCET is subjectto structural constraints, functionality constraints, andmicro-architecture constraints, all of which can be usuallydescribed as linear equations or inequalities. Also, someequations are created to describe the equality relationshipbetween the execution counts of basic blocks and e blocks to connect the control flow graph (CFG) andthe cache conflict graph (CCG).As there are only private L1 and L2 caches in the multicore architecture studied, our WCET analysis approachonly needs to construct the CCG on each L1 and L2cache to build the cache constraints. The CCG on an L2cache describes the constraints between the L2 cacheaccesses and the DRAM accesses, as an L2 cache misswill result in a DRAM access. In order to consider theworst-case DRAM performance, the objective functionof the WCET for each thread is given in Equation (7),which includes the computing time, the latency to accessthe L1 cache, and the latency to access the L2 cache. Thenlast part Ci* Mi indicates the total latency of thei 1DRAM accesses. Specifically, Ci is the worst-case latencyof a given DRAM access, and Mi denotes its number of58Yiqiang Ding et al.

Bounding Worst-Case DRAM Performance on Multicore Processorsexecution, which is bounded by the cache constraintsfrom the CCG of the L2 cache.input of this algorithm is N co-running threads, and theoutput is the WCET objective function of each thread.The worst-case DRAM performance of each co-runningthread is estimated individually. The worst-case latencyfor a given DRAM access Mj in a given thread Ti is estimated as follows. First, addr, the DRAM address of thedata fetched by Mj, is translated from the physical addressaccording to the given address mapping policy, and thebank id b and rank id r are both derived from addr. Then,the number of other co-running threads with DRAM memory accesses fetching the data on the same bank b isdenoted as Nb at line 9. These Nb threads are excludedfrom the remaining procedure. Since it is possible thatDRAM accesses from the remaining threads are still fetching data on the same bank bk other than b, the maximumnumber of threads with DRAM accesses fetching data onthe bank of bk is calculated as N0b[k] and stored in anarray from lines 11 to 15. These threads are also excludedfrom the remaining procedure. In the next step, the number of threads with DRAM accesses fetching the data inthe same rank of r is calculated as Nr. At the end of theprocessing for Mj, the number of the threads with DRAMaccesses fetching the data on different ranks is computedas Ndr. The worst-case latency Cj for Mj is calculated basedon Equations (3)-(5). The algorithm will terminate whenthe worst-case DRAM performance has been estimatedand added into the WCET objective function based onEquation (7) for all the co-running threads.WCET Computing Time L1 Latency n L2 Latency Ci * Mi(7)i 1A. Conservative ApproachIf there are N identical cores sharing the DRAM on amulticore processor, it will be safe but pessimistic to estimate the worst-case latency of each DRAM access basedon two assumptions. The first assumption is that eachDRAM access from a thread is always issued with otherN-1 DRAM accesses simultaneously from other N-1 corunning threads, and this access starts to be executed afterall other accesses finish the execution. Second, all theseconsecutive DRAM accesses fetch the data in the samebank, which will result in the worst-case scenario, asdescribed in Section IV-C. Therefore, the worst-case latencyof a DRAM access can be computed by Equation 8,where ( N – 1 ) * ( tRAS tRP) is the delay to wait for the finish of the other N-1 accesses, and tCMD tRCD tCAS tBURSTstands for the actual DRAM device access latency for thisaccess. In addition, the calculation of Ci should includethe latency of bus access tBUS and the queuing delay fromthe memory controller tQUEUE, both of which are safelyestimated as constants, as discussed in Section IV. Althoughit is safe, this approach is pessimistic, which may result inmuch overestimation.C. An Example of the Basic ApproachAn example of the basic approach is shown in Fig. 6.In this example, there are 4 threads running concurrentlyon a multicore processor with 4 cores. In each thread,there are multiple DRAM accesses. It is supposed thatthere are 4 ranks in the DRAM of this example, and eachrank has 4 banks. A DRAM access is represented by arectangle with the name Mi. The numbers inside theparentheses denote the DRAM address of the datafetched by this access, where the first number is the rankid and the second number is the bank id. For example, thefirst DRAM access in Thread A is named M1 and its rankid and bank id are both 1. In addition, all the DRAMaccesses are connected by the edges to indicate the timingorder derived from the CFG.The estimation procedure for Thread A starts withchecking the DRAM address in M1. Then, M5 in Thread Band M10 in Thread C are found to fetch the data in bank 1of rank 1 as well. So, Nb for M1 is 2. As Thread D doesnot have any DRAM access to rank 1, Nr is equal to 0 andNdr is 1. Therefore the worst-case latency of M1 can becomputed by Equation (9). Only one DRAM access M9 inThread C fetches the data in bank 2 of rank 1, which isthe same as M2, and M8 and M13 are found to access rank3 in Thread B and Thread D, respectively, so Nb is 1, Nr is2, and Ndr is 1 for M2. The worst-case latency for M2 canCi ( N – 1 ) * ( tRAS tRP ) tCMD tRCD tCAS tBURST tBUS tQUEUE(8)B. A Basic ApproachIn order to reduce the overestimation in the conservative approach, the basic approach is proposed by considering the effect of DRAM command pipelining. Asdiscussed in Section IV-C, the performance of DRAMcommand pipelining among consecutive DRAM accessesdepends on the spatial locality of the data fetched. Theworst-case situation of DRAM command pipelining happens when the data fetched by consecutive DRAMaccesses are on the same bank, which would degrade thedegree of DRAM command pipelining mostly. Given athread (task), the basic approach first checks the DRAMaddress of the data fetched by each DRAM access. Then,it determines the maximum number of DRAM accessesfrom other threads fetching the data on the same bankwith this access. If no DRAM access from other threadsis found to fetch the data on the same bank, it then examines the number of DRAM accesses from other threadsfetching the data on the same rank.The basic approach is described in Algorithm 1. TheYiqiang Ding et al.59http://jcse.kiise.org

Journal of Computing Science and Engineering, Vol. 7, No. 1, March 2013, pp. 53-66Algorithm 1 Basic Approachbe calculated by Equation (10). This case is similar to thecases of M3 and M4. Although no other DRAM accessfetches the data on the same bank as M3 and M4, there areeither M5 and M10 or M6 and M11 in Thread B and ThreadC accessing the same bank. However, there is no DRAMaccess fetching the data in same rank with M3, such thatthe worst-case latency of M3 can be derived as Equation(11). In contrast, either M12 or M14 fetches the data in rank4. Therefore, the worst-case latency of M4 can be computed by Equation (12).Following the specific timing parameters given inTable 2, the worst-case latency for C1, C2, C3, and C4 arecalculated as 63, 59, 63, and 66 cycles, respectively. Incontrast, the worst-case latency for all these DRAMaccesses in this example is estimated to be 73 cycles bythe conservative approach. It is clear that the overestimation of the worst-case DRAM performance in the conservative approach can be reduced by the basic approach.Fig. 6. An example of estimating the worst-case DRAM performanceby the basic approach.C1 tCMD tRCD tCAS 2 * tBURST 2 * ( tRAS tRP) tBUS tQUEUED. An Enhanced Approach(9)Although the basic approach considers the effect ofDRAM command pipelining on the worst-case DRAMperformance, there is still overestimation due to the timing order constraints of the co-running DRAM accesses,since the order of DRAM accesses of a given thread canimpact the timing order of DRAM accesses of otherthreads. This problem can be explained by the example inFig. 7. Assuming that there are two DRAM accesses ineach thread, Thread 1 contains Mi, Mj and Thread 2 contains Mk, Mi, where Mi and Ml fetch data on the same bankC2 tCMD tRCD tCAS tBURST tRAS tRP 2 * ( tRP tRCD ) tBUS tQUEUE(10)C3 tCMD tRCD tCAS 2 * tBURST 2 * ( tRAS tRP) tBUS tQUEUE(11)C4 tCMD 2 * tRCD tCAS tBURST 2 * tRAS 3 * tRP tBUS 2)60Yiqiang Ding et al.

Bounding Worst-Case DRAM Performance on Multicore Processorsand so do Mj and Mk. If the timing order is not taken intoaccount, there are two bank conflicts. However, it is clearthat the timing order of the threads is violated in the caseof two bank conflicts. If Mi and Ml are issued simultaneously from both threads and a bank conflict is takeninto account, Thread 2 must have reached the end of Ml,and Thread 1 has not started the execution of Mj. Thisindicates that the bank conflict between Mj and Mk cannothappen. The same analysis can be applied between Mjand Mk. Therefore, there is possibly only one bank conflict in the worst-case. Similarly, the same analysis can beapplied to rank conflicts.An enhanced approach is proposed to compute theworst-case DRAM performance more accurately by eliminat

utilize the extended integer linear programming (ILP) approach [4] to model the constraints between the accesses to the higher-level cache memory and the DRAM accesses. The worst-case DRAM performance is integrated into the objective function of the extended ILP approach to bound the WCET of a real-time task running on a multicore pro-cessor.

Related Documents:

Disusun oleh deretan flip-flop. Baik SRAM maupun DRAM adalah volatile. Sel memori DRAM lebih sederhana dibanding SRAM, karena itu lebih kecil. DRAM lebih rapat (sel lebih kecil lebih banyak sel per satuan luas) dan lebih murah. DRAM memrlukan rangkaian pengosong muatan. DRAM cenderung

Accessing data stored in the DRAM array requires a DRAM chip to perform a series of fundamental operations: activation, restoration, and precharge.1A memory controller orchestrates each of the DRAM operations while obeying the DRAM timing parameters.

we present the design and implementation of a UWB-based distance bounding system that enables accurate ranging and secure distance bounding, even if the prover is untrusted. We discuss physical layer aspects of distance bounding im-plementations such as the modulation scheme, and the con-tributing factors for processing delay at the prover. In .

ent data structure access patterns by an application. We developed a weighted DRAM access cost model that allows the com-piler to map the application access pattern to the most optimal DRAM access primitive. We illustrate the performance bene ts of these novel DRAM access prim-itive through a quanti

intended for teachers of dram a w ithin English as w ell as teachers of dram a as a separate subject. It contains a bank of teaching ideas to help the teaching of dram a objectives, and of other Fram ew ork objectives w hich can be addressed through dram a at K ey Stage 3 . T he four central

INTEGRATED CIRCUIT ENGINEERING CORPORATION 7-1 7 DRAM TECHNOLOGY Word Line Bit Line Transistor Capacitor Plate Source: ICE, "Memory 1997" 19941 Figure 7-1. DRAM Cell. DRAM Technology 7-2 INTEGRATED CIRCUITENGINEERING CORPORATION Data Data Sense Amplifier Data Data Sense Amplifier Data Data Sense Amplifier Data Data

This paper provides the first large-scale study of DRAM memory errors in the field. It is based on data collected from Google’s server fleet over a period of more than two years making up many millions of DIMM days. The DRAM in our study covers multiple vendors, DRAM densities and technologies (DDR1, DDR2, and FBDIMM).

this publication. The Association also thanks all those who provided pictures and specific examples related to the SDGs. Higher Education and Research for The Sustainable Development Goals: SDG 5: Gender Equality International Association of Universities (IAU) / International Universities Bureau. IAU provides a forum for building a worldwide higher education community, promotes exchange of .