REAPR: Reconfigurable Engine For Automata Processing

2y ago
2 Views
2 Downloads
1.44 MB
8 Pages
Last View : 29d ago
Last Download : 3m ago
Upload by : Lee Brooke
Transcription

REAPR: Reconfigurable Engine for Automata ProcessingTed Xie1 , Vinh Dang2 , Jack Wadden2 , Kevin Skadron2 , Mircea Stan11Department of Electrical and Computer Engineering, University of Virginia2Department of Computer Science, University of Virginia{ted.xie, vqd8a, wadden, skadron, mircea}@virginia.eduAbstract— Finite automata have proven their usefulness inhigh-profile domains ranging from network security to machinelearning. While prior work focused on their applicability forpurely regular expression workloads such as antivirus andnetwork security rulesets, recent research has shown thatautomata can optimize the performance for algorithms in otherareas such as machine learning and even particle physics.Unfortunately, their emulation on traditional CPU architecturesis fundamentally slow and further bottlenecked by memory.In this paper, we present REAPR: Reconfigurable Engine forAutomata PRocessing, a flexible framework that synthesizesRTL for automata processing applications as well as I/O tohandle data transfer to and from the kernel. We show that evenwith memory and control flow overheads, FPGAs still enableextremely high-throughput computation of automata workloadscompared to other architectures.I. I NTRODUCTIONMany modern applications essential to domains such ashigh-performance computing, network security, and machinelearning benefit from an approach based on finite automata.For instance, in the case of network security, regular expressions used for deep packet inspection are represented inmemory as their equivalent nondeterministic finite automata;RegEx engines such as Intel HyperScan [1] perform automata transformations to maintain excellent performance interms of both runtime and memory utilization. In the field ofmachine learning, Tracy et al. [2] have shown that automatacan even be used for a high performance implementation ofthe Random Forest classification algorithm.Previously, engineers have relied on Moore’s Law scalingfor computing power to keep up with demands for theseworkloads, despite the fact that traditional von Neumannarchitectures are poorly suited for automata processing.Now in the “post-Moore” era, it is no longer prudent torely on advances in manufacturing for improvements incomputational throughput. The unsuitability of traditionalCPU architectures for automata processing is amplified bythe increasing demand for such applications; heterogeneousaccelerator-based solutions are the way forward.Specifically, prior work [3] [4] has shown that digitalcircuits enable a one-to-one spatial mapping between automata states and circuit components such as registers andwires. Due to the volatility of many automata processingworkloads such as network security and machine learning,in which the rulesets and algorithms change frequently,non-reprogrammable application-specific integrated circuits(ASICs) are too inflexible and too costly to be commerciallyviable, and therefore reconfigurable platforms such as FP-GAs and the Micron Automata Processor [4] are an idealtarget for these workloads.Our goal in this paper is to develop a high throughput andscalable engine for enabling spatial automata processing onFPGAs, as opposed to prior work which focused purely onregular expressions. To achieve this goal, we have developedREAPR (Reconfigurable Engine for Automata PRocessing),a flexible and parameterizable tool that generates RTL toemulate finite automata.The major contributions of REAPR are the following: The first effort to characterize automata performancefor applications other than regular expressions on theFPGA using the ANMLZoo benchmark suite [5]; A head-to-head comparison of FPGA computationalthroughput against a best-effort CPU automata processing engine (Intel HyperScan) and the Micron AutomataProcessor; The first automata-to-RTL translation tool with I/Ogeneration capability; To our knowledge, the first attempt to run an automataprocessing application (Random Forest) on actual hardware by using AXI and PCI-Express to enable CPUFPGA communication; A maximally-sized Levenshtein automaton to determinepeak state capacity with complex routing;II. P RIOR W ORKA. CPU Automata EnginesIn a nondeterministic finite automaton (NFA), symbolsfrom the input stream are broadcast to each state simultaneously, and each state connects to several other states, eachof which may or may not activate depending on whether agiven state matches the incoming symbol. For each symbol,an NFA engine must determine the next set of activatedstates, which involves a linear-time scan of the adjacencylists of all states in the current activated set. In the worstcase, the adjacency list may contain nearly all of the statesin the automaton; therefore, the run-time on a CPU forsimulating an m-state automaton on n symbols is O(n · m),and for non-trivial data with non-trivial automata (n m),the overall runtime is quadratic. CPU NFA processing isadditionally hampered by the so-called “memory wall” dueto the NFA’s pointer-chasing execution model, and thereforeit is desirable to drastically reduce the number of memoryaccesses per input item. In order to mask memory latency,state-of-the-art NFA engines such as Intel HyperScan [1]

perform SIMD vector operations to execute as many statetransitions as possible for a given memory transaction. Evenso, such optimizations can not escape the fact that sequentialvon Neumann architectures are fundamentally ill-suited forthese type of workloads.In order to improve the run-time complexity of automatatraversals, some regular expression engines transform theNFA into its equivalent deterministic finite automata (DFA).A DFA only has one state active for any given symbolcycle and is functionally equivalent to an NFA; this isachieved through a process known as subset construction,which involves enumerating all possible paths through anNFA. Converting an NFA to DFA has the benefit of reducingthe runtime to O(n) for n symbols (note that now theruntime is independent of automaton size) and only requiresone memory access per input symbol, but frequently causesan exponential increase in the number of states necessary;this phenomenon is often referred to as state explosion.Subset construction for large automata incurs a huge memoryfootprint, which may actually cause performance degradationdue to memory overhead in von Neumann machines.Prior work by Becchi [6] attempted to leverage the bestof both types of finite automata (the spatial density of NFAand temporal density of DFA). By intercepting the subsetconstruction algorithm and not expanding paths that wouldresult in a state explosion, Becchi achieved 98-99% reductionin memory capacity requirement and up to 10x reduction inmemory transactions.B. Automata on FPGA1) NFA: Past implementations of NFAs on FPGA [3] [7]focused on synthesizing only regular expression matchingcircuits for applications such as antivirus file scanning andnetwork intrusion detection. REAPR extends this prior workby focusing on a more diverse set of finite automata toaddress the fact that the workload for automata processing ismuch richer and more diverse than regular expressions. Weextend the underlying approaches for NFA RTL generationfrom prior work, adapt it for other NFA applications, anddetail our process in Section IV.2) DFA: Several efforts [8] in accelerating automataprocessing with FPGAs use Aho-Corasick DFAs as theunderlying data structures. A major motivator behind thisdesign choice is the ease of translation between a DFAand a simple transition table, which is easily implementedusing BRAM. One benefit to this approach is that BRAMcontents can be hot-swapped easily, whereas a spatial designrequires a full recompilation to realize even a single change.Because DFAs do not exploit the native bit-level parallelismin digital hardware and are much better suited to memorybound CPU architectures, REAPR only focuses on the spatialimplementation of NFAs.C. The Micron Automata Processor1) Architecture and Overview: The Micron AutomataProcessor (AP) [4] is a reconfigurable fabric for emulatingfinite automata in hardware, designed in a 50nm DRAMprocess. Its fundamental building block is the State Transition Element (STE), which is the hardware realization ofan automaton state as well as the next-state transition logic.The first generation AP packs roughly 49,000 states per chip,and with 32 chips per board, one AP card holds nearly1.6 million states. REAPR implements automata states ina similar manner but is more extensible due to the flexibilityof the FPGA.2) Programming Interface: In addition to a multi-lingual(C, Python, Java) SDK, Micron offers the Automata NetworkMarkup Language (ANML), a format based upon XML fordescribing the interconnectivity of AP components such asSTEs, booleans, and counters. Developers can either generate their own ANML files or generate them from regularexpressions using the SDK’s apcompile command [9].D. Other ArchitecturesSeveral past efforts have proposed modifications to existing von Neumann architectures to specifically increase performance of automata processing workloads. HARE (Hardware Accelerator for Regular Expressions) [10] uses an arrayof parallel modified RISC processors to emulate the AhoCorasick DFA representation of regular expression rulesets.The Unified Automata Processor (UAP) [11] also uses anarray of parallel processors to execute automata transitionsand can emulate any automaton, not just Aho-Corasick.However, because these works are 1) not FPGA-centric(both are ASICs), 2) based on the von Neumann model andnot spatially distributed like REAPR, and 3) confined to alimited set of just regular expressions (as opposed to generalautomata applications), we do not directly compare to them.III. B ENCHMARKSWe synthesize the ANMLZoo [5] automata benchmarksuite developed by Wadden et al. to determine the efficiencyof REAPR. ANMLZoo contains several applications fallinginto three broad categories: regular expressions, widgets, andmesh. The applications, along with their categories, are listedbelow in Table I. Detailed descriptions of these benchmarkscan be found in the ANMLZoo paper [5].Benchmark NameSnortDotstarClamAVPowerENBrill TaggingProtomataHamming DistanceLevenshtein Distance*Entity ResolutionSequential Pattern Mining (SPM)FermiRandom ,220TABLE I: ANMLZoo is a benchmark suite for automataengines, suitable for many platforms such as traditionalCPUs, GPUs, FPGAs, and the Micron Automata Processor.*ANMLZoo’s Levenshtein benchmark is actually three 24x20 automata, each with 928 states.

ANMLZoo is normalized for one AP chip, so thesebenchmarks synthesized for the FPGA will provide a directcomparison of equivalent kernel performance between thetwo platforms.A. Maximally-Sized Levenshtein AutomatonIn addition to comparing the relative performance of theAP versus an FPGA, it is also useful to know exactly whatthe upper bounds are for FPGA capacity. For this reason,we resize the Levenshtein benchmark such that it completelysaturates the FPGA’s on-chip LUT resources. We have chosen Levenshtein specifically because it is the smallest andtherefore worst-performing application in ANMLZoo, dueto the clash between its 2D-mesh topology and the AP’stree-based routing. The poor routing can be observed in thefact that Levenshtein has the smallest number of states inANMLZoo, thus wasting the most computational potential.We believe that Levenshtein represents an application thatnot only is inefficient on the AP, but is very well-suited tothe FPGA and its 2D-mesh routing network.Algorithm 1 NFA-RTL Translation Algorithmprocedure NFA2RTL(incoming symbol)for each STE dogenerate DFF dffgenerate 1bx256 character class RAM ccgenerate 1b signal activatedfor each incoming iSTE doactivated iSTE.outputend forgenerate 1b signal char matcheschar matches cc[incoming symbol]generate 1b output signal outputoutput char matches AND activatedend forend ]cb[b]IV. RTL G ENERATIONThis work focuses mainly on the hardware synthesis ofnondeterministic finite automata rather than DFA. The NFA’shighly parallel operation of matching one single datum formany states (“Multiple Instruction Single Data” in Flynn’staxonomy) maps very well to the abundant parallelism offered by spatial architectures such as the FPGA and AP.While DFAs can also be implemented spatially, the argumentis less compelling because 1) DFAs only need to perform asingle symbol match per cycle, and therefore are better suitedfor von Neumann architectures and 2) DFAs often have ahuge area requirement.Spatial architectures implement automata states as transition logic ANDed with a single register representing whetherthe state is activated. This is the case for the AP as well asprior work [3] [7]. In the case of REAPR and the AP, thetransition logic is actually merged with the state to transforma traditional NFA into a homogeneous finite automaton [12].In these homogeneous FAs, the combined state-transitionstructure is referred to as a state-transition element (STE).Each STE’s transition logic is one-hot encoded as a 1x256memory column (the “character class”) and is ANDed withthe activation state register, the input to which is the reduction OR of enable signals coming from other states. With thisdesign, a single STE will only output “1” when its activationstate is driven high by other states and the current symbolis accepted in its character class. Algorithm 1 describes thisprocess and Figure 1 shows a visual representation of it.We propose two design methodologies to represent character classes in hardware using either the FPGA’s lookuptables (LUTs) or BRAM.\xff\x61 ‘a’\xff\x62 ‘b’\x000001\x00000\xff1000\x64 ‘d’\x63 ‘c’\x0000011000000Fig. 1: Automata states can be easily mapped to registersand look-up tables (“logic”).registers within a CLB; a LUT-based flow will not need touse as much long-distance wiring to connect to a far-awayBRAM.A. LUT-Based DesignB. BRAM-Based DesignThe main disadvantage of using LUTs for storing the character class is the long compilation time; FPGA compilers aggressively minimize logic for LUT designs, which drasticallyincreases compiler effort. Using BRAMs for transition logiccircumvents the expensive optimization step and thereforesignificantly decreases compile time.The AP’s approach to generating hardware NFAs is verysimilar to the BRAM design, except that Micron stores the256-bit columns into DRAM banks instead of FPGA BRAM.This has the benefit of high state density due to the higherdensity of DRAM compared to SRAM.Each state must accept a range of characters correspondingto outgoing transitions in a canonical finite automaton. LUTsare well-suited for this task, due to their proximity to the stateC. I/OPrior works considered only kernel performance ratherthan system performance. While this approach has the ben-

PCI-E controllerreceives dataAXI Write datato globalmemoryAXI Read DataProcessAutomatonAXI Writereports toglobal memoryWrite reportsto PCI-E busPipelinedFig. 2: Execution flow of AXI and PCI-Express transactions for automata processing kernels.efit of greatly reducing the implementation difficulty of aresearch project, it does not provide a full analysis becausereal systems are not I/O-agnostic. A main contribution ofREAPR is the inclusion of I/O circuitry over PCI-Expressand AXI for the Random Forest benchmark, making REAPRthe first work to offer a truly end-to-end automata acceleratordesign flow for FPGAs.We adopt a high level synthesis (HLS)-centric approachby designing the I/O interface using HLS and modifying thegenerated Verilog code to integrate our automata kernels.Xilinx SDAccel [13] then generates AXI and PCI-Expresscircuitry for our kernels. Testing automata circuits with realdata on real hardware allows us to obtain more realisticbenchmark results compared to simulations, as prior workshave done. The overall execution flow of REAPR with I/Ois shown in Figure 2.To integrate our RTL kernels into HLS-generated Verilog,we design the I/O kernel to have some very basic dummycomputation. A simplified code snippet is shown in Listing1, which shows data being copied from the input buffer tothe output buffer after being added to 0xFA. In the Verilog,we can search for this dummy addition, and substitute theaddition operation with our automata RTL kernel.Listing 1: I/O kernel with dummy computationvoid i o k e r n e l ( din indata , dout o u t d a t a ) {f o r ( i n t i 0 ; i DATA SIZE ; i ) {o u t d a t a [ i ] i n d a t a [ i ] 0xFA ;}}D. ReportingA major challenge to implementing automata on FPGAsis not in the kernel itself, but rather in the I/O. For every 8bit symbol processed by REAPR, thousands of reports mayfire, requiring per-cycle storage on the order of kilo-bits.This massive amount of data transfer has a non-negligibleoverhead on overall throughput.To illustrate the detrimental effects of I/O on performance,consider the following example. In the Random Forest application, there are 1,661 reporting states corresponding to 10feature classifications [2]. The host CPU post-processes thisreport data, so all of it must be preserved. A 10 MB inputfile will therefore generate 16.61 GB worth of output signals.Assuming 250 MHz kernel clock rate and a 10 GBps PCIExpress link with a single-stream blocking control flow, theoverall end-to-end throughput of the system can be expressedas follows:T hroughput 10M B10M B10GBps 10M B250M Bps 16.61GB10GBpsEvaluating the above expression gives an overall throughput of just 5.8 MBps, only about 20% of the expected250 MBps. Efficient reporting is therefore a crucial part ofdeveloping high performance automata processing kernels.c0cout0 0c1cout1 0cout1c2cout2 0cout2c3cout3 0cout3c4c5cout4 0cout5 0cout4cout5c6v0cout6 0cout0v1.v9cout6c7cout7 0cout7c8cout8 0cout8c9cout9 0cout9vote 0Vote OutVote Outmax 0Max OutMax OutFig. 3: The pipelined voter module for Random Forestcompresses the output size from 1,661 to just 8 bits.To demonstrate an example of efficient report processing,we delegate the voting stage of the Random Forest algorithmon-chip so that instead of exporting 1,661 bits of report information per cycle, we can just export the vote instead. TheRandom Forest (“RF”) kernel in the ANMLZoo benchmarksuite is trained for the MNIST hand-writing database fordigits 0-9 [2], so only four bits are necessary to encode thevote per cycle. However, because the minimum word widthof SDAccel is 8 bits (one byte), we set the vote output widthto be 8 bits instead. This enables a factor of 207 reductionin necessary report storage compared to the original 1,661bits.Each of the report bits in the RF kernel corresponds toone of ten possible feature classifications. The voter module,shown in Figure 3, contains ten identical stages. Each voterstage vi takes as input 10 classification vectors (c0 - c9 ),the determined vote from the previous stage (vote), andthe number of votes corresponding to that classification(max). Each stage i will calculate the Hamming Weightw of classification vector ci and compare that to max. Ifw max, then the current stage passes i as vote and w asmax. All of the classification vectors ci are passed to thenext stage. Because the throughput of this voter module isone vote per cycle, it has no negative impact on the overallthroughput of the Random Forest kernel.

V. E VALUATIONAll FPGA metrics were obtained for the Xilinx KintexUltraScale 060 FPGA (Alpha Data ADM-PCIE-KU3 board)with an X16 PCI-Express interface, 2,160 18 Kb BRAMs and331k CLB LUTs. The FPGA’s host computer has a quad-coreIntel Core i7-4820k CPU running at 3.70 GHz and 32 GBof 1866 MHz DDR3 RAM. CPU performance results wereobtained on a six-core Intel Core i7-5820k running at 3.30GHz with 32 GB of 2133 MHz DDR4 RAM.To obtain the synthesis and place & route results, we useXilinx Vivado’s Out of Context (OOC) synthesis and implementation feature. OOC allows us to synthesize RTL designsfor which the number of pins exceeds the maximum numberon our selected chip (1,156) in the absence of a generalpurpose report-offloading architecture. For future work, wehope to implement such an architecture to obtain moreconfident data regarding throughput, power consumption, andresource utilization.All CPU benchmark results are obtained by running amodified version of VASim [14] that uses Intel’s HyperScantool as its automata processing back-end and an ANML(instead of regular expression) parser as its front-end. Wechoose HyperScan as a general indicator of a state-of-the-arthighly optimized CPU automata processing engine.Because the AP and REAPR have similar run-time execution models and are both PCI-Express boards, we cansafely disregard data transfer and control overheads to makegeneral capacity and throughput comparisons between thetwo platforms. While in reality the I/O circuitry has a nonnegligible effect on both capacity and performance for bothplatforms, we aim to draw high-level intuitions about thearchitectures rather than the minutia of reporting.A. ANMLZoo Benchmark ResultsOur primary figure of merit to quantify capacity is theCLB utilization for the FPGA chip. CLB usage is a functionmainly of two variables: state complexity and routing complexity. Automata with very simple state character classeswill require very few CLBs to implement. Similarly, verycomplexly routed applications (for instance, Levenshtein)have so many nets that the FPGA’s dedicated routing blocksare insufficient so the compiler instead uses LUTs for routing. The CLB utilization can be observed in Figure 4.CLB utilization ranges from 2-70% for the LUT-baseddesign and 1.4-46% for the BRAM-based design. In mostcases, using BRAM results in a net reduction in CLButilization because the expensive state transition logic isstored in dedicated BRAM instead of distributed LUTs.Figure 4 also shows the results of compiling ANMLZooin the BRAM flavor. Theoretically, the total state capacityfor BRAM automata designs is the number of states per 18Kb BRAM cell multiplied by the number of cells. Ideally,we would be able to map transition logic to a 256-row by wcolumn block, where w 18Kb256b 72. The closest BRAMconfiguration we can use is 512 36, which means thatwe can only fit 36 256-bit column vectors into one BRAMcell instead of 72. Multiplying the number of states percell (36) by the number of cells (2,160) gives a per-chipBRAM-based state capacity of 77,760. Most applications’BRAM utilization is almost exactly their number of statesdivided by the total on-chip BRAM state capacity except forDotstar, ER, and SPM. In these cases, the applications havemore than the 77k allowable states for the BRAM design, soREAPR starts implementing states as LUTs after the limit issurpassed.Figure 5 shows the state complexity in ANMLZoo applications, which ranges from 1-3 CLBs per state. Whilethe complexity for logic-based automata varies dramaticallybased on the complexity of transition logic and enable signals(node in-degree), for BRAM it remains relatively consistentat roughly 1 CLB per state. Notable exceptions to this trendare Hamming, Levenshtein, and Entity Resolution. Hammingand Levenshtein are mesh-based automata with high routingcongestion, and ER is so large that the on-chip BRAMresources are exhausted and LUTs are used to implementthe remaining states.We use Vivado’s estimated maximum frequency (Fmax)to approximate throughput for REAPR, the results of whichare displayed in Figure 6. Because the hardware NFA consumes one 8-bit symbol per cycle, the peak computationalthroughput will mirror the clock frequency. For ANMLZoo,REAPR is able to achieve between 222 MHz (SPM) and 686MHz (Hamming Distance) corresponding to 222 MBps and686 MBps throughput.One interesting result of the estimated power analysisreported by Vivado (see Figure 7) is the observation that theBRAM implementation consumes much more power (1.6W- 28W) than the LUT designs (0.8W - 3.07W). The reasonfor this discrepancy is twofold: 1) BRAMs in general aremuch larger circuits than LUTs, and powering them at highfrequencies is actually quite expensive; 2) routing to andfrom BRAM cells requires using many of the FPGA’s largerlong-distance wires which tend to dissipate more energy. Infuture work, we hope to program all of these BRAM-basedcircuits onto actual hardware and measure TDP to verify thepower consumption.Figure 8 shows the power efficiency of ANMLZoo applications, which we define as the ratio between throughput andpower consumption. In all cases, the LUT-based designs aresignificantly more power efficient than the BRAM designsdue to the much lower power consumption.Using Fmax (without any reporting or I/O circuitry) asthe computational throughput, we can determine the speedup(seen in Figure 9) against a high-end CPU running IntelHyperScan. In the worst case, REAPR is on par withHyperScan and in the best case achieves over a 2,000xspeedup for the SPM application for both the BRAM- andLUT-based designs.B. Random Forest with I/O CircuitryUsing the pipelined voter module, we are able to achievean average throughput of 240 MBps for the Random Forestkernel, including data transfer and control overheads. Compared to HyperScan’s performance of 1.31 MBps for this

Power, Logic vs BRAMPower Consumption, WattsUtilization, %CLB and BRAM Utilization, Logic vs 00%20.00%10.00%0.00%302520151050BenchmarkCLB Util., LUT DesignCLB Util., BRAM DesignBenchmarkBRAM Util., BRAM DesignFig. 4: CLB and BRAM utilization for ANMLZoo benchmarks using the LUT-based and BRAM-based designmethodologies. Note that none of the benchmarks exceedmore than 70% utilization, and that in most cases the BRAMbased design uses fewer CLBs.Power, LUT DesignPower, BRAM DesignFig. 7: Estimated power consumption for LUT NFAs isvery low; the most power hungry application, Dotstar, isestimated to consume barely more than 3 W. The BRAMimplementation is much more inefficient, peaking at above25 W for three applications.Average State ComplexityPower Efficiency, Logic vs BRAM2.52Efficiency, MBps/WattCLB per kCLBs Per State, LUT DesignCLBs per State, BRAM DesignFig. 5: The average state complexity for an automaton isthe ratio of CLBs placed and routed versus the numberof automaton states. Automata with very simple characterclasses intuitively need less logic (CLBs) than more complexautomata.BenchmarkPower Efficiency, LUT DesignFig. 8: Estimated power efficiency of the LUT-based designgreatly outstrips that of BRAM. Mesh benchmarks (Hamming and Levenshtein) perform very well in this aspect.Clock Frequency, MHzMaximum Clock Frequency, Logic vs BRAM8007006005004003002001000Power Efficiency, BRAM DesignFPGA vs CPU Speedup100001000100101BenchmarkFmax, LUT Design0.1Fmax, BRAM DesignLogic SpeedupFig. 6: Clock frequency in most cases is degraded whencomparing LUT-based automata to BRAM-based, and ingeneral ranges from 200 MHz to nearly 700 MHz.BRAM SpeedupFig. 9: Speedup ranges from 1.03x to 2,188x for LUT-basedautomata and 0.87x - 2,009x for BRAM-based automata.

application, we achieve a 183x speedup on real hardware.C. Maximally-Sized Levenshtein AutomatonTo demonstrate the true power of FPGAs for automataprocessing, we have developed a new “standard candle” 1for the Levenshtein benchmark using the approach describedby Tracy et al. [15]. By generating and synthesizing largerand larger edit distance automata, we have discovered thatfor a distance of 20 (the same as the ANMLZoo Levenshtein), the longest Levenshtein kernel we can fit on ourKintex Ultrascale FPGA has a length of 1,550, requiring63,570 states. Compared to the 2,784 states in the 24x20ANMLZoo Levenshtein benchmark, the FPGA achieves a22.8x improvement in per-chip capacity.VI. D ISCUSSIONA. The Importance of I/OOur implementation of Random Forest, including thepipelined voter mechanism, achieved 240 MBps overallthroughput. This data point proves that I/O handling can havea substantial impact on overall system performance. By delegating the voting portion of the Random Forest algorithm onchip, REAPR enables FPGA developers to achieve a 40.7xspeedup over the estimated worst-case performance of 5.8MBps. Moreover, the compacted output data stream allowsthe kernel to operate at 96% of its estimated 250 MBpsthroughput, indicating that I/O overheads are minimized withour approach.Assuming that similar high performance reporting can beimplemented for other benchmarks (i.e. these other protocolsare also roughly 73% efficient), REAPR is still in the worstcase roughly on-par with a best-effort CPU approach, andin the best case orders of magnitude faster. This verifiesthat FPGAs are an excellent platform for automata, and thatfuture work should focus on efficient reporting protocols.B. The Importance of Application and Platform TopologyIn the case of the Hamming and Levenshtein benchmarks,both of which have 2D mesh topologies, the AP compilerwas unable to efficiently place and route due to a clashwith the AP’s tree-like hierarchical routing network. Sucha limitation does not exist on the FPGA, which has a2D mesh routing topology, exemplified in the FPGA’s 28xcapacity improvement for Levenshtein compared to the AP.Additionally, Hamming and Levenshtein were among the twobest-performing benchmarks in terms of power efficiency.Therefore, applications using 2D mesh-like automata arebetter suited for the closer-matching 2D routing networkavailable on an FPGA.C. Logic vs. BRAMIn general, using BRAM to hold state transition logicenables significant savings in terms of CLB utilization; inthe BRAM design methodology, CLBs are only used forcombining enable signals and in some cases routing rather1 A standard candle in the context of spatial automata processi

REAPR: Reconfigurable Engine for Automata Processing Ted Xie1, Vinh Dang 2, Jack Wadden , Kevin Skadron2, Mircea Stan1 1Department of Electrical and Computer Engineering, University of Virginia 2Department of Computer Science, University of Virginia fted.xie, vqd8a, wadden, skadron, mirceag@virginia.edu

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

nology used, the maximum imaging range is typically around 5 m, and can increase up to a maximum of 10 m in the case of the SR4000, albeit with a smaller eld of view. 4 Con gurable depth camera The con gurable depth camera is based on a long-range 3D laser scanner that allows full dynamic control over the scanning region as well as the angular .

1.Engine Oil SABA 13 1.Engine Oil 8000 14 1.Engine Oil 6000 15 1.Engine Oil 3000 16 1.Engine Oil Alvand 17 1.Engine Oil Motor Cycle Engine Oil M-150 18 1.Engine Oil M-100 19 1.Engine Oil Gas Engine Oil CNG-BUS 20 1.Engine Oil G.I.C.X.LA 21 1.Engine Oil G.I.C.X. 22 1.Engine Oil Diesel Engine Oil Power 23 1.Engine Oil Top Engine 24

What is Project Recon? A web-based GOTS tool designed to capture, manage, and link Risks, Issues, and Opportunities in a centralized database. Project Recon (formerly Risk Recon) is designed to be used by all Program Management Offices, Integrated Project Teams and any other groups performing risk management.

Trimble Recon Handhelds (WM 5.0) From the Trimble Recon “Getting Started Guide” (pages 11-12) Note: These instructions are designed for the Trimble Recon running Windows Mobile 5. To reset Trimble Recon devices running an operating system other than Windows Mobile

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

including ANSI A300. A good practice in mixed planting areas is to plant trees first followed by the larger shrubs, low shrubs and finally with ground cover plants. This prevents damage to the smaller plants; however the Contractor is responsible for sequencing. Check that plants are moist at the time of planting. Verify that trees or shrubs if marked with compass orientation are planted in .