Hardware/Software Codesign - University Of Florida

1y ago
789.90 KB
9 Pages
Last View : 11d ago
Last Download : 3m ago
Upload by : Vicente Bone

[Theerayod Wiangtong, Peter Y.K. Cheung, and Wayne Luk]Hardware/SoftwareCodesign[A systematic approach targetingdata-intensive applications]IEEE SIGNAL PROCESSING MAGAZINE [14] MAY 20051053-5888/05/ 20.00 2005IEEE DIGITALVISIONReconfigurable hardware hasreceived increasing attentionin the past decade due to itsadaptable capability and shortdesign time. Instead of usingfield-programmable gate arrays (FPGAs)simply as application-specific integratedcircuit (ASIC) replacements, designers cancombine reconfigurable hardware withconventional instruction processors in acodesign system, providing a flexible andpowerful means of implementing computationally demanding digital signal processing(DSP) applications. This type of codesign system isthe focus of this article.Most traditional codesign implementations areapplication specific and do not have a standard method forimplementing tasks. A hardware model is usually very differentfrom those used in software. These distinctive views of hardware andsoftware tasks can cause problems in the codesign process. For example, swapping tasks between hardware and softwarecan result in a totally new structure in the control circuit. In addition, many design tools leave the designers to make theirown decisions on task partitioning and scheduling, although these decisions dramatically affect the system performanceand cost. For example, partitioning in [1] has to be done manually and there is no reconfiguration at run-time.This article presents a systematic approach to hardware/software codesign targeting data-intensive applications. We focus onapplication processes that can be represented in directed acyclic graphs (DAGs) and use a synchronous dataflow (SDF) model,the popular form of dataflow employed in DSP systems [2], when running the processes. The codesign system is based onthe UltraSONIC reconfigurable platform, a system designed jointly at Imperial College and the SONY Broadcast Laboratory.This system is modeled as a loosely coupled structure consisting of a single instruction processor and multiple reconfigurable hardware elements. We suggest a new method of constructing and handling system tasks for this real codesign system. Both hardware and software tasks are structured in an interchangeable manner without sacrificing the benefit ofconcurrency found in conventional hardware implementations.

erated task manager program is used. This design approachreduces design errors and supports system modularity, scalability, and manageability for run-time reconfiguration.Our design environment involves an automated partitioningand scheduling algorithm to make a decision on where andwhen tasks are implemented and run. The CPS algorithm, collectively named for the three main steps of cluster, partition,schedule, is proposed to find the minimum processing timeunder a specified set of real-world conditions and constraints(communication time, memory conflict, and bus conflict), whichare not addressed in [3]. (Note that processing time in this context is the time to process all tasks in the system. It has thesame meaning as the maximum completion time, total production time, schedule length, or makespan that can be found insome literature.) Results from the CPS algorithm are used at theimplementation stage, where the application-independent infrastructure is available to facilitate designers. This predesignedinfrastructure, provided as standard library modules, is responsible for the control/interface mechanism in the codesign system.For example, the designed tasks must be encapsulated by thestandard task wrapper to be able to collaborate with the rest ofthe system. To control operations such as task executions, runtime reconfigurations, and data transfers, an automatically gen-THE ULTRASONIC RECONFIGURABLE PLATFORMOur codesign environment targets UltraSONIC [4], a reconfigurable computing system designed to cope with the computational power and the high data throughput demanded byreal-time video applications. The architecture exploits the spatial and temporal parallelism in video processing algorithms. Italso facilitates design reuse and supports the software plug-inmethodology.The structure of the board is shown in Figure 1. The systemconsists of plug-in processing element (PIPE) modules that perform the actual processing. The standard PIPE contains anXCV1000E FPGA and 4 2 MB SRAM. The PCI bus connectsthe UltraSONIC board to a host PC. Data transfers between theUltraSONIC and the host PC are performed over this PCI bus.(Note that the UltraSONIC main board is a universal PCI card,meaning that it can operate at 66 or 33 MHz in a 32-or 64-b PCIPIPE Bus - 64-b Address/Data and 2-b ControlLBC(LocalBus Controller)XCV300PIPE1PIPE2PIPE3PIPE1632-b 2-b CtrlGlobal Pipe Flow BusPipe Flow Chain 32-b 2-b CtrlPCIVertex Device XCV1000EPIPE Engine (PE)REGPIPE Memory (PM)PCI Bus64-b 66 MHzSRAMData/AddSRAMPIPE Router (PR)SRAMData/AddSRAMPIPE Flow Left 32-bPIPE Flow Right 32-bGlobal PIPE Flow Bus (Global) 32-bPIPE Bus (Global) 64-b[FIG1] The architecture of the UltraSONIC reconfigurable platform.IEEE SIGNAL PROCESSING MAGAZINE [15] MAY 2005

in this reconfigurable platslot running at 5 or 3.3OUR ENVIRONMENT SUPPORTSform, which does increaseV, controlled by the localAUTOMATIC PARTITIONING AND SCHEDULINGthe complexity of our sysbus controller.) On theBETWEEN A HOST PROCESSOR AND A NUMBERtem. Features that we areboard, there is one globalconcerned with in ourbus, called the PIPE busOF RECONFIGURABLE PROCESSORS.reconfigurable system are(PB), and two localas follows:buses, called the PIPEflow global (PFG) and PIPE flow chain (PFC). There are many hardware processing elements (PEs) thatEach PIPE consists of three main parts:are reconfigurable. PIPE engine: handles computations specified by the user. Reconfiguration can be performed at run-time.PE registers are used to receive parameter values from the All parameters such as the number of PEs, the number ofhost during computational processes.gates on each PE, communication time, and configurationtime are taken into account. PIPE router: responsible for data movement. Routes areprogrammable via the internal PR registers. It consumes On each PE where the number of logic gates is limited,around 10% of resources in an XCV1000E device.hardware tasks may have to be divided into several temporalgroups that will be reconfigured at run-time. PIPE memory: provides local buffering of data. Two independent data/address ports are provided for two equal memory Tasks must be scheduled without conflicts on sharedblocks (4 MB each).resources, such as buses or memories.SYSTEM SPECIFICATIONS AND MODELSThe reconfigurable hardware element of our codesign system isthe programmable FPGAs. Run-time reconfiguration is supportedSYSTEM MODELAs shown in Figure 1, the target system is a loosely coupledmodel by nature, which means that there is no shared memoryused as a medium for transferringdata. There is only local memory onSoftwareReconfigurable Hardwareeach PE. To transfer data betweenPEs, a communication channel isPPdirectly established between bothGlobalCommunicationChannelCCends that must communicate.IIThe target system can be modeledas a system consisting of a singlesoftware element and multiple reconPE0PE1PEnCPUfigurable hardware elements [seeFigure 2(a)]. There are two maintypes of buses in this system: globaland local buses. Configurations canMemMemMembe completed through the global bus.MemThis global bus is also used as a comLocal Communication Channelmunication channel between hard(a)ware and software. However,transferring data between the hardware PEs is done through the localGlobal Communication Channelbus. Each PE has its own local memTaskLocal Communication Channelory for storing input and output data0TaskTask RecoManagerfor all internal tasks. In addition,nfi8TaskTaskTaskgTaskthere is a well-established mecha25110Tasknism for users to control PEs fromTaskTask8Task37the host processor based on a set ofTaskTask12Task410well-defined application program9interface (API) functions.MemLocal TaskTasks in a system process areMemMemMemControllerdynamically implemented and exeand MemoryInterfaceSWPE0PE1PEncuted (once step for each task) eitherin the CPU or PEs, according to(b)precedence relationships and priori[FIG2] The UltraSONIC system. (a) The system model. (b) An example of task implementation.ties. All operations are controlled andIEEE SIGNAL PROCESSING MAGAZINE [16] MAY 2005

initiated by a task manager program running in software. Thereexists a local controller in each PE to interact with the taskmanager and to be responsible for all local operations such asexecuting tasks, memory interfacing, and transferring databetween PEs. An example of implementing tasks in the system isgiven in Figure 2(b).can run concurrently when they are mapped to different PEs. Thisis an improvement over the model proposed by others in [6].FIRING RULESWe adopt the rules by which data is processed by a task from thesynchronous data flow (SDF) computational model [7]. However,in this work, our tasks are coarse-grained tasks represented in aDAG and we use the PE local memory as buffers between nodes,replacing the FIFO in [7] and [8]. As a result, tasks mapped to thesame PE must be executed sequentially to avoid memory conflict. If a task requires a large amount of input data, the datamust be sliced into sufficiently small units for processing to takeskenToTASK MODELThe tasks that we implement in our system are assumed to conform to the following restrictions: Software and hardware tasks are built uniformly to be performed under the same control mechanism. This simplifiessystem management and taskswapping. Tasks implemented in eachIE1IE2hardware PE are coarse-grain100/(4) 50/(2)tasks, which may consist of oneor more functional tasksAAAAA(blocks or loops). Communication between25/(1) 75/(3)tasks is always through localOE1OE2single port memory used asRead (IE1)Read (IE2)ExecuteWrite (OE1)Write (OE2)buffers between tasks.Repeat 25 Times Tasks for a PE may be dynamically swapped in and out using[FIG3] Example of firing process.dynamic reconfiguration.There are different types ofplace. The task is fired repeatedly as read-execute-write cyclestasks to be specified in this system: normal, software-only,until all data for the task has been processed.hardware-only, and dummy tasks. A normal task is free to beAn example of our task execution model is shown in Figure 3.partitioned and scheduled either in hardware or softwareIn this example, task A, which has two incoming edges (IE1,resources. A software-only task is a task that users intentionallyIE2) and two outgoing edges (OE1, OE2), is a task to be fired.implement in software without exception. Similarly, a hardThere are 100 tokens from IE1 and 50 tokens from IE2 to beware-only task is implemented solely in hardware. A dummyexecuted. The number shown inside the parentheses on eachtask is either source or sink for inputting and outputting data,edge is the number of data values needed for each firing iterarespectively, and is not involved in any computation. In our systion, representing the consuming rate or producing rate of atem, we assume that inputs and outputs are initially providedtask. For instance, the consuming rate on IE1 of task A is four,and written to the microprocessor memory; a dummy task iswhile the producing rate on OE1 is one.therefore a software task by default.To process data, task A first reads four tokens from IE1 followed by two tokens from IE2. This must be performed sequenEXECUTION CRITERIAtially because these inputs are stored in the same single-port localAfter tasks have been loaded into a PE and are ready to bememory used as buffers on each edge. All of these input tokensprocessed, they cannot be interrupted in the middle of the exeare stored inside the node before being processed. When execucution process; in other words, they are nonpreemptive. Tasktion is completed, one token is written out to OE1, followed byexecution is completed in three consecutive steps: read inputthree tokens to OE2. These steps are performed repeatedly untildata, process the data, and write the results. This is done repeatall data (shown as the first number on the edge) is processed.edly until input data stored in memory is completely processed.Thus, the communication time between the local memory andCODESIGN ENVIRONMENTthe task (while executing) is considered to be a part of the taskFigure 4 depicts the codesign environment of the UltraSONICexecution time [5]. Also, resource conflicts must be prevented,system. It is divided into the front-end and the back-endwhich means the same shared resource (such as memory orstage. The front end is responsible for system specifications,bus) is only available for use by one task.input intermediate format, and system partitioning/schedulThe main restriction of this execution model is that exactly oneing. The back-end involves hardware/software task design andtask in a given PE is active at any one time. This is a direct conseimplementation, design verifications, control mechanisms, andquence of the single-port memory restriction that allows one tasksystem debugging.to access the memory at any given time. However, multiple tasksIEEE SIGNAL PROCESSING MAGAZINE [17] MAY 2005

At the front end, the design to be implemented is assumed tobe described in a suitable high-level language, which is thenmapped to a DAG at coarse-grained level. Nodes and edges in theDAG represent tasks and data dependencies, respectively. Thegroup of algorithms known as the CPS algorithm reads a textualinput file that includes DAG information and parameters for theclustering, partitioning, and scheduling process. During thisinput stage, users can specify the type of tasks as normal,software-only, hardware-only, or dummy tasks.After obtaining the results of the partitioning and schedulingprocess, which are the physical and temporal bindings for eachtask, we can start the back-end design implementation phase. Inthe case of hardware tasks, they may be divided into many temporalgroups that can either be statically mapped to a hardware resourceor dynamically configured during run-time.We currently assume that software tasks are manually written in C/C , while hardware tasks are designed manually in ahardware description language (such as Verilog in this work)using a library-based approach. Once all the hardware tasks for agiven PE are available, they are wrapped in a predesigned circuit, called xPEtask, which is application independent.Commercially available synthesis and place-and-route tools arethen used to produce the final configuration files for each hardware element. Each task in this implementation methodrequires some hardware overhead to implement the task framewrapper circuit. Therefore, our system favors partitioning algorithms that generate coarse-grained tasks.The results from the partitioning and scheduling process,the memory allocator, the task control protocol, the API functions, and the configuration files of hardware tasks are used toautomatically generate the codes for the task manager program that controls all operations in this system (such asdynamic configuration, task execution, and data transfer). Theresulting task manager is inherently multithreaded to ensurethat tasks can run concurrently where possible.In the following section, the important parts of the codesignenvironment, including the CPS algorithm, the task managerprogram, and the infrastructure, are described.Design Specification inHigh-Level LanguageDAGFront EndDisplayingGraphsCPSClustering(Two-Phase)New DAGDisplayingGraphsBack EndPartitioning(Tabu Search)Scheduling(List Scheduling)Task Model,Comm Model,Target SystemModelMapping andScheduling Info Temporal HW Task GroupsSW TasksSW Code(Multithreadin C/C )Parametersfor CPSDAGInfoMapping andScheduling andFinal DAG InfoTarget API,ProtocolTask ManagerProgramGenerator(C/C )Mapping on Files of Commercial FPGADesign ToolsEach Temporal HW(Xilinx)Group (.ucd)TargetPCISWDebuggerHWDebugger[FIG4] The proposed codesign environment.IEEE SIGNAL PROCESSING MAGAZINE [18] MAY 2005THE CPS ALGORITHMThe CPS algorithm in thefront-end stage plays animportant role in thiscodesign system. It helpsdesigners in task clustering, partitioning, andscheduling, which areknown as intractable problems [9]. The CPS methodis a combination of threeheuristic algorithms: thetwo-phase clustering, thetabu search, and the listscheduling. This combination is designed to obtain agood result in a reasonabletime frame.The two-phase clustering algorithm [10] is usedas a preprocessing step tomodify the granularity oftasks and enable more taskparallelism. On average, ithas been shown to achieve15% shorter processingtime for different taskgranularities. A new,smaller DAG with coarsergrained tasks is then partitioned and scheduled inorder to obtain the minimum processing time.The heuristic algorithm, based on tabu

software processor (asearch, is used to partihost PC), the servicetion tasks into softwareOUR DESIGN ENVIRONMENT INVOLVES ANtime of which is uncerand hardware [11]. It hasAUTOMATED PARTITIONING AND SCHEDULINGtain and depends on sevbeen modified for thisALGORITHM TO MAKE A DECISION ON WHEREeralunpredictablereal system, which has aAND WHEN TASKS ARE IMPLEMENTED AND RUN.factors, a time-triggeredsearch space of K Nmethod is not suitable(where K is the numberfor real-time controlof processing elementsactions. To properly synchronize the execution of the tasks and theand N is the number of tasks). This exponentially increasingcommunication between tasks, our task manager employs ansearch space makes an exhaustive search impractical. Althoughevent-triggered protocol when running an application. However,heuristic search could yield a near-optimal solution, convergenceunlike a reactive codesign system [13], we do not regard externalspeed could be greatly improved by using a good initial guess.real-time events as triggers. Instead, we use the termination ofThe list scheduler is used to order tasks, without any sharedeach task execution or data transfer as event triggers, and the sigresource conflicts, with regard to partitioning results, tasknaling of such events is done through dedicated registers.precedence, and the target system model. Our schedulingWith a single CPU model, the software processor must runprocess has a tight relationship with the partitioning process. Itsimultaneously the task manager and the software tasks. A mulis used as a cost function to examine the processing time of thetithreaded programming technique is then employed to runguessed solutions from the partitioner. After the processing timethese two types of processes concurrently. A mutex (short forinformation is obtained, it is sent back to guide the partitionermutual exclusion) is a way of communicating among threadsto explore only promising regions. This iteration processthat are executing asynchronously.between partitioning and scheduling to minimize processingtime will terminate when the stop condition (such as the numEXAMPLEber of iterations specified by the designer) is met.Figure 5(a) shows an example of a DAG. After the partitioningand scheduling process, operation sequences to run the DAGTHE TASK MANAGERcan be obtained as shown in Figure 5(b). This information isTwo main control methods in a distributed architecture are cenused to automatically generate the task manager program basedtralized control and distributed control [12]. In this work, weon the message-based, event-triggered protocol as described earchoose to implement the former due to its simplicity and goodlier. In this example, the task manager first sends a message tomatching to the PC host processor in our system. The centralizedexecute task A in the processor (SW). Consequently, the configcontrol task manager program is used to orchestrate the sequencuration process runs to load the temporal group of tasks B anding of all hardware and software tasks, the transfer of data and theC into PE1. The task manager waits until task A is finishedsynchronization between them, and the dynamic reconfigurationbefore initiating data transfer between SW and PE1, preparingof FPGAs in PEs when required. Because this program runs on theSWTMPE0EXE ASrcSWAPE1ACFGTaskB,CSRCTRFDSTEXE BBTemporalGroup 1BPE1SRCCTemporalGroup 2TRFDSTEXE CPE0DCFGTaskECTRFDSTSRCETaskDEXE EECFGEXE DSink SinkSink SinkMessageBoardGlobalVariablefor SWTasksDDSTDSTN.B. TM Task ManagerCFG Configuration PeriodTRF Inter-PE Data Transfer(a)TRFTRFSRCTaskAMessage BoardRegistersTask ManagerCheck OperationMessages on EachPE and DecideWhen to run HW Tasks run SW Tasks inBackground(multithread) Reconfig HWTasks Transfer DataBetween PEs Transfer DataBetweenSW and PE.Start(TaskID E)Finish(TaskID )Data TransferFinish TransferPE0Message BoardRegistersStart(TaskID C)Finish(TaskID B)Data TransferFinish TransferSRCSW(b)PE1(c)[FIG5] (a) DAG example, (b) operation sequences of the task manager, and (c) the task manager control view.IEEE SIGNAL PROCESSING MAGAZINE [19] MAY 2005TaskETaskControllerTaskBTaskCTaskController

applicable to processfor task B to be executedstreaming data thatnext. This example alsoTO CONTROL OPERATIONS SUCH AS TASKusually employs ashows that there are twoEXECUTIONS, RUN-TIME RECONFIGURATIONS,pipelined structure. WehardwaretemporalAND DATA TRANSFERS, AN AUTOMATICALLYcannot initiate severalgroups for PE1 that willGENERATED TASK MANAGER PROGRAM IS USED.messages to run tasks inbe reconfigured at rundifferent pipeline stagestime.simultaneously fromFigure 5(c) shows thethe manager program, which is a sequential software code.conceptual control view of the task manager and its operations.However, one possibility to extend this work is to use specialThe task manager communicates with a local task controller onhardware nodes that can handle real-time pipelining to controleach PE in order to assert control. A message board is used inoperations, rather than the task manager program [14].each PE to receive commands from the task manager or to flagfinishing status to the task manager. As can be seen, a messageDESIGN INFRASTRUCTURE: THE TASK WRAPPERindicating execution completion from task B is posted to a specifBoth hardware and software tasks are designed manually basedic register inside PE1. The task manager program polls this regon the results of partitioning. However, we provide an infrastrucister, finds the message, and then proceeds to the next scheduledture to assist designers. A task core will reside in a standard pretask (in this case, task C). Using this method, tasks on each PEdesigned structure called the task wrapper, which is responsiblerun independently because the program operates asynchronouslyfor cooperating with the task manager and local controllers. Theat the system level.software task wrapper is an automatically generated code thatpartly resides in the task manager program. The hardware taskLIMITATIONSwrapper is a predesigned HDL code with generic parameters forAt present, the task manager program, which is based ondifferent tasks with different consumption and production rates.an asynchronous control protocol and runs in software, is notPIPE EngineStart X RegFinish X RegTask AxPE TaskPF INPIPE BusMemoryInterfacePort APF mem INPF mem OUTPF OUTControllerPort BMem ControlPIPE RouterPF Left (PFC) PFGAddressData IN/OUTWEPage 1Page 2Page 3PIPEMemoryPagenPF Right (PFC)[FIG6] The hardware design structure in each PIPE.IEEE SIGNAL PROCESSING MAGAZINE [20] MAY 2005O/P Buffer(a) Use Shift-RegisterTask Data Ctrl Reg 0Task Data Ctrl Reg 1xPE RegisterLocalMemory2 4 MBBus Control andRoutingControlAddrCircuitWEAddrRAM BlockTask Data Out Reg 0Task Data Out Reg 1Task Data Out Reg 2Task Data Out Reg 3Task Req Task AckI/P BufferxPE TaskCircuitRAM BlockData INData OUTTask CI/P Shift RegTask Data in Reg 0Task Data in Reg 1Task Data in Reg 2Task Data in Reg 3xPE TaskO/P Shift RegInter PIPE Trf RegInter PIPE Trf CmdTask BxPE ControlHardware TaskHardware OverheadTask Core(The Task Wrapper)ControlAddr Cnt(b) Use RAM Block

SW-Only TaskHW or SW TaskIMPLEMENTATION AND RESULTSThis section first provides implementation details of the proposed codesignframework for the UltraSONIC system.We then describe a case study for ourapproach based on JPEG compressionand present some experimental results.Figure 6 shows how the predesignedinfrastructure for hardware tasks isimplemented in the UltraSONIC PIPE.The infrastructure consists of threeapplication-independent modules:xPEcontrol, xPEregister, and xPEtask.Based on the information onxPEregister, xPEcontrol can controloperations of all hardware tasks residentin xPEtask wrappers. The total hardwareoverhead of this infrastructure is modest. It consumes around 10% of theXCV1000E FPGA on each PIPE.Read.BMPSW00,1RGB2YCbCr164YCb 5,82D-DCT 2D-DCT 2D-DCT8 88 88 856764Quantize Quantize W-Only TaskBy conforming to a set of designrules, designers can concentrate on constructing the task cores without havingto worry about the interfacing or controlling mechanisms. This infrastructurefacilitates fast design cycles and reduceserror-prone aspects of the design process.11,12Zigzag G7] (a) DAG of JPEG compression algorithm implemented in this work. (b) The resultsafter clustering and partitioning.CASE STUDY: JPEG COMPRESSIONJPEG compression has been used as a real codesign application[15]–[16] for grey scale images, using a block size of 4 4 pixels. In this work, however, the standard JPEG compression [17]with block size of 8 8 pixels for color images is implementedusing the DCT baseline method with sequential encoding.The DAG of the JPEG compression algorithm is illustrated inFigure 7. Tasks such as Read.BMP, RGB2YCbCr, Encoder, andWrite.JPG are implemented in software for convenience. Theother modules (Level Shifters, 2D-DCTs, and Quantizers) can beimplemented in either software or hardware. For hardware, thedesigned tasks are wrapped by the xPEtask wrapper with theRAM structure. Note that we employ a one-dimensional fastDCT architecture for eight pixels [18]. To deal with two-dimen-sional data blocks with size of 8 8, the DCT block is appliedfirst in the horizontal direction, then the vertical.EXPERIMENTAL RESULTSFor the software-only solution, C codes of every task moduleare written and run under the control of the task manager program. Table 1 shows execution times of the software-only solution for different sizes of pictures. The values are averaged from20 runs on the UltraSONIC host, a PC with a Pentium II processor running at 450 MHz.When using hardware to alleviate computational jobs in thisJPEG algorithm, we can substantially reduce the processing timeby around 83% on average for given image sizes (see Table 2). Toaccomplish this, the tasks, including Level Shifter, 2D-DCT, and[TABLE 1] EXECUTION TIMES OF TASKS RUNNING IN SOFTWARE.EXECUTION TIME (ms)IMAGES(PIXELS)400 300640 480800 600800 8001,024 768PERCENTAGEBMPSIZE352 KB901 KB1,407 KB1,876 KB2,305 5894.480.90%67.4192.0299.1325.7398.06.08%*The average time to process one color component.IEEE SIGNAL PROCESSING MAGAZINE [21] MAY EG4. KB18 KB55 KB63 KB70 KB–

[TABLE 2] COMPARISONS OF PROCESSING TIMEOF JPEG COMPRESSION.IMAGES(PIXELS)400 300640 480800 600800 8001,024 768SOFTWARESOLUTION3.15 s8.23 s12.96 s17.38 s21.71 sHARDWARE/SOFTWARESOLUTION*0.57 s1.34 s2.04 s2.71 s3.26 s*Using only two PIPEs available in the system.Quantizer (which consume about 88% of processing time if alltasks are run in software), are all moved into hardware. Thismove results in substantial processing time reduction. The correct results of every image size are produced.We also accomplish the implementation of the 8- and 16point FFT algorithms, which contain 24 and 52 task nodes,respectively. The radix-2 butterfly node is represented as a taskin the implementation. Unfortunately, the details cannot beexhibited in this article due to space constraints.SUMMARYThis article introduces and demonstrates a task-basedhardware/software codesign environment specialized for realtime video applications. Both the automated partitioning andscheduling environment (the predesigned infrastructure in theform of wrappers) and the task manager program help to provide a fast and robust route for supporting demanding applications in our codesign system. The UltraSONIC reconfigurablecomputer, which has been used for implementing many industrial-grade applications at SONY and Imperial College, allowsus to develop a realistic system model. Many simplifyingassumptions found in previous research, such as zero communication overhead and no possible resource conflicts,become unnecessary.Current and future work includes improvement of our taskexecution model, which uses local memory as a shared buffer forhardware tasks on each PE. This limits the possible degree of concurrency within a PE. The task manager should also be improvedfor better concurrency. Furthermore, we plan to extend our ideashere to cover the SONIC-on-a-chip system [19], which wouldrequire improving our system model at different levels.Peter Y.K. Cheung is the deputy head of the Electrical andElectronic Engineering Department at Imperial College,University of London, where he is professor of digital systems.His research interests include VLSI architectures for DSP andvideo processing, reconfigurable computing, embedded systems,and high-level synthes

hardware-only, and dummy tasks. A normal task is free to be partitioned and scheduled either in hardware or software resources. A software-only task is a task that users intentionally implement in software without exception. Similarly, a hard-ware-only task is implemented solely in hardware. A dummy

Related Documents:

ware and software. Market forces encourage such systems to be developed with di erent hardware-software decompositions to meet di erent points on the price-performance-power curve. Current design methodologies make the exploration of di erent hardware-software

2 Classic Hardware/Software Design Process zBasic features of current process: - System immediately partitioned into hardware and software components - Hardware and software developed separately zImplications of these features: - HW/SW trade-offs restricted Impact of HW and SW on each other cannot be assessed easily - Late system integration zConsequences of these features:


Cisco MDS 9000 Family Hardware and NX-OS Release 5.x Supported Software 1-2 Cisco MDS 9000 Family Hardware and NX-OS Release 4.2x Supported Software 1-8 Cisco MDS 9000 Family Hardware and NX-OS Release 4.1x Supported Software 1-15 Cisco MDS 9000 Family Hardware

ACM/IEEE International Conference on Hardware/Software Codesign and System Synthesis (CODES ISSS’15), Amsterdam, Netherlands, October, 2015. Peng Deng, Fabio Cremona, Qi Zhu, Marco Di Natale and Haibo Zeng, “A Model-based Synthesis Flow for Automotive CPS”, 6 th ACM/IEEE International Conference on Cyber-Physical Systems

by software. Commodity hardware devices, such as Mel-lanox NICs and P4 Switches, support both putting a hardware counter to every flow and sampling the hardware traffic to software. The cost of using hardware counters for flow rate measurement is very high (more discussion in Section2). If sampling a portion of the hardware traffic to .

tres tipos principales de software: software de sistemas, software de aplicación y software de programación. 1.2 Tipos de software El software se clasifica en tres tipos: Software de sistema. Software de aplicación. Software de programación.

Please find below a 12 week beginner, sprint distance triathlon training plan to help you prepare for your event. This 12 week training plan is designed to get a novice triathlete through a sprint distance triathlon. It is not a complex or hugely time consuming programme, it will get you to the finish line in good shape. In order to be able complete the training youshould have a reasonable .