Steve Krueger DSP Architecture Group Dallas

2y ago
11 Views
2 Downloads
244.36 KB
47 Pages
Last View : 9d ago
Last Download : 2m ago
Upload by : Lucca Devoe
Transcription

DMAThe Hidden Key to DSP SystemsSteve KruegerDSP Architecture GroupDallas

What Do DSP Systems Do? Process– filter, scale, transform, encode/decode,correlate, etc.– these operations are processor and dataintensive Signals– generally continuous or nearly continuousstreams of sampled real-world data– usually, real-time

Impediments to Processing Can’t process until data is received(often a whole block of data) Most algorithms have many memoryaccesses Memory access latency increases withmemory size and distance fromprocessor Some operations too specialized forgeneral purpose DSP

Model hLargeOn-ChipMemoryoff chip datasource/sinkoff chip memoryDMA Proca few cycles 10 cycles50 - 100 cyclesMemory Latency

C6400 SystemCorrelationCoProcC6400DSP16 kBMemintMcBSP1024 kBOn-ChipMemoryoff chip datasource/sinkoff chip memoryEDMAa few cycles 10 cycles50 - 100 cyclesMemory Latency

DMA Solves Two Problems I/O interrupt loading slows processorwork on tasks Data in close memory avoids latencyfor higher performance

Problem of Interrupt Loading Each peripheral usually interrupts toindicate data ready. Processor takes interrupt by enteringan interrupt handler, transferring thedata, then resuming the interruptedprocessing. This interrupt sequence takes at least20 cycles on most modernarchitectures — many more on some.

Interrupts of High Speed Peripheral A high-speed peripheral can generatea lot of interrupts:– say you have a 1 Msps ADC thatproduces an 8-bit conversion everymicrosecond.– If you interrupt for every conversion, have1 million interrupts per second.– If the interrupt takes 30 cycles, this isusing 5% of a C6416 at 600 MHz.– What if your system needs 8 of these?

Solutions to Interrupt Problem Can buffer at the peripheral andinterrupt less frequently. Mightinterrupt once per 32 conversions– increases complexity and cost ofperipheral, adds latency to peripheral data,more complex error handling Can make peripheral a bus master andhave it store data into memory.– Increases the complexity and cost ofperipheral

Solutions — continued Can have a central bus master toservice any peripheral– keeps peripherals simple and shares costbetween all peripherals– This is the basic idea of DMA (DirectMemory Access)

DMA Terms Event - a hardware signal that initiatesa transfer on a DMA channel Channel - one thread of transfer. Itcontains source and destinationaddresses plus control information Element: 8-, 16-, 32-bit datum Frame: Group of elements Array: Group of contiguous elements Block: Group of Frames or Arrays

Peripheral DMA FlowI/OPeriph.eventoRead data registerDMAControllerpWrite to memorynMemoryn Peripheral signalsan event toindicate it has dataready.o DMA Controllerreads peripheraldatap DMA Controllerwrites data tomemory

TI EDMA Peripheral DMA FlownI/OPeriph.eventpEDMAChannelProcessorReadoSend transfer cmdqTransferControllerWriteMemoryn Peripheral signalsan event toindicate it has dataready.o EDMA Controllersends a datatransfer commandto data moveenginep Transfer Controllerreads data registerq Transfer Controllerwrites memory

Interrupt Reduction DMA processes I/O events toaccumulate buffers for processing Reduces DSP interrupts by 10-1000times and increases tolerable interruptlatency by a similar factorEvents:Transfers:Interrupts:Buffers:

Memory Latency Reduction Program’s performance is reducedwhen accessing long latencymemories. Two ways of implementing processors:– overall stall - all processor activity waits– dependence stall - all dependentoperations wait

C6400 SystemCorrelationCoProcC6400DSP16 kBMemintMcBSP1024 kBOn-ChipMemoryoff chip datasource/sinkoff chip memoryEDMAa few cycles 10 cycles50 - 100 cyclesMemory Latency

Cost of Dependence Stall Dependence stall is expensive to implement.These are the techniques of out-of-orderCPU architecture. Must examine large “window” of instructionsto find those (few) that aren’t dependent onany pending operation. Must keep a time-dependent set ofprocessor state (GPRs) so each instructioncan run in proper environment. Must discard or rewind state on exception orother interruption.

Memory Stalls in DSP Systems Dependence stall (out-of-order)techniques are too expensive for DSPsystems. They also make it harder to predict thereal-time behavior. No DSPs have used these techniques. Instead, all DSPs have used total stallduring memory latency. Inexpensive but can hurt performance.

Effects of Memory Latency700600500400Execution Time3002001000LocalOn-chipOff-chip

Cache Memory Cache memory is well known to address theproblem of keeping frequently accessedinformation in a local memory for low-latencyaccess. But cache has problems in some DSPapplications:– Data isn’t always reused, or has limited reuse– Cache only loads data when requested the firsttime– Cache line size may not be a good match fordata size

Effects of Cache Memory1009080706050Execution Time403020100LocalOn-chipOff-chip

Addressible Local RAM Most newer TI DSPs have selectionfacilities that allow local RAM to beeither cache memory or addressibleRAM. If a programmer is clever and usesDMA, local RAM can be very effectivein many DSP applications, achievingnear ideal processing rates.

Stream Processing Data supply is an infinite (or practicallyinfinite) stream. The arrival of data is predictable. Each datum or block of data isprocessed the same. The processing is very regular andpredictable.A problem with these characteristics allowsfor accurate prediction of which data will beneeded and when.

Basic Plan Block algorithm to process buffers of inputsamples Wrap buffer processing loop with a datatransfer loop. Perform transfers in parallel with processing. Work inbound transfers ahead of processingso will complete before needed. Perform outbound transfers once processingcomplete.This is a job for DMA!

Streaming Prefetch Loop NestOuter loop to prefetch sourcebuffers using DMA and doublebufferingalgorithmdata transfer loopapplicationSetup and initializationInner loop performs algorithm ona buffer producing an outputbufferOuter loop to post-store theoutput buffers using DMA anddouble bufferingCleanup and exit

Data Transfer Loopstartfrom Prev ibufarrived?Prev obufdone?to algNLoop to startN

Prepare and Start DMA To prepare and start a DMA transfer,need to:– write a channel program and store it in theDMA controller– signal an event to trigger that channelprogram This can be slow– Might write and store once then reuse– Might have a faster means for QuickTransfers

Quick DMA C6x1x processors have QDMA QDMA are a set of control registersthat initiate a TC transfer immediately. Need only supply:– source address– destination address– a transfer length– and control bits

C6400 SystemCorrelationCoProcC6400DSP16 kBMemintMcBSP1024 kBOn-ChipMemoryoff chip datasource/sinkoff chip memoryEDMAa few cycles 10 cycles50 - 100 cyclesMemory Latency

C6400 System — really16 kBRAM/ C6400DSPintCorrelationCoProcMcBSP1024 kBOn-ChipRAM/ off chip datasource/sinkoff chipmemoryEDMAControllerTransferControllerEMIF

EDMA Highlights 16 Channels– Each channel may chain multiple transfers directed byparameter sets– 60 additional parameter sets for reload and linked transfers– Performs all transfer types done by C6x0x DMA– Programmed via a dedicated parameter RAM (PaRAM) Highly Efficient Transfer Controller (TC)– Crossbar architecture processes multiple transfersconcurrently– Highly-efficient and fully-pipelined cycle-by-cycleprioritization for low channel turn around TC services all DSP and cache memory requests inaddition to EDMA channel program transfers

EDMA Controller Features High Performance: single cycle throughput 2KB Parameter RAM stores up to 85 transfer entries 16 channels programmable for––––––Element size (byte, half-word, word)Src/Dst Addressing ModesTransfer Type (2D or non-2D)Priority of the transferLinked transfersChaining channels with one event Up to 16 Sync events (from external device orperipheral) Generates a CPU interrupt upon transfer completion Emulation and Endian Support

EDMA Controller Concepts Terms––––––Element: 8-, 16-, 32-bitFrame: Group of elementsArray: Group of contiguous elementsBlock: Group of Frames or Arrays2-D Transfer: Block transfer of ArraysNon-2D Transfer: Block Transfer of Frames 2KB Parameter RAM– Stores transfer parameters for 16 channels– Stores reload parameters for up to 69 entries– Each entry comprises 6 words; always align on24-byte boundary

Programmable Addressing Src and /or Dst Address can:–––––Remain StaticIncrementDecrementModified by signed index valuesReplaced with Link parameters Independently programmable for Source andDestination Indexing allows different strides between elementsand between frames Allows:– 2D Block Transfers on a single event– Circular Buffering via Linked List– Data Sorting/Interleaving

EDMA ArchitectureHPI RequestsTRSystem events(McBSP requests,timers, /EINTz,CPU initiatedL2/QDMA RequestsTRTREDMARequest QueuesTC EDMA architecturally has three sections– EDMA controller and parameter RAM– Transfer Crossbar (TC)– Transfer Request (TR) nodesI/O ports:EMIF, McBSPs,internal memory,HPI

EDMA Controller and PaRAMTRnodeChannel 0 ParamsChannel 1 ParamsChannel N ParamsReload Channel 0ParamsReload Channel 1ParamsReload Channel NParamsFiniteStateMachineCaptures events from allsystem DMA requesters Simultaneous eventsserviced via priority encoder FSM reads parameter blockfrom dedicated 2KB PaRAM Formatted to create a TRP(Transfer Request Packet),and sent to TC via TR node Parameter updates andlinking, while TC performsI/O Essentially, multi-threaded,special-purpose processordownstream TR nodes (TC)priorityencoderevents (serial ports, FIFOAF/AE, external devices)upstream TR nodes unused (scratch area)EDMA Parameter RAM

EDMA Transfer nt N Parameters3116 150OPTIONSSRC ADDRESSARRAY/FRAME ELEMENT COUNTCOUNTDST ADDRESSARRAY/FRAMEELEMENT INDEXINDEXELEMENT COUNT LINK ADDRESSRELOAD0x01A0000C0x01A00010Word 0Word 1Word 20x01A00014Word 3Word 4Word 5Options Field3129 28PRI27 2625 24ESIZE SUM23DUM222DS21202DD TCINT1916 15TCC21Resv LINK00x01A00018to 0x01A0002C 0x01A00168to x01A0018C0x01A001900x01A00194 0x01A007E0to 0x01A007F70x01A007F8 0x01A007FFEvent ParametersEvent 0, OptionsEvent 0, SRC AddressEvent 0,Event 0,Array/Frame Count Element CountEvent 0, DST AddressEvent 0,Event 0,Array/Frame IndexElement IndexEvent 0, ElementEvent 0,Count ReloadLink AddressParameters For Event 1 Parameters For Event 15Reload/Link ParametersEvent N, OptionsEvent N, SRC AddressEvent N,Event N,Array/Frame Count Element CountEvent N, DST AddressEvent N,Event N,Array/Frame IndexElement IndexEvent N, ElementEvent N,Count ReloadLink Address Reload Parameters for Event ZUnused RAMScratch Pad Area

EDMA Transfer Priority Channels have no priorities; Instead their transferparameters have Programmable Priority 3 Priority Levels available:– Level 0 / Urgent: NOT valid for EDMA Transfers– Level 1 / High: Used by EDMA/HPI Transfers– Level 2 / Low: Used by EDMA Transfers Level 1 and 2 Priorities are independentlyprogrammable for 16 channels when competingfor:– EMIF– Peripherals– L2 SRAM Priority Queue Status Register indicates if a priorityqueue is empty

EDMA: Interrupt Generation Generates a single CPU interrupt (EDMA INT) for all16 channels Transfer Completion Code (TCC:0-15) specified foreach channel sets the relevant Channel InterruptPending bit in CIPR (assuming that the relevant CIERbit is set)– Multiple channels can have same TCC - same ISR fordifferent events Channel Complete Interrupt is generated when thechannel executes the transfer to completion -- notwhen the transfer request is submitted

TC Architecture TR Packets are placed into one of three queues (0 highest,2 lowest)Transfers are performed in order within each queueTC pipeline processes each set of TR parameters to performaccessesAll three queues can be active simultaneously

Typical EDMA FlowEDMATRQ0SRCQ1DSTQ2TCQueuesdataWR cmdTRPre-WR cmdL2/internalRD cmdRUDRR dataREVTMcBSP

EDMA Example #1EIXSource address 0 1EIX0 2EIX0 30 4FIXDestination address1 11 21 31 42 22 32 4AllFIX2 13100x23000000Source addressfr cnt 0x2el cnt 0x4Destination addressfr index FIXel index EIXDon't careDon't care311000129 28002726025112423022002120019000016 ervedLINKFS Source elements spaced by EIX and frames by FIX All destination transfers go to a single address EDMA supports linear, fixed, decrement, and indexedaddressing modes

EDMA Example #2 - Linking 1 element peroccurrenceSRC Address(McBSP) frame count 3 (N - 1)EIXBuff 1000111222333666777FIXframe linkEIXBuff 244455FIX5 elementelement count 4Double buffering can be easilyset up using linking feature ofEDMAEach buffer can create aninterrupt to CPU to inform itdata is ready(alternatively an EDMAtransfer can set a S/W-visibleflag)Increased numbers of bufferscan easily be added throughEDMA parametersMany exotic combinations arepossible using EDMA options,linking and reload

EDMA Example #2 - Linkinginitial setup - 1 oftop 16 entriesOptionsMcBSPFM CT 3EL CT 4Buff 1FIXEIXEl CT Reload 4Link AddressOptionsMcBSPFM CT 3anywhere in EDMARAM2DS 0, 2DD 0, FS 0SRC DIR 00 (fixed),DST DIR 11 (indexed)EL CT 4interrupt to CPU (optional)2DS 0, 2DD 0, FS 0SRC DIR 00 (fixed),DST DIR 11 (indexed)Buff 2FIXEIXEl CT Reload 4Link AddressOptionsMcBSPFM CT 3EL CT 4interrupt to CPU (optional)2DS 0, 2DD 0, FS 0SRC DIR 00 (fixed),DST DIR 11 (indexed)Buff 1FIXEIXEl CT Reload 4Link Addressinterrupt to CPU (optional)

Quick DMA (QDMA)310QDMA OPT0x02000000QDMA Options310QDMA SRC0x02000004source address16 1531QDMA CNT0Line Count0QDMA DST16 15310Line/Frame Index31000x02000024source address16 1531QDMA S CNT0Line Count0x02000028Element Count310QDMA S DST0x0200002CDestination Address16 1531QDMA S IDX29 2826ESZ 2DS252423SUM 2DD22Element Index21DUM20TCINT 0Line/Frame Index27 0x02000020QDMA OptionsQDMA S SRCPRI0x02000010Element Index3131 0x0200000CDestination Addresssame physical register setQDMA S OPT0x02000008Element Count31QDMA IDX 16 1519TCC0x020000301Reserved0FS QDMAs can be submittedby CPU as a “fire-n-forget”type block transferQDMA registers areaccessible in a single cyclePseudo-mapping of submitregisters allow back-toback similar transfers to besubmitted in a single cycleQDMAs support alladdressing modes as theEDMAQDMAs do not supportlinking, but may be“chained” to EDMA events

QDMA Programming Model Initial requests– Perform 4 writes to QDMA registers to set upparameters– Perform fifth write to QDMA pseudo register to setup fifth parameter and automatically submittransfer request Subsequent requests– Write only changing parameters to QDMA pseudoregisters to update parameters and submit request Interrupts and chaining are supportedexactly as with the EDMA

Conclusion Two ways that DMA is important inDSP systems:– Lower interrupt rate and therefore loweroverhead from interrupts– Scheduling data into low-latencymemories These have convinced DSP vendors toprovide sophisticated DMA controllersin our on-chip systems.

Addressible Local RAM Most newer TI DSPs have selection facilities that allow local RAM to be either cache memory or addressible RAM. If a programmer is clever and uses DMA, local RAM can be very effective in many

Related Documents:

Component Dsp codec wrapper Component Dsp render. HIFI4 Core. Dsp codecs. SAI/ESAI/DMA DAC. Figure 2. Software architecture for DSP processor The DSP-related code includes the DSP framework, DSP remoteproc driver, DSP wrapper, unit test, DSP codec wrapper, and DSP codec. The DSP framework is a firmware code which runs on the DSP core.

Nov 29, 2013 · Title Chip Mega Man X3/ Rockman X3 CX4 Mega Man X2/ Rockman X2 CX4 Suzuka 8 Hours DSP-1 Super F1 Circus Gaiden DSP-1 Super Bases Loaded 2 / Super 3D Baseball DSP-1 Super Air Diver 2 DSP-1 Shutokō Battle 2: Drift King Keichii Tsuchiya & Masaaki Bandoh DSP-1 Shutokō Battle '94: Keichii Tsuchiya Drift King DSP-1 Pilotwings DSP-1 Mic

Stratix II EP2S60 DSP Development Board Features The Stratix II EP2S60 DSP development board is included with the DSP Development Kit, Stratix II Edition (ordering code DSP-DEVKIT-2S60). This board is a development platform for high-performance digital signal processing (DSP) designs, an

Figure 1. DSP Development Kit Contents The DSP development kit includes: Stratix EP1S25 or EP1S80 DSP Development Board —The Stratix EP1S25 and EP1S80 DSP development boards are prototyping platforms that provide system designers with a solution for DSP designs. Key features

We hebben meerdere DSP-filters ter beschikking om te vergelijken: het DSP-filter van de Kenwood TS-570, de Timewave DSP-599zx, de MFJ-781 en de NIR. De Kenwood TS-570 Deze transceiver is terecht één van de succesnum-mers van Kenwood. Het middenfrequent is uitgerust met kwartsfilters. Die krijgen de hulp van een DSP in het LF.

QIAamp DSP 96 DNA Blood Kit (12) Cat. no. 61162 4. QIAamp DSP DNA FFPE Tissue Kit (50) Cat. no. 60404 QIAamp DSP Kits – manual and automatable on QIAcube 5. QIAamp DSP DNA Blood Mini Kit (50) Cat. no. 61104 6. QIAamp DSP DNA Mini Kit (50) Cat. no. 61304 7. QIAamp DSP Virus Spin Kit (50) Cat. no. 61704 8. QIAamp

Features The Stratix EP1S25 DSP development board is included with the DSP Development Kit, Stratix Edition (ordering code: DSP-BOARD/S25). This board is a powerful development platform for digital signal processing (DSP) designs, and features the Stratix EP1S25 device in the fastes

Linux DSP Tools provides the following foundational target content for DSP development. DSP/BIOS Real time kernel. Configurable, scalable, deterministic task scheduling with API’s for real time analysis. DSP/BIOS Link Program load, memory read write, shared memory channel driver for int