DMA implementations for FPGAbased data acquisition systemsPresenter: Wojciech M. ZabołotnyInstitute of Electronic SystemsWarsaw University of TechnologyXL-th IEEE-SPIE Joint Symposium Wilga 20171/23
FPGA in DAQ FPGA chips are a perfect solution for interfacing the FEE in theDAQ systems–Flexible communication interfaces (either supported with dedicatedcores or possible to implement in the programmable logic)–Possibility to operate in hard real-time. No problems with interruptlatencies. It is possible to achieve fully deterministic precise timing.There are some disadvantages–High cost of FPGA based solution–Difficult implementation of more complex data processing algorithms–Difficult implementation of more complex communication protocols,especially of those related to buffering and repeated retransmission ofhuge amount of data (e.g. TCP/IP)Solution?XL-th IEEE-SPIE Joint Symposium Wilga 20172/23
FPGA „PC” in DAQ The solution is to use the standard computer „PC” or „ES”as early as possible in the DAQ chain.Possible architectures include:–Using SoCs (e.g. Xilinx Zynq, ZynqMP, Altera SoC FPGAs)–Using FPGAs „tightly coupled” with the computer system viahigh speed interface – e.g. PCIeThe problem is the efficient delivery of data from the FPGApart to the memory of the computer.To spare the CPU computational power for the realprocessing of data, usage of DMA is advisable.XL-th IEEE-SPIE Joint Symposium Wilga 20173/23
DMA solutions - embarras derichesse There are various portable solutions available, often for free–https://opencores.org/project,wb dma–https://opencores.org/project,dma axi–https://opencores.org/project,virtex7 pcie dmaThere are different DMA IP-cores provided by the FPGAvendors, optimized for their FPGA hardwareThe FPGA implementation offers us an exceptional oportunity toprepare a DMA system carefully adjusted to the specificrequirements of the particular DAQThe following examples were developed for Xilinx FPGAs(Family 7 or UltraScale )XL-th IEEE-SPIE Joint Symposium Wilga 20174/23
The first system The system was created for the GEM detector DAQ.The hardware platform was the KC705 board.The FPGA receives the data from FEE, preprocesses it,and stores the result in the huge DDR4 memory.The data must be read from that memory via the PCIeinterface.This solution is well suited for situations where theavarage data bandwidth is moderate, but it is fluctuating.In that architecture the natural solution was to use theAXI Central DMA Controller and theAXI Memory Mapped to PCI Express Gen2 IP cores.XL-th IEEE-SPIE Joint Symposium Wilga 20175/23
Implementation of the first systemDDRAXI lregistersPCIeblockXL-th IEEE-SPIE Joint Symposium Wilga 2017PCIeComputersystemsystemComputerDDR controllerFPGA basedDAQ system6/23
Results The implementation can be easily performed in the Vivado Block Diagram EditorThe Linux driver allowed to allocate the DMA buffer and to mmap it into theapplications memory.The theoretical throughput of AXI and PCIe was 16Gb/s and of AXI. Themaximum achieved throughput was 10.45 Gb/s for writing to DDR and 8.05 Gb/sfor reading from DDR.For the continuous stream of the data the memory bus may be a bottle neck.XL-th IEEE-SPIE Joint Symposium Wilga 20177/23
The second system The hardware platform was the ZCU102 board containingboth the FPGA and the ARM CPU (SoC)The second system was created for the acquisition ofdata from the hardware video encoder (VSI project)The data was delivered by the AXI4 Stream interfaceThe data should be written to the memory of the PSconnected via AXI4 interface.Each fragment was delivered in a separate AXI4 Streampacket, but due to the compression the packets lengthcould differ.The natural solution seemed to be the AXI DMA controllerXL-th IEEE-SPIE Joint Symposium Wilga 20178/23
Topology of the second systemSoC systemDDRDDR controllerVideosignalVideo inputblockVideo encoderAXI4StreamDDRControlregistersDMAblockXL-th IEEE-SPIE Joint Symposium Wilga 2017AXIBusProcessingsystem9/23
Problems. To receive continuous stream of data, it was necessary to use thecontroller in a circular mode.Unfortunately, the AXI DMA Controller with the original Linux kerneldidn’t report correctly the length of the last transfer.Thorough investigation has shown, that it may be difficult to reliablyfix the problem. (The register holding the length of the transfer getsoverwritten when the next transfer starts)The good alternative was to use the AXI Data Mover–The transfer commands are delivered by AXI4 Stream–The status of transfers are delivered back by another AXI4 Stream interface.There is no risk to loose the the information about the length of the transfer!How to feed the ADM with the commands, and to receive statuses?–The AXI Streaming FIFO is the good choice.XL-th IEEE-SPIE Joint Symposium Wilga 201710/23
Implementation with the Xilinx blocks The implementation allows to avoid the „buffer overrun” problems.There are a few (16) DMA buffers (mmapped to the applications memory),and the transfer request for each buffer is generated in advance and writtento the FIFO.After the status of the particular transfer is received, the data is delivered tothe application for processing.Only after the application confirms, that the data is processed, the transferrequest may be resubmitted to the FIFOXL-th IEEE-SPIE Joint Symposium Wilga 201711/23
Linux driver API The DMA buffers are mapped into the application’s memory. The lengthof the single buffer must not be smaller than the maximum length of theframe. Communication with the driver is performed via ioctl calls: ADM START - Starts the data acquisition. ADM STOP - Stops the data acquisition. ADM GET - Return the number of the next available buffer with the newvideo frame. If no buffer is available yet, puts the application to sleep.ADM CONFIRM - Confirms that the last buffer was processedADM RESET - This command resets the AXI Data Mover and AXIStreaming FIFO. It is necessary before the new data acquisition isstarted to ensure that no stale commands from the previous, possiblyinterrupted transmission are stored in those blocks.The ADM GET and ADM CONFIRM ioctls ensure the appropriatesynchronization of the access to the DMA buffers.XL-th IEEE-SPIE Joint Symposium Wilga 201712/23
Results The DMA system and the driver was carefullytested, and is currently used in the VSI system.Due to the specific features of the data sourceno maximum throughput tests were performed.It was stated, that even at the maximum framesize of 4MB and frame rate of 60 fps, the CPUload realted to reception of data was below 1%.XL-th IEEE-SPIE Joint Symposium Wilga 201713/23
The third system The third system combined the features of the first two.The hardware platform was a purpose-developed Artix-7based PCIe card.It was the DAQ for the same GEM detector measurementsystem used in case 1, but now configured for the continuousoperation. Therefore, the DDR buffering of data was useless The data was delivered by the AXI4 Stream interface, but thepackets could be bigger than any reasonable single DMAbuffer.Therefore it was necessary to use another architectureXL-th IEEE-SPIE Joint Symposium Wilga 201714/23
MeasurementdataTopology of the third system FPGA based DAQ he IP-core used as a DMA engine and PCIe block was the XilinxDMA for PCIe also known as XDMA.The block supports 64-bit addressing at the PCIe side, so it could beused with huge (above 4GB) sets of DMA buffers.The block is so complex, that it was practically necessary to use thedriver provided by Xilinx. Unfortunately, it required certain modifications.XL-th IEEE-SPIE Joint Symposium Wilga 201715/23
Driver corrections The original driver supported the cyclical transfer only with read/writeoperations – no zero-copy transfer was possibleFor cyclical transfer the driver didn’t implement any overrun protection–The driver checks the „MAGIC number” of the transfer request–After the transfer is finished, its status is written back to the memory as„metadata writeback” with another „MAGIC number”.–It is possible to configure the same transfer request and writeback addresses.So the status overwrites the request, and blocks a possibility to perform thesame transfer again.–After the application processes the data, the transfer request should berewritten, with the „MAGIC number” written as the last word. That ensuresthat the overrun condition will generate a transfer error.Another problem was related to handling of huge data in a circularbufferXL-th IEEE-SPIE Joint Symposium Wilga 201716/23
Buffer mapping Received data are organized in structures for directaccess from the C-languageThe scattered DMA buffers were mapped so that theycreate a huge continuous buffer in a virtual address spaceTo allow efficient direct processing – caching was switchedon for the buffer (so synchronization between CPU andDMA was necessary via ioctls)The processing library may simply use the pointer to thedata–But what about cyclic buffer?XL-th IEEE-SPIE Joint Symposium Wilga 201717/23
Buffer mapping Received data are organized in structures for direct accessfrom the C-languageThe scattered DMA buffers were mapped so that they create ahuge continuous buffer in a virtual address spaceTo allow efficient direct processing – caching was switched onfor the buffer (so synchronization between CPU and DMA wasnecessary via ioctls)The processing library may simply use the pointer to the data–But what about cyclic buffer?–The first solution is usage of the buffer with length of 2 N bytes, andusing the modular arithmetic to access the contentsXL-th IEEE-SPIE Joint Symposium Wilga 201718/23
Buffer mapping Received data are organized in structures for direct accessfrom the C-languageThe scattered DMA buffers were mapped so that they createa huge continuous buffer in a virtual address spaceTo allow efficient direct processing – caching was switched onfor the buffer (so synchronization between CPU and DMA wasnecessary via ioctls)The processing library may simply use the pointer to the data–But what about cyclic buffer?–The solution is the „overlap mapping”XL-th IEEE-SPIE Joint Symposium Wilga 201719/23
Physical addressesBuffer 1Buffer 0Buffer 3Buffer 5Buffer 6Buffer 7Buffer 4Buffer 2Buffer 0Buffer 1Buffer 2Buffer 3Buffer 4Buffer 5Buffer 6Buffer 7Buffer 0Buffer 1Buffer 2Virtual addresses in the applicationOverlap mapping Scattered DMA buffers aremapped as a continuousbuffer in the virtual addressspace.Double mapping of thebeginning of the bufferensures, that each objectstored in the cyclic buffermay be reliably accessed viaa standard pointer as acontinuous entity.The overlapmappingThe maximumlength of thepacketXL-th IEEE-SPIE Joint Symposium Wilga 201720/23
Results The third system was tested with the simulateddata.The achieved throughput was 14.2 Gb/s (89%of the theoretical throughput 16 Gb/s for 4 lanesPCIe Gen 2.Long term (28 h) tests has proven the error-freetransmission.XL-th IEEE-SPIE Joint Symposium Wilga 201721/23
Conclusions Three DMA systems adjusted to different architectures of the data acquisitionsystems and different requirements are presented.The simplest version performs DMA transfers on request from the data-processingapplication.– no problems related to cyclic mode, possible overruns, and synchronization between theDMA and the processing threads.The third version is the high-performance system able to almost fully utilize thebandwidth of the PCIe bus for delivery of the continuous stream of data for a longtime.The possibilities to work around deficiencies of the IP-core design have beenpresented.All presented DMA systems have been successfully synthesized, implementedand tested. They may be reused in different DAQ systems - both based on SoCchips using only the AXI bus, and in PCIe-based systems with the PCIe endpointblocks.The presented solutions are based on Xilinx provided IP cores. However, similarblocks are available also for FPGA or SoC chips from other vendors. Thedescribed techniques used in the Linux kernel drivers should also be portable toother hardware platforms.XL-th IEEE-SPIE Joint Symposium Wilga 201722/23
Thank you for your attention!XL-th IEEE-SPIE Joint Symposium Wilga 201723/23
DMA block Control registers PCIe block FPGA based DAQ system PCIe M e a s u r e m e n t d a t a AXI4 Stream The IP-core used as a DMA engine and PCIe block was the Xilinx DMA for PCIe also known as XDMA. The block supports 64-bit addressing at the PCIe side, so it could be used with huge (above 4GB) sets of DMA buffers.
PG 3 DMA-011 DMA-043 DMA-096 DMA-053 DMA-056 DMA-064 DMA-063 DMA-066 DMA-066B DMA-067 DMA-068 DMA-079 DMA-084 DMA-087 DMA-088
Different DMA for each surface type. Slide courtesy of Santa Barbara County and Dan Cloak. 1225 SF Existing Impervious Area. DMA-1. 3200 DMA-2. 3200 DMA-3: 3700 DMA-4. 12400 DMA-5: 500 DMA-6. 8500 DMA-7: 4200 Total 35700 1225 SF Existing Impervious Area. Slide courtesy of Santa Barbara County and Dan Cloak. Sizing - Treatment Only. DMA Name .
This DMA General Certification Overview course is the first of five mandatory courses required for DMA certification: 1. DMA General Certification Overview 2. DMA Military Sexual Trauma (MST) and the Disability Examination Process 3. DMA Medical Opinions 4. DMA Aggravation Opinions 5. DMA Gulf War General Medical Examination
DMA interrupt handler are implemented in emlib, but callbacks can be registered by application emlib DMA config includes DMA interrupt handler Callback functions registered during DMA config 17. Hands-on task 1 - Basic Mode 1. Open an\fae_training\iar\dma.eww and got to adc_basic project 2. Run code and check that DMA- CHREQSTATUS is set to 1
Linux - DMA buf Application Coprocessor libmetal Allocator (ION ) Remoteproc ioctl to import DMA buf Linux Kernel metal_shm_open() metal_shm_attach() metal_shm_sync DMA buf DMA buf fd DMA buf fd va, dev_addr DMA buf fd dev addr, size Sync_r/Sync_w, Ack RPMsg dev_addr, size Sync_r/Sync_w, Shm size Ack
Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original
In this thesis, FPGA-based simulation and implementation of direct torque control (DTC) of induction motors are studied. DTC is simulated on an FPGA as well as a personal computer. Results prove the FPGA-based simulation to be 12 times faster. Also an experimental setup of DTC is implemented using both FPGA and dSPACE. The FPGA-based design .
PGT Commerce HOD Commerce Professor, Department of Commercial Senior East Point School KMPG, Badalpur Social Sciences & Humanities Secondary School, Delhi NCERT, New Delhi 16 Daryaganj, New Delhi Dr. Piyush Prasad Dr. Neha Agarwal Sh. Sanjeev Kumar Smt. Alka Rani Dr. Amit Agarwal Academic Officer Financial Expert V ice Principal PGT, Commerce Lecturer (Accountancy) GAIL Town Ship Govt. Co-edu .