Pipelining & Verilog - MIT

3y ago
47 Views
2 Downloads
2.73 MB
38 Pages
Last View : Today
Last Download : 2m ago
Upload by : Kairi Hasson
Transcription

Pipelining & Verilog 6.111 Fall 2016DivisionLatency & ThroughputPipelining to increase throughputRetimingVerilog Math FunctionsLecture 91

Sequential DividerAssume the Dividend (A) and the divisor (B) have N bits. If weonly want to invest in a single N-bit adder, we can build asequential circuit that processes a single subtraction at a timeand then cycle the circuit N times. This circuit works on unsignedoperands; for signed operands one can remember the signs, makeoperands positive, then correct sign of result.01PSLSBAS 0BN bitsN 1N 1N 1 0?6.111 Fall 2016SLecture 9Init: P 0, load A and BRepeat N times {shift P/A left one bittemp P-Bif (temp 0){P temp, ALSB 1}else ALSB 0}Done: Q in A, R in P2

Verilog divider.v// The divider module divides one number by another. It// produces a signal named "ready" when the quotient output// is ready, and takes a signal named "start" to indicate// the the input dividend and divider is ready.// sign -- 0 for unsigned, 1 for twos complementalways @( posedge clk ) begindel ready !bit;if( start ) beginbit WIDTH;quotient 0;quotient temp 0;dividend copy (!sign !dividend[WIDTH-1]) ?{1'b0,zeros,dividend} :{1'b0,zeros, dividend 1'b1};divider copy (!sign !divider[WIDTH-1]) ?{1'b0,divider,zeros} :{1'b0, divider 1'b1,zeros};// It uses a simple restoring divide algorithm.// http://en.wikipedia.org/wiki/Division (digital)#Restoring divisionmodule divider #(parameter WIDTH 8)(input clk, sign, start,input [WIDTH-1:0] dividend,input [WIDTH-1:0] divider,output reg [WIDTH-1:0] quotient,output [WIDTH-1:0] remainder;output ready);reg [WIDTH-1:0] quotient temp;reg [WIDTH*2-1:0] dividend copy, divider copy, diff;reg negative output;wire [WIDTH-1:0] remainder (!negative output) ?dividend copy[WIDTH-1:0] : dividend copy[WIDTH-1:0] 1'b1;reg [5:0] bit;reg del ready 1;wire ready (!bit) & del ready;wire [WIDTH-2:0] zeros 0;initial bit 0;initial negative output 0;6.111 Fall 2016negative output sign &&((divider[WIDTH-1] && !dividend[WIDTH-1]) (!divider[WIDTH-1] && dividend[WIDTH-1]));endelse if ( bit 0 ) begindiff dividend copy - divider copy;quotient temp quotient temp 1;if( !diff[WIDTH*2-1] ) begindividend copy diff;quotient temp[0] 1'd1;endquotient (!negative output) ?quotient temp : quotient temp 1'b1;divider copy divider copy 1;bit bit - 1'b1;endendendmoduleLecture 9L. Williams MIT ‘133

Math Functions in CoregenWide selection of math functions available6.111 Fall 2016Lecture 94

Coregen Dividernot necessary manyapplicationsDetails in data sheet.6.111 Fall 2016Lecture 95

Coregen DividerChose minimiumnumber for applicationReady For Data: neededif clocks/divide 16.111 Fall 2016Lecture 96

Performance Metrics for CircuitsCircuit Latency (L):time between arrival of new input and generationof corresponding output.For combinational circuits this is just tPD.Circuit Throughput (T):Rate at which new outputs appear.For combinational circuits this is just 1/tPD or 1/L.6.111 Fall 2016Lecture 97

Coregen Divider LatencyLatency dependent ondividend width fractioanl reminder width6.111 Fall 2016Lecture 98

Performance of Combinational CircuitsFor combinational logic:L tPD,T 1/tPD.FXHP(X)GWe can’t get the answer faster,but are we making effective useof our hardware at all times?XF(X)G(X)P(X)F & G are “idle”, just holding their outputsstable while H performs its computation6.111 Fall 2016Lecture 99

Retiming: A very useful transformRetiming is the action of moving registers around in the system Registers have to be moved from ALL inputs to ALL outputs or vice versaCutset retiming: A cutset intersects the edges, such that this would result in two disjointpartitions of the edges being cut. To retime, delays are moved from the ingoing to theoutgoing edges or vice versa.Benefits of retiming: Modify critical path delay Reduce total number of registers6.111 Fall 2016Lecture 910

Retiming Combinational Circuitsaka “Pipelining”151525XP(X)6.111 Fall 2016P(Xi-2)2020L 45T 1/4525XiAssuming ideal registers:i.e., tPD 0, tSETUP 0Lecture 9tCLK 25L 2*tCLK 50T 1/tCLK 1/2511

Pipeline diagramsF15XH25P(X)Clock cycleii 1i 2XiXi 1Xi 2F RegF(Xi)F(Xi 1)F(Xi 2)G RegG(Xi)G(Xi 1)G(Xi 2)H(Xi)H(Xi 1) H(Xi 2)GPipeline stages20InputH Regi 3Xi 3 The results associated with a particular set of inputdata moves diagonally through the diagram, progressingthrough one pipeline stage each clock cycle.6.111 Fall 2016Lecture 912

Pipeline ConventionsDEFINITION:a K-Stage Pipeline (“K-pipeline”) is an acyclic circuit having exactly Kregisters on every path from an input to an output.a COMBINATIONAL CIRCUIT is thus an 0-stage pipeline.CONVENTION:Every pipeline stage, hence every K-Stage pipeline, has a register on itsOUTPUT (not on its input).ALWAYS:The CLOCK common to all registers must have a period sufficient tocover propagation over combinational paths PLUS (input) register tPDPLUS (output) register tSETUP.The LATENCY of a K-pipeline is K times the period ofthe clock common to all registers.The THROUGHPUT of a K-pipeline is the frequency ofthe clock.6.111 Fall 2016Lecture 913

Ill-formed pipelinesConsider a BAD job of pipelining:AXYC12BFor what value of K is the following circuit a K-Pipeline? noneProblem:Successive inputs get mixed: e.g., B(A(Xi 1), Yi). Thishappened because some paths from inputs to outputshave 2 registers, and some have only 1!This CAN’T HAPPEN on a well-formed K pipeline!6.111 Fall 2016Lecture 914

A pipelining methodologyStep 1:Add a register on each output.Step 2:Add another register on eachoutput. Draw a cut-set contourthat includes all the newregisters and some part of thecircuit. Retime by moving regsfrom all outputs to all inputs ofcut-set.STRATEGY:Focus your attention onplacing pipelining registersaround the slowest circuitelements (BOTTLENECKS).A4 nSB3 nSC8 nSD4 nSRepeat until satisfied with T.F5 nSE2 nST 1/8nsL 24ns6.111 Fall 2016Lecture 915

Pipeline Example2X31AC2 1-pipeline improvesneither L or T.132 T improved by breakinglong combinational paths,allowing faster pipe:41/23-pipe:61/26.111 Fall 2016OBSERVATIONS:Lecture 9 Too many stages cost L,don’t improve T. Back-to-back registersare often required tokeep pipeline wellformed.16

Pipeline Example - VerilogLab 3 PongpixelXhcount,vcount,etcG8CYintermediatewiresNo pipelineassign y G(x);assign pixel C(y)XG8 G game logic 8ns tpdclock9 System clock 65mhz 15ns period – opps// logic for y// logic for pixelY2Y C draw round puck, usemultiply with 9ns tpdpixelC9clockPipelinealways @(posedge clock) begin.y2 G(x);// pipeline ypixel C(y2)// pipeline pixelend6.111 Fall 2016reg [N:0] x,y;reg [23:0] pixelalways @ * beginy G(x);pixel C(y);endLecture 9Latency 2 clock cyles!Implications?17

Increasing Throughput: PipeliningIdea: split processing acrossseveral clock cycles by dividingcircuit into pipeline stagesseparated by registers that holdvalues passing from one stage tothe next. registerThroughput 1/4tPD,FA instead of 1/8tPD,FA)6.111 Fall 2016Lecture 918

How about tPD 1/2tPD,FA? register6.111 Fall 2016Lecture 919

Timing Reports65mhz 27mhz*2.4SynthesisreportMultiple: 7.251nsTotal Propagationdelay: 34.8ns6.111 Fall 2016Lecture 920

History of Computational Fabrics Discrete devices: relays, transistors (1940s-50s) Discrete logic gates (1950s-60s) Integrated circuits (1960s-70s) e.g. TTL packages: Data Book for 100’s of different parts Gate Arrays (IBM 1970s) Transistors are pre-placed on the chip & Place and Route softwareputs the chip together automatically – only program the interconnect(mask programming) Software Based Schemes (1970’s- present) Run instructions on a general purpose core Programmable Logic (1980’s to present) A chip that be reprogrammed after it has been fabricated Examples: PALs, EPROM, EEPROM, PLDs, FPGAs Excellent support for mapping from Verilog ASIC Design (1980’s to present) Turn Verilog directly into layout using a library of standard cells Effective for high-volume and efficient use of silicon area6.111 Fall 2016Lecture 921

Reconfigurable Logic Logic blocks Interconnect I/O blocks Key questions:– To implement combinationaland sequential logic– Wires to connect inputs andoutputs to logic blocks– Special logic blocks atperiphery of device forexternal connections– How to make logic blocks programmable?(after chip has been fabbed!)– What should the logic granularity be?– How to make the wires programmable?(after chip has been fabbed!)– Specialized wiring structures for localvs. long distance routes?– How many wires per logic ion6.111 Fall 2016Lecture 922

Programmable Array Logic (PAL) Based on the fact that any combinational logic can berealized as a sum-of-products PALs feature an array of AND-OR gates with programmableinterconnectinputsignalsANDarrayOR arrayoutputsignalsprogramming ofproduct terms6.111 Fall 2016programming ofsum termsLecture 923

RAM Based Field ProgrammableLogic - XilinxCLBSlewRateControlCLBD ccPadInputBufferCLBQ DCLBProgrammableInterconnectDelayI/O Blocks (IOBs)C1 C2 C3 C4H1 DIN S/R '1H'K6.111 Fall 2016QH'F'ECRDXConfigurableLogic Blocks (CLBs)Lecture 924

LUT Mapping N-LUT direct implementation of a truth table: any functionof n-inputs. N-LUT requires 2N storage elements (latches) N-inputs select one latch location (like a memory)InputsOutputLatches set by configuration bitstream4LUT example6.111 Fall 2016Lecture 925

Configuring the CLB as a RAMMemory is built using Latches not FFs16x2Read is same a LUT Function!6.111 Fall 2016Lecture 926

Xilinx 4000 Interconnect6.111 Fall 2016Lecture 927

Xilinx 4000 Interconnect DetailsWires are not ideal!6.111 Fall 2016Lecture 928

Add Bells & WhistlesHardProcessorGigabitSerial18 BitI/O36 Bit18 nceControlBRAMClockMgmtCourtesy of David B. Parlour, ISSCC 2004 Tutorial,“The Reality and Promise of Reconfigurable Computing in Digital Signal Processing”6.111 Fall 2016Lecture 929

The Virtex II CLB (Half Slice Shown)6.111 Fall 2016Lecture 930

Adder ImplementationCoutLUT: A BBY A B CinADedicated carry logic1 half-Slice 1-bit adderCin6.111 Fall 2016Lecture 931

FPGA’sDSP with 25x18multiplierGigabit ethernetsupportVirtex 2Virtex 6Spartan 3EArtix-7 A1006.111 Fall 2016CLBDist RAMBlock RAM Multipliers8,448667,0002407,9251,056 kbit6,200 kbit15 kbit1,188 kbit2,592 kbit22,752 kbit72 kbit4,860 kbitLecture 9144 (18x18)1,344 (25x18)4 (18x18)240 (25x18)32

Design Flow - Mapping Technology Mapping: Schematic/HDL to Physical Logic units Compile functions into basic LUT-based groups (function oftarget architecture)acbDSETQDSETQLUTbCLRQCLRQdalways @(posedge clock or negedge reset)beginif (! reset)q 0;elseq (a&b&c) (b&d);end6.111 Fall 2016Lecture 933

Design Flow – Placement & Route Placement – assign logic location on a particular deviceLUTLUTLUT Routing – iterative process to connect CLB inputs/outputs and IOBs. Optimizes critical pathdelay – can take hours or days for large, dense designsIterate placement if timingnot metSatisfy timing? GenerateBitstream to config deviceChallenge! Cannot use full chip for reasonable speeds (wires are not ideal).Typically no more than 50% utilization.6.111 Fall 2016Lecture 934

Example: Verilog to FPGAmodule adder64 (input [63:0] a, b;output [63:0] sum); Synthesis Tech Map Place&Routeassign sum a b;endmodule64-bit Adder Example6.111 Fall 2016Virtex II – XC2V2000Lecture 935

How are FPGAs Used?Logic Emulation Prototyping Reconfigurable hardware Ensemble of gate arrays used to emulate acircuit to be manufacturedGet more/better/faster debugging done thanwith simulationOne hardware block used to implement morethan one functionSpecial-purpose computation engines Hardware dedicated to solving one problem(or class of problems)Accelerators attached to general-purposecomputers (e.g., in a cell phone!)FPGA-based Emulator(courtesy of IKOS)6.111 Fall 2016Lecture 936

Summary FPGA provide a flexible platform for implementing digitalcomputing A rich set of macros and I/Os supported (multipliers, blockRAMS, ROMS, high-speed I/O) A wide range of applications from prototyping (to validate adesign before ASIC mapping) to high-performance spatialcomputing Interconnects are a major bottleneck (physical design andlocality are important considerations)6.111 Fall 2016Lecture 937

Test Benchmodule sample tf;// Inputsreg bit in;reg [3:0] bus in;module sample(input bit in,input [3:0] bus in,// Outputswire out bit;wire [7:0] out bus;output out bit,output [7:0] out bus);. . . Verilog . . .// Instantiate the Unit Under Test (UUT)sample uut (.bit in(bit in),.bus in(bus in),.out bit(out bit),.out bus(out bus));endmoduleinitial begin// Initialize Inputsbit in 0;bus in 0;// Wait 100 ns for global reset to finish#100;// Add stimulus hereendendmodule6.111 Fall 2016Lecture 938

Pipeline Conventions DEFINITION: a K-Stage Pipeline(“K-pipeline”) is an acyclic circuit having exactly K registers on everypath from an input to an output. a COMBINATIONAL CIRCUIT is thus an 0-stage pipeline. CONVENTION: Every pipeline stage, hence every K-Stage pipeline, has a register on its OUTPUT(not on its input). ALWAYS:

Related Documents:

Verilog-A HDL Overview 1.1 Overview This Verilog-A Hardware Description Language (HDL) language reference manual defines a behavioral language for analog systems. Verilog-A HDL is derived from the IEEE 1364 Verilog HDL specification. This document is intended to cover the definition and semantics of Verilog-A HDL as proposed by Open Verilog .

Verilog PLI Tutorial ? : 20% Complete What's new in Verilog 2001? : 50% Complete Verilog Quick Reference. Verilog in One Day : This tutorial is in bit lighter sense, with humor, So take it cool and enjoy. INTRODUCTION Introduction. Verilog is a HARDWARE DESCRIPTION LANGUAGE (HDL). A hardware

The Verilog Golden Reference Guide is a compact quick reference guide to the Verilog hardware description language, its syntax, semantics, synthesis and application to hardware design. The Verilog Golden Reference Guide is not intended as a replacement for the IEEE Standard Verilog Language Reference Manual.

an independent Verilog consultant, specializing in providing comprehensive expert training on the Verilog HDL, SystemVerilog and PLI. Stuart is a co-authorof thebooks "SystemVerilogfor Design", "Verilog-2001: A Guide to theNewFeatures in the Verilog Hardware Description Language" and

Verilog HDL model of a discrete electronic system and synthesizes this description into a gate-level netlist. FPGA Compiler II / FPGA Express supports v1.6 of the Verilog language. Deviations from the definition of the Verilog language are explicitly noted. Constructs added in versions subsequent to Verilog 1.6 might not be supported.

Verilog vs. VHDL –Verilog is relatively simple and close to C –VHDL is complex and close to Ada –Verilog has 60% of the world digital design market (larger share in US) Verilog modeling range –From gates to proc

Verilog code thinks it is calling a native Verilog task or function Using the SystemVerilog DPI – Verilog code can directly call C functions – Verilog code can dire

Verilog Hardware Descriptive Language 5th edition, Donald Thomas, Philip Moorby, 2002,Kluwer Academic. Verilog HDL, A guide to digital design and synthesis, Samir Palnitkar, Sun Soft Press Verilog HDL Synthesis ( A practical primer ), J Bhasker, Star galaxy publishing Verilog