DEPLOYING QUANTIZATION-AWARE TRAINED NETWORKS

2y ago

15 Views

2 Downloads

942.24 KB

35 Pages

Last View : 11d ago

Last Download : 3m ago

Upload by : Randy Pettway

Report this link

Download PDF

Transcription

DEPLOYING QUANTIZATION-AWARETRAINED NETWORKS USING TENSORRTDheeraj Peri, Jhalak Patel, Josh Park

AGENDAQUANTIZATION IN NEURAL NETWORKSPost Training Quantization (PTQ)Quantization Aware Training (QAT)DESIGNING QUANTIZED NETWORKSTrain QAT network in TensorflowTransforming QAT network to ONNXACCELERATE QUANTIZED NETWORKS WITH TENSORRTOptimize QAT networks with TensorRTInference and evaluation2

INTRODUCTIONState of the art neural networks have seen tremendous success on computer vision, natural language processing,robotics tasks.With millions of floating-point operations, deployment of AI models in real time is challenging.Some of the techniques for making neural networks faster and lighter1) Architectural improvements2) Designing new and efficient layers which can replace traditional layers3) Neural network pruning which removes unimportant weights4) Software and hardware optimizations5) Quantization techniques3

QUANTIZATION IN NEURAL NETWORKSQuantization is the process of converting continuous values to discrete set of values using linear/non-linear scalingtechniques.Dequantized FP32 tensors should not deviate too much from the pre-quantized FP32 tensor.Quantization parameters are essential for minimizing information loss when converting from higher precision tolower precision P32(dequantized)4

QUANTIZATION SCHEMESFloating point tensors can be converted to lower precision tensors using a variety of quantization schemes.e.g., R s(Q – z) where R is the real number, Q is the quantized values and z are scale and zero point which are the quantization parameters (q-params) to be determined.For symmetric quantization, zero point is set to 0. This indicates the real value of 0.0 is equivalent to a quantizedvalue of 0.q-params can be determined from either post training quantization or quantization aware training schemes.5

POST TRAINING QUANTIZATION (PTQ)Start with a pre-trained model and evaluate it on a calibration dataset.Calibration data is used to calibrate the model. It can be a subset of trainingdata.Calculate dynamic ranges of weights and activations in the network to computeCalibration dataPre-trainedmodelGather layerstatisticsquantization parameters (q-params).Quantize the network using q-params and run inference.Computeq-paramsQuantize model6

QUANTIZATION AWARE TRAINING (QAT)Start with a pre-trained model and introduce quantization ops atPre-trainedmodelvarious layers.Finetune it for a small number of epochs.Add QAT opsSimulates the quantization process that occurs during inference.Finetune withQAT opsThe goal is to learn the q-params which can help to reduce theaccuracy drop between the quantized model and pre-trained model.Store q-paramsQuantize modelfor inference7

PTQ VS QATPTQQATUsually fastSlowNo re-training of the modelModel needs to be trained/finetunedPlug and play of quantizationschemesPlug and play of quantizationschemes (requires re-training)Less control over final accuracy ofthe modelMore control over final accuracysince q-params are learned duringtraining.8

QAT IN TENSORFLOWTF has a quantization API which automatically addsquantization ops to a given graph.tf.contrib.quantize.create training graph()tf.contrib.quantize.create eval graph()Provides tools to rewrite the original graph and addsquantization ops for weights and activations.Additional arguments need to be provided for configuring thetype of quantization.We use tf.quantization.quantize and dequantize (QDQ)operation for symmetric quantization.Output round(input *scale) * inverse scale9

TOOLKITDeep Learning examples toolkit open sourced byNVIDIA.NGC container support with latest features fromdifferent frameworks.End-End Workflow for deploying Resnet-50 withQAT in TensorRT1) Finetuning RN-50 QAT2) Post processing3) Exporting frozen graph4) TF2ONNX conversion5) TensorRT Inference10

STEP 1: FINETUNING RN50 WITH QATtf.contrib.quantize.create training graph adds quantization nodes in the RN50 graph.Quantization nodes are added at weights (conv/FC layers) and activation layers in the network.Load the pre-trained weights, finetune the QAT model and save the new weights.RN-50 graphtf.contrib.quantize.create training graphFinetuningNew weightsPretrainedWeights11

STEP 2: POST PROCESSINGThis step is required to ensure TensorRT builds successfully on RN50 QAT graph.After finetuning, convert the final fully connected (FC) layer into a 1x1 convolution layer preserving the sameweights.RN-50 QATFC layer1000 classclassificationReplace with 1x1 conv12

STEP 3: EXPORTING FROZEN GRAPHSGenerate a frozen graph using the RN-50 QAT graph and the new weights from finetuning stage.This step converts the variables in the graph to constants by using the weights in the checkpoints.Both data formats (NCHW and NHWC) can be used, although NCHW is recommended for the final graph.RN-50 QATgraphConvert variablesto constantsFrozen TF graphNew weights13

STEP 4: TF2ONNX CONVERSIONTF2ONNX converter (https://github.com/onnx/tensorflow-onnx) transforms a TensorFlow pb file to ONNX.It has conversion support for all common deep learning layers.Support for QDQ layers in TF2ONNX converter has been added for the following conversion.QDQ ops store information about dynamic ranges of the tensors. This is converted as scale and zero-pointparameters during ONNX conversion.Quantize TensorFlowONNXSupport for QDQ: e/fake quant ops rewriter/tf2onnx14

STEP 5: TENSORRT INFERENCEGenerated ONNX graph with QuantizeLinear and DequantizeLinear ops is parsed using ONNX parser availablein TensorRT.TensorRT performs several optimizations on this graph and builds an optimized engine for the specific GPU.TensorRTONNX er (Offline)BuildEngineExecuteRuntime15

TENSORRT INFERENCE ACCELERATOR16

QUANTIZATION𝑄 𝑥, 𝑠𝑐𝑎𝑙𝑒, 𝑧𝑒𝑟𝑜 𝑝𝑜𝑖𝑛𝑡 𝑟𝑜𝑢𝑛𝑑(𝑄 𝑥, 𝑠𝑐𝑎𝑙𝑒 ZNon-Quantized𝑥 𝑠𝑐𝑎𝑙𝑒- affine𝑧𝑒𝑟𝑜 𝑝𝑜𝑖𝑛𝑡)𝑥)𝑠𝑐𝑎𝑙𝑒- symmetric*XXint8Op(S i, S o)Wxint8int8Op(S i, S w, S o)fp32 / int8fp32 / int8YYQuantized Op* TensorRT only supports symmetric quantization17

PTQ MODEL INFERENCECalibrationdata Computes: Per-tensor activation scale Per-channel weight scale Quantizes: Activation tensors WeightsFP32 ModelEngine with bothquantized/nonquantized opsCalibrationTensorRTNo control over which opsare quantizedModel trained without QAT18

PTQ LIMITATIONSXXExpect Ato execute in INT8AExpect Cto execute in INT8CExpect Bto execute in FP32BExpect Dto execute in INT8DYYQuantized GEMM followedby high precision activationfor accuracy eg. LSTMQuantized GEMM followedby low precision activationfor speed eg. ImageclassificationFor best results, the network must: specify where quantization and dequantization take place. learn the best quantization scales .19

QAT MODEL INFERENCEXXfp32fp32Quantize scaleQint8Expect Ato execute in INT8AQuantize scaleQint8Expect Cto execute in INT8Cint8int8Dequantize scaleDQExpect Dto execute in INT8int8fp32Expect Bto execute in FP32Bfp32DDequantize scaleDQfp32YYQuantized GEMM followedby high precision activationfor accuracy eg. LSTMQuantized GEMM followedby low precision activationfor speed eg. Imageclassification20

QUANTIZATION OPSONNX::QuantizeLinearXfp32Qint8Y𝑦 ��(𝑥 𝑧𝑒𝑟𝑜 NNX::DequantizeLinearYint8DQfp32X𝑥 𝑦 𝑧𝑒𝑟𝑜 𝑝𝑜𝑖𝑛𝑡𝑘 𝑠𝑐𝑎𝑙𝑒𝑘Zero point must be 0. Symmetric scaling.Per-tensor scaling.Per-channel scaling with arbitrary scaling axis (k).21

QDQ OPS INSERTIONS: RECOMMENDATIONRecommend QDQ ops insertion at Inputs of quantizable opsMatches QLinear/QConv semantics i.e. low precision input, high precision output.No complexity in deciding whether to quantize output or not. Just Don't.Let the ops decide what precision input they want.22

QDQ OPS INSERTIONS: RECOMMENDATIONInserting QDQ ops at inputs (recommended)Makes life easy for frameworks quantization toolsNo special logic for Conv-BN or Conv-ReLUJust insert QDQ in front of quantizable ops. Leave the rest to the back end (TensorRT).Makes life easy for back end optimizers (TensorRT) Explicit quantization. No implicit rule eg. "Quantize operator input if output is quantized”.Inserting QDQ ops at outputs (not recommended, but supported)Some frameworks quantization tools have this behavior by default.Sub-optimal performance when network is "partial quantization" i.e. not all ops are quantized.Optimal performance when network is "fully quantized" i.e. all ops in network are quantized.23

QDQ OPS INSERTIONS: AT INPUTSSome ops require high precision input form QConv/QLinear.BERT large finetuned for squad v1.1 (91.01 F1 in fp32)Don't insert QDQ at inputs.Ops with quantized inputEg. LayerNorm (BERT), Sigmoid, TanH (LSTM), Swish (EfficientNet)Baseline: Linear, MM, BMM90.66BaseLine GeLU90.28BaseLine LayerNorm after Linear5.98Some ops can handle low precision input without accuracy drop. Insert QDQ at inputs. Eg. GeLU (BERT), Softmax (BERT).F1EfficientNet b3 (81.61 top-1 in fp32)Ops with quantized inputTop-1Conv80.28Conv Swish78.3724

QDQ OPS INSERTIONS: rfp32fp32NormNormfp32fp32QAT 8QLinearint8Can delQAT Model* Omitting weights QDQ for Linear op for simplifying diagram25

EXAMPLE: QAT MODEL l trained without QAT26

FINE-TUNED TF GRAPH: WITH FAKE QUANT OPSActivationquantization is pertensorWeight quantization canbe per-tensor or p32Relufp32FQfp32ConvFake Quant ops are inserted before quantizable opsWLOG FQ can be FakeQuant*, QDQV2, QDQV327

FINE-TUNED ONNX GRAPH: WITH QDQ p32Convfp32Relufp32Qint8DQfp32ConvQDQ rewriter in TF2ONNX converter replaces Fake Quant ops with QDQ pairs28fp32

QDQ GRAPH OPTIMIZER: FOLD 2Convfp32fp32Relufp32Qint8DQfp32Convfp32Note: QDQ graph optimizer is part of generic TensorRT graph optimizer29

QDQ GRAPH OPTIMIZER: MATCH QUANTIZED OP AND fp32fp32ReluWe fuse DQ ops with Conv, Convwith Relu, and Q op withConvRelu to create QConvReluwith INT8 inputs and INT8outputfp32Qint8DQfp32Convfp32If there is no Q op available forepilog fusion, this will fuse intoQConv with FP32 output30

QDQ GRAPH OPTIMIZER: QUANTIZED INFERENCE vfp3231

INFERENCE PIPELINECreate network with kEXPLICIT PRECISION flag.Set trt.Builderflag.INT8 to enable INT8 precision.Parse Resnet-50 ONNX graph using ONNX parseravailable in TensorRT and build TensorRT engine.Setup the test data pipeline and perform input pre-processing and resizing operations.Run the engine on the input data. Copy the outputs ofthe model back to the host.32

EVALUATION OF RESNET-50 QAT NETWORK The evaluation has been performed on RTX2080 Ti GPU and Tensorflow 1.15. TF network is running in FP32 whereas TensorRTinference is in INT8 precision.Slight drop in accuracy (0.15 %).Preprocessing of input images influences the finalaccuracy.Runtime is significantly improved by TensorRT.Around 12x speed up.33

CONCLUSIONQuantization aware training provides a new alternative to deploy networks in lower precision.Since quantization scales are computed during training, QAT models might be less prone to accuracy drop duringinference compared to PTQ networks in some cases.We have demonstrated an end to end workflow of Resnet-50 QAT model and show that the INT8 accuracy is closeto FP32 model.34

STEP 1: FINETUNING RN50 WITH QAT tf.contrib.quantize.create_training_graphadds quantization nodes in the RN50 graph. Quantization nodes are added at weights (conv/FC layers) and activation layers in the network. Load the pre-trained weights, finetune the QAT model and save the new weights. RN-50 graph tf.con

Related Documents:

Quantization of Gauge Fields - University of Illinois at ...

Quantization of Gauge Fields We will now turn to the problem of the quantization of gauge th eories. We will begin with the simplest gauge theory, the free electromagnetic ﬁeld. This is an abelian gauge theory. After that we will discuss at length the quantization of non-abelian gauge ﬁelds. Unlike abelian theories, such as the

27 Views

3y ago

CHAPTER 16: Quantum Mechanics and the Hydrogen Atom

Bohr’s solution Quantization of angular momentum Leads to quantization of radii (“Bohr orbits”) Leads to quantization of energies Assume the “Bohr frequency condition” Yields the same “Rydberg formula” for allowed energy levels!!! a0 1 bohr (0.529 Å), Ry 1 Rydberg 2.17987 x 10-18 J

22 Views

2y ago

Actorq: Quantization for Actor-learner Distributed Reinforcement Learning

of quantization on various aspects of reinforcement learning (e.g: training, deployment, etc) remains unexplored. Applying quantization to reinforcement learning is nontrivial and different from traditional neural network. In the context of policy inference, it may seem that, due to the sequential decision making nature of reinforcement learning,

12 Views

1y ago

Learning to Display High Dynamic Range Images - Nottingham

best group high dynamic range values. Quantization Input High dynamic range data (real values) 0 x 1 n-1 n Quantization indices n N-1 Mapping Displaying Intensity Levels L n L N Figure 3, The process of display high dynamic range scene (from purely a numerical processing's point of view) Quantization, also known as clustering, is a well-

10 Views

9m ago

A Power-Aware and QoS-Aware Service Model on Wireless …

novel power-aware and QoS-aware service model over wireless networks. In the proposed model, mobile terminals use proxies to buffer data so that the WNIs can sleep for a long time period. To achieve power-aware communication while satisfying the delay requirement of

30 Views

2y ago

TRANSFER LEARNING FOR ENDOSCOPY DISEASE DETECTION AND ...

3.1. Pre-trained model backbone and network head re-moval We removed the network head or the ﬁnal layers of the pre-trained model with a Resnet-101 backbone [12] that was ini-tially trained on the COCO dataset. This stage is crucial as the pre-trained model was trained for a different classiﬁcation task.

29 Views

3y ago

Lab 7: Interconnection between legacy networks and SDN networks

1.3 Interworking SDN and legacy networks via SDN-IP application SDN networks operate in a different manner than legacy networks, which are not going to be entirely replaced in the present time. Thus, one of the obstacles of deploying an SDN network was to integrate it with IP networks1.

7 Views

1y ago

Inmate Release Report

3/15/2021 6105636 lopez richard 3/15/2021 5944787 padilla elizabeth 3/15/2021 6122354 rodriguez alfredo 3/16/2021 6074310 aldan francisco 3/16/2021 6060380 bradley vincent 3/16/2021 6133841 camacho victor 3/16/2021 6100845 cardenas cesar 3/16/2021 6133891 castaneda jesse .

105 Views

3y ago

Recent Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

OWNER'S GUIDE - NinjaKitchen

auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. please keep these important safeguards in mind when using the . appliance: mportant: make sure that the .

1y ago

285 Views

Quotes within Quotes: When Single (') and Double (") Quotes . - SAS

Here the outside double quotes are replaced by a single quote and the apostrophe is replaced by two single quotes. This works because when the parser sees two single (or double) quotes immediately following each other, the parser resolves them into one quote mark after the closing quote has been determined.

1y ago

237 Views

What These Inspirational Quotes Say

Self Motivation Quotes Success Quotes Teacher Quotes And after reading all of these inspirational quotes you’d like to share which quotation is . -- Brian Tracy "You must constantly ask yourself these questions: Who am I around? What are they doing to me? Wha

2y ago

302 Views

Consumer Guide Auto Insurance - Tennessee

Auto insurance doesn't cover paying off your loan if your car is damaged and its market value is less than what you owe. Auto dealers and lenders may offer guaranteed auto protection (GAP) insurance for this purpose. Your auto insurance will cover you if you drive into Canada. To drive into Mexico, however, you'll need to buy Mexican auto .

1y ago

199 Views

NAIC Consumer Shopping Tool for Auto Insurance

Whether you are buying auto insurance for the first time, or shopping to be sure you are getting the best deal, you already know how important auto insurance is. By law in most states, if you own a car, you must have some auto insurance. Remember, there is no such thing as a "full coverage" auto insurance policy. Policies are made up of

1y ago

185 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

REVIEW OF AUTOMOBILE INSURANCE RATES - Consumers' Association of Canada

In the summer of 2003 the Association compiled over 7,000 auto insurance rate quotes from sources across Canada. In the case of those provinces in which private insurers provide auto insurance the study ensured that the rate quotes obtained reflected the range of prices likely to be found in those markets.

1y ago

213 Views

Broadway towing winchester ky

MO 77 Motors: Rock Hill, SC 7th Avenue Auto Salvage: Fargo, ND 81 Auto Parts & Recycling : Salem, VA 82 Auto Wrecking: Brookfield, OH #9 Truck & Auto Parts (No US Shipping) : Tottenham, ON 97 Auto Wrecking Shull's Towing: Brewster , WA 98 Auto Recyclers: Brooksville, FL 99 Auto Dismantler: Stockton, CA A & A Auto & Truck LLC:

2y ago

465 Views

All about auto insurance - Option Consommateurs

of insurance companies with which they have agreements. Insurance agents: agents work for a specific insurance company. Before you decide to do business with either a broker or an agent, check out prices, the products being proposed and the quality of the service. Buying auto insurance 4 All about auto insurance

1y ago

230 Views

A Message from Our President - Fox Valley Corvette

Bob Jass Chev-rolet 630-365-6481 Auto Parts 25% in most cas-es Ron Westphal Chevrolet 630-898-9630 Auto Parts 25% in most cas-es Thomsons Auto Parts 630-879-6363 Auto Parts 10% in most cas-es American Mod-ern Insurance Co. Collector Car Auto Insurance 10% on Collector Auto Polic

2y ago

225 Views

Quotations - Free Website Builder: Create free websites

cards, but sometimes, playing a poor hand well." . 50th Birthday Quotes 60th Birthday Quotes And there are more. Funny Birthday Quotes Cute Birthday Quotes . it a try, itʼs free. Triumph over failure can be a

2y ago

267 Views

The Top 100 Motivational & Inspirational Quotes for 2015

I've spent hours crawling through the web trying to find the best quotes to keep me motivated and inspired all throughout the New Year. I've saved hundreds of quotes on my laptop and figured that words alone could motivate and inspire me. but if I couple the quotes

2y ago

329 Views

Inspirational Quotes - Guideposts

Inspirational Quotes Inspiring quotes are like vitamins for the soul. From the heartfelt to the humorous, the words of wisdom you’ll find here will strengthen your faith, lift your spirits, and even spark a positive change in your life. This collection of some our favorite inspirational quotes from religious figures, world leaders, authors,

2y ago

553 Views

DEPLOYING QUANTIZATION-AWARE TRAINED NETWORKS

It looks like you're using an ad-blocker