DEPLOYING QUANTIZATION-AWARE TRAINED NETWORKS

2y ago
15 Views
2 Downloads
942.24 KB
35 Pages
Last View : 11d ago
Last Download : 3m ago
Upload by : Randy Pettway
Transcription

DEPLOYING QUANTIZATION-AWARETRAINED NETWORKS USING TENSORRTDheeraj Peri, Jhalak Patel, Josh Park

AGENDAQUANTIZATION IN NEURAL NETWORKSPost Training Quantization (PTQ)Quantization Aware Training (QAT)DESIGNING QUANTIZED NETWORKSTrain QAT network in TensorflowTransforming QAT network to ONNXACCELERATE QUANTIZED NETWORKS WITH TENSORRTOptimize QAT networks with TensorRTInference and evaluation2

INTRODUCTIONState of the art neural networks have seen tremendous success on computer vision, natural language processing,robotics tasks.With millions of floating-point operations, deployment of AI models in real time is challenging.Some of the techniques for making neural networks faster and lighter1) Architectural improvements2) Designing new and efficient layers which can replace traditional layers3) Neural network pruning which removes unimportant weights4) Software and hardware optimizations5) Quantization techniques3

QUANTIZATION IN NEURAL NETWORKSQuantization is the process of converting continuous values to discrete set of values using linear/non-linear scalingtechniques.Dequantized FP32 tensors should not deviate too much from the pre-quantized FP32 tensor.Quantization parameters are essential for minimizing information loss when converting from higher precision tolower precision P32(dequantized)4

QUANTIZATION SCHEMESFloating point tensors can be converted to lower precision tensors using a variety of quantization schemes.e.g., R s(Q – z) where R is the real number, Q is the quantized values and z are scale and zero point which are the quantization parameters (q-params) to be determined.For symmetric quantization, zero point is set to 0. This indicates the real value of 0.0 is equivalent to a quantizedvalue of 0.q-params can be determined from either post training quantization or quantization aware training schemes.5

POST TRAINING QUANTIZATION (PTQ)Start with a pre-trained model and evaluate it on a calibration dataset.Calibration data is used to calibrate the model. It can be a subset of trainingdata.Calculate dynamic ranges of weights and activations in the network to computeCalibration dataPre-trainedmodelGather layerstatisticsquantization parameters (q-params).Quantize the network using q-params and run inference.Computeq-paramsQuantize model6

QUANTIZATION AWARE TRAINING (QAT)Start with a pre-trained model and introduce quantization ops atPre-trainedmodelvarious layers.Finetune it for a small number of epochs.Add QAT opsSimulates the quantization process that occurs during inference.Finetune withQAT opsThe goal is to learn the q-params which can help to reduce theaccuracy drop between the quantized model and pre-trained model.Store q-paramsQuantize modelfor inference7

PTQ VS QATPTQQATUsually fastSlowNo re-training of the modelModel needs to be trained/finetunedPlug and play of quantizationschemesPlug and play of quantizationschemes (requires re-training)Less control over final accuracy ofthe modelMore control over final accuracysince q-params are learned duringtraining.8

QAT IN TENSORFLOWTF has a quantization API which automatically addsquantization ops to a given graph.tf.contrib.quantize.create training graph()tf.contrib.quantize.create eval graph()Provides tools to rewrite the original graph and addsquantization ops for weights and activations.Additional arguments need to be provided for configuring thetype of quantization.We use tf.quantization.quantize and dequantize (QDQ)operation for symmetric quantization.Output round(input *scale) * inverse scale9

TOOLKITDeep Learning examples toolkit open sourced byNVIDIA.NGC container support with latest features fromdifferent frameworks.End-End Workflow for deploying Resnet-50 withQAT in TensorRT1) Finetuning RN-50 QAT2) Post processing3) Exporting frozen graph4) TF2ONNX conversion5) TensorRT Inference10

STEP 1: FINETUNING RN50 WITH QATtf.contrib.quantize.create training graph adds quantization nodes in the RN50 graph.Quantization nodes are added at weights (conv/FC layers) and activation layers in the network.Load the pre-trained weights, finetune the QAT model and save the new weights.RN-50 graphtf.contrib.quantize.create training graphFinetuningNew weightsPretrainedWeights11

STEP 2: POST PROCESSINGThis step is required to ensure TensorRT builds successfully on RN50 QAT graph.After finetuning, convert the final fully connected (FC) layer into a 1x1 convolution layer preserving the sameweights.RN-50 QATFC layer1000 classclassificationReplace with 1x1 conv12

STEP 3: EXPORTING FROZEN GRAPHSGenerate a frozen graph using the RN-50 QAT graph and the new weights from finetuning stage.This step converts the variables in the graph to constants by using the weights in the checkpoints.Both data formats (NCHW and NHWC) can be used, although NCHW is recommended for the final graph.RN-50 QATgraphConvert variablesto constantsFrozen TF graphNew weights13

STEP 4: TF2ONNX CONVERSIONTF2ONNX converter (https://github.com/onnx/tensorflow-onnx) transforms a TensorFlow pb file to ONNX.It has conversion support for all common deep learning layers.Support for QDQ layers in TF2ONNX converter has been added for the following conversion.QDQ ops store information about dynamic ranges of the tensors. This is converted as scale and zero-pointparameters during ONNX conversion.Quantize TensorFlowONNXSupport for QDQ: e/fake quant ops rewriter/tf2onnx14

STEP 5: TENSORRT INFERENCEGenerated ONNX graph with QuantizeLinear and DequantizeLinear ops is parsed using ONNX parser availablein TensorRT.TensorRT performs several optimizations on this graph and builds an optimized engine for the specific GPU.TensorRTONNX er (Offline)BuildEngineExecuteRuntime15

TENSORRT INFERENCE ACCELERATOR16

QUANTIZATION𝑄 π‘₯, π‘ π‘π‘Žπ‘™π‘’, π‘§π‘’π‘Ÿπ‘œ π‘π‘œπ‘–π‘›π‘‘ π‘Ÿπ‘œπ‘’π‘›π‘‘(𝑄 π‘₯, π‘ π‘π‘Žπ‘™π‘’ ZNon-Quantizedπ‘₯ π‘ π‘π‘Žπ‘™π‘’- affineπ‘§π‘’π‘Ÿπ‘œ π‘π‘œπ‘–π‘›π‘‘)π‘₯)π‘ π‘π‘Žπ‘™π‘’- symmetric*XXint8Op(S i, S o)Wxint8int8Op(S i, S w, S o)fp32 / int8fp32 / int8YYQuantized Op* TensorRT only supports symmetric quantization17

PTQ MODEL INFERENCECalibrationdata Computes: Per-tensor activation scale Per-channel weight scale Quantizes: Activation tensors WeightsFP32 ModelEngine with bothquantized/nonquantized opsCalibrationTensorRTNo control over which opsare quantizedModel trained without QAT18

PTQ LIMITATIONSXXExpect Ato execute in INT8AExpect Cto execute in INT8CExpect Bto execute in FP32BExpect Dto execute in INT8DYYQuantized GEMM followedby high precision activationfor accuracy eg. LSTMQuantized GEMM followedby low precision activationfor speed eg. ImageclassificationFor best results, the network must: specify where quantization and dequantization take place. learn the best quantization scales .19

QAT MODEL INFERENCEXXfp32fp32Quantize scaleQint8Expect Ato execute in INT8AQuantize scaleQint8Expect Cto execute in INT8Cint8int8Dequantize scaleDQExpect Dto execute in INT8int8fp32Expect Bto execute in FP32Bfp32DDequantize scaleDQfp32YYQuantized GEMM followedby high precision activationfor accuracy eg. LSTMQuantized GEMM followedby low precision activationfor speed eg. Imageclassification20

QUANTIZATION OPSONNX::QuantizeLinearXfp32Qint8Y𝑦 οΏ½οΏ½οΏ½(π‘₯ π‘§π‘’π‘Ÿπ‘œ NNX::DequantizeLinearYint8DQfp32Xπ‘₯ 𝑦 π‘§π‘’π‘Ÿπ‘œ π‘π‘œπ‘–π‘›π‘‘π‘˜ π‘ π‘π‘Žπ‘™π‘’π‘˜Zero point must be 0. Symmetric scaling.Per-tensor scaling.Per-channel scaling with arbitrary scaling axis (k).21

QDQ OPS INSERTIONS: RECOMMENDATIONRecommend QDQ ops insertion at Inputs of quantizable opsMatches QLinear/QConv semantics i.e. low precision input, high precision output.No complexity in deciding whether to quantize output or not. Just Don't.Let the ops decide what precision input they want.22

QDQ OPS INSERTIONS: RECOMMENDATIONInserting QDQ ops at inputs (recommended)Makes life easy for frameworks quantization toolsNo special logic for Conv-BN or Conv-ReLUJust insert QDQ in front of quantizable ops. Leave the rest to the back end (TensorRT).Makes life easy for back end optimizers (TensorRT) Explicit quantization. No implicit rule eg. "Quantize operator input if output is quantized”.Inserting QDQ ops at outputs (not recommended, but supported)Some frameworks quantization tools have this behavior by default.Sub-optimal performance when network is "partial quantization" i.e. not all ops are quantized.Optimal performance when network is "fully quantized" i.e. all ops in network are quantized.23

QDQ OPS INSERTIONS: AT INPUTSSome ops require high precision input form QConv/QLinear.BERT large finetuned for squad v1.1 (91.01 F1 in fp32)Don't insert QDQ at inputs.Ops with quantized inputEg. LayerNorm (BERT), Sigmoid, TanH (LSTM), Swish (EfficientNet)Baseline: Linear, MM, BMM90.66BaseLine GeLU90.28BaseLine LayerNorm after Linear5.98Some ops can handle low precision input without accuracy drop. Insert QDQ at inputs. Eg. GeLU (BERT), Softmax (BERT).F1EfficientNet b3 (81.61 top-1 in fp32)Ops with quantized inputTop-1Conv80.28Conv Swish78.3724

QDQ OPS INSERTIONS: rfp32fp32NormNormfp32fp32QAT 8QLinearint8Can delQAT Model* Omitting weights QDQ for Linear op for simplifying diagram25

EXAMPLE: QAT MODEL l trained without QAT26

FINE-TUNED TF GRAPH: WITH FAKE QUANT OPSActivationquantization is pertensorWeight quantization canbe per-tensor or p32Relufp32FQfp32ConvFake Quant ops are inserted before quantizable opsWLOG FQ can be FakeQuant*, QDQV2, QDQV327

FINE-TUNED ONNX GRAPH: WITH QDQ p32Convfp32Relufp32Qint8DQfp32ConvQDQ rewriter in TF2ONNX converter replaces Fake Quant ops with QDQ pairs28fp32

QDQ GRAPH OPTIMIZER: FOLD 2Convfp32fp32Relufp32Qint8DQfp32Convfp32Note: QDQ graph optimizer is part of generic TensorRT graph optimizer29

QDQ GRAPH OPTIMIZER: MATCH QUANTIZED OP AND fp32fp32ReluWe fuse DQ ops with Conv, Convwith Relu, and Q op withConvRelu to create QConvReluwith INT8 inputs and INT8outputfp32Qint8DQfp32Convfp32If there is no Q op available forepilog fusion, this will fuse intoQConv with FP32 output30

QDQ GRAPH OPTIMIZER: QUANTIZED INFERENCE vfp3231

INFERENCE PIPELINECreate network with kEXPLICIT PRECISION flag.Set trt.Builderflag.INT8 to enable INT8 precision.Parse Resnet-50 ONNX graph using ONNX parseravailable in TensorRT and build TensorRT engine.Setup the test data pipeline and perform input pre-processing and resizing operations.Run the engine on the input data. Copy the outputs ofthe model back to the host.32

EVALUATION OF RESNET-50 QAT NETWORK The evaluation has been performed on RTX2080 Ti GPU and Tensorflow 1.15. TF network is running in FP32 whereas TensorRTinference is in INT8 precision.Slight drop in accuracy (0.15 %).Preprocessing of input images influences the finalaccuracy.Runtime is significantly improved by TensorRT.Around 12x speed up.33

CONCLUSIONQuantization aware training provides a new alternative to deploy networks in lower precision.Since quantization scales are computed during training, QAT models might be less prone to accuracy drop duringinference compared to PTQ networks in some cases.We have demonstrated an end to end workflow of Resnet-50 QAT model and show that the INT8 accuracy is closeto FP32 model.34

STEP 1: FINETUNING RN50 WITH QAT tf.contrib.quantize.create_training_graphadds quantization nodes in the RN50 graph. Quantization nodes are added at weights (conv/FC layers) and activation layers in the network. Load the pre-trained weights, finetune the QAT model and save the new weights. RN-50 graph tf.con

Related Documents:

Quantization of Gauge Fields We will now turn to the problem of the quantization of gauge th eories. We will begin with the simplest gauge theory, the free electromagnetic field. This is an abelian gauge theory. After that we will discuss at length the quantization of non-abelian gauge fields. Unlike abelian theories, such as the

Bohr’s solution Quantization of angular momentum Leads to quantization of radii (β€œBohr orbits”) Leads to quantization of energies Assume the β€œBohr frequency condition” Yields the same β€œRydberg formula” for allowed energy levels!!! a0 1 bohr (0.529 Γ…), Ry 1 Rydberg 2.17987 x 10-18 J

of quantization on various aspects of reinforcement learning (e.g: training, deployment, etc) remains unexplored. Applying quantization to reinforcement learning is nontrivial and different from traditional neural network. In the context of policy inference, it may seem that, due to the sequential decision making nature of reinforcement learning,

best group high dynamic range values. Quantization Input High dynamic range data (real values) 0 x 1 n-1 n Quantization indices n N-1 Mapping Displaying Intensity Levels L n L N Figure 3, The process of display high dynamic range scene (from purely a numerical processing's point of view) Quantization, also known as clustering, is a well-

novel power-aware and QoS-aware service model over wireless networks. In the proposed model, mobile terminals use proxies to buffer data so that the WNIs can sleep for a long time period. To achieve power-aware communication while satisfying the delay requirement of

3.1. Pre-trained model backbone and network head re-moval We removed the network head or the final layers of the pre-trained model with a Resnet-101 backbone [12] that was ini-tially trained on the COCO dataset. This stage is crucial as the pre-trained model was trained for a different classification task.

1.3 Interworking SDN and legacy networks via SDN-IP application SDN networks operate in a different manner than legacy networks, which are not going to be entirely replaced in the present time. Thus, one of the obstacles of deploying an SDN network was to integrate it with IP networks1.

3/15/2021 6105636 lopez richard 3/15/2021 5944787 padilla elizabeth 3/15/2021 6122354 rodriguez alfredo 3/16/2021 6074310 aldan francisco 3/16/2021 6060380 bradley vincent 3/16/2021 6133841 camacho victor 3/16/2021 6100845 cardenas cesar 3/16/2021 6133891 castaneda jesse .