ApproxTuner: A Compiler And Runtime System For Adaptive .

2y ago
7 Views
2 Downloads
1.03 MB
16 Pages
Last View : 17d ago
Last Download : 2m ago
Upload by : Axel Lin
Transcription

ApproxTuner: A Compiler and Runtime Systemfor Adaptive ApproximationsHashim SharifYifan ZhaoUniversity of Illinois atUrbana-Champaign, USAhsharif3@illinois.eduUniversity of Illinois atUrbana-Champaign, USAyifanz16@illinois.eduMaria KotsifakouRuntime Verification, Inc., USAmaria.kotsifakou@runtimeverification.comAkash KothariBen SchreiberElizabeth WangUniversity of Illinois atUrbana-Champaign, USAakashk4@illinois.eduUniversity of Illinois atUrbana-Champaign, USAbjschre2@illinois.eduUniversity of Illinois atUrbana-Champaign, USAeyw3@illinois.eduYasmin SaritaNathan ZhaoKeyur JoshiCornell University, USAycs4@cornell.eduUniversity of Illinois atUrbana-Champaign, USAnz11@illinois.eduUniversity of Illinois atUrbana-Champaign, USAkpjoshi2@illinois.eduVikram S. AdveSasa MisailovicSarita AdveUniversity of Illinois atUrbana-Champaign, USAvadve@illinois.eduUniversity of Illinois atUrbana-Champaign, USAmisailo@illinois.eduUniversity of Illinois atUrbana-Champaign, USAsadve@illinois.eduAbstractManually optimizing the tradeoffs between accuracy, performance and energy for resource-intensive applications withflexible accuracy or precision requirements is extremely difficult. We present ApproxTuner, an automatic framework foraccuracy-aware optimization of tensor-based applicationswhile requiring only high-level end-to-end quality specifications. ApproxTuner implements and manages approximations in algorithms, system software, and hardware.The key contribution in ApproxTuner is a novel threephase approach to approximation-tuning that consists of development-time, install-time, and run-time phases. Our approach decouples tuning of hardware-independent and hardware-specific approximations, thus providing retargetabilityacross devices. To enable efficient autotuning of approximation choices, we present a novel accuracy-aware tuning technique called predictive approximation-tuning, whichPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrightsfor components of this work owned by others than the author(s) mustbe honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from permissions@acm.org.PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea 2021 Copyright held by the owner/author(s). Publication rights licensedto ACM.ACM ISBN 978-1-4503-8294-6/21/02. . . icantly speeds up autotuning by analytically predictingthe accuracy impacts of approximations.We evaluate ApproxTuner across 10 convolutional neural networks (CNNs) and a combined CNN and image processing benchmark. For the evaluated CNNs, using onlyhardware-independent approximation choices we achieve amean speedup of 2.1x (max 2.7x) on a GPU, and 1.3x meanspeedup (max 1.9x) on the CPU, while staying within 1 percentage point of inference accuracy loss. For two differentaccuracy-prediction models, ApproxTuner speeds up tuning by 12.8x and 20.4x compared to conventional empiricaltuning while achieving comparable benefits.CCS Concepts: Software and its engineering Compilers.Keywords: Approximate Computing, Compilers, Heterogeneous Systems, Deep Neural Networks1IntroductionWith the ubiquitous deployment of highly compute-intensivemachine-learning and big data processing workloads, optimizing these workloads is becoming increasingly important. A wide range of applications using these workloadsare being deployed on both the cloud and the edge, including image processing, object classification, speech recognition, and face recognition [8, 43, 46]. Many of these workloads are very computationally demanding which makesit challenging to achieve the desired performance levelson resource-constrained systems.

PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of KoreaMany modern machine learning and image-processingapplications are inherently “approximate” in the sense thatthe input data are often collected from noisy sensors (e.g.,image and audio streams) and output results are usually probabilistic (e.g., for object classification or facial recognition).Such applications can trade off small amounts of result quality(or accuracy) for improved performance and efficiency [60],where result quality is an application-specific property suchas inference accuracy in a neural network or peak signal-tonoise ratio (PSNR) in an image-processing pipeline. Previous research has presented many individual domain-specificand system-level techniques for trading accuracy for performance. For instance, reduced precision models are widespread in deep learning [3, 7, 14, 34, 45]. Recent specializedaccelerators incorporate hardware-specific approximationsthat provide significant improvements in performance andenergy in exchange for relaxed accuracy [1, 2, 11, 29, 30, 48,59]. Such techniques can provide a crucial strategy to achievethe necessary performance and/or energy for emerging applications in edge computing.In practice, a realistic application (e.g., a neural network ora combination of an image processing pipeline and an imageclassification network) can make use of multiple approximation techniques for different computations in the code, eachwith its own parameters that must be tuned, to achieve thebest results. For example, our experiments show that for theResNet-18 network, which contains 22 tensor operations, thebest combination is to use three different approximationswith different parameter settings in different operations. Amajor open challenge is how to automatically select, configure,and tune the parameters for combinations of approximationtechniques, while meeting end-to-end requirements on energy,performance, and accuracy.To address this broad challenge, we need to solve several specific challenges. First, the variety of software andhardware approximation choices, and the tuning knobs foreach of them, induce a large search space for accuracy-awareoptimization – up to 791 configurations in our benchmarks.Second, efficiently searching such large spaces is made evenmore difficult because individual “candidate configurations”(sample points in the search space) can be expensive to measure on edge systems for estimating accuracy, performanceand energy. For example, measurement-based (aka, “empirical”) tuning for the ResNet50 neural network took 11 days ona server-class machine. Third, the diversity of hardware compute units often used in edge-computing devices [63] yieldsdiverse hardware-specific approximation options with varying accuracy-performance tradeoffs. Moreover, optimizations in specialized hardware accelerators often create important opportunities for orders-of-magnitude improvements.These hardware-specific tuning opportunities are critical,but difficult to exploit without sacrificing software portability, which is crucial in some important domains, like mobileH. Sharif et al.devices. Fourth, the best accuracy-performance-energy tradeoffs may vary significantly at run time, depending on systemload, battery level, or time-varying application requirements.1.1 ApproxTuner SystemWe present ApproxTuner, an automatic framework foraccuracy-aware optimization of tensor-based applications(an important subset of edge computing applications). It findsapplication configurations that maximize speedup and/or energy savings of an application, subject to end-to-end qualityspecifications. It addresses all of the challenges above, and isthe first and only system to handle all of them, as we discussin Section 9.Tensor computations are widely used in machine learningframeworks, and many important domains such as imageprocessing, scientific computing, and others [3, 31–33]. ApproxTuner focuses on tensor-based computations for tworeasons. First, limiting the system to these computationsenables algorithmic approximations specific to tensor operations (in addition to generic software and hardware approximations). Adding support for other classes of computationscan be done in a similar fashion, but is outside the scope ofthis work. Second, many current and emerging hardwareaccelerators focus on tensor computations [1, 2, 29, 30, 48],enabling novel hardware-specific approximations for thesecomputations. ApproxTuner includes support for a novel experimental, analog-domain accelerator [59] that exemplifiessuch approximation opportunities and associated challenges.ApproxTuner tackles the last two challenges above — andenables hardware-specific, yet portable, tuning and run-timeadaptation – by decomposing the optimization process intothree stages: development-time, install-time and run-time.At development time, ApproxTuner selects hardware-independent approximations and creates a tradeoff curve thatcontains the approximations with highest quality and performance ApproxTuner found during search. At install time,the system refines this curve using hardware-specific optimizations and performance measurements. The final tradeoffcurve is included with the program binary. The program canuse the refined curve for the best static choices of approximations before the run, and it can further adapt these choicesusing the same curves based on run-time conditions. Using the final tradeoff curve at run-time as well keeps theoverheads of run-time adaptation negligible.ApproxTuner tackles the first two challenges – and enables efficient navigation of the large tradeoff space and efficient estimation of performance and quality by introducingnovel predictive approximation-tuning. Predictive approximation-tuning uses one-time error profiles of individualapproximations, together with error composition modelsfor tensor-based applications, to predict end-to-end application accuracy. ApproxTuner also facilitates distributedapproximation-tuning since the error profile collection canhappen at multiple client devices, while a centralized server

ApproxTunercan perform time-intensive autotuning. It makes installtime tuning (with hardware-specific approximations) feasible which would otherwise be prohibitively expensive ona single resource-constrained edge device.1.2 ContributionsIn summary, our contributions are: A system that combines a wide range of existing hardware, software and algorithmic approximations, supportsdiverse heterogeneous systems, and provides an easy-touse programming interface for accuracy-aware tuning.Our results show that different kinds of approximationsand approximation knobs are best suited for different applications and also across sub-computations in the sameapplication. A novel three-phase accuracy-aware tuning technique thatprovides performance portability, retargetability to compute units with hardware-specific approximation knobs,and dynamic tuning. It splits tuning into: 1) constructionof tradeoff curves for hardware-independent approximations at development-time, 2) mapping to hardware-specificapproximations at install-time, and 3) a fast approximationselection at runtime. A novel predictive approximation-tuning technique thatuses compositional models for accuracy prediction to speedup both development-time and install-time tuning. Fortwo different accuracy-prediction models, our predictivetuning strategy speeds up tuning by 12.8x and 20.4x compared to conventional empirical tuning while achievingcomparable benefits. Experimental evaluation for 11 benchmarks – 10 CNNs and1 combined CNN image processing benchmark – shows:Hardware-independent Approximations. When usingonly hardware-independent approximations, ApproxTunerachieves geomean speedup of 2.1x and energy reduction of2x on GPU, geomean speedup of 1.3x and energy reductionof 1.3x on CPU, with 1 percentage point accuracy loss.Hardware Approximations. At install time, ApproxTunermaps tensor operations to an energy-efficient analog compute accelerator, PROMISE, and provides a geomean energy reduction of 3.25x (across benchmarks) with 1 percentage point drop in inference accuracy. ApproxTunerexploits such hardware-specific approximations withoutsacrificing object-code portability.Runtime Adaptation for Approximations. To counteract performance slowdowns imposed by runtime conditions such as low-power modes, ApproxTuner can dynamically tune approximation knobs with extremely lowrun-time overheads, by using tradeoff curves shipped withthe application binary.2ApproxTuner OverviewFigure 1 shows the high-level workflow for ApproxTuner. ApproxTuner builds on the HPVM and ApproxHPVM compilerPPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of KoreaFig. 1. ApproxTuner workflow.systems [35, 57], which are briefly described below. ApproxTuner takes as input programs written in Keras or PyTorchfor convolutional neural networks, or CNNs, or an extensionof C that can be compiled to HPVM [35] (for other tensorbased programs), and compiles them to the ApproxHPVMinternal representation (IR) [57]. ApproxHPVM manages thecompilation to various compute units (Section 2.1). ApproxTuner optimizes the computations in this IR using threephases: 1) development-time, 2) install-time, and 3) run-time(Section 2.2). ApproxTuner’s goal is to select combinationsof software and hardware approximations (detailed in Section 2.3) that meet performance, energy, and accuracy targets.2.1 Preliminaries and TerminologyHPVM and ApproxHPVM. HPVM IR is a dataflow graphbased parallel program representation that captures coarseand fine-grain data and task parallelism [35]. HPVM providesa portable IR and a retargetable compiler framework that cantarget diverse compute units, e.g., CPUs, GPUs and FPGAs.ApproxHPVM extends HPVM with support for tensoroperations and limited support for accuracy-aware tuning(see Section 9) [57]. The tensor operations represent dataparallel computations on tensors, commonly used in CNNsand image processing applications. ApproxHPVM also addsback-ends to the HPVM system for a few custom machinelearning accelerators/ libraries, including one for cuDNNon NVIDIA GPUs and one for a programmable analog inmemory compute accelerator called PROMISE [59].This work focuses on approximations for a set of predefined tensor operations included in ApproxHPVM, such asconvolutions, matrix multiplication, ReLU, map, and reduce.We refer to Sharif et al. [57, Table 1] for the definition of theseoperations in ApproxHPVM. These operations are the unitsof scheduling and approximation in ApproxTuner, where aschedule is a mapping of tensor operations to compute unitsin the target system. Hardware-independent approximationsfor an operation are those whose impact on the outputs (i.e.,

PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Koreathe semantics) of a program is fixed, regardless of the hardware used to execute the operation. Examples include filtersampling, perforated convolutions, and reduced precision,described below (Section 2.3). Some approximations, likethe use of (IEEE) FP16 instead of FP32 may have hardwareindependent semantics, and yet may be implemented in hardware for efficiency (in fact, they may be too inefficient touse without hardware support, possibly making them nonportable). Other approximations are called hardware-specific;the output quality impact of these approximations is specific to the choice of hardware used. The impact on energyand performance will usually be hardware-dependent for allkinds of approximations.Quality of Service. A quality-of-service (QoS) metric is a(usually domain-specific) metric over the outputs of somecomputation, such as inference accuracy or PSNR. The QoSmetric function tor a tensor-based program, 𝑄𝑜𝑆 : 𝑇out 𝑇gold R takes the output tensor 𝑇out and the exact output𝑇gold , and produces a scalar value.A QoS constraint is a constraint over the QoS level of anapplication. As higher is conventionally better for QoS, weassume a QoS constraint is a lower bound; the opposite casecan be treated similarly.Knobs, Configurations, and Tradeoff Curves. An approximation knob is a discrete-valued parameter of an approximation method (represented using integers in ApproxTuner) that can be modified to control the quality, energy,and running time. A zero value denotes no approximation.Each tensor operation may be assigned an approximationchoice or none. A configuration is a map Config : 𝑜𝑝 Intthat assigns an approximation knob value to every tensoroperation in the program. The search space is the set of allpossible configurations.A tradeoff point is a triple, (QoS, Perf, config), which specifies the quality-of-service and the performance of the configuration on specified inputs. The set of all tradeoff points,denoted S, represents the tradeoff space. To compare tradeoff points, we define a dominance relation ( ) in the usualway [18]: a point 𝑠 1 (QoS1, Perf1, config1 ) is dominated bya point 𝑠 2 (QoS2, Perf2, config2 ) iff it has both lower QoSand worse performance. Formally, 𝑠 1 𝑠 2 iff QoS1 QoS2and Perf1 Perf2 . Strict dominance, 𝑠 1 𝑠 2 , is defined asdominance when the two points have unequal QoS or Perf.A search algorithm explores a subset of the tradeoff space𝑆 S to find desirable tradeoff points. One way to describedesirable points is through Pareto sets. A Pareto set of a set 𝑆is its subset that consists of non-dominated points:𝑃𝑆 (𝑆) { 𝑠 𝑠 𝑆 𝑠 ′ 𝑆 . 𝑠 𝑠 ′ }(1)This set defines a tradeoff curve, which contains linear segments between the points in the Pareto set. Intuitively, thesepoints have the best tradeoffs that the search algorithm wasable to find. We describe how to select the configurationsor combinations of configurations from a tradeoff curve inH. Sharif et al.Section 5. We also define a relaxed version of the curve,𝑃𝑆𝜀 , which includes tradeoff points that are within 𝜀 distancefrom points in the Pareto set:𝑃𝑆 𝜀 (𝑆) { 𝑠 𝑠 𝑆 𝑠 𝑃𝑆 (𝑆) . dist(𝑠, 𝑠 ) 𝜀 } (2)Here, dist is the usual Euclidean distance in the tradeoff space.2.2 Overview of Three Stages of ApproxTunerOur goal is to automatically select approximation knobsthat minimize energy and/or maximize performance for agiven program or its component, while satisfying some userspecified end-to-end QoS constraint. We refer to this task asapproximation-tuning and do it in three stages:The development-time stage (Section 3) computes atradeoff curve using only hardware-independent approximations. ApproxTuner takes as input a program and a QoS constraint, and generates a set of possible configurations, 𝑆 0 , thatoptimize a hardware-agnostic performance metric (e.g., operation count) and produce QoS values within the constraint. Inpractice, the QoS and performance calculated at developmenttime can be imprecise because they use hardware-agnosticestimates of performance (e.g., operation count) and optionally of accuracy (with predictive tuning). To make it morelikely we will capture points with good measured QoS andperformance, we create the relaxed tradeoff curve, 𝑃𝑆𝜀 (𝑆 0 ),which is shipped with the application.The install-time stage (Section 4) uses the developmenttime tradeoff curves and measures the actual performanceof each configuration in those tradeoff curves on the targethardware. Next, we create a refined tradeoff curve, 𝑃𝑆 (𝑆 ),where 𝑆 is the 𝑃𝑆𝜀 curve from development-time updatedwith real performance measurements. This stage can be runany time that the target hardware is available.If hardware-specific approximations (not known at development time) are available, which may also change the QoSmetrics, the install-time stage performs a new autotuningstep that includes the hardware approximations and generates a new tradeoff curve.The run-time stage (Section 5) takes the program’s finaltradeoff curve from the install-time phase and uses it fordynamic approximation tuni

where result quality is an application-specific property such as inference accuracy in a neural network or peak signal-to-noise ratio (PSNR) in an image-processing pipeline. Previ-ous research has presented many individual domain-specific and system-level techniques for trading accuracy for

Related Documents:

Oracle Container Runtime for Docker 19.03 1-2 Oracle Container Runtime for Docker 18.09 1-3 Oracle Container Runtime for Docker 18.03 1-3 Oracle Container Runtime for Docker 17.06 1-4 Docker 17.03 1-5 Docker 1.12 1-6 2 Installing Oracle Container Runtime for Docker Setting Up the Unbreakable Enterprise Kernel 2-1

IAR ARM compiler 6.3 and higher GNU C Compiler for ARM architecture Arm Keil C/C compiler 5.4 and higher For the ColdFire devices, the following compilers are supported: CodeWarrior for MCU, 10.1 and higher. IAR ColdFire C compiler 1.2 GNU C Compiler for ColdFire architecture The Microco

Compiler Design 10 A compiler can broadly be divided into two phases based on the way they compile. Analysis Phase Known as the front-end of the compiler, the analysis phase of the compiler reads the source program, divides it into core parts, and then checks for lexical, grammar, and syntax errors.

In particular you need: Linux { Compilers either Intel Fortran Compiler versions 14 or 15 with correspond-ing Intel C Compiler or GNU’s Compiler Collection 4.9.2, both gfortran and gcc { GNU’s make, gmake, version 3.77 or 3.81 { perl, version 5.10 Cray/Linux { either Intel Fortran Compiler versions 14 or 15 with corresponding Intel C Compiler

Compiler Design 10 A compiler can broadly be divided into two phases based on the way they compile. Analysis Phase Known as the front-end of the compiler, the analysis phase of the compiler reads the source program, divides it into core parts,

COMPILER DESIGN LECTURE NOTES . Department of CSE - 2 - UNIT -1 1.1 OVERVIEW OF LANGUAGE PROCESSING SYSTEM 1.2 Preprocessor A preprocessor produce input to compilers. They may perform the following functions. . 1.9 STRUCTURE OF THE COMPILER DESIGN Phases of a compiler: A compiler

May 2010 A Non-Confidential ARM Compiler toolchain v4.1 Release 30 September 2010 B Non-Confidential Update 1 for ARM Compiler toolchain v4.1 28 January 2011 C Non-Confidential Update 2 for ARM Compiler toolchain v4.1 Patch 3 30 April 2011 C Non-Confidential Update 3 for ARM Compiler toolchain v4.1 Patch 4

this and other articles in this Volume. The dis-cussion in this article is limited to the relation-ship between these factors and the induction coil design. Current Flow in the Part Eddy currents are the primary source of power dissipation in most induction heat treat-ing applications. Eddy currents, just like all