112 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT .

2y ago
47 Views
2 Downloads
1.99 MB
15 Pages
Last View : 2d ago
Last Download : 2m ago
Upload by : Vicente Bone
Transcription

112IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 15, NO. 1, MARCH 2018Efficient Deep Neural Network Serving:Fast and FuriousFeng Yan , Member, IEEE, Yuxiong He, Member, IEEE, Olatunji Ruwase, Member, IEEE,and Evgenia Smirni, Senior Member, IEEEAbstract—The emergence of deep neural networks (DNNs) as astate-of-the-art machine learning technique has enabled a variety of artificial intelligence applications for image recognition,speech recognition and translation, drug discovery, and machinevision. These applications are backed by large DNN models running in serving mode on a cloud computing infrastructure toprocess client inputs such as images, speech segments, and textsegments. Given the compute-intensive nature of large DNN models, a key challenge for DNN serving systems is to minimize therequest response latencies. This paper characterizes the behavior of different parallelism techniques for supporting scalableand responsive serving systems for large DNNs. We identify andmodel two important properties of DNN workloads: 1) homogeneous request service demand and 2) interference among requestsrunning concurrently due to cache/memory contention. Theseproperties motivate the design of serving deep learning systemsfast (SERF), a dynamic scheduling framework that is poweredby an interference-aware queueing-based analytical model. Tominimize response latency for DNN serving, SERF quickly identifies and switches to the optimal parallel configuration of theserving system by using both empirical and analytical methods.Our evaluation of SERF using several well-known benchmarksdemonstrates its good latency prediction accuracy, its abilityto correctly identify optimal parallel configurations for eachbenchmark, its ability to adapt to changing load conditions, andits efficiency advantage (by at least three orders of magnitudefaster) over exhaustive profiling. We also demonstrate that SERFsupports other scheduling objectives and can be extended toany general machine learning serving system with the similarparallelism properties as above.Index Terms—Deep learning, DNN serving, scheduling, parallelism, performance, analytical model, interference-aware.I. I NTRODUCTIONHE RECENT advance in Deep Neural Network (DNN)models have enabled state-of-the-art accuracy on important yet challenging artificial intelligence tasks, such as imagerecognition [1]–[3] and captioning [4], [5], video classification [6], [7] and captioning [8], speech recognition [9], [10],TManuscript received May 4, 2017; revised September 22, 2017; acceptedNovember 4, 2017. Date of publication February 21, 2018; date of currentversion March 9, 2018. This work is supported by NSF grant CCF-1218758,CCF-1649087, and CCF-1756013. The associate editor coordinating thereview of this paper and approving it for publication was Yixin Diao.(Corresponding author: Feng Yan.)F. Yan is with the Department of Computer Science and Engineering,University of Nevada at Reno, Reno, NV 89557 USA (e-mail: fyan@unr.edu).Y. He and O. Ruwase are with Microsoft Research, Redmond, WA 98052USA (e-mail: yuxhe@microsoft.com; olruwase@microsoft.com).E. Smirni is with the Department of Computer Science, College of Williamand Mary, Williamsburg, VA 23187 USA (e-mail: esmirni@cs.wm.edu).Digital Object Identifier 10.1109/TNSM.2018.2808352and text processing [11]. These advancements by DNNs haveenabled a variety of new applications, including personal digital assistants [12], real-time natural language processing andtranslation [13], photo search [14] and captioning [15], drugdiscovery [16], and self-driving cars [17].A key driver of these recent improvements in DNNperformance is the ability to train large DNN models, containing billions of neural connections, using large amountsof training data [1], [3], [16], [18]. Once trained, these bigDNN models are deployed in a serving mode to processapplication inputs, such as images, voice commands, speechsegments, handwritten text. However, big DNN models requiresignificant compute cycles and memory bandwidth to processeach input, and are therefore impractical to run on batterypowered and small form-factor hardware devices, such aslaptops, tablets, and mobile phones. Consequently, big DNNmodels are typically deployed as client-server applications,with the client running on a mobile device, and the server,including the model, running as a serving system on the cloud(e.g., Cortana, Siri, and Google Now). This paper presents howto build a scalable and responsive serving systems for theselarge DNN models.Like other interactive online services, such as Web searchand online gaming, DNN serving requires consistently lowresponse times to attract and retain users. Computing theanswer for a user request using large DNN models maytake seconds and even minutes to complete when runningsequentially on a single machine.One promising approach for reducing the DNN serving latency is to parallelize computation. There are threecomplementary ways to achieve large-scale parallelism inDNN serving systems. First, the DNN model, which consistsof billions of neurons and connections, can be partitionedacross multiple servers. Each request is processed concurrently with communications across these servers (inter-nodeparallelism). Second, at each server, a request can be further parallelized using multiple threads exploiting multicorearchitecture of modern hardware (intra-node parallelism).Finally, multiple requests can be processed concurrentlywithin each multicore server to exploit service parallelismamong requests (service parallelism). However, this serviceparallelism does not reduce the latency of an individualquery. Nonetheless, processing multiple requests in parallel is valuable because it improves the server’s throughputand potentially reduces the time that a request waits forexecution.c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.1932-4537 See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

YAN et al.: EFFICIENT DNN SERVING: FAST AND FURIOUSFig. 1. Latency under low load (left plot) and high load (right plot) usingdifferent configurations (inter-node parallelism is set to 1) for ImageNet-22K.While these parallelism techniques present opportunitiesto reduce DNN serving latency, deciding optimal parallelismchoice is challenging. Applying parallelism degrees blindlycould harm performance. For example, service parallelism mayincrease memory system contention to the point of prolongingrequest processing time; inter-node parallelism may prolongrequest processing if the cross-machine communication overhead exceeds the computation speedup. Figure 1 shows thelatency of serving the ImageNet-22K workload [19] under different combinations of service and intra-node parallelisms onan 8-core machine (refer to Section V-A for detailed experimental setup). The left and right scatter plots represent lowand high load conditions (request arrival rate) respectively.Each point represents one parallel configuration, and the sizeof the point indicates its latency value. The figure demonstratesthat: (1) Many parallel configurations are possible, even withonly 8 cores and without considering inter-node parallelism.(2) The latency difference between the best parallel configuration and the worst parallel configuration can be significant, i.e.,by orders of magnitudes. This gap grows further under higherloads. (3) The latency values and the best parallel configurationchanges as a function of the load.We propose SERF, a framework for serving deep learning systems fast which integrates light-weight profiling with aqueueing-based prediction model to quickly find optimal parallel configurations for DNN serving. SERF exploits three keyintuitions to address the three challenges. First, it employs adynamic scheduler that determines online the ideal parallelismconfiguration based on the system load. Second, as it is hardto accurately estimate the impact of a parallel configurationon the latency, SERF leverages light-weight profiling to measure workload latencies of a few key configurations on thehardware of interest. Third, instead of an exhaustive profilingover all configurations under all loads, which is unavoidablyvery slow and not practical, we develop the core component ofSERF — a queuing-based analytical model for performanceprediction — which uses only limited simple profiling (thatcan be done very fast) to record essential system and workload information that is used as input to the model. Using thisinput, the model achieves remarkably accurate predictions ofthe request latency of any parallel configuration under anygiven load, thus can be used online in a dynamic workloadsetting.We implement SERF in the context of an image classification service based on the image classification module ofthe Adam distributed deep learning framework [3]. We stress113that SERF is not limited to the Adam architecture, but alsoapplicable to serving systems based on other DNN frameworks(e.g., Caffe [20], Theano [21], and Torch7 [22]) as similar parallelism decisions and configuration knobs are also availablethere. Our current prototype includes implementations of ourparallelism techniques, as well as a load generator for simulating arrival process. We evaluate its performance using a20-machine cluster and conduct vast experiments by runningseveral state-of-the-art classification benchmarks, includingImageNet [19] and CIFAR [23]. We demonstrate the accuracy of our queueing-based prediction model by comparingits prediction results with testbed measurements. Moreover,we show that, comparing to using static parallel configurations, SERF swiftly recommends the optimal configurationunder various loads. Comparing to exhaustive profiling, SERFadapts three orders of magnitude faster under dynamic andever-changing environments.We also demonstrate that SERF supports different scheduling objectives, e.g., finding the minimum amount of requiredresources to meet a target latency SLO (service levelobjective) and can be extended to support any generalmachine learning serving system with similar characteristics as DNN serving system. We summarize the maincontributions of the paper as follows: (1) We conduct acomprehensive workload characterization of a DNN serving system, highlighting the opportunities and challengesof using different parallelism techniques to reduce responselatency (Section III). (2) We propose the SERF scheduling framework, which integrates lightweight profiling andqueueing-based latency prediction model to find best parallelconfigurations effectively and efficiently (Section IV). (3) Weimplement SERF and evaluate it on a cluster of machines.The experimental results verify its effectiveness and efficiency(Section V).II. BACKGROUNDDNNs consist of large numbers of neurons with multipleinputs and a single output called an activation. Neurons areconnected hierarchically, layer by layer, with the activationsof neurons in layer l 1 serving as inputs to neurons in layer l.This deep hierarchical structure enables DNNs to learn complex tasks, such as image recognition, speech recognition, andtext processing.A DNN service platform supports training and serving.DNN training is offline batch processing that uses learningalgorithms, such as stochastic gradient descent (SGD) [24]and labeled training data to tune the neural network parameters for a specific task. DNN serving is instead interactiveprocessing requiring fast response per request, e.g., within 710 milliseconds for speech application [25], and within 200 300 milliseconds even for challenging large-scale models likeImageNet-22K. It deploys the trained DNN models in serving mode to answer user requests, e.g., for a dog recognitionapplication, a user request provides a dog image as input andreceives the type of the dog as output. The response timeof a request is the sum of its service time (execution time)and waiting time. An important common performance metric

114IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 15, NO. 1, MARCH 2018Fig. 2. Service time comparison under different parallelism techniques using ImageNet-22K. Each plot reports the speedup when increase the degree at onlyone parallelism (fix the other two parallelisms).for interactive workloads is the average request response time(average latency), which we adopt in our work.In DNN serving, each user input, which we refer to as arequest, is evaluated layer by layer in a feed-forward mannerwhere the output of a layer l 1 becomes the input of layer l.More specifically, define ai as the activation (output) of neuron i in layer l. The value of ai is computed as a functionof its J inputs from neurons in the preceding layer l 1 as:ai f (( Jj 1 wij aj ) bi ) , where wij is the weight associatedwith the connection between neuron i at layer l and neuronj at layer l 1, and bi is the bias term associated with neuron i. The activation function f , associated with all neurons inthe network, is a pre-defined non-linear function, typically asigmoid or hyperbolic tangent. Therefore, for a given request,its main computation at each layer l is a matrix-vector multiplication of the weight of the layer with the activation vectorfrom layer l 1 (or the input vector if l 0).Inter-node, intra-node, and service-level parallelisms arewell-supported among various DNN models and applications [1], [3], [26]. Inter-node parallelism partitions the neuralnetwork across multiple node/machines, with activations ofneural connections that cross node/machine boundaries beingexchanged as network messages. Intra-node parallelism usesmulti-threading to parallelize the feed-forward evaluation ofeach input image using multiple cores. As the computationat each DNN layer is simply a matrix-vector multiplication,it can be easily parallelized using parallel libraries such asOpenMP [27] or TBB [28] by employing a parallel for loop.Service-level parallelism is essentially admission control thatlimits the maximum number of concurrent running requests.We define a parallelism configuration as a combination of theintra-node parallelism degree, inter-node parallelism degree,and maximum allowed service parallelism degree. Note theservice parallelism is defined as a maximum value instead ofthe exact value due to the random request arrival process, e.g.,at certain moments, the system may have less requests thanthe defined service parallelism degree.III. W ORKLOAD C HARACTERIZATIONIn this section, we present comprehensive workload characterization that shows the opportunities and challenges of usingthe various parallelism techniques to reduce DNN servinglatency, as well as their implications on the design of SERF.We make four key observations: (1) Parallelism impacts service time in complex ways, making it difficult to model servicetimes without workload profiling. (2) DNN workloads havehomogeneous requests, i.e., service times under the same parallelism degree exhibit little variance, which allows SERF tomeasure request service time with affordable profiling cost.(3) DNN workloads exhibit interference among concurrentrunning requests, which motivates a new model and solutionof SERF. (4) DNN workloads show load-dependent behavior,which indicates the importance of accurate latency estimationand parallel configuration adaptation according to system load.We present workload characterization results of two wellknown image classification benchmarks, CIFAR-10 [2] andImageNet-22K [19], on servers using Intel Xeon E5-2450processors. Each processor has 8 cores, with private 32KBL1 and 256KB L2 cache, and shared 20MB L3 cache. Thedetailed experimental set up for both workloads and hardwareis provided in Section V.A. Impact of Parallelism on Service TimeModeling the impact of parallelism on DNN serving without workload profiling is challenging because parallelismhas complex effects on the computation and communicationcomponents of request service time (shown in Figures 2 and 3).Figure 2 shows the DNN request service speedup for different degrees of intra-node, inter-node, and service parallelism.For intra-node parallelism, the speedup is close to linear up to3 cores, but slows down beyond 4 cores. This effect is due tothe limited memory bandwidth. When the total memory bandwidth demands are close to or exceed the available bandwidth,the bandwidth per core reduces, decreasing speedup. For internode parallelism, increasing the parallelism degree from 1 to2 yields a 2X service time speedup because the computationtime, which is dominant, is halved, while communication timegrows marginally; increasing from 2 to 4 results in super-linearspeedup due to caching effects, as the working set fits in theL3 cache; increasing from 4 to 8 results in smaller speedupincrease as communication starts to dominate service time. Forservice parallelism, parallelism degrees 2 result in increasedservice time due to memory interference among concurrentlyserviced requests. These results are indicative of the impactof different parallelism on service time. Speedups can varya lot, depending on many factors, including DNN size, theratio of computation and communication, cache size, memorybandwidth.Figure 3 demonstrates the relationship between inter-nodeand intra-node parallelism: the results indicate that the degree

YAN et al.: EFFICIENT DNN SERVING: FAST AND FURIOUS115Fig. 5. Service time comparison with different number of concurrent requests.Fig. 3. Relationship between inter-node and intra-node parallelism usingImageNet-22K.Fig. 4. CDH (Cumulative Data Histogram) of service times. The left plot iswith parallelism degree tuple (2, 1, 4) and the right plot is with (4, 4, 2).of one parallelism technique can affect the behavior of another.More precisely, intra-node parallelism speedup depends onthe degree of inter-node parallelism: speedup reduces withlarger inter-node parallelism. This is because communicationtime is increasingly the dominant portion of service time withlarger degrees of inter-node parallelism, therefore the computation time improvements of intra-node parallelism becomeless important to overall service time.In summary, since parallelism efficiency depends on variousfactors (e.g., workload and hardware properties) and since oneparallelism technique can affect the behavior of others, it is difficult to accurately model service time. SERF circumvents thisby incorporating workload profiling to predict request servicetime.B. Homogeneous RequestsWe observe that for a given parallelism degree tuple,1defined as (service parallelism degree, inter-node parallelismdegree, intra-node parallelism degree), the service times ofDNN requests exhibit very little variance because the sameamount of computation and communication is performed foreach request. Thus, we refer to DNN requests as being homogeneous. Figure 4 shows two examples corresponding to tworepresentative cases of parallelism degrees. The first example as shown in the left plot of Figure 4 is with parallelismdegree tuple of (2, 1, 4), where the majority of requests arein the range of 330ms to 340ms and the SCV (squared coefficient of variation) is only 0.03. The second example as shownin the right plot of Figure 4 is under parallelism (4, 4, 2),1 Note that parallelism degree tuple is different from parallelism configuration. In parallelism degree tuple, each parallelism is set exactly to thedegree value while in parallelism configuration, max service parallelism isan admission policy that defines the maximum allowed degree of serviceparallelism.where most requests are in the range of 130ms to 160mswith the SCV of 0.09. The slightly larger variance can beattributed to variations in the cross-machine communicationdelays caused by inter-node parallelism. The magnitude ofthese variations is consistent with what is normally expectedin computer communication systems while running a requestmultiple times [2

112 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 15, NO. 1, MARCH 2018 Efficient Deep Neural Network Serving: Fast and Furious Feng Yan , Member, IEEE, Yuxiong He, Member, IEEE, Olatunji Ruwase, Member, IEEE, and Evgenia Smirni, Senior Member, IEEE Abstract—The emergence of deep neural networks (DNNs) as a state-of-the-art machine learning technique has enabled a vari-

Related Documents:

IEEE 3 Park Avenue New York, NY 10016-5997 USA 28 December 2012 IEEE Power and Energy Society IEEE Std 81 -2012 (Revision of IEEE Std 81-1983) Authorized licensed use limited to: Australian National University. Downloaded on July 27,2018 at 14:57:43 UTC from IEEE Xplore. Restrictions apply.File Size: 2MBPage Count: 86Explore furtherIEEE 81-2012 - IEEE Guide for Measuring Earth Resistivity .standards.ieee.org81-2012 - IEEE Guide for Measuring Earth Resistivity .ieeexplore.ieee.orgAn Overview Of The IEEE Standard 81 Fall-Of-Potential .www.agiusa.com(PDF) IEEE Std 80-2000 IEEE Guide for Safety in AC .www.academia.eduTesting and Evaluation of Grounding . - IEEE Web Hostingwww.ewh.ieee.orgRecommended to you b

Signal Processing, IEEE Transactions on IEEE Trans. Signal Process. IEEE Trans. Acoust., Speech, Signal Process.*(1975-1990) IEEE Trans. Audio Electroacoust.* (until 1974) Smart Grid, IEEE Transactions on IEEE Trans. Smart Grid Software Engineering, IEEE Transactions on IEEE Trans. Softw. Eng.

COLOUR CHART PAGE 112 NUANCIER P. 112 FARBKARTE S. 112 CARTA DE COLORES PçG. 112. CNR 87 Riscaldata. CNR87 CNR87C 42 C . CARTELLA COLORI PAG. 112 COLOUR CHART PAGE 112 NUANCIER P. 112 CARTA DE COLORES PÁG. 112. CODE Codice C

IEEE TRANSACTIONS ON IMAGE PROCESSING, TO APPEAR 1 Quality-Aware Images Zhou Wang, Member, IEEE, Guixing Wu, Student Member, IEEE, Hamid R. Sheikh, Member, IEEE, Eero P. Simoncelli, Senior Member, IEEE, En-Hui Yang, Senior Member, IEEE, and Alan C. Bovik, Fellow, IEEE Abstract— We propose the concept of quality-aware image, in which certain extracted features of the original (high-

IEEE Robotics and Automation Society IEEE Signal Processing Society IEEE Society on Social Implications of Technology IEEE Solid-State Circuits Society IEEE Systems, Man, and Cybernetics Society . IEEE Communications Standards Magazine IEEE Journal of Electromagnetics, RF and Microwaves in Medicine and Biology IEEE Transactions on Emerging .

Standards IEEE 802.1D-2004 for Spanning Tree Protocol IEEE 802.1p for Class of Service IEEE 802.1Q for VLAN Tagging IEEE 802.1s for Multiple Spanning Tree Protocol IEEE 802.1w for Rapid Spanning Tree Protocol IEEE 802.1X for authentication IEEE 802.3 for 10BaseT IEEE 802.3ab for 1000BaseT(X) IEEE 802.3ad for Port Trunk with LACP IEEE 802.3u for .

802 IEEE TRANSACTIONS ON ENERGY CONVERSION, VOL. 24, NO. 3, SEPTEMBER 2009 Comparison of IEEE 112 and New IEC Standard 60034-2-1 Wenping Cao, Member, IEEE Abstract—This paper describes a comparative study of induc-tion motor testing standards IEEE 112 and newly published IEC 60034-2-1, primarily used in the United States and Europe, respectively.

the 48-hour working week, which does not specifically exempt library (or academic) workers from the regulations. However, it should be feasib le to devise and negotiate librarian working schedules that would bring Edinburgh into line with other British universities that have already adopted 24-hour opening. Academic Essay Writing for Postgraduates . Independent Study version . 7. Language Box .