Swayam - Arpan Gujarati

2y ago
10 Views
2 Downloads
2.82 MB
90 Pages
Last View : 29d ago
Last Download : 3m ago
Upload by : Josiah Pursley
Transcription

SwayamDistributed Autoscaling forMachine Learning as a ServiceArpan Gujarati,Björn B. BrandenburgSameh Elnikety,Yuxiong He1Kathryn S. McKinley

Machine Learning as a Service (MLaaS)Machine LearningAmazon Machine LearningData Science &Machine LearningGoogle Cloud AI2

Machine Learning as a Service (MLaaS)1. TrainingMachine Learning Amazon Machine LearningData Science &Machine LearningTrainedModelUntrainedmodelDataset2. Prediction Google Cloud AIQuery2 TrainedModelAnswer

Machine Learning as a Service (MLaaS)This work2. Prediction Query TrainedModelAnswerModels are already trained and available for prediction2

SwayamDistributed autoscaling2. Prediction of the compute resourcesneeded for prediction servingQueryinside the MLaaS infrastructure3 TrainedModelAnswer

Prediction serving (application perspective)MLaaS ProviderApplication / End Userimage"cat"Imageclassifier4

Prediction serving (provider perspective)MLaaS ProviderFinite compute resources"Backends" for predictionLots of trained models!5

Prediction serving (provider perspective)MLaaS ProviderFinite compute resources"Backends" for predictionApplication / End UserLots of trained models!(1) New predictionrequest for thepink modelMultiple requestdispatchers "Frontends"(2) A frontend receivesthe request5

Prediction serving (provider perspective)MLaaS ProviderFinite compute resources"Backends" for predictionApplication / End User(4) The backend fetchesthepinkmodelLots of trained models!(3) The request isdispatched to anidle backend(1) New predictionrequest for thepink modelMultiple requestdispatchers "Frontends"(2) A frontend receivesthe request5

Prediction serving (provider perspective)MLaaS ProviderFinite compute resources"Backends" for prediction(4) The backend fetchesthepinkmodelLots of trained models!Application / End User(5) The requestoutcome is predicted(3) The request isdispatched to anidle backendMultiple requestdispatchers "Frontends"(6) The response issent back throughthe frontend(2) A frontend receivesthe request5(1) New predictionrequest for thepink model

Prediction serving (objectives)MLaaS ProviderFinite compute resources"Backends" for predictionLots of trained models!Multiple requestdispatchers "Frontends"6Application / End User

Prediction serving (objectives)MLaaS ProviderResource efficiencyFinite compute resources"Backends" for predictionLots of trained models!Multiple requestdispatchers "Frontends"6Low latency, SLAsApplication / End User

Static partitioning of trained models7

Static partitioning of trained modelsMLaaS ProviderThe trained modelspartitioned amongthe finite backends7

Static partitioning of trained modelsMLaaS ProviderThe trained modelspartitioned amongthe finite backendsNo need to fetch andinstall the pink modelMultiple requestdispatchers "Frontends"7Application / End User

Static partitioning of trained modelsMLaaS ProviderThe trained modelspartitioned amongthe finite backendsNo need to fetch andinstall the pink modelApplication / End UserProblem: Not all models areused at all timesMultiple requestdispatchers "Frontends"7

Static partitioning of trained modelsMLaaS ProviderThe trained modelspartitioned amongthe finite backendsNo need to fetch andinstall the pink modelApplication / End UserProblem: Not all models areused at all timesMultiple requestdispatchers "Frontends"Problem: Many more models than backends,high memory footprint per model7

Static partitioning of trained modelsMLaaS ProviderResource efficiencyNo need to fetch andinstall the pink modelThe trained modelspartitioned amongthe finite backendsLow latency, SLAsApplication / End UserStatic partitioning is infeasibleProblem: Not all models areused at all timesMultiple requestdispatchers "Frontends"Problem: Many more models than backends,high memory footprint per model8

# Active backendsfor the pink modelClassical approach: autoscalingThe number of active backendsare automatically scaled up ordown based on loadRequest load forthe pink modelTime9

# Active backendsfor the pink modelClassical approach: autoscalingThe number of active backendsare automatically scaled up ordown based on loadRequest load forthe pink modelWith ideal autoscaling .Enough backends to guaranteelow latency# Active backends over time isminimized for resource efficiencyTime9

Autoscaling for MLaaS is challenging [1/3]10

Autoscaling for MLaaS is challenging [1/3]MLaaS ProviderFinite compute resources"Backends" for prediction(4) The backend fetchesthepinkmodelLots of trained models!(5) The requestoutcome is predictedMultiple requestdispatchers "Frontends"10

Autoscaling for MLaaS is challenging [1/3]MLaaS ProviderFinite compute resources"Backends" for predictionChallenge(4) The backend fetchesthepinkmodelLots of trained models!ProvisioningTime (4)( a few seconds)(5) The requestoutcome is predicted ExecutionTime (5)( 10ms to 500ms)RequirementPredictive autoscaling tohide the provisioning latencyMultiple requestdispatchers "Frontends"10

Autoscaling for MLaaS is challenging [2/3]MLaaS architecture islarge-scale, multi-tieredHardwarebrokerFrontendsBackends [ VMs, containers ]11

Autoscaling for MLaaS is challenging [2/3]MLaaS architecture islarge-scale, multi-tieredChallengeMultiple frontends withpartial information aboutthe workloadHardwarebrokerFrontendsRequirementFast, coordination-free,globally-consistent autoscalingdecisions on the frontendsBackends [ VMs, containers ]11

Autoscaling for MLaaS is challenging [3/3]Strict, model-specific SLAson response times"99% of requests mustcomplete under 500ms""99.9% of requests mustcomplete under 1s""[A] 95% of requestsmust complete under850ms""[B] Tolerate up to 25%increase in request rateswithout violating [A]"12

Autoscaling for MLaaS is challenging [3/3]Strict, model-specific SLAson response times"99% of requests mustcomplete under 500ms""99.9% of requests mustcomplete under 1s""[A] 95% of requestsmust complete under850ms""[B] Tolerate up to 25%increase in request rateswithout violating [A]"12ChallengeNo closed-form solutions toget response-time distributionsfor SLA-aware autoscalingRequirementAccurate waiting-time andexecution-time distributions

}Swayam: model-driven distributed autoscalingChallengesProvisioningTime (4)( a few seconds) ExecutionTime (5)( 10ms to 500ms)Multiple frontends withpartial information aboutthe workloadNo closed-form solutions toget response-time distributionsfor SLA-aware autoscaling13We address these challengesby leveraging specificML workload characteristicsand design an analytical modelfor resource estimationthat allows distributed andpredictive autoscaling

Outline1. System architecture, key ideas2. Analytical model for resource estimation3. Evaluation results14

System architecture15

System architectureBackends dedicatedfor the pink modelApplication / End UserApplication / End UserBackends dedicatedfor the blue modelApplication / End UserHardwarebrokerApplication / End UserFrontends15Backendsdedicated for thegreen modelGlobal poolof backends

System architectureBackends dedicatedfor the pink modelApplication / End UserApplication / End UserBackends dedicatedfor the blue modelApplication / End UserHardwarebrokerApplication / End UserFrontendsBackendsdedicated for thegreen modelGlobal poolof backendsObjective: dedicated set of backends should dynamically scale1. If load decreases, extra backends go back to the global pool (for resource efficiency)2. If load increases, new backends are set up in advance (for SLA compliance)15

System architectureLet's focus on the pink modelBackends dedicatedfor the pink modelApplication / End UserApplication / End UserApplication / End UserHardwarebrokerApplication / End UserFrontendsObjective: dedicated set of backends should dynamically scale1. If load decreases, extra backends go back to the global pool (for resource efficiency)2. If load increases, new backends are set up in advance (for SLA compliance)15

Key idea 1: Assign states to each backend16

Key idea 1: Assign states to each backendIn theglobal poolDedicated to atrained modelcoldwarm16

Key idea 1: Assign states to each backendIn theglobal poolcoldHaven't executeda request for awhilenot-in-useDedicated to atrained modelwarmin-useMaybe executing arequest16

Key idea 1: Assign states to each backendIn theglobal poolcoldHaven't executeda request for awhileExecuting arequestnot-in-useDedicated to atrained modelwarmbusyin-useidleMaybe executing arequestWaiting for arequest16

Key idea 1: Assign states to each backendIn theglobal poolcoldExecuting arequestnot-in-useDedicated to atrained modelwarmDedicated, but not useddue to reduced loadHaven't executeda request for awhilebusyin-useidleMaybe executing arequestWaiting for arequest16Can be safelygarbage collected(scale-in). or easilytransitioned to an inuse state (scale-out)

Key idea 1: Assign states to each backendIn theglobal poolcoldExecuting arequestnot-in-useDedicated to atrained modelwarmDedicated, but not useddue to reduced loadHaven't executeda request for awhilebusyin-useidleMaybe executing arequestWaiting for arequestCan be safelygarbage collected(scale-in). or easilytransitioned to an inuse state (scale-out)How do frontends know which dedicated backends to use, and which to not use?16

Key idea 2: Order the dedicated set of backendsBackends dedicatedfor the pink model12345678910111217

Key idea 2: Order the dedicated set of backendsBackends dedicatedfor the pink model123456789101112If 9 backends are sufficientfor SLA compliance .17

Key idea 2: Order the dedicated set of backendsBackends dedicatedfor the pink model172839410511If 9 backends are sufficientfor SLA compliance .Backends dedicatedfor the pink model6frontends use backends 1-912345612backends 10-12 transitionto not-in-use state789101112 warm in-use busy/idle warm not-in-use17

Key idea 2: Order the dedicated set of backendsBackends dedicatedfor the pink model172839410511If 9 backends are sufficientfor SLA compliance .Backends dedicatedfor the pink model6frontends use backends 1-912345612backends 10-12 transitionto not-in-use state789101112 warm in-use busy/idleHow do frontends know how manybackends are sufficient? warm not-in-use17

Key idea 3: Swayam instance on every frontendSwayaminstanceBackends dedicatedfor the pink modelIncomingrequestsFrontendscomputes globally consistent minimum #backends necessary for SLA compliance123456789101112 warm in-use busy/idle warm not-in-use18

Outline1. System architecture, key ideas2. Analytical model for resource estimation3. Evaluation results19

Making globally-consistent decisionsat each frontend (Swayam instance)What is the minimum # backendsrequired for SLA compliance?20

Making globally-consistent decisionsat each frontend (Swayam instance)What is the minimum # backendsrequired for SLA compliance?1. Expected request execution time2. Expected request waiting time3. Total request load20

Making globally-consistent decisionsat each frontend (Swayam instance)What is the minimum # backendsrequired for SLA compliance?1. Expected request execution time2. Expected request waiting time3. Total request load20}leverage ML workloadcharacteristics

Determining expected request execution timesTrace 135Normalized Frequency (%)Studied execution traces of 15popular services hosted onMicrosoft Azure's MLaaS platformData from trace (bin width 10)30252015105005021100150 200 250Service Times (ms)300350400

Determining expected request execution timesTrace 1Variation is low‣ Fixed-sized feature vectors‣ Input-independent control flow‣ Non-deterministic machine & OSevents main sources of variability35Normalized Frequency (%)Studied execution traces of 15popular services hosted onMicrosoft Azure's MLaaS platformData from trace (bin width 10)30252015105005021100150 200 250Service Times (ms)300350400

Determining expected request execution timesTrace 1Variation is low‣ Fixed-sized feature vectors‣ Input-independent control flow‣ Non-deterministic machine & OSevents main sources of variability35Normalized Frequency (%)Studied execution traces of 15popular services hosted onMicrosoft Azure's MLaaS platformData from trace (bin width 10)Fitted lognormal distribution302520151050050Modeled using log-normal distributions21100150 200 250Service Times (ms)300350400

Determining expected request waiting timesload balancing (LB)22

Determining expected request waiting timesload balancing (LB)Waiting Time (ms)700Threshold 0100

Determining expected request waiting timesload balancing (LB)Waiting Time (ms)700Threshold (350ms)Global SchedulingPartitioned 2290100Global and partitionedperform well, but there areimplementation tradeoffs

Determining expected request waiting timesload balancing (LB)Waiting Time (ms)700600JIQ doesn't resultin good tail waitingtimes500Threshold (350ms)Global SchedulingPartitioned ackends802290100Global and partitionedperform well, but there areimplementation tradeoffs

Determining expected request waiting timesload balancing (LB)Waiting Time (ms)700600JIQ doesn't resultin good tail waitingtimes500400300Threshold (350ms)Global SchedulingPartitioned SchedulingJoin-Idle-QueueRandom DispatchRandom dispatchgives much better tailwaiting times20010003040506070#Backends802290100Global and partitionedperform well, but there areimplementation tradeoffs

Determining expected request waiting timesload balancing (LB)Waiting Time (ms)700600JIQ doesn't resultin good tail waitingtimes500400Threshold (350ms)Global SchedulingPartitioned SchedulingJoin-Idle-QueueRandom DispatchWe use a LB policy basedRandom dispatchon random dispatch!300gives much better tailwaiting times20010003040506070#Backends802290100Global and partitionedperform well, but there areimplementation tradeoffs

Determining the total request loadin the near future, to accountfor high provisioning times23

Determining the total request loadTotal request rateL'LHardwarebrokerL' Since the broker spreadsrequests uniformly amongeach frontends}in the near future, to accountfor high provisioning timesL'L/FFFrontends23Total # frontends

Determining the total request load}in the near future, to accountfor high provisioning timesL'LHardwarebrokerL' L'L/FFFrontends23Each Swayam instance‣ Predicts L' for near futureDepends on the time tosetup a new backend

Determining the total request load}in the near future, to accountfor high provisioning timesL'LHardwarebrokerL' L'L/FFFrontends23Each Swayam instance‣ Predicts L' for near future‣ Given F, computes L F x L'Determined frombroker / through agossip protocol

Making globally-consistent decisionsat each frontend (Swayam instance)What is the minimum # backendsrequired for SLA compliance?1. Expected request execution time2. Expected request waiting time3. Total request load24

SLA-aware resource estimationn min # backendsFor eachtrained modelResponse-TimeThresholdRTmaxService LevelSLminBurst ThresholdU25

SLA-aware resource estimationn min # backendsFor eachtrained modelResponse-TimeThresholdResponse Time ModelingRTmaxService LevelSLminWaiting Time DistributionLoadxUSLminpercentileresponse timeExecution Time Distributionn 125 RTmax?NoBurst ThresholdUYesn

SLA-aware resource estimationn min # backendsFor eachtrained modelResponse-TimeThresholdResponse Time ModelingRTmaxService LevelSLminClosed-form expression forpercentile response time(see the appendix)Waiting Time se timeExecution Time Distributionn 125 RTmax?NoBurst ThresholdUYesn

SLA-aware resource estimationn min # backendsFor eachtrained modelResponse-TimeThresholdAmplified based onthe burst thresholdResponse Time ModelingRTmaxService LevelSLminWaiting Time DistributionLoadxUSLminpercentileresponse timeExecution Time Distributionn 125 RTmax?NoBurst ThresholdUYesn

SLA-aware resource estimationn min # backendsFor eachtrained modelResponse-TimeThresholdResponse Time ModelingRTmaxService LevelWaiting Time DistributionLoadxUSLminExecution Time DistributionBurst ThresholdUInitializationn 125SLminpercentileresponse timeYes RTmax?Compute percentileresponse time for nRetry, as long asnot SLA compliantn No

Swayam FrameworkSwayaminstanceBackends dedicatedfor the pink modelIncomingrequestsFrontendscomputes globally consistent minimum #backends necessary for SLA compliance123456789101112 warm in-use busy/idle warm not-in-use26

Outline1. System architecture, key ideas2. Analytical model for resource estimation3. Evaluation results27

Evaluation setup Prototype in C on top of Apache Thrift 100 backends per service 8 frontends 1 broker 1 server (for simulating the clients)28

Evaluation setup Prototype in C on top of Apache Thrift 100 backends per service 8 frontends 1 broker 1 server (for simulating the clients) Workload 15 production service traces (Microsoft Azure MLaaS) Three-hour traces (request arrival times and computation times) Query computation & model setup times emulated by spinning28

SLA configuration for each model Response-time threshold RTmax 5C C denotes the mean computation time for the model Desired service level SLmin 99% 99% of the requests must have response times under RTmax Burst threshold U 2x Tolerate increase in request rate by up to 100% Initially, 5 pre-provisioned backends29

Baseline: Clairvoyant Autoscaler (ClairA) It knows the processing time of each request beforehand It can travel back in time to provision a backend "Deadline-driven" approach to minimize resource waste30

Baseline: Clairvoyant Autoscaler (ClairA) It knows the processing time of each request beforehand It can travel back in time to provision a backend "Deadline-driven" approach to minimize resource waste ClairA1 assumes zero setup times, immediate scale-ins Reflects the size of the workload30

Baseline: Clairvoyant Autoscaler (ClairA) It knows the processing time of each request beforehand It can travel back in time to provision a backend "Deadline-driven" approach to minimize resource waste ClairA1 assumes zero setup times, immediate scale-ins Reflects the size of the workload ClairA2 assumes non-zero setup times, lazy scale-ins Swayam-like30

Baseline: Clairvoyant Autoscaler (ClairA) It knows the processing time of each request beforehand It can travel back in time to provision a backend "Deadline-driven" approach to minimize resource waste ClairA1 assumes zero setup times, immediate scale-ins Reflects the size of the workload ClairA2 assumes non-zero setup times, lazy scale-ins Swayam-like Both ClairA1 and ClairA2 depend on RTmax, but not on SLmin and U30

Resource usage vs. SLA compliance31

Resource usage(normalized)Resource usage vs. SLA compliance1.210.80.60.40.20ClairA1ClairA2Swayam (frequency of SLA compliance)12345673189Trace IDs101112131415

Resource usage(normalized)Resource usage vs. SLA compliance1.210.80.60.40.20ClairA1ClairA2Swayam (frequency of SLA compliance)12345673189Trace IDs101112131415

Resource usage(normalized)Resource usage vs. SLA compliance1.210.80.60.40.20ClairA1ClairA2Swayam (frequency of SLA compliance)12345673189Trace IDs101112131415

Resource usage vs. SLA compliance12345673189Trace 2Swayam (frequency of SLA .2097%Resource usage(normalized)Frequency of SLACompliance15

Resource usage vs. SLA compliance12345673189Trace 2Swayam (frequency of SLA .2097%Resource usage(normalized)Swayam performs muchbetter than ClairA2 in terms ofresource efficiency15

123456789Trace IDs1011121314Swayam is resourceefficient but at the cost ofSLA Swayam (frequency of SLA .2097%Resource usage(normalized)Resource usage vs. SLA compliance15

123456789Trace IDs1011121314Swayam is resourceefficient but at the cost ofSLA Swayam (frequency of SLA .2097%Resource usage(normalized)Resource usage vs. SLA compliance15

Resource usage vs. SLA compliance12345673189Trace 2Swayam (frequency of SLA .2097%Resource usage(normalized)Swayam seems to perform poorlybecause of a very bursty trace15

Summary Perfect SLA, irrespective of the input workload, is too expensive in terms of resource usage (as modeled by ClairA)32

Summary Perfect SLA, irrespective of the input workload, is too expensive in terms of resource usage (as modeled by ClairA) To ensure resource efficiency, practical systems need to trade off some SLA compliance while managing client expectations32

Summary Perfect SLA, irrespective of the input workload, is too expensive in terms of resource usage (as modeled by ClairA) To ensure resource efficiency, practical systems need to trade off some SLA compliance while managing client expectations Swayam strikes a good balance, for MLaaS prediction serving by realizing significant resource savings at the cost of occasional SLA violations32

Summary Perfect SLA, irrespective of the input workload, is too expensive in terms of resource usage (as modeled by ClairA) To ensure resource efficiency, practical systems need to trade off some SLA compliance while managing client expectations Swayam strikes a good balance, for MLaaS prediction serving by realizing significant resource savings at the cost of occasional SLA violations Easy integration into any existing request-response architecture32

Thank you. Questions?33

Arpan Gujarati, Björn B. Brandenburg Sameh Elnikety, Yuxiong He Kathryn S. McKinley Swayam . Google Cloud AI Trained Model Untrained model Dataset 1. Training Trained Model 2. Prediction Query Answe

Related Documents:

Installation of the Gujarati Indic Input 2 is a very easy process that takes less than two minute to complete. Run or double click Gujarati Indic Input 2 Setup. The setup wizard will guide you through the installation process. Once the installation process is complete, Gujarati Indic Input 2 has been successfully installed will be displayed.File Size: 1MBPage Count: 16Explore furtherDownload gujarati indic for windows 10 64 bit for freeen.freedownloadmanager.orgDownload gujarati typing software windows 10 for freeen.freedownloadmanager.orgDownload microsoft gujarati indic input 64 bit for free .en.freedownloadmanager.orgGujarati Indic Input (free) download Windows versionen.freedownloadmanager.orgFREE Gujarati Typing English to Gujarati Translation .www.easygujaratityping.comRecommended to you b

MOOC India Subjects SWAYAM https://swayam.gov.in/ All Subjects School , UG & PG NPTEL https://nptel.ac.in Engineering & Science & Humanities EdX https://www.edx.org .

Kannada Gurumukhi Oriya Tamil Telugu Bengali 3.1 Gujarati Script The Gujarati script was adapted from the Devanagari script to write the Gujarati language. The earliest known document in the Gujarati script is a manuscript dating from 1592, and the script first appeared in print in a 1797 advertisement.

Stories, Gujarati Bhajan, Gujarati Sant Vani, NonStop Garba, Gujarati Duha Chhand , Mataji Aarti and Stuti, Aarti Thad, Veradi Jhulan and Gujarati Jokes presented in famous voices . For more information, please visit

Constitution and Bylaws of the Gujarati Samaj, Inc. Page 3 of 12 PREAMBLE We, the Gujaratis and Gujarati speaking community, having made our home in the United States of America, recognize the need to preserve and enhance our time-honored traditional values of Gujarati and Indian heritage, philosophy, way of life, and culture.

Gujarati language for handwriting analysis Swetang Patel14 Abstract This paper gives information about Graphology, handwriting analysis, different types of handwriting characteristics. Gujarati language also known as Gujerati, Gujarathi, Guzratee, Guujratee, Gujrathi, and Gujerathi. Gujarati is an Indo-Aryan language and its native language

http://devistotrams.blogspot.com/ Sree Lalita Sahasra Nama Stotram in Gujarati Sree Lalita Sahasra Nama Stotram – Gujarati Lyrics (Text) Sree Lalita Sahasra Nama .

Quand un additif alimentaire est autorisé au niveau européen, celui-ci bénéficie d'un code du type Exxx. Les additifs sont classés selon leur catégories. Cependant, étant donné le développement de la liste et son caractère ouvert, la place occupée par un additif alimentaire dans la liste n'est plus nécessairement indicative de sa fonction. Sommaire 1 Tableau des colorants .