CS6787 Lecture 2 —Fall 2021 - Cs.cornell.edu

2y ago
12 Views
2 Downloads
3.94 MB
73 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Brady Himes
Transcription

Getting SGDOff The GroundBasic Techniques We Always UseCS6787 Lecture 2 — Fall 2021

The hidden cost of SGD By switching to SGD, we eliminated the costly sum over thedatasetnwt 1 wt latexit sha1 base64 "fTWckoXt57qPTjIz50q/Coq6WaA " AAACQ3icbVBNaxRBEO2JX3H92ujRS H/DiQRGvgr2bPWjig4bX71V1Vb VaGRqxY00FjCatc035 9Hbu7x c2QMp3yYoi3VHQe0tKi9EnnTQepa6vMq 2k zS/Gcw1QrkennkBp5l6lvUH8TBeAK6SZEkGYondrH 8Pk1XDj/evBzptlHKvisXgi1kUiNsWOeCd2xUhI8Vl8Fd/Fj hL9C36Gf26KF2Jlj2PxD Ifv8BhpqwsQ /latexit 1X ·rf (wt , xi )n i 1wt 1 wt rf (wt ; xit ) But the cost of computing the individual gradientsremains, and we’ll need to run more steps

Using gradients naively is problematic Hardware efficiency perspective Need some way to compute gradients efficiently onthe underlying hardware Software engineering perspective If we had to express the gradients by hand, thisrequires human effort Also makes it difficult to change the objective, sincewe’d need to re-derive the gradient Also makes us prone to bugs

How do we address these problems? Automatic differentiation Compute a gradient automatically — just need to specify theobjective Prime example: backpropagation Can also decompose the gradient computation into operatorsthat will run efficiently on hardware Machine learning frameworks Make it easy to express learning tasks in a high-level language Support for building scalable systems

Why Automatic Differentiation? There are other classical approaches we have to computederivatives. Symbolic differentiation Express the objective function as a mathematical expression, thendifferentiate symbolically by applying the chain rule and other rules ofcalculus. Numerical differentiation0f (x) lim latexit sha1 base64 "vPMfaD5P/Toj5ci3ZBjvulp2VRA " TaSpko8jiejFW/ZN hiUbzFfwZu35KNxVOGkppuQvr Jx75OtzcyuFxyQ5j dZyqXPLj/PTLUj/ 6cQdJAVjjK6qI73yl7sAsBhLPsNfUAyqBSa52Z3 jKuPVCGv23 ZpZev5cghX5HGsojAOvqJTQXIuTdifpJ6uCuyBdgw5Z1 Gk/SubGlYprpFJ6v0oTSyOa pQMMmbVlZ5bik7pTM 1QohpLe/fBccDfrp OvHvVGkdrz0vyT8UffgOp0b3y /latexit h!0f (x h)f (x2hh) f (x )f (x2 )for small Why might these be a bad approach for SGD?

How does automatic differentiation work? Main idea: transform the program that computes theobjective directly into a program that computes thegradient. There are many ways of doing automatic differentiation Two broad classes: forward mode and reverse mode For most ML applications, we use backpropagation, aparticular flavor of reverse-mode automatic differentiation One specialized to compute gradients of neural network objectives

Backpropagation Start with a computation graph that represents thefunction to be differentiated Forward pass: compute through the graphnormally, as if we were computing the functionvalue Save all intermediate values used in the computation—memory cost Backward pass: now proceed backwards throughthe graph, computing gradients of the functionvalue w.r.t. intermediates

Backpropagation: A simple example Consider using backpropagation to differentiate the functionh(x) exp(sin(cos(x))) latexit sha1 base64 "G93MxtDdNpd0hnPJwNWpzgrXHxw " CWNGlXbdbyu3srq2vpHfLGxt7 AFqIJbUAN1gMEjeAav4M16sl6sd tj3pqzsplD8AfW5w/agJW8 /latexit 0h (x) exp(sin(cos(x))) · cos(cos(x)) · ( sin(x)) latexit sha1 base64 "lPkBjSPgAmtufTqQgNRWXb1lOm8 " 92 MG H B9ckROSJ1wckceyTN58e69J /Ve/scHfGGnmXyo7z3D NdqHc /latexit Notice that there’s a bunch of redundant expressions here Backpropagation will compute, in order:a1 cos(x)g3 exp(a2 )a2 sin(a1 )g2 g3 · cos(a1 )a3 exp(a2 ) latexit sha1 base64 "4emq5Li /klon7IE/zN6HVikcsM " annwGgZYqNwgD4PANsRPiaI8Z8wfVkwZGcSG M7SBRJEa4j7qkYSRHEVHN0fS WVT9glcs Pfn2dLNPI4MOAYnIAc8cAlK4A6UQQVg8ARewBt4t56tV vD py1rljzmSPwp6zJNz7pnYw /latexit f 0 (x) g2 · ( sin(x)) latexit sha1 base64 "EvhniFSGRn3PAPfZoKmXwrw96NQ " AAACOHicbZBPS8MwGMbT b/ jxoIhXP4Hp1oM6X0h4eH7vS/I rSGSmmJGRrafKpIg3EMd0jSSo5io1nC8 AjuGqcNIyHN4RqO3Z8TQxQrNYhD0xkj3VV/WWb AQYP4AW8gXfr0Xq1PqzPSWvByme2wK yvr4B/PGl8g /latexit

Key thing to know: Automatic differentiationcan compute gradients For any function that has differentiablecomponents To arbitrary precision Using a small constant factor of additionalcompute compared with the cost to computethe objective

Systems tradeoffs of backpropagation Question: what are some aspects of backpropagationthat are beneficial from a systems/hardwareefficiency perspective? Question: what are some aspects of backpropagationthat may present an additional systems/hardwareefficiency cost? Relative to the cost of computing the function (but not thegradient).

This solves part of theproblem but there’s still asoftware engineering challenge.How do we build software that people can use toreliably and robustly build, train, and deploymachine learning solutions?

The answer: machine learning frameworks Goal: make ML easier From a software engineering perspective Make the computations more reliable, debuggable, and robust Goal: make ML scalable To large datasets running on distributed heterogeneoushardware Goal: make ML accessible So that even people who aren’t ML systems experts can getgood performance

ML frameworks come in a few flavors General machine learning frameworks Goal: make a wide range of ML workloads and applications easy forusers General big data processing frameworks Focus: computing large-scale parallel operations quickly Typically has machine learning as a major, but not the only,application Deep learning frameworks Focus: fast scalable backpropagation and inference Although typically supports other applications as well

How can we evaluate an ML framework? How popular is it? Use drives use — ML frameworks have a snowball effect Popular frameworks attract more development and eventuallymore features Who is behind it? Major companies ensure long-term support What are its features? Often the least important consideration — unfortunately

Common Features of MachineLearning Frameworks

What do ML frameworks support? Basic tensor operations Provides the low-level math behind all the algorithms Including support for running them on hardware such as GPUs Automatic differentiation Used to make it easy to run backprop on any model Simple-to-use composable implementations of systemstechniques Including most of the techniques we will discuss in the remainder ofthis course

Tensors CS way to think about it: a tensor is a multidimensionalarray Math way to think about it: a tensor is a multilinear mapT : Rd1 Rd2 · · · Rdn ! R , T 2 Rd1 d2 ··· dnT (x1 , x2 , . . . , xn ) is linear in each xi , with other inputs fixed. Here the number n is called the order of the tensor For example, a matrix is just a 2nd-order tensor

Examples of Tensors in Machine Learning The CIFAR10 dataset consists of 60000 32x32 colorimages We can write the training set as a tensorTCIFAR10 2 R32 32 3 60000 Gradients for deep learning can also be tensors Example: fully-connected layer with 100 input and 100 outputneurons, and mini-batch size b 32G 2 R100 100 32

Common Operations on Tensors Elementwise operations — looks like vector sum Example: Hadamard product(A B)i1 ,i2 ,.,in Ai1 ,i2 ,.,in Bi1 ,i2 ,.,in Broadcast operations — expand along one or moredimensions Example: A 2 R11 1 , B 2 R11 5 , then with broadcasting(A B)i,j Ai,1 Bi,j Extreme version of this is the tensor product Matrix-multiply-like operations — sum or reduce along adimension Also called tensor contraction

Broadcasting makes ML easy to write Here’s how easy it is to write the loss and gradient forlogistic regression Doesn’t even need to include a for-loop

Tensors: a systems perspective Loads of data parallelism Tensors are in some sense the structural embodiment of dataparallelism Multiple dimensions à not always obvious which one best toparallelize over Predictable linear memory access patterns Great for locality Many different ways to organize the computation Creates opportunities for frameworks to automatically optimize

GeneralMachine LearningFrameworks

and NumPy Adds large multi-dimensional array and matrix types (tensors) topython Supports basic numerical operations on tensors, on the CPU SciPy Builds on NumPy and adds tools for scientific computing Supports optimization, data structures, statistics, symboliccomputing, etc. Also has an interactive interface (Jupyter) and a neat plotting tool(matplotlib) Great ecosystem for prototyping systems

Julia and MATLAB JuliaRelatively new language (8 years old) with growing communityNatively supports numerical computing and all the tensor opsSyntax is nicer than Python, and it’s often fasterHas Flux, a library for machine learning that supportsbackpropagation But less support from the community and less library support MATLAB The decades-old standard for numerical computing Supports tensor computation, and some people use it for ML But has less attention from the community because it’s proprietary

scikit-learn A broad, full-featured toolbox of machine learning and data analysistools In Python Features support for classification, regression, clustering,dimensionality reduction: including SVM, logistic regression, kMeans, PCA

Theano Machine learning library for python Created by the University of Montreal Supports tight integration with NumPy But also supports CPU and GPU integration Making it very fast for a lot of applications Development has ceased because of competition fromother libraries

GeneralBig Data ProcessingFrameworks

The original: MapReduce/Hadoop Invented by Google to handle distributed processing People started to use it for distributed machinelearning And people still use it today But it’s mostly been supplanted by other libraries And for good reason Hadoop does a lot of disk writes in order to be robust againstfailure of individual machines — not necessary for machinelearning applications

Apache Spark Open-source cluster computing framework Built in Scala, and can also embed in Python Developed by Berkeley AMP lab Now spun off into a company: DataBricks The original pitch: 100x faster than Hadoop/MapReduce Architecture based on resilient distributed datasets (RDDs) Essentially a distributed fault-tolerant data-parallel array

Spark MLLib Scalable machine learning library built on top of Spark Supports most of the same algorithms scikit-learn supports Classification, regression, decision trees, clustering, topic modeling Not primarily a deep learning library Major benefit: interaction with other processing in Spark SparkSQL to handle database-like computation GraphX to handle graph-like computation

Apache Mahout Backend-independent programming environment formachine learning Can support Spark as a backend But also supports basic MapReduce/Hadoop Focuses mostly on collaborative filtering, clustering, andclassification Similarly to MLLib and scikit-learn Also not very deep learning focused

Many more here Lots of very good frameworks for large-scaleparallel programming don’t end up becomingpopular Takeaway: important to release code peoplecan use easily And capture a group of users who can then helpdevelop the framework

Deep Learning Frameworks

Caffe Deep learning framework Developed by Berkeley AI research Declarative expressions for describing network architecture Fast — runs on CPUs and GPUs out of the box And supports a lot of optimization techniques Huge community of users both in academia and industry

Caffe code example

TensorFlow End-to-end deep learning system Developed by Google Brain API primarily in Python With support for other languages Architecture: build up a computation graph in Python Then the framework schedules it automatically on the availableresources Although recently TensorFlow has announced an eager version Super-popular, still very popular for deploying ML

TensorFlow code example

Python package that focuses on Tensor computation (like numpy) with strong GPU acceleration Deep Neural Networks built on a tape-based autograd system Eager computation out-of-the-box Uses a technique called reverse-mode auto-differentiation Allows users to change network behavior arbitrarily with zero lag or overhead Fastest implementation of this method PyTorch is the most popular framework for ML research

PyTorch example

Deep learning library from Apache. Scalable C backend Support for many frontend languages, including Python, Scala,C , R, Perl Focus on scalability to multiple GPUs Sometimes performs better than competing approaches.

MXNet Example(from MXNet MNISTtutorial)

and many other frameworks for ML Theano ONNX Jax New frameworks will continue to be developed!

ML Frameworks: Conclusion We use ML frameworks to both make ML code run moreefficiently and make it easier to express learningprocedures These frameworks are important to know aboutbecause they give you the tools you can use to buildML software. Most of you will be using an ML framework to do yourcourse project.

To get SGD off the ground, we don’t just use software.Here are some basic statistical techniques that we prettymuch always use Getting SGDOff The GroundBasic Techniques We Always UseCS6787 Lecture 2 — Fall 2021

Mini-Batching

Gradient Descent vs. SGD Gradient descent: all examples at oncewt 1 wtNX1 trf (wt ; xi )N i 1 Stochastic gradient descent: one example at a timewt 1 wt t rf (wt ; xit ) Is it really all or nothing? Can we do somethingintermediate?

Mini-Batch Stochastic Gradient Descent An intermediate approachwt 1 wt1 X trf (wt ; xi ) Bt i2Btwhere Bt is sampled uniformly from the set of all subsetsof {1, , N} of size b. The b parameter is the batch size Typically choose b N. Also called mini-batch gradient descent

How does runtime cost of Mini-Batch compareto SGD and Gradient Descent? Takes less time to compute each update than gradientdescent Only needs to sum up b gradients, rather than Nwt 1 wt1 X trf (wt ; xi ) Bt i2Bt But takes more time for each update than SGD So what’s the benefit? It’s more like gradient descent, so maybe it converges fasterthan SGD?

Mini-Batch SGD Converges Start by breaking up the update rule into expectedupdate and noisewt 1w wtw t (rh(wt ) rh(w ))1 X t(rf (wt ; xi ) rh(wt )) Bt i2Bt Second moment bound E kwt 1 2w k E kwt w2 t2 E 4 t (rh(wt )2rh(w )) k1 X(rf (wt ; xi ) Bt i2Bt rh(wt ))235

Mini-Batch SGD Converges (continued)Let2i rf (wt ; xi )rh(wt ), and23i (10i 2 Bti2/ BtX1E4(rf (wt ; xi ) rh(wt )) 5 Bt i2Bt232322NXX114545 E Eiii Bt Bt 2i 1i2Bt20132!TNNN XNXXX114@A54 E Eiijj2 Bt 2 B ti 1 j 1i 1j 1i jTij35

Mini-Batch SGD Converges (continued) Because we sampled B uniformly at random, for i jb bE [ i j ] P (i 2 B j 2 B) P (i 2 B) P (j 2 B i 2 B) ·N NE 2i b P (i 2 B) N So we can bound our square error term as2232N XNXX114E4(rf (wt ; xi ) rh(wt )) 5 E Bt Bt 2i 1 j 1i2Bt23NXX1 4b(b 1) Tb25 2E kkjibN (N 1) iNi 1i6 j11i jTij35

Mini-Batch SGD Converges (continued)2232NXXX11b1TE4(rf (wt ; xi ) rh(wt )) 5 E4kj i Bt bNN 1i 1i2Bti6 j23 n XnN XX1 4b 1b 1T25 E 1kkjibNN 1 i 1 j 1 iN 1i 1"#"N# NXX1b 1N b2 E 0 1k ik Ek i k2bNN 1bN (N 1)i 1i 1325ki

Mini-BatchSGDConverges(continued)232"X1N b145E(rf (wt ; xi ) rh(wt )) E Bt b(N 1)Ni2Bt"#NN b1 Xkrf (wt ; xi ) rh(wt )k2 Eb(N 1)N i 1NXi 1k2kiN b ·Mb(N 1)M b Compared with SGD, squared error term decreased bya factor of b#

Mini-Batch SGD Converges (continued) Recall that SGD converged to a noise ball of size at most lim E kwTT !1 2w k M 2µ µ2 Since mini-batching decreases error term by a factor ofb, it will have lim E kwTT !1 2w k M (2µ µ2 )b Noise ball smaller by the same factor!

Advantages of Mini-Batch (reprise) Takes less time to compute each update than gradientdescent Only needs to sum up b gradients, rather than Nwt 1 wt1 X trf (wt ; xi ) Bt i2Bt Converges to a smaller noise ball than stochasticgradient descent lim E kwTT !1 2w k M (2µ µ2 )b

How to choose the batch size? Mini-batching is not a free win Naively, compared with SGD, it takes b times as much effort to get ab-times-as-accurate answer But we could have gotten a b-times-as-accurate answer by justrunning SGD for b times as many steps with a step size of /b. But it still makes sense to run it for systems and statisticalreasons Mini-batching exposes more parallelism Mini-batching lets us estimate statistics about the full gradient moreaccurately — we’ll see this come up in Batch Normalization Another use case for hyperparameter optimization

Mini-Batch SGD is very widely used Including in basically all neural network training b 32 is a classical typical default value for batch size From “Practical Recommendations for Gradient-BasedTraining of Deep Architectures,” Bengio 2012. Nowadays larger batch sizes are more common b 64, b 128 Some of this change is driven by systems considerations!

Overfitting,Generalization Error, andRegularization

Minimizing Training Loss is Not our Real Goal Training loss looks likeNX1h(w) f (w; xi )N i 1 What we actually want to minimize is expected loss on newexamples Drawn from some real-world distribution ɸh̄(w) Ex [f (w; x)] Typically, we assume the training examples were drawn from thisdistribution

Overfitting Minimizing the training loss doesn't generallyminimize the expected loss on new examples They are two different objective functions after all Difference between the empirical loss on the trainingset and the expected loss on new examples is called thegeneralization error Even a model that has high accuracy on the training setcan have terrible performance on new examples Phenomenon is called overfitting

Demo

How to address overfitting Many, many techniques to deal with overfitting Have varying computational costs But this is a systems course so what can we do withlittle or no extra computational cost? Notice from the demo that some loss functions dobetter than others Can we modify our loss function to prevent overfitting?

Regularization Add an extra regularization term to the objectivefunction Most popular type: L2 regularizationNNXX1122h(w) f (w; xi ) kwk2 f (w; xi ) N i 1N i 12dXx2kk 1 Also popular: L1 regularizationNNdXXX11h(w) f (w; xi ) kwk1 f (w; xi ) kxk kN i 1N i 1k 1

Benefits of Regularization Cheap to compute For SGD and L2 regularization, there’s just an extra scalingwt 1 (12 t2)wt t rf (wt ; xit ) L2 regularization makes the objective strongly convex This makes it easier to get and prove bounds on convergence Helps with overfitting

Demo

How to choose the regularization parameter? One way is to use an independent validation set toestimate the test error, and set the regularizationparameter manually so that it is high enough to avoidoverfitting This is what we saw in the demo But doing this naively can be computationallyexpensive Need to re-run learning algorithm many times Yet another use case for hyperparameter optimization

More general forms of regularization Regularization is used more generally todescribe anything that helps prevent overfitting By biasing learning by making some models moredesirable a priori Many techniques that give throughputimprovements also have a regularizing effect Sometimes: a win-win of better statistical andhardware performance

Early Stopping

Asymptotically large training sets Setting 1: we have a distribution ɸ and we sample a verylarge (asymptotically infinite) number of points from it, thenrun stochastic gradient descent on that training set for onlyN iterations. Can our algorithm in this setting overfit? No, because its training set is asymptotically equal to the truedistribution. Can we compute this efficiently? No, because its training set is asymptotically infinitely large

Consider a second setting Setting 1: we have a distribution ɸ and we sample a verylarge (asymptotically infinite) number of points from it, thenrun stochastic gradient descent on that training set for onlyN iterations. Setting 2: we have a distribution ɸ and we sample N pointsfrom it, then run stochastic gradient descent using each ofthese points exactly once. What is the difference between the output of SGD in thesetwo settings? Asymptotically, there’s no difference! So SGD in Setting 2 will also never overfit

Early Stopping Motivation: if we only use each training example oncefor SGD, then we can’t overfit. So if we only use each example a few times, weprobably won’t overfit too much. Early stopping: just stop running SGD before itconverges.

Benefits of Early Stopping Cheap to compute Literally just does less work It seems like the technique was designed to makesystems run faster Helps with overfitting

Questions? Upcoming things Please fill out the paper assignment survey tonight!

Why Automatic Differentiation? There are other classical approaches we have tocompute derivatives. Symbolic differentiation Express the objective function as a mathematical expression, then differentiate symbolically by applying the chain rule and other rules of calculus. Numerical differentiation f 0(x) lim h!0 f (x h) f (x h) 2h

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016

Partial Di erential Equations MSO-203-B T. Muthukumar tmk@iitk.ac.in November 14, 2019 T. Muthukumar tmk@iitk.ac.in Partial Di erential EquationsMSO-203-B November 14, 2019 1/193 1 First Week Lecture One Lecture Two Lecture Three Lecture Four 2 Second Week Lecture Five Lecture Six 3 Third Week Lecture Seven Lecture Eight 4 Fourth Week Lecture .

August 2, 2021 15 August 2, 2021 16 August 2, 2021 17 August 3, 2021 18 August 4, 2021 19 August 5, 2021 20 August 6, 2021 21 August 9, 2021 22 August 9, 2021 23 August 9, 2021 24 August 10, 2021 25 August 11, 2021 26 August 12, 2021 27 August 13, 2021 28 August 16, 2021 29 August 16, 2021 30 August 16, 2021 31

Introduction to Quantum Field Theory for Mathematicians Lecture notes for Math 273, Stanford, Fall 2018 Sourav Chatterjee (Based on a forthcoming textbook by Michel Talagrand) Contents Lecture 1. Introduction 1 Lecture 2. The postulates of quantum mechanics 5 Lecture 3. Position and momentum operators 9 Lecture 4. Time evolution 13 Lecture 5. Many particle states 19 Lecture 6. Bosonic Fock .

Regional Office – Chennai and 25 other Operating Offices functioning under Regional Office – Chennai”. This envelope should be addressed to (Regional Manager, General Administration Department, United India Insurance Company Limited, Regional Office, 134, Silingi Buildings, Greams Road, Chennai-06). The Technical bid contains details of the service provided, eligibility criteria, current .