CS534: Machine Learning - College Of Engineering

3y ago
40 Views
2 Downloads
529.65 KB
79 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Kaden Thurman
Transcription

CS534: Machine LearningThomas G. Dietterich221C Dearborn Halltgd@cs.orst.eduhttp://www.cs.orst.edu/ tgd/classes/5341

Course OverviewIntroduction:– Basic problems and questions in machine learning. Example applicationsLinear ClassifiersFive Popular Algorithms–––––Decision trees (C4.5)Neural networks (backpropagation)Probabilistic networks (Naïve Bayes; Mixture models)Support Vector Machines (SVMs)Nearest Neighbor MethodTheories of Learning:– PAC, Bayesian, Bias-Variance analysisOptimizing Test Set Performance:– Overfitting, Penalty methods, Holdout Methods, EnsemblesSequential and Spatial Data– Hidden Markov models, Conditional Random Fields; Hidden Markov SVMsProblem Formulation– Designing Input and Output representations2

Supervised Learning– Given: Training examples hx, f(x)i for some unknown function f.– Find: A good approximation to f.Example Applications– Handwriting recognitionx: data from pen motionf(x): letter of the alphabet– Disease Diagnosisx: properties of patient (symptoms, lab tests)f(x): disease (or maybe, recommended therapy)– Face Recognitionx: bitmap picture of person’s facef(x): name of person– Spam Detectionx: email messagef(x): spam or not spam3

Appropriate Applications forSupervised LearningSituations where there is no human expert– x: bond graph of a new molecule– f(x): predicted binding strength to AIDS protease moleculeSituations were humans can perform the task but can’t describe howthey do it– x: bitmap picture of hand-written character– f(x): ascii code of the characterSituations where the desired function is changing frequently– x: description of stock prices and trades for last 10 days– f(x): recommended stock transactionsSituations where each user needs a customized function f– x: incoming email message– f(x): importance score for presenting to the user (or deleting withoutpresenting)4

Formal training pointsSettingTrainingP(x,y)sampletest pointhx,yiyxlearningalgorithmTraining examples are drawnindependently at random according tounknown probability distribution P(x,y)The learning algorithm analyzes theexamples and produces a classifier fGiven a new data point hx,yi drawn from P,the classifier is given x and predicts ŷ f(x)The loss L(ŷ,y) is then measuredGoal of the learning algorithm: Find the fthat minimizes the expected lossfŷylossfunctionL(ŷ,y)5

Formal Version of Spam DetectionP(x,y): distribution of email messages x and theirtrue labels y (“spam” or “not spam”)training sample: a set of email messages that havebeen labeled by the userlearning algorithm: what we study in this course!f: the classifier output by the learning algorithmtest point: A new email message x (with its true, buthidden, label y)true label yloss function L(ŷ,y):predictedlabel ŷspamnotspamspam010not spam106

Three Main Approaches toMachine LearningLearn a classifier: a function f.Learn a conditional distribution: a conditionaldistribution P(y x)Learn the joint probability distribution: P(x,y)In the first two weeks, we will study one exampleof each method:– Learn a classifier: The LMS algorithm– Learn a conditional distribution: Logistic regression– Learn the joint distribution: Linear discriminantanalysis7

Infering a classifier f from P(y x)Predict the ŷ that minimizes the expectedloss:f(x) argmin Ey x [L(ŷ, y)]ŷ argminŷXP (y x)L(ŷ, y)y8

Example: Making the spam decisionSuppose our spam detectorpredicts that P(y “spam” x) 0.6. What is the optimalclassification decision ŷ?Expected loss of ŷ “spam” is0 * 0.6 10 * 0.4 4Expected loss of ŷ “no spam”is 1 * 0.6 0 * 0.4 0.6Therefore, the optimalprediction is “no spam”true label ypredictedlabel ŷspamnotspamspam010not spam10P(y x)0.60.49

Inferring a classifier fromthe joint distribution P(x,y)We can compute the conditional distributionaccording to the definition of conditionalprobability:P (x, y k).P (y k x) Pj P (x, y j)In words, compute P(x, y k) for each value of k.Then normalize these numbers.Compute ŷ using the method from the previousslide10

Fundamental Problem of MachineLearning: It is ill-posedExample x1 x210 020 130 041 050 161 170 1x31010100x40011001y001100011

Learning Appears ImpossibleThere are 216 65536possible booleanfunctions over four inputfeatures. We can’t figureout which one is correctuntil we’ve seen everypossible input-output pair.After 7 examples, we stillhave 29 x3 x400011011000110110001101100011011y?01000?1?0?12

Solution: Work with a restrictedhypothesis spaceEither by applying prior knowledge or byguessing, we choose a space of hypotheses Hthat is smaller than the space of all possiblefunctions:–––––simple conjunctive rulesm-of-n ruleslinear functionsmultivariate Gaussian joint probability distributionsetc.13

Illustration: Simple ConjunctiveRulesThere are only 16simpleconjunctions (nonegation)However, nosimple ruleexplains the data.The same is truefor simple clausesRuletrue yx1 yx2 yx3 yx4 yx1 x2 yx1 x3 yx1 x4 yx2 x3 yx2 x4 yx3 x4 yx1 x2 x3 yx1 x2 x4 yx1 x3 x4 yx2 x3 x4 yx1 x2 x3 x4 yCounterexample132173333343333314

A larger hypothesis space:m-of-n rulesAt least m of then variables mustbe trueThere are 32possible rulesOnly one rule isconsistent!variables{x1}{x2}{x3}{x4}{x1, x2}{x1, x3}{x1, x4}{x2, x3}{x2, x4}{x3, x4}{x1, x2, x3}{x1, x2, x4}{x1, x3, x4}{x2, x3, x4}{x1, x2, x3, x4}Counterexample1-of 2-of 3-of 3–1***3–153–153315

Two Views of LearningView 1: Learning is the removal of ourremaining uncertainty– Suppose we knew that the unknown function was anm-of-n boolean function. Then we could use thetraining examples to deduce which function it is.View 2: Learning requires guessing a good,small hypothesis class– We can start with a very small class and enlarge ituntil it contains an hypothesis that fits the data16

We could be wrong!Our prior “knowledge” might be wrongOur guess of the hypothesis class couldbe wrong– The smaller the class, the more likely we arewrong17

Two Strategies for MachineLearningDevelop Languages for Expressing PriorKnowledge– Rule grammars, stochastic models, Bayesiannetworks– (Corresponds to the Prior Knowledge view)Develop Flexible Hypothesis Spaces– Nested collections of hypotheses: decision trees,neural networks, cases, SVMs– (Corresponds to the Guessing view)In either case we must develop algorithms forfinding an hypothesis that fits the data18

TerminologyTraining example. An example of the form hx,yi. x isusually a vector of features, y is called the class label.We will index the features by j, hence xj is the j-th featureof x. The number of features is n.Target function. The true function f, the true conditionaldistribution P(y x), or the true joint distribution P(x,y).Hypothesis. A proposed function or distribution hbelieved to be similar to f or P.Concept. A boolean function. Examples for which f(x) 1are called positive examples or positive instances of theconcept. Examples for which f(x) 0 are called negativeexamples or negative instances.19

TerminologyClassifier. A discrete-valued function. The possiblevalues f(x) {1, , K} are called the classes or classlabels.Hypothesis space. The space of all hypotheses thatcan, in principle, be output by a particular learningalgorithm.Version Space. The space of all hypotheses in thehypothesis space that have not yet been ruled out by atraining example.Training Sample (or Training Set or Training Data): a setof N training examples drawn according to P(x,y).Test Set: A set of training examples used to evaluate aproposed hypothesis h.Validation Set: A set of training examples (typically asubset of the training set) used to guide the learning20algorithm and prevent overfitting.

Key Issues in Machine LearningWhat are good hypothesis spaces?– which spaces have been useful in practical applications?What algorithms can work with these spaces?– Are there general design principles for learning algorithms?How can we optimize accuracy on future data points?– This is related to the problem of “overfitting”How can we have confidence in the results? (thestatistical question)– How much training data is required to find an accuratehypotheses?Are some learning problems computational intractable?(the computational question)How can we formulate application problems as machinelearning problems? (the engineering question)21

A framework for hypothesis spacesSize: Does the hypothesis space have a fixed size or a variablesize?– fixed-sized spaces are easier to understand, but variable-sized spacesare generally more useful. Variable-sized spaces introduce the problemof overfittingStochasticity. Is the hypothesis a classifier, a conditionaldistribution, or a joint distribution?– This affects how we evaluate hypotheses. For a deterministichypothesis, a training example is either consistent (correctly predicted)or inconsistent (incorrectly predicted). For a stochastic hypothesis, atrianing example is more likely or less likely.Parameterization. Is each hypothesis described by a set of symbolic(discrete) choices or is it described by a set of continuousparameters? If both are required, we say the space has a mixedparameterization.– Discrete parameters must be found by combinatorial search methods;continuous parameters can be found by numerical search methods22

A Framework for HypothesisSpaces (2)23

A Framework for LearningAlgorithmsSearch Procedure– Direct Computation: solve for the hypothesis directly– Local Search: start with an initial hypothesis, make smallimprovements until a local maximum– Constructive Search: start with an empty hypothesis, graduallyadd structure to it until a local optimumTiming– Eager: analyze training data and construct an explicit hypothesis– Lazy: store the training data and wait until a test data point ispresented, then construct an ad hoc hypothesis to classify thatone data pointOnline vs. Batch (for eager algorithms)––Online: analyze each training example as it is presentedBatch: collect examples, analyze them in a batch, output anhypothesis24

A Framework for LearningAlgorithms (2)25

Linear Threshold Unitsh(x) ( 1 if w1x1 . . . wnxn w0 1 otherwiseWe assume that each feature xj and each weightwj is a real number (we will relax this later)We will study three different algorithms forlearning linear threshold units:– Perceptron: classifier– Logistic Regression: conditional distribution– Linear Discriminant Analysis: joint distribution26

What can be represented by anLTU:Conjunctionsx1 x2 x4 y1 · x1 1 · x2 0 · x3 1 · x4 3At least m-of-nat-least-2-of{x1, x3, x4} y1 · x1 0 · x2 1 · x3 1 · x4 227

Things that cannot be represented:Non-trivial disjunctions:(x1 x2) (x3 x4) y1 · x1 1 · x2 1 · x3 1 · x4 2 predictsf (h0110i) 1.Exclusive-OR:(x1 x2) ( x1 x2) y28

A canonical representationGiven a training example of the form(hx1, x2, x3, x4i, y)transform it to(1, x1, x2, x3, x4i, y)The parameter vector will then bew hw0, w1, w2, w3, w4i.We will call the unthresholded hypothesis u(x,w)u(x,w) w · xEach hypothesis can be writtenh(x) sgn(u(x,w))Our goal is to find w.29

The LTU Hypothesis Spaceµn2¶Fixed size: There are O 2 distinctlinear threshold units over n booleanfeaturesDeterministicContinuous parameters30

Geometrical ViewConsider three training examples: (h1.0, 1.0i, 1)(h0.5, 3.0i, 1)(h2.0, 2.0i, 1)We want a classifier that looks likethe following:31

The Unthresholded DiscriminantFunction is a HyperplaneThe equationu(x) w · xis a planeŷ ( 1 if u(x) 0 1 otherwise32

Machine Learning and OptimizationWhen learning a classifier, the natural way toformulate the learning problem is the following:– Given:A set of N training examples{(x1,y1), (x2,y2), , (xN,yN)}A loss function L– Find:The weight vector w that minimizes the expected loss on thetraining dataN1 XJ(w) L(sgn(w · xi), yi).N i 1In general, machine learning algorithms applysome optimization algorithm to find a goodhypothesis. In this case, J is piecewiseconstant, which makes this a difficult problem33

Approximating the expected loss bya smooth functionSimplify the optimization problem by replacing theoriginal objective function by a smooth, differentiablefunction. For example, consider the hinge loss:N1 X w) J(max(0, 1 yiw · xi)N i 1When y 134

Minimizing J by Gradient Descent SearchStart with weight vector w0Compute gradient w0) J(Ã w0) w0) w0) J( J( J(,, . , w0 w1 wnCompute w1 w0 – η J̃(w 0)where η is a “step size” parameterRepeat until convergence!35

Computing the GradientLet J i(w) max(0, yiw · xi) N J (w ) 1 X J i (w) wk wk N i 1 ÃN1 X N i 1 wkJ i (w)! X J i (w ) max 0, yiw j xij w k wkj (P0if yi j wj xij 0 yi xik otherwise36

Batch Perceptron AlgorithmGiven:training examples (xi, yi), i 1 . . . NLet w (0, 0, 0, 0, . . . , 0) be the initial weight vector.Let g (0, 0, . . . , 0) be the gradient vector.Repeat until convergenceFor i 1 to N doui w · xiIf (yi · ui 0)For j 1 to n dogj gj yi · xijg : g/Nw : w η gSimplest case: η 1, don’t normalize g: “Fixed Increment Perceptron”37

Online Perceptron AlgorithmLet w (0, 0, 0, 0, . . . , 0) be the initial weight vector.Repeat foreverAccept training example i: hxi, yiiu i w · xiIf (yi ui 0)For j 1 to n dogj : yi · xijw : w η gThis is called stochastic gradient descent because theoverall gradient is approximated by the gradient from eachindividual example38

Learning Rates and ConvergenceThe learning rate η must decrease to zero in order to guaranteeconvergence. The online case is known as the Robbins-Munroalgorithm. It is guaranteed to converge under the followingassumptions:lim ηt 0t Xt 0 Xt 0ηt ηt2 The learning rate is also called the step size. Some algorithms (e.g.,Newton’s method, conjugate gradient) choose the stepsizeautomatically and converge fasterThere is only one “basin” for linear threshold units, so a localminimum is the global minimum. Choosing a good starting point canmake the algorithm converge faster39

Decision BoundariesA classifier can be viewed as partitioning the input space or featurespace X into decision regionsA linear threshold unit always produces a linear decision boundary.A set of points that can be separated by a linear decision boundaryis said to be linearly separable.40

Exclusive-OR is Not LinearlySeparable41

Extending Perceptron to More thanTwo ClassesIf we have K 2 classes, we can learn aseparate LTU for each class. Let wk be theweight vector for class k. We train it by treatingexamples from class y k as the positiveexamples and treating the examples from allother classes as negative examples. Then weclassify a new data point x according toŷ argmax wk · x.k42

Summary of Perceptron algorithmfor LTUsDirectly Learns a ClassifierLocal Search– Begins with an initial weight vector. Modifies ititerative to minimize an error function. The errorfunction is loosely related to the goal of minimizingthe number of classification errorsEager– The classifier is constructed from the trainingexamples– The training examples can then be discardedOnline or Batch– Both variants of the algorithm can be used43

Logistic RegressionLearn the conditional distribution P(y x)Let py(x; w) be our estimate of P(y x), where w is avector of adjustable parameters. Assume only twoclasses y 0 and y 1, andp1(x; w) exp w · x.1 exp w · xp0(x; w) 1 p1(x; w).On the homework, you will show that this is equivalent top1(x; w)log w · x.p0(x; w)In other words, the log odds of class 1 is a linear functionof x.44

Why the exp function?One reason: A linear function has a range from[– , ] and we need to force it to be positiveand sum to 1 in order to be a probability:45

Deriving a Learning AlgorithmSince we are fitting a conditional probability distribution, we nolonger seek to minimize the loss on the training data. Instead, weseek to find the probability distribution h that is most likely given thetraining dataLet S be the training sample. Our goal is to find h to maximize P(h S):P (S h)P (h)P (S)h argmax P (S h)P (h)argmax P (h S) argmaxhby Bayes’ Rulebecause P (S) doesn’t depend on hh argmax P (S h)if we assume P (h) uniformh argmax log P (S h)because log is monotonichThe distribution P(S h) is called the likelihood function. The loglikelihood is frequently used as the objective function for learning. It isoften written as ℓ(w).The h that maximizes the likelihood on the training data is called themaximum likelihood estimator (MLE)46

Computing the LikelihoodIn our framework, we assume that each trainingexample (xi,yi) is drawn from the same (butunknown) probability distribution P(x,y). Thismeans that the log likelihood of S is the sum ofthe log likelihoods of the individual trainingexamples:log P (S h) log XiYiP (xi , yi h)log P (xi , yi h)47

Computing the Likelihood (2)Recall that any joint distribution P(a,b) can befactored as P(a b) P(b). Hence, we can writeargmax logP (S h) argmaxhh argmaxhXiXilogP (xi, yi h)logP (yi xi , h)P (xi h)In our case, P(x h) P(x), because it does notdepend on h, soargmax logP (S h) argmaxhh argmaxhXiXilogP (yi xi , h)P (xi h)logP (yi xi , h)48

Log Likelihood for ConditionalProbability EstimatorsWe can express the log likelihood in a compactform known as the cross entropy.Consider an example (xi,yi)– If yi 0, the log likelihood is log [1 – p1(x; w)]– if yi 1, the log likelihood is log [p1(x; w)]These cases are mutually exclusive, so we cancombine them to obtain:ℓ(yi; xi,w) log P(yi xi,w) (1 – yi) log[1 – p1(xi;w)] yi log p1(xi;w)The goal of our learning algorithm will be to findw to maximizeJ(w) i ℓ(yi; xi,w)49

Fitting Logistic Regression byGradient AscentX J (w) (yi; xi, w) wj wji (y ; x , w) ((1 yi) log[1 p1(xi ; w)] y1 log p1(xi; w )) wj i i wj (1 yi)"Ã!Ã1 p1(xi; w)1 p (x ; w) 1 i yi1 p1(xi ; w) wjp1(xi; w) wj#Ã!yi p1(xi; w)(1 yi ) p1(xi; w) 1 p1(xi ; w) wj"#Ã!yi(1 p1(xi; w)) (1 yi)p1(xi ; w) p1(xi ; w) p1(xi ; w)(1 p1(xi ; w)) w j"#Ã!yi p1(xi ; w) p1(xi ; w) p1(xi; w)(1 p1(xi; w )) w j 50!

Gradient Computation (continued)Note that p1 can also be written asp1(xi ; w) 1.(1 exp[ w · xi])From this, we obtain: p1(xi ; w) 1 (1 exp[ w · xi ])2 wj(1 exp[ w · xi]) w j 1exp[ w·x]( w · xi ) i2 w(1 exp[ w · xi])j1exp[ w · xi]( xij ) (1 exp[ w · xi])2 p1(xi ; w )(1 p1(xi ; w))xij51

Completing the GradientComputationThe gradient of the log likelihood of asingle point is therefore"#Ã!yi p1(xi ; w) p1(xi ; w)p1(xi; w)(1 p1(xi; w )) w j"#yi p1(xi ; w) p1(xi; w)(1 p1(xi; w))xijp1(xi; w)(1 p1(xi; w )) (yi p1(xi; w))xij (yi ; xi, w) wjThe overall gradient isX J(w) (yi p1(xi ; w))xij wji52

Batch Gradient Ascent for Logistic RegressionGiven:training examples (xi, yi), i 1 . . . NLet w (0, 0, 0, 0, . . . , 0) be the initial weight vector.Repeat until convergenceLet g (0, 0, . . . , 0) be the gradient vector.For i 1 to N dopi 1/(1 exp[ w · xi])errori yi piFor j 1 to n dogj gj errori · xijw : w η gstep in direction of increasing gradientAn online gradient ascent algorithm can be constructed, of courseMost statistical packages use a second-order (Newton-Raphson)algorithm for faster convergence. Each iteration of the second-ordermethod can be viewed as a weighted least squares computation, sothe algorithm is known as Iteratively-Reweighted Least Squares(IRLS)53

Logistic Regression Implements aLinear Discriminant FunctionIn the 2-class 0/1 loss function case, we shouldpredict ŷ 1 ifXyEy x[L(0, y)] Ey x [L(1, y)]XP (y x)L(0, y) P (y x)L(1, y)yP (y 0 x)L(0, 0) P (y 1 x)L(0, 1) P (y 0 x)L(1, 0) P (y 1 x)L(1, 1)

Machine Learning Learn a classifier: a function f. Learn a conditional distribution: a conditional distribution P(y x) Learn the joint probability distribution: P(x,y) In the first two weeks, we will study one example of each method: – Learn a classifier: The LMS algorithm

Related Documents:

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

Machine learning has many different faces. We are interested in these aspects of machine learning which are related to representation theory. However, machine learning has been combined with other areas of mathematics. Statistical machine learning. Topological machine learning. Computer science. Wojciech Czaja Mathematical Methods in Machine .

Machine Learning Real life problems Lecture 1: Machine Learning Problem Qinfeng (Javen) Shi 28 July 2014 Intro. to Stats. Machine Learning . Learning from the Databy Yaser Abu-Mostafa in Caltech. Machine Learningby Andrew Ng in Stanford. Machine Learning(or related courses) by Nando de Freitas in UBC (now Oxford).

Machine Learning Machine Learning B. Supervised Learning: Nonlinear Models B.5. A First Look at Bayesian and Markov Networks Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL .

with machine learning algorithms to support weak areas of a machine-only classifier. Supporting Machine Learning Interactive machine learning systems can speed up model evaluation and helping users quickly discover classifier de-ficiencies. Some systems help users choose between multiple machine learning models (e.g., [17]) and tune model .

Artificial Intelligence, Machine Learning, and Deep Learning (AI/ML/DL) F(x) Deep Learning Artificial Intelligence Machine Learning Artificial Intelligence Technique where computer can mimic human behavior Machine Learning Subset of AI techniques which use algorithms to enable machines to learn from data Deep Learning

a) Plain milling machine b) Universal milling machine c) Omniversal milling machine d) Vertical milling machine 2. Table type milling machine 3. Planer type milling machine 4. Special type milling machine 5.2.1 Column and knee type milling machine The column of a column and knee

INTRODUCTION The Discipline and Practice of Qualitative Research Norman K. Denzin and Yvonna S. Lincoln T he global community of qualitative researchers is mid-way between two extremes, searching for a new middle, moving in several different directions at the same time.1 Mixed methodologies and calls for scientifically based research, on the one side, renewed calls for social justice inquiry .