1 PAC Learning Model - ECE/CIS

2y ago
22 Views
2 Downloads
317.96 KB
6 Pages
Last View : 18d ago
Last Download : 3m ago
Upload by : Philip Renner
Transcription

ELEG/CISC 867: Advanced Machine LearningSpring 2019Lecture 3: PAC LearningLecturer: Xiugang Wu02/21/2019, 02/26/2019 & 02/28/2019In the previous lecture, we have introduced a simplified learning model, and shown that for a finite hypothesisclass with realizability assumption, if ERM rule is applied on a sufficiently large training sample (whose size isindependent of the underlying distribution or labeling function), then the output hypothesis will be probablyapproximately correct. This lecture will discuss the PAC (Probably Approximately Correct) learning modelin its full generality.1PAC Learning ModelLast lecture, we have made several assumptions on the following learning model to simplify our discussion.We now relax these assumptions and introduce the general PAC learning framework.Learning:{(Xi , Yi )}ni 1h2HLearnerPrediction:XhŶ h(X)Figure 1: PAC learning model.Label set. Last time, we assumed the label set Y {0, 1}, which corresponds to predicting a binary labelto a given example X. However, many learning tasks take a different form. For example, one may wish topredict a real valued number (say, the temperature at 9:00 p.m. tomorrow), or a label picked from a finiteset of labels (like the topic of the main story in tomorrow’s paper). One can easily extend our model to suchregression or multiclass classification problems by relaxing the label set to be the set of real vectors or theset of multiple labels. This generalized Y set is also often referred to as the target set.Data-Generation Mechanism. Last time, we assumed that the target variable Y is fully determined bythe input X, which may not be a realistic assumption in many cases, e.g. think of predicting which team willwin a basketball game based on their history. From now on, we will replace the “target labeling function”with a more flexible notion, i.e. a data-labels generating distribution. In particular, we consider a jointdistribution PXY , or simply P , over X Y. One can view such a distribution as being composed of twoparts: a distribution PX over unlabeled domain points (sometimes called the marginal distribution) and aconditional distribution over labels for each domain point, PY X .Performance Measure of a Predictor. We now introduce a general framework of quantifying theperformance of a predictor. Given a target set Y and a reconstructed target set Ŷ, let be any functionfrom Y Ŷ to the set of nonnegative real numbers, : Y Ŷ R . Such functions are referred to asloss functions as they are used to quantify how bad we feel about our reconstruction ŷ once we find out theground truth y. Define the risk L(h, P ) associated with a predictor h under data-generating distribution P1

2Lecture 3: PAC Learningas the expected loss when applying h to X, i.e.L(h, P ) , E(X,Y ) P [ (Y, h(X))].This risk is also called the true risk as it statistically measures the true performance of predictor h on unseendata. In contrast, one can also consider the empirical risk that the predictor h incurs over the trainingsample,n1X (yi , h(xi )).n i 1It can be readily seen that the the empirical risk is simply the risk of h evaluated under the empiricaldistribution Pn , i.e. L(h, Pn ).Several widely used loss functions are as follows: 0-1 loss: The 0-1 loss, widely used in classification, is defined as 0 1 (y, ŷ) I(y 6 ŷ).The risk of h under distribution P and loss function 0 1 is simply the probability of error,EP [ 0 1 (Y, h(X))] P (Y 6 h(X)). Square loss: The square loss, also known as 2 loss or quadratic loss, is usually used in the regressionproblem and is defined as sq (y, ŷ) ky ŷk2 .The risk of h under distribution P and loss function sq is generally known as the Mean Square Error(MSE),EP [ sq (Y, h(X))] EP [kY h(X)k2 ]. Logarithmic loss: The logarithmic loss, or log loss in short, is a loss function widely used in classificationwhen the reconstruction is “soft” and ŷ represents a distribution over Y, log (y, ŷ) log1.ŷ(y)The risk of h under distribution P and loss function log is given byE(X,Y ) P [ log (Y, h(X))] E(X,Y ) P [ log [h(X)](Y )].The empirical risk of h under training sample z n and loss function log is given byn11Xlog,n i 1[h(xi )](yi )which can be also written asn1XH(pi , qi ),n i 1

3Lecture 3: PAC Learningwhere H(pi , qi ) is the cross entropy between the distributions pi and qi over Y, defined asX1H(pi , qi ) ,pi (y) log,qi (y)y Yand pi and qi are respectively the distribution induced by seeing yi and the distribution the predictoroutputs, i.e.pi (y) I(y yi ),qi (y) [h(xi )](y).For this reason, log loss is also widely known as the cross-entropy loss.Bayes Predictor. Suppose that one knows the underlying distribution P . Then the predictorf argmin L(h, P )hthat minimizes the true risk is called the Bayes predictor, or Bayes estimator, or Bayes decision rule, andits resultant riskmin L(h, P )his called the Bayes risk. Under different loss functions, the Bayes predictor takes different forms: 0-1 loss: Under the 0-1 loss, the Bayes predictor f is given by the well-known maximum a posteriori(MAP) rule, i.e.,f (x) argmax pY X (y x),y Ywith the Bayes riskL(f, P ) 1 Xx Xmax pX,Y (x, y).y Y Square loss: Under the square loss, the Bayes predictor f is given by the conditional expectation of Ygiven X x, i.e.,f (x) EP [Y X x],with the Bayes riskL(f, P ) EP [Var(Y X)]. Log loss: Under the log loss, the Bayes predictor f is given by the conditional distribution of Y givenX x, i.e.,[f (x)](y) pY X (y x),with the Bayes risk being the conditional entropy of Y given X:L(f, P ) E(X,Y ) P [ log pY X (Y X)] HP (Y X).Unfortunately, since we do not know P , we cannot utilize the above Bayes predictors to achieve the minimalpossible error. Instead, what the learner does have access to is the training sample. So we will choose somehypothesis class, and require that the learner will, based on the training sample, find a predictor whose erroris not much larger than the best possible error achievable by any hypothesis within the class. This idea isformalized in the following definition of PAC learnability.

4Lecture 3: PAC Learning1.1PAC LearnabilityGiven the above general formulation of a learning model, we are now ready to define PAC learnability.Definition 1.1 (PAC Learnability) A hypothesis class H is PAC (Probably Approximately Correct) learnableif there exist a function nH : (0, 1) (0, 1) N and a learning algorithm with the following property: Forevery , δ (0, 1) and every distribution P , if n nH , then PZ n P n L(hZ n , P ) min L(h, P ) 1 δ.h HSeveral remarks about the above definition of PAC learnability are as follows: Accuracy and confidence parameters: The definition of PAC learnability contains two approximationparameters mentioned before. The accuracy parameter determines how far the output predictor canbe from the optimal one within the class (this corresponds to the “approximately correct”), and theconfidence parameter δ indicates how likely the output predictor is to meet that accuracy requirement(corresponds to the “probably” part of “PAC”). Sample complexity: The function nH determines the sample complexity of learning H, that is, howmany examples at least are required to guarantee a probably approximately correct solution. The sample complexity is a function of the accuracy and confidence parameters. It also depends on propertiesof the hypothesis class H — for example, in last lecture we showed that for a finite class satisfyingthe realizability assumption the sample complexity depends on log of the size of H. In fact, using theabove definition of PAC learnability, one can rephrase the result we showed in the last lecture as thefollowing: Every finite hypothesis class H satisfying the realizability assumption is PAC learnable with.sample complexity nH ( , δ) log( H /δ) Agnostic PAC learning: This framework is also generally known as agnostic PAC learning as it doesn’tassume realizability. We will often omit the prefix “agnostic” in this course, in which case PAC learningrefers to this general agnostic case.2Finite Hypothesis Class: Agnostic CaseWe now revisit the finite hypothesis class and show that any finite hypothesis class is learnable with ERMin the agnostic case, i.e. even without the realizability assumption.2.1Uniform Convergence Is Sufficient For LearnabilityFor ERM to work, it suffices to ensure that the empirical risk of all hypothesis in H are good approximationsof their true risk. In other words, we need that uniformly over all hypothesis in H, the empirical risk is closeto the true risk. This is formalized in the following.Definition 2.1 ( -representative sample) A training sequence Z n is called -representative if h H, L(h, Pn ) L(h, P ) .Lemma 2.1 If Z n is /2-representative, then the output hZ n of ERMH (Z n ) satisfiesL(hZ n , P ) L(h , P ) ,where we assume that h achieves the minimum risk within the class H.

5Lecture 3: PAC LearningProof:L(hZ n , P ) L(hZ n , Pn ) /2 L(h , Pn ) /2 L(h , P ) /2 /2 L(h , P ) . The above lemma implies that to ensure that ERM is a PAC learner, it suffices to show that with probabilityof at least 1 δ, Z n is -representative. The uniform convergence condition formalizes this requirement.Definition 2.2 (Uniform convergence) We say H has the uniform convergence property if there exists2nUCa function nUCH : (0, 1) N such that for every , δ (0, 1) and P , if Z P with n nH then withprobability of at least 1 δ, Z n is -representative.Corollary 2.1 If H has the uniform convergence property with a function nUCH , then the class is PAClearnable with sample complexity nH ( , δ) nUC( /2,δ),andinthiscaseERMis a successful PAC learnerHfor H.2.2Finite Classes Are Agnostic PAC LearnableWe now show that finite classes are agnostic PAC learnable by showing that uniform convergence holds forany finite hypothesis class. For this, we first introduce a measure concentration inequality due to Hoeffding,which quantifies the gap between empirical averages and their expected value.Lemma 2.2 (Hoeffding’s Inequality) Let X1 , X2 , . . . , Xn be a sequence of independent random variablesand assume that for all i, E[Xi ] µ and P(Xi [a, b]) 1. Then, for any 0,!n21X 2n Xi µ 2e (b a)2 .Pn i 1Now consider the empirical risk of any h H under training sample Z n ,nL(h, Pn ) 1X (Yi , h(Xi )),n i 1where (Yi , h(Xi )), i [1 : n] are i.i.d. with mean L(h, P ) for any i [1 : n]. Let us further assume that therange of is [0, 1]. Then applying Hoeffding’s inequality to the sequence of (Yi , h(Xi )), we obtain2P n ( L(h, Pn ) L(h, P ) ) 2e 2n ,and therefore2P n ( h H, s.t. L(h, Pn ) L(h, P ) ) 2 H e 2n .This shows that ifn log(2 H /δ),2 2thenP n ( L(h, Pn ) L(h, P ) , h H) 1 δ,and the corollary below follows immediately.

6Lecture 3: PAC LearningCorollary 2.2 Let H be a finite hypothesis class and be a loss function with range [0, 1]. Then H enjoysthe uniform convergence property with sample complexitynUCH ( , δ) log(2 H /δ).2 2Furthermore, the class is agnostically PAC learnable using the ERM algorithm with sample complexitynH ( , δ) nUCH ( /2, δ) 2 log(2 H /δ). 2

4 Lecture 3: PAC Learning 1.1 PAC Learnability Given the above general formulation of a learning model, we are now ready to de ne PAC learnability. De nition 1.1 (PAC Learnability) A hypothesis class His PAC (Probably Approximately Correct) learnable if there exist a function n H: (0;1)

Related Documents:

pac-mk51bc pac-mk51bcb pac-mk31bc pac-mk31bcb [model name] branch box pac-mk51bc pac-mk51bcb pac-mk31bc pac-mk31bcb pac-mk51bc. ocb589d 2 3 1-1. branch box functional parts . 7 t7w e55 401lev assy mka50bc 1 includes coils 8 t7w e77 242coil(lev)/xap-5p grn 1 gr

pt-600 36–37 pac 120/125t max 40cs/max 42/max 43 38–39 pac 140/m max 40 40 pac 140/m ht 40 41 pac 130/m max 70/max 80/max 100 42 pac 110 powermax 350 43 pac 123t/m powermax 600 44–45 pac 121t/m / pac 125t/m powermax 800 46–47 thermacut re

MIG-brazing wires 440 lbs (200 kg) Marathon Pac is the most advanced bulk wire packaging system available to fabricators. The complete Marathon Pac family consists of: Standard Marathon Pac Jumbo Marathon Pac Mini Marathon Pac Endless Marathon Pac Mini Marathon Pacwas introduced as the perfect solution for fabricators with

Aug 07, 2007 · OCTA-PAC’s are designed around high frequency, low loss MPP core material ECONO-PAC’s are a lower cost version of OCTA-PAC’s offering high saturation flux density, Powder Iron core material OCTA-PAC PLUS’s offer higher current ratings and higher saturation flux densities than OCTA

Electrical & Computer Engineering Student Affairs Office ece.ucsd.edu . ECE 174. ECE 175A: ECE 175B* Year 4: ECE 171B* ECE 172A* DESIGN. PROF. ELECTIVE: PROF. ELECTIVE. TECH. ELECTIVE: TECH. ELECTIVE. MACHINE LEARNING & CONTROLS DEPTH *Pick one of ECE 171B, 172A or 175B to complete the 4th Depth course requirement.

Fri., March 13 PAC-12 TOURNAMENT SEMIFINALS CANCELED Sat., March 14 PAC-12 TOURNAMENT CHAMPIONSHIP GAME CANCELED For Immediate Release \\ Tuesday, April 7, 2020 Contacts \\ Jesse Hooker (jhooker@pac-12.org, 607-329-3925), Nicole Mitchell (nmitchell@pac-12.org) 2019-20

PAC-12 GYMNASTICS WEEKLY RELEASE // January 13, 2015 3 WOMEN’S GYMNASTICS ON PAC-12 NETWORKS Pac-12 gymnastics teams will see an unprec-edented amount of television coverage this season. In the third year of Pac-12 Networks, 21 women’s gymnastics

AngularJS, and honestly, I cannot imagine writing this same application using another kind of technology in this short period of time. I was so excited about it that I wrote an article on using AngularJS with Spring MVC and Hibernate for a magazine called Java Magazine. After that, I created an AngularJS training program that already has more than 200 developers who enrolled last year. This .