Boosting - Lagout

2y ago
89 Views
2 Downloads
5.27 MB
528 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Aiyana Dorn
Transcription

BoostingFoundations and Algorithms“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page i

Adaptive Computation and Machine LearningThomas Dietterich, EditorChristopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate EditorsA complete list of the books published in this series may be found at the back of the book.“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page ii

BoostingFoundations and AlgorithmsRobert E. SchapireYoav FreundThe MIT PressCambridge, MassachusettsLondon, England“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page iii

2012 Massachusetts Institute of TechnologyAll rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means(including photocopying, recording, or information storage and retrieval) without permission in writing from thepublisher.For information about special quality discounts, please email special sales@mitpress.mit.eduThis book was set in Times Roman by Westchester Book Composition.Printed and bound in the United States of America.Library of Congress Cataloging-in-Publication DataSchapire, Robert E.Boosting : foundations and algorithms / Robert E. Schapire and Yoav Freund.p. cm.—(Adaptive computation and machine learning series)Includes bibliographical references and index.ISBN 978-0-262-01718-3 (hardcover : alk. paper)1. Boosting (Algorithms) 2. Supervised learning (Machine learning) I. Freund, Yoav. II. Title.Q325.75.S33 2012006.3'1—dc2320110389721098765432“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page iv

To our families“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page v

On the cover: A randomized depiction of the potential function t (s) used in the boostby-majority algorithm, as given in equation (13.30). Each pixel, identified with an integerpair (t, s), was randomly colored blue with probability t (s), and was otherwise coloredyellow (with colors inverted where lettering appears). The round t runs horizontally fromT 1225 at the far left down to 0 at the far right, and position s runs vertically from 225at the top to 35 at the bottom. An edge of γ 0.06 was used. [Cover design by MollySeamans and the authors.]“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page vi

ContentsSeries ForewordPrefacexixiii1Introduction and Overview1.1 Classification Problems and Machine Learning1.2 Boosting1.3 Resistance to Overfitting and the Margins Theory1.4 Foundations and AlgorithmsSummaryBibliographic NotesExercises1241417191920ICORE ANALYSIS212Foundations of Machine Learning2.1 A Direct Approach to Machine Learning2.2 General Methods of Analysis2.3 A Foundation for the Study of Boosting AlgorithmsSummaryBibliographic NotesExercises232430434949503Using AdaBoost to Minimize Training Error3.1 A Bound on AdaBoost’s Training Error3.2 A Sufficient Condition for Weak Learnability3.3 Relation to Chernoff Bounds3.4 Using and Designing Base Learning AlgorithmsSummaryBibliographic NotesExercises5354566062707171“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page vii

viiiContents4Direct Bounds on the Generalization Error4.1 Using VC Theory to Bound the Generalization Error4.2 Compression-Based Bounds4.3 The Equivalence of Strong and Weak LearnabilitySummaryBibliographic NotesExercises5The Margins Explanation for Boosting’s Effectiveness5.1 Margin as a Measure of Confidence5.2 A Margins-Based Analysis of the Generalization Error5.3 Analysis Based on Rademacher Complexity5.4 The Effect of Boosting on Margin Distributions5.5 Bias, Variance, and Stability5.6 Relation to Support-Vector Machines5.7 Practical Applications of MarginsSummaryBibliographic AMENTAL PERSPECTIVES1396Game Theory, Online Learning, and Boosting6.1 Game Theory6.2 Learning in Repeated Game Playing6.3 Online Prediction6.4 Boosting6.5 Application to a “Mind-Reading” GameSummaryBibliographic NotesExercises1411421451531571631691691707Loss Minimization and Generalizations of Boosting7.1 AdaBoost’s Loss Function7.2 Coordinate Descent7.3 Loss Minimization Cannot Explain Generalization7.4 Functional Gradient Descent7.5 Logistic Regression and Conditional Probabilities7.6 Regularization7.7 Applications to Data-Limited LearningSummaryBibliographic 48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page viii75758386888989

Contentsix8Boosting, Convex Optimization, and Information Geometry8.1 Iterative Projection Algorithms8.2 Proving the Convergence of AdaBoost8.3 Unification with Logistic Regression8.4 Application to Species Distribution ModelingSummaryBibliographic IC EXTENSIONS2699Using Confidence-Rated Weak Predictions9.1 The Framework9.2 General Methods for Algorithm Design9.3 Learning Rule-Sets9.4 Alternating Decision TreesSummaryBibliographic NotesExercises27127327528729029629729710Multiclass Classification Problems10.1 A Direct Extension to the Multiclass Case10.2 The One-against-All Reduction and Multi-label Classification10.3 Application to Semantic Classification10.4 General Reductions Using Output CodesSummaryBibliographic NotesExercises30330531031632033333333411Learning to Rank11.1 A Formal Framework for Ranking Problems11.2 A Boosting Algorithm for the Ranking Task11.3 Methods for Improving Efficiency11.4 Multiclass, Multi-label Classification11.5 ApplicationsSummaryBibliographic NotesExercises341342345351361364367369369“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page ix

xContentsIVADVANCED THEORY37512Attaining the Best Possible Accuracy12.1 Optimality in Classification and Risk Minimization12.2 Approaching the Optimal Risk12.3 How Minimizing Risk Can Lead to Poor AccuracySummaryBibliographic NotesExercises37737838239840640640713Optimally Efficient Boosting13.1 The Boost-by-Majority Algorithm13.2 Optimal Generalization Error13.3 Relation to AdaBoostSummaryBibliographic NotesExercises41541643244845345345314Boosting in Continuous Time14.1 Adaptiveness in the Limit of Continuous Time14.2 BrownBoost14.3 AdaBoost as a Special Case of BrownBoost14.4 Experiments with Noisy DataSummaryBibliographic NotesExercises459460468476483485486486Appendix: Some Notation, Definitions, and Mathematical BackgroundA.1 General NotationA.2 NormsA.3 Maxima, Minima, Suprema, and InfimaA.4 LimitsA.5 Continuity, Closed Sets, and CompactnessA.6 Derivatives, Gradients, and Taylor’s TheoremA.7 ConvexityA.8 The Method of Lagrange MultipliersA.9 Some Distributions and the Central Limit TheoremBibliographyIndex of Algorithms, Figures, and TablesSubject and Author Index“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page x491491492493493494495496497498501511513

Series ForewordThe goal of building systems that can adapt to their environments and learn from theirexperience has attracted researchers from many fields, including computer science, engineering, mathematics, physics, neuroscience, and cognitive science. Out of this researchhas come a wide variety of learning techniques that are transforming many industrial andscientific fields. Recently, several research communities have converged on a commonset of issues surrounding supervised, unsupervised, and reinforcement learning problems.The MIT Press Series on Adaptive Computation and Machine Learning seeks to unify themany diverse strands of machine learning research and to foster high quality research andinnovative applications.The MIT Press is extremely pleased to publish this contribution by Robert Schapireand Yoav Freund. The development of boosting algorithms by Schapire, Freund, and theircollaborators over the last twenty years has had an immense impact on machine learning,statistics, and data mining. Originally developed to address a fundamental theoretical question, boosting has become a standard tool for solving a wide variety of problems in machinelearning and optimization. The book offers a definitive, yet highly accessible, treatment ofboosting. It explains the theory underlying the basic algorithm as well as presenting extensions to confidence-rated prediction, multi-class classification, and ranking. This book willserve as a valuable reference for researchers and as a focused introduction to machinelearning for undergraduate and beginning graduate students interested in understandingthis elegant approach to machine learning.“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page xi

PrefaceThis book is about boosting, an approach to machine learning based on the idea of creatinga highly accurate prediction rule by combining many relatively weak and inaccurate rules.A remarkably rich theory has evolved around boosting, with connections to a wide range oftopics including statistics, game theory, convex optimization, and information geometry.In addition, AdaBoost and other boosting algorithms have enjoyed practical success withapplications, for instance, in biology, vision, and speech processing. At various times in itshistory, boosting has been the subject of controversy for the mystery and paradox that itseems to present.In writing this book, we have aimed to reach nearly anyone with an interest in boosting(as well as an appropriate, but relatively minimal, technical background), whether studentsor advanced researchers, whether trained in computer science, statistics, or some other field.We specifically hope that the book will be useful as an educational tool, and have thereforeincluded exercises in every chapter. Although centered on boosting, the book introduces avariety of topics relevant to machine learning generally, as well as to related fields such asgame theory and information theory.The main prerequisite for this book is an elementary background in probability. We alsoassume familiarity with calculus and linear algebra at a basic, undergraduate level. Anappendix provides background on some more advanced mathematical concepts which areused mainly in later chapters. The central notions of machine learning, boosting, and so onare all presented from the ground up.Research on boosting has spread across multiple publications and disciplines over aperiod of many years. This book attempts to bring together, organize, extend, and simplifya significant chunk of this work. Some of this research is our own or with co-authors,but a very large part of what we present—including a few of the chapters almost in theirentirety—is based on the contributions of the many other excellent researchers who workin this area. Credit for such previously published work is given in the bibliographic notes atthe end of every chapter. Although most of the material in this book has appeared elsewhere,the majority of chapters also include new results that have never before been published.“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page xiii

xivPrefaceThe focus of this book is on foundations and algorithms, but also on applications. Following a general introduction to machine learning algorithms and their analysis, the bookexplores in part I the core theory of boosting, particularly its ability to generalize (that is,make accurate predictions on new data). This includes an analysis of boosting’s trainingerror, as well as bounds on the generalization error based both on direct methods and onthe margins theory. Next, part II systematically explores some of the other myriad theoretical viewpoints that have helped to explain and understand boosting, including thegame-theoretic interpretation, the view of AdaBoost as a greedy procedure for minimizinga loss function, and an understanding of boosting as an iterative-projection algorithm withconnections to information geometry and convex optimization. Part III focuses on practicalextensions of AdaBoost based on the use of confidence-rated weak hypotheses, and formulticlass and ranking problems. Finally, some advanced theoretical topics are covered inpart IV, including the statistical consistency of AdaBoost, optimal boosting, and boostingalgorithms which operate in continuous time. Although the book is organized around theoryand algorithms, most of the chapters include specific applications and practical illustrations.Readers with particular interests, or those organizing a course, might choose one of anumber of natural tracks through this book. For a more theoretical treatment, part III couldbe omitted. A track focused on the practical application of boosting might omit chapters 4, 6,and 8, and all of part IV. A statistical approach might emphasize chapters 7 and 12 whileomitting chapters 4, 6, 8, 13, and 14. Some of the proofs included in this book are somewhatinvolved and technical, and can certainly be skipped or skimmed. A rough depiction of howthe chapters depend on each other is shown in figure P.1.This book benefited tremendously from comments and criticisms we received fromnumerous individuals. We are especially grateful to ten students who read an earlier draftof the book as part of a Princeton graduate seminar course: Jonathan Chang, Sean Gerrish, Sina Jafarpour, Berk Kapicioglu, Indraneel Mukherjee, Gungor Polatkan, AlexanderIII2IV13641III3141275981011. 411Figure P.1An approximate depiction of how the chapters of this book depend on each other. Each edge u v representsa suggestion that chapter u be read before chapter v. (The dashed edge indicates that section 11.4 depends onchapter 10, but the other sections of chapter 11 do not.)“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page xiv

PrefacexvSchwing, Umar Syed, Yongxin (Taylor) Xi, and Zhen (James) Xiang. Their close readingand numerous suggestions, both in and out of class, were extraordinarily helpful and led tosignificant improvements in content and presentation in every one of the chapters.Thanks also to Peter Bartlett, Vladimir Koltchinskii, Saharon Rosset, Yoram Singer, andother anonymous reviewers of this book for their time and their many constructive suggestions and criticisms. An incomplete list of the many, many others who provided help,comments, ideas, and insights includes: Shivani Agarwal, Jordan Boyd-Graber, OlivierChapelle, Kamalika Chaudhuri, Michael Collins, Edgar Dobriban, Miro Dudík, DaveHelmbold, Ludmila Kuncheva, John Langford, Phil Long, Taesup Moon, Lev Reyzin,Ron Rivest, Cynthia Rudin, Rocco Servedio, Matus Telgarsky, Paul Viola, and ManfredWarmuth. Our apologies to others who were surely, though certainly not intentionally,omitted from this list.We are grateful to our past and present employers for supporting this work: AT&T Labs;Columbia Center for Computational Learning Systems; Princeton University; University ofCalifornia, San Diego; and Yahoo! Research. Support for this research was also generouslyprovided by the National Science Foundation under awards 0325463, 0325500, 0513552,0812598, and 1016029.Thanks to all of our collaborators and colleagues whose research appears in this book,and who kindly allowed us to include specific materials, especially figures, as cited andacknowledged with gratitude in the appropriate chapters. We are grateful to KatherineAlmeida, Ada Brunstein, Jim DeWolf, Marc Lowenthal, Molly Seamans and everyone atMIT Press for their tireless assistance in preparing and publishing this book. Thanks also tothe various editors at other publishers we considered, and to all those who helped with someoccasionally thorny copyright issues, particularly Laurinda Alcorn and Frank Politano.Finally, we are grateful for the love, support, encouragement, and patience provided byour families: Roberta, Jeni, and Zak; Laurie, Talia, and Rafi; and by our parents: Hans andLibby, Ora and Rafi.“48740 7P 8291 000.tex” — 10/1/2012 — 17:41 — page xv

1Introduction and OverviewHow is it that a committee of blockheads can somehow arrive at highly reasoned decisions,despite the weak judgment of the individual members? How can the shaky separate viewsof a panel of dolts be combined into a single opinion that is very likely to be correct? Thatthis possibility of garnering wisdom from a council of fools can be harnessed and used toadvantage may seem far-fetched and implausible, especially in real life. Nevertheless, thisunlikely strategy turns out to form the basis of boosting, an approach to machine learning thatis the topic of this book. Indeed, at its core, boosting solves hard machine-learning problemsby forming a very smart committee of grossly incompetent but carefully selected members.To see how this might work in the context of machine learning, consider the problem offiltering out spam, or junk email. Spam is a modern-day nuisance, and one that is ideallyhandled by highly accurate filters that can identify and remove spam from the flow oflegitimate email. Thus, to build a spam filter, the main problem is to create a method bywhich a computer can automatically categorize email as spam (junk) or ham (legitimate).The machine learning approach to this problem prescribes that we begin by gathering acollection of examples of the two classes, that is, a collection of email messages whichhave been labeled, presumably by a human, as spam or ham. The purpose of the machinelearning algorithm is to automatically produce from such data a prediction rule that can beused to reliably classify new examples (email messages) as spam or ham.For any of us who has ever been bombarded with spam, rules for identifying spam orham will immediately come to mind. For instance, if it contains the word Viagra, then itis probably spam. Or, as another example, email from one’s spouse is quite likely to beham. Such individual rules of thumb are far from complete as a means of separating spamfrom ham. A rule that classifies all email containing Viagra as spam, and all other emailas ham, will very often be wrong. On the other hand, such a rule is undoubtedly telling ussomething useful and nontrivial, and its accuracy, however poor, will nonetheless be significantly better than simply guessing entirely at random as to whether each email is spamor ham.Intuitively, finding these weak rules of thumb should be relatively easy—so easy, infact, that one might reasonably envision a kind of automatic “weak learning” program that,“48740 7P 8291 001.tex” — 10/1/2012 — 17:41 — page 1

21 Introduction and Overviewgiven any set of email examples, could effectively search for a simple prediction rule thatmay be rough and rather inaccurate, but that nonetheless provides some nontrivial guidancein separating the given examples as spam or ham. Furthermore, by calling such a weaklearning program repeatedly on various subsets of our dataset, it would be possible to extracta collection of rules of thumb. The main idea of boosting is to somehow combine theseweak and inaccurate rules of thumb into a single “committee” whose overall predictionswill be quite accurate.In order to use these rules of thumb to maximum advantage, there are two critical problemsthat we face: First, how should we choose the collections of email examples presented tothe weak learning program so as to extract rules of thumb that will be the most useful?And second, once we have collected many rules of thumb, how can they be combined intoa single, highly accurate prediction rule? For the latter question, a reasonable approach issimply for the combined rule to take a vote of the predictions of the rules of thumb. For thefirst question, we will advocate an approach in which the weak learning program is forcedto focus its attention on the “hardest” examples, that is, the ones for which the previouslychosen rules of thumb were most apt to give incorrect predictions.Boosting refers to a general and provably effective method of producing a very accurateprediction rule by combining rough and moderately inaccurate rules of thumb in a mannersimilar to that suggested above. This book presents in detail much of the recent workon boosting, focusing especially on the AdaBoost algorithm, which has undergone intensetheoretical study and empirical testing. In this first chapter, we introduceAdaBoost and someof the key concepts required for its study. We also give a brief overview of the entire book.See the appendix for a description of the notation used here and throughout the book, aswell as some brief, mathematical background.1.1 Classification Problems and Machine LearningThis book focuses primarily on classification problems in which the goal is to categorizeobjects into one of a relatively small set of classes. For instance, an optical character recognition (OCR) system must classify images of letters into the categories A, B, C, etc. Medical diagnosis is another example of a classification problem in which the goal is to diagnose apatient. In other words, given the symptoms manifested by the patient, our goal is to categorize him or her as a sufferer or non-sufferer of a particular disease. The spam-filtering example is also a classification problem in which we attempt to categorize emails as spam or ham.We focus especially on a machine-learning approach to classification problems. Machinelearning studies the design of automatic methods for making predictions about the futurebased on past experiences. In the context of classification problems, machine-learningmethods attempt to learn to predict the correct classifications of unseen examples throughthe careful examination of examples which were previously labeled with their correctclassifications, usually by a human.“48740 7P 8291 001.tex” — 10/1/2012 — 17:41 — page 2

1.1 Classification Problems and Machine Learning3We refer to the objects to be classified as instances. Thus, an instance is a descriptionof some kind which is used to derive a predicted classification. In the OCR example, theinstances are the images of letters. In the medical-diagnosis example, the instances aredescriptions of a patient’s symptoms. The space of all possible instances is called the instance space or domain, and is denoted by X . A (labeled) example is an instance togetherwith an associated label indicating its correct classification. Instances are also sometimesreferred to as (unlabeled) examples.During training, a learning algorithm receives as input a training set of labeled examplescalled the training examples. The output of the learning algorithm is a prediction rule calleda classifier or hypothesis. A classifier can itself be thought of as a computer program whichtakes as input a new unlabeled instance and outputs a predicted classification; so, in mathematical terms, a classifier is a function that maps instances to labels. In this book, we usethe terms classifier and hypothesis fairly interchangeably, with the former emphasizing aprediction rule’s use in classifying new examples, and the latter emphasizing the fact thatthe rule has been (or could be) generated as the result of some learning process. Otherterms that have been used in the literature include rule, prediction rule, classification rule,predictor, and model.To assess the quality of a given classifier, we measure its error rate, that is, the frequencywith which it makes incorrect classifications. To do this, we need a test set, a separate set oftest examples. The classifier is evaluated on each of the test instances, and its predictions arecompared against the correct classifications of the test examples. The fraction of examples onwhich incorrect classifications were made is called the test error of the classifier. Similarly,the fraction of mistakes on the training set is called the training error. The fraction of correctpredictions is called the (test or training) accuracy.Of course, the classifier’s performance on the training set is not of much interest since ourpurpose is to build a classifier that works well on unseen data. On the other hand, if there isno relationship at all between the training set and the test set, then the learning problem isunsolvable; the future can be predicted only if it resembles the past. Therefore, in designingand studying learning algorithms, we generally assume that the training and test examplesare taken from the same random source. That is, we assume that the examples are chosenrandomly from some fixed but unknown distribution D over the space of labeled examplesand, moreover, that the training and test examples are generated by the same distribution.The generalization error of a classifier measures the probability of misclassifying a randomexample from this distribution D; equivalently, the generalization error is the expected testerror of the classifier on any test set generated by D. The goal of learning can now be statedsuccinctly as producing a classifier with low generalization error.To illustrate these concepts, consider the problem of diagnosing a patient with coronaryartery disease. For this problem, an instance consists of a description of the patient includingitems such as sex, age, cholesterol level, chest pain type (if any), blood pressure, and resultsof various medical tests. The label or class associated with each instance is a diagnosisprovided by a doctor as to whether or not the patient described actually suffers from the“48740 7P 8291 001.tex” — 10/1/2012 — 17:41 — page 3

41 Introduction and Overviewdisease. During training, a learning algorithm is provided with a set of labeled examplesand attempts to produce a classifier for predicting if new patients suffer from the disease.The goal is to produce a classifier that is as accurate as possible. Later, in section 1.2.3, wedescribe experiments using a publicly available dataset for this problem.1.2 BoostingWe can now make some of the informal notions about boosting described above moreprecise. Boosting assumes the availability of a base or weak learning algorithm which,given labeled training examples, produces a base or weak classifier. The goal of boosting isto improve the performance of the weak learning algorithm while treating it as a “black box”which can be called repeatedly, like a subroutine, but whose innards cannot be observedor manipulated. We wish to make only the most minimal assumptions about this learningalgorithm. Perhaps the least we can assume is that the weak classifiers are not entirely trivialin the sense that their error rates are at least a little bit better than a classifier whose everyprediction is a random guess. Thus, like the rules of thumb in the spam-filtering example,the weak classifiers can be rough and moderately inaccurate, but not entirely trivial anduninformative. This assumption, that the base learner produces a weak hypothesis that isat least slightly better than random guessing on the examples on which it was trained, iscalled the weak learning assumption, and it is central to the study of boosting.As with the words classifier and hypothesis, we use the terms base and weak roughlyinterchangeably, with weak emphasizing mediocrity in performance and base connotinguse as a building block.Like any learning algorithm, a boosting algorithm takes as input a set of training examples(x1 , y1 ), . . . , (xm , ym ) where each xi is an instance from X , and each yi is the associatedlabel or class. For now, and indeed for most of this book, we assume the simplest casein which there are only two classes, 1 and 1, although we do explore extensions tomulticlass problems in chapter 10.A boosting algorithm’s only means of learning from the data is through calls to the baselearning algorithm. However, if the base learner is simply called repeatedly, always withthe same set of training data, we cannot expect anything interesting to happen; instead, weexpect the same, or nearly the same, base classifier to be produced over and over again, sothat little is gained over running the base learner just once. This shows that the boostingalgorithm, if it is to improve on the base learner, must in some way manipulate the data thatit feeds to it.Indeed, the key idea behind boosting is to choose training sets for the base learner insuch a fashion as to force it to infer something new about the data each time it is called.This can be accomplished by choosing training sets on which we can reasonably expectthe performance of the preceding base classifiers to be very poor—even poorer than theirregular weak performance. If this can be accomplished, then we can expect the base learner“48740 7P 8291 001.tex” — 10/1/2012 — 17:41 — page 4

1.2 Boosting5to output a new base classifier which is significantly different from its predecessors. This isbecause, although we think of the base learner as a weak and mediocre learning algorithm,we nevertheless expect it to output classifiers that make nontrivial predictions.We are now ready to describe in detail the boosting algorithm AdaBoost, which incorporates these ideas, and whose pseudocode is shown as algorithm 1.1. AdaBoost proceeds inrounds or iterative calls to the base learner. For choosing the training sets provided to thebase learner on each round, AdaBoost maintains a distribution over the training examples.The distribution used on the t-th round is denoted Dt , and the weight it assigns to trainingexample i is denoted Dt (i). Intuitively, this weight is a measure of the importance of correctly classifying example i on the current round. Initially, all weights are set equally, but oneach round, the weights of incorrectly classified examples are increased so that, effectively,hard examples get successively higher weight, forcing the base learner to focus its attentionon them.Algorithm 1.1The boosting algorithm AdaBoostGiven: (x1 , y1 ), . . . , (xm , ym ) where xi X , yi { 1, 1}.Initialize: D1 (i) 1/m for i 1, . . . , m.For t 1, . . . , T : Train weak learner using distribution Dt . Get weak hypothesis ht : X { 1, 1}. Aim: select ht to minimalize the weighted error:. t Pri Dt [ht (xi ) yi ] . 11 t.Choose αt ln2 tUpdate, for i 1, . . . , m: αDt (i)e t if ht (xi ) yiDt 1 (i) e αtif ht (xi ) yiZt Dt (i) exp( αt yi ht (xi )),Ztwhere Zt is a normalization factor (chosen so that Dt 1 will be a distribution).Output the final

The focus of this book is on foundations and algorithms, but also on applications. Fol-lowing a general introduction to machine learning algorithms and their analysis, the book explores in part I t

Related Documents:

learning [12]. In this context, [21] presented a multi-class boosting-based classification framework that jointly selects weak classifiers shared among different tasks. In contrast, here the interested is in boosting a single target classifier by leveraging the (instance-based, or paramet

The Pendulum Charts: by Dale W. Olson Immune System Analysis Immune Boosting Supplements Immune Boosting Protocols If you are a beginner at using a pendulum or the Pendulum Charts, we would highly suggest reading: The Pendulum Bridge to Infinite Knowing, or The Pendulum Instruction Chart Book, or

tions. Boosting methods have been originally proposed as ensemble methods, see Section 1.1, which rely on the principle of generating multiple predictions and majority voting (averaging) among the individual classifiers. Later, Breiman [15, 16] made a path-breaking observation that the Ada-

Introduction XGBoost is short for eXtreme Gradient Boosting. It is An open-sourced tool A variant of the gradient boosting machine The winning model for several kaggle competitions

k plus proches voisins / k nearest neighbors Arbres de décisions Forêts aléatoires / random forests Bagging Forêts aléatoires / random forests Boosting Introduction Gradient-boosting AdaBoost En pratique. . Algorithme d

Tianqi Chen University of Washington tqchen@cs.washington.edu Carlos Guestrin University of Washington guestrin@cs.washington.edu ABSTRACT Tree boosting is a highly e ective and widely used machine learning method. In this paper, we describe a scalable end-

Mehryar Mohri - Foundations of Machine Learning page 16 Base learners: decision trees, quite often just decision stumps (trees of depth one). Boosting stumps: data in , e.g., , . associat

ASTM F2100-11 KC300 Masks† ASTM F1862 Fluid Resistance with synthetic blood, in mm Hg 80 mm Hg 80 mm Hg 120 mm Hg 120 mm Hg 160 mm Hg 160 mm Hg MIL-M-36954C Delta P Differential pressure, mm H 2O/cm2 4.0 mm H 2O 2.7 5.0 mm H 2O 3.7 5.0 mm H 2O 3.0 ASTM F2101 Bacterial Filtration Efficiency (BFE), % 95% 99.9% 98% 99.9% 98% 99.8% .