Introduction To Machine Learning: Improve Performance By .

2y ago
8 Views
2 Downloads
4.15 MB
75 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Kaden Thurman
Transcription

Introduction to Machine Learning:Improve Performance by ObservationCS271P, Fall Quarter, 2018Introduction to Artificial IntelligenceProf. Richard LathropRead Beforehand: R&N Ch. 18.1-18.4

You will be expected to know Understand Attributes, Error function, Classification & Regression,Hypothesis (Predictor function) What is Supervised Learning? Decision Tree Algorithm Entropy & Information Gain Tradeoff between train and test with model complexity Cross validation

Deep Learning in Physics:Searching for Exotic ParticlesThanks toPierre Baldi

Thanks toPierre Baldi

Thanks toPierre BaldiDaniel WhitesonPeter Sadowski

Higgs Boson DetectionThanks toPierre BaldiDeep network improves AUC by 8%BDT Boosted Decision Trees inTMVA packageNature Communications, July 2014

Thanks toPadhraic SmythApplication to Extra-Tropical CyclonesGaffney et al, Climate Dynamics, 2007

Thanks toPadhraic SmythOriginal DataIceland ClusterGreenland ClusterHorizontal Cluster

Thanks toPadhraic SmythCluster Shapes for Pacific Typhoon TracksCamargo et al, J. Climate, 2007

Thanks toPadhraic SmythTROPICAL CYCLONES Western North Pacific Padhraic Smyth, UC Irvine: DS 06Camargo et al, J. Climate, 200710

Thanks toPadhraic SmythAn ICS Undergraduate Success Story“The key student involved in this work started out as an ICSundergrad. Scott Gaffney took ICS 171 and 175, got interested in AI,started to work in my group, decided to stay in ICS for his PhD, did aterrific job in writing a thesis on curve-clustering and working withcollaborators in climate science to apply it to important scientificproblems, and is now one of the leaders of Yahoo! Labs reportingdirectly to the CEO there, http://labs.yahoo.com/author/gaffney/.Scott grew up locally in Orange County and is someone I like to pointas a great success story for ICS.”--- From Padhraic Smyth

Thanks toXiaohui Xie

Thanks toXiaohui Xie

Thanks toXiaohui Xie

p53 and Human CancersThanks toRichard Lathrop p53 is a central tumorsuppressor protein“The guardian of the genome” Cancer Mutants:About 50% of all humancancers have p53 mutations. Rescue Mutants:Several second-site mutationsrestore functionality to somep53 cancer mutants in vivo.p53 core domain bound to DNAImage Generated with UCSF ChimeraCho, Y., Gorina, S., Jeffrey, P.D., Pavletich, N.P. Crystalstructure of a p53 tumor suppressor-DNA complex:understanding tumorigenic mutations. Science v265pp.346-355 , 1994

Active Learning for Biological DiscoveryThanks toRichard LathropFind CancerRescueMutantsKnowledgeTheoryExperiment

Computational Active LearningPick the Best ( Most Informative) Unknown Examplesto LabelUnknownKnownExample 1Example 2Example 3 Example NExampleN 1Train theClassifierExampleN 2ClassifierExampleN 3ExampleChooseN 4Example(s) to LabelExample MTraining SetAdd New Example(s)To Training Set

Visualization of Selected Regions Positive Region:Predicted Active96-105 (Green) Negative Region:Predicted Inactive223-232 (Red) Expert Region:Predicted Active114-123 (Blue)Thanks toRichard Lathrop

Novel Single-a.a. Cancer Rescue MutantsThanks toRichard LathropMIP Positive(96-105)MIP Negative(223-232)Expert(114-123)# StrongRescue80 (p 0.008)6 (not significant)# Weak Rescue32 (not significant)7 (not significant)Total # Rescue112 (p 0.022)13 (not significant)No significant differences between the MIP Positive and Expert regions.Both were statistically significantly better than the MIP Negative region.The Positive region rescued for the first time the cancer mutant P152L.No previous single-a.a. rescue mutants in any region.

Complete architectures for intelligence? Search?– Solve the problem of what to do. Learning?– Learn what to do. Logic and inference?– Reason about what to do.– Encoded knowledge/”expert” systems? Know what to do. Modern view: It’s complex & multi-faceted.

Importance of representation Definition of “state” can be very important A good representation– Reveals important features– Hides irrelevant detail– Exposes useful constraints}Most important– Makes frequent operations easy to do– Rapidly or efficiently computable It’s nice to be fast

Reveals important features / Hides irrelevant detail“You can’t learn what you can’t represent.” --- G. Sussman In search: A man is traveling to market with a fox, a goose, and a bag ofoats. He comes to a river. The only way across the river is a boat thatcan hold the man and exactly one of the fox, goose or bag of oats. Thefox will eat the goose if left alone with it, and the goose will eat the oats ifleft alone with it.How can the man get all his possessions safely across the river? A good representation makes this problem easy:MFGO11100010101011110001M manF foxG gooseO oats0 starting side1 ending side

Reveals important features / Hides irrelevant detail“You can’t learn what you can’t represent.” --- G. Sussman In logic:If the unicorn is mythical, then it is immortal, but if it isnot mythical, then it is a mortal mammal. If the unicorn iseither immortal or a mammal, then it is horned. The unicornis magical if it is horned. Prove that the unicorn is both magical and horned. A good representation makes this problem easy:( Y R ) ( Y R ) ( Y M ) ( R H ) ( M H ) ( H G ) ( G H )Y unicorn is mYthicalR unicorn is moRtal( R M )( H)M unicorn is a maMmal(HM)H unicorn is Horned(H)G unicorn is maGical( )

A Learning ProblemWhether someone is going to play tennis on a given day, given someweather conditions.Records for the past two weeks.Today is Sunny, Hot, Normal humidity, and strong wind. Playing tennis?

A Learning ProblemTraining data:Target variable;Class label;Goal;Output variable;Dependent variable.Attributes;Input Variables;Features; Covariates.New data:?

Terminology Attributes– Also known as features, variables, independentvariables, covariates Target Variable– Also known as goal predicate, dependent variable, Classification– Also known as discrimination, supervisedclassification, Error function– Also known as objective function, loss function,

Types of learning Supervised learning: learn mapping, attributes target Classification: target variable is discrete (e.g., spam email) Regression: target variable is real-valued (e.g., stock market) Unsupervised learning: understand hidden data structure Clustering: group data into “similar” groups Latent space embedding: learn a simple data representation Other types of learning Reinforcement learning: e.g., game-playing agent Learning to rank, e.g., document ranking in Web search And many others .

Types of learningData is label(Learn mapping: attributes target)Discretetarget variableContinuoustarget tronClassifierData is unlabel(Discover hidden eansClustering

Simple illustrative learning problemProblem:Decide whether to wait for a table at arestaurant, based on the following te: is there an alternative restaurant nearby?Bar: is there a comfortable bar area to wait in?Fri/Sat: is today Friday or Saturday?Hungry: are we hungry?Patrons: number of people in the restaurant (None, Some, Full)Price: price range ( , , )Raining: is it raining outside?Reservation: have we made a reservation?Type: kind of restaurant (French, Italian, Thai, Burger)WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, 60)

Training Data for Supervised LearningTraining data X1:1.2.3.4.5.6.7.8.9.10.Alternate: is there an alternative restaurant nearby? YesBar: is there a comfortable bar area to wait in? NoFri/Sat: is today Friday or Saturday? NoHungry: are we hungry? YesPatrons: number of people in the restaurant (None, Some, Full) SomePrice: price range ( , , ) Raining: is it raining outside? NoReservation: have we made a reservation? YesType: kind of restaurant (French, Italian, Thai, Burger) FrenchWaitEstimate: estimated waiting time (0-10, 10-30, 30-60, 60) 0-10Training data X1 waited for a table at the restaurant.

Training Data for Supervised Learning

Supervised or Inductive learningxf(x)

Supervised or Inductive learningxf(x)h(x,θ)

Empirical Error Functions Sum is over all training pairs in the training data DExamples:distance squared error if h and f are real-valued (regression)distance delta-function if h and f are categorical (classification) Choosing the error function E( ) is as important as choosinghypothesis function h( ).- E( ) reflect real “loss” in problem- But often chosen for mathematical/algorithmic convenience

Supervised Learning as Optimization or Search Error function: Empirical learning finding h(x), or h(x; θ) that minimizes E(h)–If E(h) is differentiable continuous optimization problem using gradientdescent, etc E.g., multi-layer neural networks–If E(h) is non-differentiable (e.g., classification) systematic searchproblem through the space of functions h E.g., decision tree classifiers Once we decide on what the functional form of h is, and what the errorfunction E is, then machine learning typically reduces to a large search oroptimization problem Additional aspect: we really want to learn a function h that will generalizewell to new data, not just memorize training data – will return to thislater

Decision Tree RepresentationsKey requirements: Attribute-value description:Attributes must be expressible in a fixed collection of properties orattributes (e.g., True/False; hot, mild, cold; , , ). Predefined classes (target values):The target function has discrete output values (boolean or multiclass)

Decision Tree Representations Each node is labeled as an attribute and each edge is labeled as avalue of that attribute. Leaf nodes are labeled as the target variable.A xor B ( A B ) ( A B ) in DNF Every path in the tree could represent 1 row in the truth table Can represent any Boolean function In DNF: Disjunction of conjunctions Eg. (A B) v ( A B)

Decision Tree Representations Constrain h(.) to be a decision tree–This is the R&N tree for the Restaurant Wait problem:

Decision Tree Representations

Decision Tree Representations

Decision Tree LearningHow many distinct decision trees with n Boolean attributes?With 6 Boolean attributes, there are 18,446,744,073,709,551,616 possibledecision trees!

Decision Tree Learning Find the smallest decision tree consistent with the n examples- Unfortunately this is provably intractable to do optimallyTermination criteria-For noiseless data, if all examples at a node have the same label thendeclare it a leaf and backup-For noisy data it might not be possible to find a “pure” leaf using the givenattributes we’ll return to this later – but a simple approach is to have adepth-bound on the tree (or go to max depth) and use majority voteGreedy heuristic search used in practice:-Select root node that is “best” in some sense-Partition data into multiple subsets, depending on root attribute value-Recursively grow subtrees, until termination criteria met.

Pseudocode for Decision tree learningTerminationConditionsLoop through allvalues in bestRecursive call

Choosing an Attribute Idea: a good attribute splits the examples into subsets that are(ideally) "all positive" or "all negative"

Choosing an Attribute Idea: a good attribute splits the examples into subsets that are(ideally) "all positive" or "all negative" Patrons? is a better choice–––How can we quantify this?Information gain (next slides)Other metrics are also used, e.g., Gini impurity, variance reduction– Often very similar results to information gain in practice

Entropy and Information “Entropy” is a measure of randomness(amount of uncertainty; amount of hj nL-x8; https://www.youtube.com/watch?v ZsY4WcQOrfk

Entropy, H(p), with only 2 outcomesConsider 2 class problem:p probability of class #1,1 – p probability of class #2In binary case:H(p)high entropy,high disorder,high uncertainty100.5p1Low entropy, low disorder, low uncertainty

Entropy and InformationEntropy:– Log base two, units of entropy are “bits”– If only two outcomes: Examples:H(x) .25 log 4 .25 log 4 .25 log 4 .25 log 4 log 4 2 bitsMax entropy for 4 outcomesH(x) .75 log 4/3 .25 log 4 0.8133 bitsH(x) 1 log 1 0 bitsMin entropy

Information Gain H(P) current entropy of class distribution P at a particular node,before further partitioning the data H(P A) conditional entropy given attribute A weighted average entropy of conditional class distribution,after partitioning the data according to the values in A Gain(A) H(P) – H(P A)– Sometimes written IG(A) InformationGain(A) Note that by definition, conditional entropy can’t be greater thanthe entropy, so Information Gain must be non-negative Simple rule in decision tree learning– At each internal node, split on the node with the largestinformation gain [or equivalently, with smallest H(P A) ]

Root Node ExampleFor the training set,, H(6/12, 6/12) 1 bitpositive (p)negative (1-p)H(6/12, 6/12) -(6/12)*log2(6/12) - (6/12)*log2(6/12) 1Consider the attributes Patrons and Type:Patrons has the highest IG of all attributes and so is chosen bythe learning algorithm as the rootInformation gain is then repeatedly applied at internal nodes untilall leaves contain only examples from one class or the other

Choosing an attributeIG(Patrons) 0.541 bitsIG(Type) 0 bits

Decision Tree Learned Decision tree learned from the 12 examples:Hungry?

Decision Tree LearnedR&N TreeLearned Tree

Assessing PerformanceTraining data performance is typically optimistice.g., error rate on training dataReasons?- classifier may not have enough data to fully learn the concept (buton training data we don’t know this)- for noisy data, the classifier may overfit the training dataIn practice we want to assess performance “out of sample”how well will the classifier do on new unseen data? This is thetrue test of what we have learned (just like a classroom)With large data sets we can partition our data into 2 subsets, train and test- build a model on the training data- assess performance on the test data

Example of Test PerformanceRestaurant problem- simulate 100 data sets of different sizes- train on this data, and assess performance on an independent test set- learning curve plotting accuracy as a function of training set size- typical “diminishing returns” effect (some nice theory to explain this)

Overfitting and UnderfittingYX

A Complex ModelY high-order polynomial in XYX

A Much Simpler ModelY a X b noiseYX

Overfitting and UnderfittingYYXX

Example 2My biologist colleagues say,“Oh, that’s the sample thatwe dropped on the floor!”

Example 2

Example 2

Example 2

Example 2

How Overfitting affects PredictionPredictiveErrorError on Training DataModel Complexity

How Overfitting affects PredictionPredictiveErrorError on Test DataError on Training DataModel Complexity

How Overfitting affects ror on Test DataError on Training DataModel ComplexityIdeal Rangefor Model ComplexityToo-Simple ModelsToo-Complex Models

Training and Validation Data We can use the class labels (target variable) for the trainingdata to compute the training error. We do now know the label of new data. How to compute theerror on test data? We could split the data into training set and validate set.Full Data SetTraining DataValidationDataIdea: train eachmodel on the“training data”and then testeach model’saccuracy onthe validation data

The k-fold Cross-Validation Method Why just choose one particular 90/10 “split” of the data?– In principle we could do this multiple times “k-fold Cross-Validation” (e.g., k 10)– randomly partition our full data set into k disjoint subsets (eachroughly of size n/k, n total number of training data points) for i 1:10 (here k 10)– train on 90% of data,– Acc(i) accuracy on other 10% end Cross-Validation-Accuracy 1/kΣiAcc(i) choose the method with the highest cross-validation accuracy common values for k are 5 and 10 Can also do “leave-one-out” where k n

Disjoint Validation Data SetsValidation Data (aka Test Data)Full Data Set1st partitionTraining DataFull data setTraining dataValidate data (test data)

Disjoint Validation Data SetsFull Data Set1st partitionFull data setTraining dataValidate data (test data)2nd partition

Disjoint Validation Data SetsFull Data Set1st partition2nd partitionFull data setTraining dataValidate data (test data)3rd partition4th partition5th partition

More on Cross-Validation Notes– cross-validation generates an approximate estimate of how wellthe learned model will do on “unseen” data– by averaging over different partitions it is more robust than just asingle train/validate partition of the data– “k-fold” cross-validation is a generalization partition data into disjoint validation subsets of size n/k train, validate, and average over the v partitions e.g., k 10 is commonly used– k-fold cross-validation is approximately k times computationallymore expensive than just fitting a model to all of the data

You will be expected to know Understand Attributes, Error function, Classification & Regression,Hypothesis (Predictor function) What is Supervised Learning? Decision Tree Algorithm Entropy & Information Gain Tradeoff between train and test with model complexity Cross validation

Summary Supervised (Inductive) learning– Error function, class of hypothesis/models {h}– Want to minimize E on our training data– Example: decision tree learning Generalization– Training data error is over-optimistic– We want to see performance on test data– Cross-validation is a useful practical approach

Unsupervised learning: understand hidden data structure Clustering: group data into “similar” groups Latent space embedding: learn a simple data representation Other types of learning Reinforcement learning: e.g., game-playing agent Learning to ran

Related Documents:

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

Machine learning has many different faces. We are interested in these aspects of machine learning which are related to representation theory. However, machine learning has been combined with other areas of mathematics. Statistical machine learning. Topological machine learning. Computer science. Wojciech Czaja Mathematical Methods in Machine .

Machine Learning Real life problems Lecture 1: Machine Learning Problem Qinfeng (Javen) Shi 28 July 2014 Intro. to Stats. Machine Learning . Learning from the Databy Yaser Abu-Mostafa in Caltech. Machine Learningby Andrew Ng in Stanford. Machine Learning(or related courses) by Nando de Freitas in UBC (now Oxford).

Machine Learning Machine Learning B. Supervised Learning: Nonlinear Models B.5. A First Look at Bayesian and Markov Networks Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL .

work/products (Beading, Candles, Carving, Food Products, Soap, Weaving, etc.) ⃝I understand that if my work contains Indigenous visual representation that it is a reflection of the Indigenous culture of my native region. ⃝To the best of my knowledge, my work/products fall within Craft Council standards and expectations with respect to

with machine learning algorithms to support weak areas of a machine-only classifier. Supporting Machine Learning Interactive machine learning systems can speed up model evaluation and helping users quickly discover classifier de-ficiencies. Some systems help users choose between multiple machine learning models (e.g., [17]) and tune model .

Artificial Intelligence, Machine Learning, and Deep Learning (AI/ML/DL) F(x) Deep Learning Artificial Intelligence Machine Learning Artificial Intelligence Technique where computer can mimic human behavior Machine Learning Subset of AI techniques which use algorithms to enable machines to learn from data Deep Learning

Introduction to machine and machine tools Research · April 2015 DOI: 10.13140/RG.2.1.1419.7285 CITATIONS 0 READS 43,236 1 author: . machine and power hacksaws lathe machine, Planer lathe machine, Sloter lathe machine etc. Basics of Mechanical Engineering (B.M.E) Brown Hill College of Engg. & Tech.