Introduction to Machine Learning:Improve Performance by ObservationCS271P, Fall Quarter, 2018Introduction to Artificial IntelligenceProf. Richard LathropRead Beforehand: R&N Ch. 18.1-18.4
You will be expected to know Understand Attributes, Error function, Classification & Regression,Hypothesis (Predictor function) What is Supervised Learning? Decision Tree Algorithm Entropy & Information Gain Tradeoff between train and test with model complexity Cross validation
Deep Learning in Physics:Searching for Exotic ParticlesThanks toPierre Baldi
Thanks toPierre Baldi
Thanks toPierre BaldiDaniel WhitesonPeter Sadowski
Higgs Boson DetectionThanks toPierre BaldiDeep network improves AUC by 8%BDT Boosted Decision Trees inTMVA packageNature Communications, July 2014
Thanks toPadhraic SmythApplication to Extra-Tropical CyclonesGaffney et al, Climate Dynamics, 2007
Thanks toPadhraic SmythOriginal DataIceland ClusterGreenland ClusterHorizontal Cluster
Thanks toPadhraic SmythCluster Shapes for Pacific Typhoon TracksCamargo et al, J. Climate, 2007
Thanks toPadhraic SmythTROPICAL CYCLONES Western North Pacific Padhraic Smyth, UC Irvine: DS 06Camargo et al, J. Climate, 200710
Thanks toPadhraic SmythAn ICS Undergraduate Success Story“The key student involved in this work started out as an ICSundergrad. Scott Gaffney took ICS 171 and 175, got interested in AI,started to work in my group, decided to stay in ICS for his PhD, did aterrific job in writing a thesis on curve-clustering and working withcollaborators in climate science to apply it to important scientificproblems, and is now one of the leaders of Yahoo! Labs reportingdirectly to the CEO there, http://labs.yahoo.com/author/gaffney/.Scott grew up locally in Orange County and is someone I like to pointas a great success story for ICS.”--- From Padhraic Smyth
Thanks toXiaohui Xie
Thanks toXiaohui Xie
Thanks toXiaohui Xie
p53 and Human CancersThanks toRichard Lathrop p53 is a central tumorsuppressor protein“The guardian of the genome” Cancer Mutants:About 50% of all humancancers have p53 mutations. Rescue Mutants:Several second-site mutationsrestore functionality to somep53 cancer mutants in vivo.p53 core domain bound to DNAImage Generated with UCSF ChimeraCho, Y., Gorina, S., Jeffrey, P.D., Pavletich, N.P. Crystalstructure of a p53 tumor suppressor-DNA complex:understanding tumorigenic mutations. Science v265pp.346-355 , 1994
Active Learning for Biological DiscoveryThanks toRichard LathropFind CancerRescueMutantsKnowledgeTheoryExperiment
Computational Active LearningPick the Best ( Most Informative) Unknown Examplesto LabelUnknownKnownExample 1Example 2Example 3 Example NExampleN 1Train theClassifierExampleN 2ClassifierExampleN 3ExampleChooseN 4Example(s) to LabelExample MTraining SetAdd New Example(s)To Training Set
Visualization of Selected Regions Positive Region:Predicted Active96-105 (Green) Negative Region:Predicted Inactive223-232 (Red) Expert Region:Predicted Active114-123 (Blue)Thanks toRichard Lathrop
Novel Single-a.a. Cancer Rescue MutantsThanks toRichard LathropMIP Positive(96-105)MIP Negative(223-232)Expert(114-123)# StrongRescue80 (p 0.008)6 (not significant)# Weak Rescue32 (not significant)7 (not significant)Total # Rescue112 (p 0.022)13 (not significant)No significant differences between the MIP Positive and Expert regions.Both were statistically significantly better than the MIP Negative region.The Positive region rescued for the first time the cancer mutant P152L.No previous single-a.a. rescue mutants in any region.
Complete architectures for intelligence? Search?– Solve the problem of what to do. Learning?– Learn what to do. Logic and inference?– Reason about what to do.– Encoded knowledge/”expert” systems? Know what to do. Modern view: It’s complex & multi-faceted.
Importance of representation Definition of “state” can be very important A good representation– Reveals important features– Hides irrelevant detail– Exposes useful constraints}Most important– Makes frequent operations easy to do– Rapidly or efficiently computable It’s nice to be fast
Reveals important features / Hides irrelevant detail“You can’t learn what you can’t represent.” --- G. Sussman In search: A man is traveling to market with a fox, a goose, and a bag ofoats. He comes to a river. The only way across the river is a boat thatcan hold the man and exactly one of the fox, goose or bag of oats. Thefox will eat the goose if left alone with it, and the goose will eat the oats ifleft alone with it.How can the man get all his possessions safely across the river? A good representation makes this problem easy:MFGO11100010101011110001M manF foxG gooseO oats0 starting side1 ending side
Reveals important features / Hides irrelevant detail“You can’t learn what you can’t represent.” --- G. Sussman In logic:If the unicorn is mythical, then it is immortal, but if it isnot mythical, then it is a mortal mammal. If the unicorn iseither immortal or a mammal, then it is horned. The unicornis magical if it is horned. Prove that the unicorn is both magical and horned. A good representation makes this problem easy:( Y R ) ( Y R ) ( Y M ) ( R H ) ( M H ) ( H G ) ( G H )Y unicorn is mYthicalR unicorn is moRtal( R M )( H)M unicorn is a maMmal(HM)H unicorn is Horned(H)G unicorn is maGical( )
A Learning ProblemWhether someone is going to play tennis on a given day, given someweather conditions.Records for the past two weeks.Today is Sunny, Hot, Normal humidity, and strong wind. Playing tennis?
A Learning ProblemTraining data:Target variable;Class label;Goal;Output variable;Dependent variable.Attributes;Input Variables;Features; Covariates.New data:?
Terminology Attributes– Also known as features, variables, independentvariables, covariates Target Variable– Also known as goal predicate, dependent variable, Classification– Also known as discrimination, supervisedclassification, Error function– Also known as objective function, loss function,
Types of learning Supervised learning: learn mapping, attributes target Classification: target variable is discrete (e.g., spam email) Regression: target variable is real-valued (e.g., stock market) Unsupervised learning: understand hidden data structure Clustering: group data into “similar” groups Latent space embedding: learn a simple data representation Other types of learning Reinforcement learning: e.g., game-playing agent Learning to rank, e.g., document ranking in Web search And many others .
Types of learningData is label(Learn mapping: attributes target)Discretetarget variableContinuoustarget tronClassifierData is unlabel(Discover hidden eansClustering
Simple illustrative learning problemProblem:Decide whether to wait for a table at arestaurant, based on the following te: is there an alternative restaurant nearby?Bar: is there a comfortable bar area to wait in?Fri/Sat: is today Friday or Saturday?Hungry: are we hungry?Patrons: number of people in the restaurant (None, Some, Full)Price: price range ( , , )Raining: is it raining outside?Reservation: have we made a reservation?Type: kind of restaurant (French, Italian, Thai, Burger)WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, 60)
Training Data for Supervised LearningTraining data X1:1.2.3.4.5.6.7.8.9.10.Alternate: is there an alternative restaurant nearby? YesBar: is there a comfortable bar area to wait in? NoFri/Sat: is today Friday or Saturday? NoHungry: are we hungry? YesPatrons: number of people in the restaurant (None, Some, Full) SomePrice: price range ( , , ) Raining: is it raining outside? NoReservation: have we made a reservation? YesType: kind of restaurant (French, Italian, Thai, Burger) FrenchWaitEstimate: estimated waiting time (0-10, 10-30, 30-60, 60) 0-10Training data X1 waited for a table at the restaurant.
Training Data for Supervised Learning
Supervised or Inductive learningxf(x)
Supervised or Inductive learningxf(x)h(x,θ)
Empirical Error Functions Sum is over all training pairs in the training data DExamples:distance squared error if h and f are real-valued (regression)distance delta-function if h and f are categorical (classification) Choosing the error function E( ) is as important as choosinghypothesis function h( ).- E( ) reflect real “loss” in problem- But often chosen for mathematical/algorithmic convenience
Supervised Learning as Optimization or Search Error function: Empirical learning finding h(x), or h(x; θ) that minimizes E(h)–If E(h) is differentiable continuous optimization problem using gradientdescent, etc E.g., multi-layer neural networks–If E(h) is non-differentiable (e.g., classification) systematic searchproblem through the space of functions h E.g., decision tree classifiers Once we decide on what the functional form of h is, and what the errorfunction E is, then machine learning typically reduces to a large search oroptimization problem Additional aspect: we really want to learn a function h that will generalizewell to new data, not just memorize training data – will return to thislater
Decision Tree RepresentationsKey requirements: Attribute-value description:Attributes must be expressible in a fixed collection of properties orattributes (e.g., True/False; hot, mild, cold; , , ). Predefined classes (target values):The target function has discrete output values (boolean or multiclass)
Decision Tree Representations Each node is labeled as an attribute and each edge is labeled as avalue of that attribute. Leaf nodes are labeled as the target variable.A xor B ( A B ) ( A B ) in DNF Every path in the tree could represent 1 row in the truth table Can represent any Boolean function In DNF: Disjunction of conjunctions Eg. (A B) v ( A B)
Decision Tree Representations Constrain h(.) to be a decision tree–This is the R&N tree for the Restaurant Wait problem:
Decision Tree Representations
Decision Tree Representations
Decision Tree LearningHow many distinct decision trees with n Boolean attributes?With 6 Boolean attributes, there are 18,446,744,073,709,551,616 possibledecision trees!
Decision Tree Learning Find the smallest decision tree consistent with the n examples- Unfortunately this is provably intractable to do optimallyTermination criteria-For noiseless data, if all examples at a node have the same label thendeclare it a leaf and backup-For noisy data it might not be possible to find a “pure” leaf using the givenattributes we’ll return to this later – but a simple approach is to have adepth-bound on the tree (or go to max depth) and use majority voteGreedy heuristic search used in practice:-Select root node that is “best” in some sense-Partition data into multiple subsets, depending on root attribute value-Recursively grow subtrees, until termination criteria met.
Pseudocode for Decision tree learningTerminationConditionsLoop through allvalues in bestRecursive call
Choosing an Attribute Idea: a good attribute splits the examples into subsets that are(ideally) "all positive" or "all negative"
Choosing an Attribute Idea: a good attribute splits the examples into subsets that are(ideally) "all positive" or "all negative" Patrons? is a better choice–––How can we quantify this?Information gain (next slides)Other metrics are also used, e.g., Gini impurity, variance reduction– Often very similar results to information gain in practice
Entropy and Information “Entropy” is a measure of randomness(amount of uncertainty; amount of hj nL-x8; https://www.youtube.com/watch?v ZsY4WcQOrfk
Entropy, H(p), with only 2 outcomesConsider 2 class problem:p probability of class #1,1 – p probability of class #2In binary case:H(p)high entropy,high disorder,high uncertainty100.5p1Low entropy, low disorder, low uncertainty
Entropy and InformationEntropy:– Log base two, units of entropy are “bits”– If only two outcomes: Examples:H(x) .25 log 4 .25 log 4 .25 log 4 .25 log 4 log 4 2 bitsMax entropy for 4 outcomesH(x) .75 log 4/3 .25 log 4 0.8133 bitsH(x) 1 log 1 0 bitsMin entropy
Information Gain H(P) current entropy of class distribution P at a particular node,before further partitioning the data H(P A) conditional entropy given attribute A weighted average entropy of conditional class distribution,after partitioning the data according to the values in A Gain(A) H(P) – H(P A)– Sometimes written IG(A) InformationGain(A) Note that by definition, conditional entropy can’t be greater thanthe entropy, so Information Gain must be non-negative Simple rule in decision tree learning– At each internal node, split on the node with the largestinformation gain [or equivalently, with smallest H(P A) ]
Root Node ExampleFor the training set,, H(6/12, 6/12) 1 bitpositive (p)negative (1-p)H(6/12, 6/12) -(6/12)*log2(6/12) - (6/12)*log2(6/12) 1Consider the attributes Patrons and Type:Patrons has the highest IG of all attributes and so is chosen bythe learning algorithm as the rootInformation gain is then repeatedly applied at internal nodes untilall leaves contain only examples from one class or the other
Choosing an attributeIG(Patrons) 0.541 bitsIG(Type) 0 bits
Decision Tree Learned Decision tree learned from the 12 examples:Hungry?
Decision Tree LearnedR&N TreeLearned Tree
Assessing PerformanceTraining data performance is typically optimistice.g., error rate on training dataReasons?- classifier may not have enough data to fully learn the concept (buton training data we don’t know this)- for noisy data, the classifier may overfit the training dataIn practice we want to assess performance “out of sample”how well will the classifier do on new unseen data? This is thetrue test of what we have learned (just like a classroom)With large data sets we can partition our data into 2 subsets, train and test- build a model on the training data- assess performance on the test data
Example of Test PerformanceRestaurant problem- simulate 100 data sets of different sizes- train on this data, and assess performance on an independent test set- learning curve plotting accuracy as a function of training set size- typical “diminishing returns” effect (some nice theory to explain this)
Overfitting and UnderfittingYX
A Complex ModelY high-order polynomial in XYX
A Much Simpler ModelY a X b noiseYX
Overfitting and UnderfittingYYXX
Example 2My biologist colleagues say,“Oh, that’s the sample thatwe dropped on the floor!”
Example 2
Example 2
Example 2
Example 2
How Overfitting affects PredictionPredictiveErrorError on Training DataModel Complexity
How Overfitting affects PredictionPredictiveErrorError on Test DataError on Training DataModel Complexity
How Overfitting affects ror on Test DataError on Training DataModel ComplexityIdeal Rangefor Model ComplexityToo-Simple ModelsToo-Complex Models
Training and Validation Data We can use the class labels (target variable) for the trainingdata to compute the training error. We do now know the label of new data. How to compute theerror on test data? We could split the data into training set and validate set.Full Data SetTraining DataValidationDataIdea: train eachmodel on the“training data”and then testeach model’saccuracy onthe validation data
The k-fold Cross-Validation Method Why just choose one particular 90/10 “split” of the data?– In principle we could do this multiple times “k-fold Cross-Validation” (e.g., k 10)– randomly partition our full data set into k disjoint subsets (eachroughly of size n/k, n total number of training data points) for i 1:10 (here k 10)– train on 90% of data,– Acc(i) accuracy on other 10% end Cross-Validation-Accuracy 1/kΣiAcc(i) choose the method with the highest cross-validation accuracy common values for k are 5 and 10 Can also do “leave-one-out” where k n
Disjoint Validation Data SetsValidation Data (aka Test Data)Full Data Set1st partitionTraining DataFull data setTraining dataValidate data (test data)
Disjoint Validation Data SetsFull Data Set1st partitionFull data setTraining dataValidate data (test data)2nd partition
Disjoint Validation Data SetsFull Data Set1st partition2nd partitionFull data setTraining dataValidate data (test data)3rd partition4th partition5th partition
More on Cross-Validation Notes– cross-validation generates an approximate estimate of how wellthe learned model will do on “unseen” data– by averaging over different partitions it is more robust than just asingle train/validate partition of the data– “k-fold” cross-validation is a generalization partition data into disjoint validation subsets of size n/k train, validate, and average over the v partitions e.g., k 10 is commonly used– k-fold cross-validation is approximately k times computationallymore expensive than just fitting a model to all of the data
You will be expected to know Understand Attributes, Error function, Classification & Regression,Hypothesis (Predictor function) What is Supervised Learning? Decision Tree Algorithm Entropy & Information Gain Tradeoff between train and test with model complexity Cross validation
Summary Supervised (Inductive) learning– Error function, class of hypothesis/models {h}– Want to minimize E on our training data– Example: decision tree learning Generalization– Training data error is over-optimistic– We want to see performance on test data– Cross-validation is a useful practical approach
Unsupervised learning: understand hidden data structure Clustering: group data into “similar” groups Latent space embedding: learn a simple data representation Other types of learning Reinforcement learning: e.g., game-playing agent Learning to ran
decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .
Machine learning has many different faces. We are interested in these aspects of machine learning which are related to representation theory. However, machine learning has been combined with other areas of mathematics. Statistical machine learning. Topological machine learning. Computer science. Wojciech Czaja Mathematical Methods in Machine .
Machine Learning Real life problems Lecture 1: Machine Learning Problem Qinfeng (Javen) Shi 28 July 2014 Intro. to Stats. Machine Learning . Learning from the Databy Yaser Abu-Mostafa in Caltech. Machine Learningby Andrew Ng in Stanford. Machine Learning(or related courses) by Nando de Freitas in UBC (now Oxford).
Machine Learning Machine Learning B. Supervised Learning: Nonlinear Models B.5. A First Look at Bayesian and Markov Networks Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL .
work/products (Beading, Candles, Carving, Food Products, Soap, Weaving, etc.) ⃝I understand that if my work contains Indigenous visual representation that it is a reflection of the Indigenous culture of my native region. ⃝To the best of my knowledge, my work/products fall within Craft Council standards and expectations with respect to
with machine learning algorithms to support weak areas of a machine-only classifier. Supporting Machine Learning Interactive machine learning systems can speed up model evaluation and helping users quickly discover classifier de-ficiencies. Some systems help users choose between multiple machine learning models (e.g., [17]) and tune model .
Artificial Intelligence, Machine Learning, and Deep Learning (AI/ML/DL) F(x) Deep Learning Artificial Intelligence Machine Learning Artificial Intelligence Technique where computer can mimic human behavior Machine Learning Subset of AI techniques which use algorithms to enable machines to learn from data Deep Learning
Introduction to machine and machine tools Research · April 2015 DOI: 10.13140/RG.2.1.1419.7285 CITATIONS 0 READS 43,236 1 author: . machine and power hacksaws lathe machine, Planer lathe machine, Sloter lathe machine etc. Basics of Mechanical Engineering (B.M.E) Brown Hill College of Engg. & Tech.