Lectures On Machine Learning - Lecture 1: From Artificial . - Benasque

1y ago
13 Views
2 Downloads
2.99 MB
69 Pages
Last View : 2d ago
Last Download : 3m ago
Upload by : Philip Renner
Transcription

Lectures on Machine Learning Lecture 1: from artificial intelligence to machine learning Stefano Carrazza TAE2018, 2-15 September 2018 European Organization for Nuclear Research (CERN) Acknowledgement: This project has received funding from HICCUP ERC Consolidator grant (614577) and by the European Unions Horizon 2020 research and innovation programme under grant agreement no. 740006. N 3PDF Machine Learning PDFs QCD

Why lectures on machine learning? 1

Why lectures on machine learning? because it is an essential set of algorithms for building models in science, 1

Why lectures on machine learning? because it is an essential set of algorithms for building models in science, fast development of new tools and algorithms in the past years, 1

Why lectures on machine learning? because it is an essential set of algorithms for building models in science, fast development of new tools and algorithms in the past years, nowadays it is a requirement in experimental and theoretical physics, 1

Why lectures on machine learning? because it is an essential set of algorithms for building models in science, fast development of new tools and algorithms in the past years, nowadays it is a requirement in experimental and theoretical physics, large interest from the HEP community: IML, conferences, grants. 1

What expect from these lectures? 2

What expect from these lectures? Learn the basis of machine learning techniques. Learn when and how to apply machine learning algorithms. 2

The talk is divided in three lectures: Lecture 2 (tomorrow) Lecture 1 (today) Artificial intelligence Parameter learning Machine learning Non-linear models Model representation Beyond neural networks Metrics Clustering Lecture 3 (tomorrow) Hyperparameter tune Cross-validation ML in practice The PDF case study 3

Some references Books: The elements of statistical learning, T. Hastie, R. Tibshirani, J. Friedman. An introduction to statistical learning, G. James, D. Witten, T. Hastie, R. Tibshirani. Deep learning, I. Goodfellow, Y. Bengio, A. Courville. Online resources: HEP-ML: https://github.com/iml-wg/HEP-ML-Resources Tensorflow: http://tensorflow.org Keras: http://keras.io Scikit: http://scikit-learn.org 4

Artificial Intelligence

Artificial intelligence timeline 5

Defining A.I. Artificial intelligence (A.I.) is the science and engineering of making intelligent machines. (John McCarthy ‘56) 6

Defining A.I. Artificial intelligence (A.I.) is the science and engineering of making intelligent machines. (John McCarthy ‘56) Machine learning Natural language processing Artificial intelligence Knowledge reasoning Computer vision Speech Planning Robotics A.I. consist in the development of computer systems to perform tasks commonly associated with intelligence, such as learning . 6

A.I. and humans There are two categories of A.I. tasks: abstract and formal: easy for computers but difficult for humans, e.g. play chess (IBM’s Deep Blue 1997). Knowledge-based approach to artificial intelligence. 7

A.I. and humans There are two categories of A.I. tasks: abstract and formal: easy for computers but difficult for humans, e.g. play chess (IBM’s Deep Blue 1997). Knowledge-based approach to artificial intelligence. intuitive for humans but hard to describe formally: e.g. recognizing faces in images or spoken words. Concept capture and generalization 7

A.I. technologies Historically, the knowledge-based approach has not led to a major success with intuitive tasks for humans, because: requires human supervision and hard-coded logical inference rules. lacks of representation learning ability. 8

A.I. technologies Historically, the knowledge-based approach has not led to a major success with intuitive tasks for humans, because: requires human supervision and hard-coded logical inference rules. lacks of representation learning ability. Solution: The A.I. system needs to acquire its own knowledge. This capability is known as machine learning (ML). e.g. write a program which learns the task. 8

Venn diagram for A.I. Artificial intelligence e.g. Knowledge bases Machine learning e.g. Logistic regression Representation learning e.g. Autoencoders Deep learning e.g. MLPs When a representation learning is difficult, ML provides deep learning techniques which allow the computer to build complex concepts out of simpler concepts, e.g. artificial neural networks (MLP). 9

Machine Learning

Machine learning definition Definition from A. Samuel in 1959: Field of study that gives computers the ability to learn without being explicitly programmed. 10

Machine learning definition Definition from A. Samuel in 1959: Field of study that gives computers the ability to learn without being explicitly programmed. Definition from T. Mitchell in 1998: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance on T , as measured by P , improves with experience E. 10

Machine learning examples Thanks to work in A.I. and new capability for computers: Database mining: Search engines Spam filters Medical and biological records 11

Machine learning examples Thanks to work in A.I. and new capability for computers: Database mining: Search engines Spam filters Medical and biological records Intuitive tasks for humans: Autonomous driving Natural language processing Robotics (reinforcement learning) Game playing (DQN algorithms) 11

Machine learning examples Thanks to work in A.I. and new capability for computers: Database mining: Search engines Spam filters Medical and biological records Intuitive tasks for humans: Autonomous driving Natural language processing Robotics (reinforcement learning) Game playing (DQN algorithms) 11

Machine learning examples Thanks to work in A.I. and new capability for computers: Database mining: Search engines Spam filters Medical and biological records Intuitive tasks for humans: Autonomous driving Natural language processing Robotics (reinforcement learning) Game playing (DQN algorithms) Human learning: Concept/human recognition Computer vision Product recommendation 11

ML applications in HEP 12

ML in experimental HEP There are many applications in experimental HEP involving the LHC measurements, including the Higgs discovery, such as: Tracking Particle identification Fast Simulation Event filtering 13

ML in experimental HEP Some remarkable examples are: Signal-background detection: Decision trees, artificial neural networks, support vector machines. Jet discrimination: Deep learning imaging techniques via convolutional neural networks. HEP detector simulation: Generative adversarial networks, e.g. LAGAN and CaloGAN. 14

ML in theoretical HEP 15

ML in theoretical HEP 1 1 NNPDF3.1 (NNLO) Supervised learning: 0.9 g/10 0.9 xf(x,µ 2 10 GeV2) 0.7 0.8 g/10 0.7 The structure of the proton at the LHC d 0.6 parton distribution functions 0.6 uv 0.5 0.3 c 0.5 0.4 Theoretical prediction and combination Monte Carlo reweighting techniques xf(x,µ 2 104 GeV 2) s 0.8 uv 0.4 u dv s b 0.3 0.2 u 0.1 dv 0.2 d 0.1 c 0 3 10 10 2 x 10 1 0 3 10 1 BSM searches and exclusion limits Clustering and compression Density estimation and anomaly detection 1 ST STJ STJ 101 100 10-1 #/ST 10-2 #/STJ PDF4LHC15 recommendation 10 1 x 10 σ per bin [pb] Unsupervised learning: 10 2 Top quark rapidity 2 POWHEG BOX PYTHIA8 neural network Sudakov #/STJ Monte Carlo sampling 1.6 1.25 1 0.8 0.6 1.6 1.25 1 0.8 0.6 1.6 1.25 1 0.8 0.6 -4 -3 -2 -1 0 1 2 y(t) 3 4 16

Machine learning algorithms Machine learning algorithms: Supervised learning Supervised learning: regression, classification, . Input Data Training Data Set Desired Output Supervisor Labels are known Algorithm Processing Output 17

Machine learning algorithms Machine learning algorithms: Supervised learning: regression, classification, . Unsupervised learning: clustering, dim-reduction, . Unsupervised learning Input Data Unknown Output No Training Data Set Discover Interpretation from Features Labels are unknown Algorithm Processing Output 17

Machine learning algorithms Reinforcement learning Machine learning algorithms: Input Data Supervised learning: regression, classification, . Unsupervised learning: clustering, dim-reduction, . Agent Best Action Reinforcement learning: real-time decisions, . Reward Environment Algorithm Output 17

Machine learning algorithms More than 60 algorithms. 18

Workflow in machine learning The operative workflow in ML is summarized by the following steps: Data Model Cost function Training Cross-validation Best model Optimizer The best model is then used to: supervised learning: make predictions for new observed data. unsupervised learning: extract features from the input data. 19

Models and metrics

Models and metrics Data Model Cost function Training Cross-validation Best model Optimizer 20

Model representation in supervised learning We define parametric and structure models for statistical inference: examples: linear models, neural networks, decision tree. Data Set for Training Machine Learning Algorithm Input x Model Estimated Prediction Given a training set of input-output pairs A (x1 , y1 ), . . . , (xn , yn ). Find a model M which: M(x) y where x is the input vector and y discrete labels in classification and real values in regression. 21

Model representation in supervised learning Examples of models: linear regression we define a vector x Rn as input and predict the value of a scalar y R as its output: ŷ(x) wT x b where w Rn is a vector of parameters and b a constant. 22

Model representation in supervised learning Examples of models: linear regression we define a vector x Rn as input and predict the value of a scalar y R as its output: ŷ(x) wT x b where w Rn is a vector of parameters and b a constant. Generalized linear models are also available increasing the power of linear models: 22

Model representation in supervised learning Examples of models: linear regression we define a vector x Rn as input and predict the value of a scalar y R as its output: ŷ(x) wT x b where w Rn is a vector of parameters and b a constant. Generalized linear models are also available increasing the power of linear models: Non-linear models: neural networks (talk later). 22

Model representation trade-offs However, the selection of the appropriate model comes with trade-offs: Prediction accuracy vs interpretability: e.g. linear model vs splines or neural networks. Linear Regression Decision Tree Interpretability K-Nearest Neighbors Random Forest Support Vector Machines Neural Nets Accuracy 23

Model representation trade-offs However, the selection of the appropriate model comes with trade-offs: Prediction accuracy vs interpretability: e.g. linear model vs splines or neural networks. Optimal capacity/flexibility: number of parameters, architecture deal with overfitting, and underfitting situations 23

Assessing the model performance How to check model performance? define metrics and statistical estimators for model performance. Examples: Regression: cost / loss / error function, Classification: cost function, precision, accuracy, recall, ROC, AUC 24

Assessing the model performance - cost function To access the model performance we define a cost function J(w) which often measures the difference between the target and the model output. In a optimization procedure, given a model ŷw , we search for: arg min J(w) w The mean square error (MSE) is the most commonly used for regression: n J(w) 1X (yi ŷw (xi ))2 n i 1 a quadratic function and convex function in linear regression. 25

Assessing the model performance - cost function Other cost functions are depending on the nature of the problem. ATLAS1JET11 - R 0.4 - k-factor models regression with uncertainties, chi-square: (yi ŷw (xi ))(σ 1 )ij (yj ŷw (xj )) i,j 1 1.050 1.025 1.000 NN model k-factor CGP y 0.8 1.100 NNLO/NLO n X 1.075 1.075 1.050 1.025 1.000 NN model k-factor CGP y 1.2 1.10 NNLO/NLO J(w) NN model k-factor CGP y 0.2 1.100 NNLO/NLO Some other examples: 1.05 1.00 NN model k-factor CGP y 1.8 1.05 1.00 0.95 NN model k-factor CGP y 2.2 1.10 NNLO/NLO σij is the data covariance matrix. e.g. for LHC data experimental statistical and systematics correlations. 1.05 1.00 0.95 NN model k-factor CGP y 2.8 1.10 NNLO/NLO where: NNLO/NLO 1.10 1.05 1.00 0.95 250 500 750 1000 pT (GeV) 1250 1500 1750 26

Assessing the model performance - cost function logistic regression (binary classification): cross-entropy n J(w) 1X yi log ŷw (xi ) (1 yi ) log(1 ŷw (xi )) n i 1 where ŷw (xi ) 1/(1 e w T xi ). 27

Assessing the model performance - cost function density estimate / regression: negative log-likelihood: J(w) n X log(ŷw (xi )) i 1 0.4 P(v1) 0.08 0.07 0.06 P 0.05 0.04 0.03 0.02 0.01 0.00 Gaussian mixture pdf RTBM model Sampling Ns 105 0.2 0.0 6 4 2 v2 0 2 4 20 10 0 v 10 20 6 6 4 2 0 v1 2 4 6 0.00 0.25 P(v2) 0.50 28

Assessing the model performance - cost function density estimate / regression: negative log-likelihood: J(w) n X log(ŷw (xi )) i 1 0.4 P(v1) 0.08 0.07 0.06 P 0.05 0.04 0.03 0.02 0.01 0.00 Gaussian mixture pdf RTBM model Sampling Ns 105 0.2 0.0 6 4 2 v2 0 2 4 20 10 0 v 10 20 6 6 4 2 0 v1 2 4 6 0.00 0.25 P(v2) 0.50 Kullback-Leibler, RMSE, MAE, etc. 28

Training and test sets Another common issue related to model capacity in supervised learning: The model should not learn noise from data. The model should be able to generalize its output to new samples. 29

Training and test sets Another common issue related to model capacity in supervised learning: The model should not learn noise from data. The model should be able to generalize its output to new samples. To observe this issue we split the input data in training and test sets: training set error, JTr (w) test set/generalization error, JTest (w) Total number of examples Training Set Test Set 29

Training and test sets The test set is independent from the training set but follows the same probability distribution. Training Set Model building Permanent model Test Set Prediction Estimate performance 30

Bias-variance trade-off From a practical point of view dividing the input data in training and test: The training and test/generalization error conflict is known as bias-variance trade-off. 31

Bias-variance trade-off Supposing we have model ŷ(x) determined from a training data set, and considering as the true model Y y(X) , with y(x) E(Y X x), where the noise has zero mean and constant variance. 32

Bias-variance trade-off Supposing we have model ŷ(x) determined from a training data set, and considering as the true model Y y(X) , with y(x) E(Y X x), where the noise has zero mean and constant variance. If we take (x0 , y0 ) from the test set then: 2 E[(y0 ŷ(x0 ))2 ] (Bias[ŷ(x0 )]) Var[ŷ(x0 )] Var( ), where Bias[ŷ(x0 )] E[ŷ(x0 )] y(x0 ) 2 Var[ŷ(x0 )] E[ŷ(x0 )2 ] (E[ŷ(x0 )]) So, the expectation averages over the variability of y0 (bias) and the variability in the training data. 32

Bias-variance trade-off If ŷ increases flexibility, its variance increases and its biases decreases. Choosing the flexibility based on average test error amounts to a bias-variance trade-off: High Bias underfitting: erroneous assumptions in the learning algorithm. High Variance overfitting: erroneous sensitivity to small fluctuations (noise) in the training set. 33

Bias-variance trade-off More examples of bias-variance trade-off: 34

Bias-variance trade off Regularization techniques can be applied to modify the learning algorithm and reduce its generalization error but not its training error. For example, including the weight decay to the MSE cost function: n J(w) 1X (yi ŷw (xi ))2 λwT w. n i 1 where λ is a real number which express the preference for weights with smaller squared L2 norm. 35

Solution for the bias-variance trade off Tuning the hyperparameter λ we can regularize a model without modifying explicitly its capacity. 36

Solution for the bias-variance trade off A common way to reduce the bias-variance trade-off and choose the proper learning hyperparamters is to create a validation set that: not used by the training algorithm not used as test set Total number of examples Training Set Validation Set Test Set Training set: examples used for learning. Validation set: examples used to tune the hyperparameters. Test set: examples used only to access the performance. Techniques are available to deal with data samples with large and small number of examples. (talk later) 37

Assessing model performance for classification In binary classification tasks we usually complement the cost function with the accuracy metric defined as: TP TN Accuracy . TP TN FP FN Example: True Positives (TP) e.g. 8 False Positives (FP) e.g. 2 False Negatives (FN) e.g. 4 True Negatives (TN) e.g. 20 Accuracy 82% 38

Assessing model performance for classification In binary classification tasks we usually complement the cost function with the accuracy metric defined as: TP TN Accuracy . TP TN FP FN Example: True Positives (TP) e.g. 8 False Positives (FP) e.g. 2 False Negatives (FN) e.g. 4 True Negatives (TN) e.g. 20 Accuracy 82% However accuracy does not represents the overall situation for skewed classes, i.e. imbalance data set with large disparity, e.g. signal and background. In this cases we define precision and recall. 38

Assessing model performance for classification Precision: proportion of correct positive identifications. Recall: proportion of correct actual positives identifications. Precision TP , TP FP True Positives (TP) e.g. 8 False Positives (FP) e.g. 2 False Negatives (FN) e.g. 4 True Negatives (TN) e.g. 20 Recall TP TP FN Accuracy 82% Precision 80% Recall 67% 39

Assessing model performance for classification Precision: proportion of correct positive identifications. Recall: proportion of correct actual positives identifications. Precision TP , TP FP True Positives (TP) e.g. 8 False Positives (FP) e.g. 2 False Negatives (FN) e.g. 4 True Negatives (TN) e.g. 20 Recall TP TP FN Accuracy 82% Precision 80% Recall 67% Various metrics have been developed that rely on both precision and recall, e.g. the F1 score: Precision · Recall F1 2 · 73% Precision Recall 39

Assessing model performance for classification In a binary classification we can vary the probability threshold and define: the receiver operating characteristic curve (ROC curve) is a metric which shows the relationship between correctly classified positive cases, the true positive rate (TRP/recall) and the incorrectly classified negative cases, false positive rate (FPR, (1-effectivity)). TPR TP , TP FN FPR FP FP TN 40

Assessing model performance for classification The area under the ROC curve (AUC) represents the probability that classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. AUC provides an aggregate measure of performance across all possible classification thresholds. AUC is 0 if predictions are 100% wrong AUC is 1 if all predictions are correct. AUC is scale-invariant and classification-threshold-invariant. 41

Summary

Summary We have covered the following topics: Motivation and overview of A.I. Definition and overview of ML. Model representation definition and trade-offs Learning metrics for accessing the model performance Metrics for classification. 42

This capability is known as machine learning (ML).!e.g. write a program which learns the task. 8. . ML provides deep learning techniques which allow the computer to build complex concepts out of simpler concepts, e.g. arti cial neural networks (MLP). 9. Machine Learning. Machine learning de nition De nition from A. Samuel in 1959:

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

Machine Learning Real life problems Lecture 1: Machine Learning Problem Qinfeng (Javen) Shi 28 July 2014 Intro. to Stats. Machine Learning . Learning from the Databy Yaser Abu-Mostafa in Caltech. Machine Learningby Andrew Ng in Stanford. Machine Learning(or related courses) by Nando de Freitas in UBC (now Oxford).

Lecture 5-6: Artificial Neural Networks (THs) Lecture 7-8: Instance Based Learning (M. Pantic) . (Notes) Lecture 17-18: Inductive Logic Programming (Notes) Maja Pantic Machine Learning (course 395) Lecture 1-2: Concept Learning Lecture 3-4: Decision Trees & CBC Intro Lecture 5-6: Artificial Neural Networks .

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

Partial Di erential Equations MSO-203-B T. Muthukumar tmk@iitk.ac.in November 14, 2019 T. Muthukumar tmk@iitk.ac.in Partial Di erential EquationsMSO-203-B November 14, 2019 1/193 1 First Week Lecture One Lecture Two Lecture Three Lecture Four 2 Second Week Lecture Five Lecture Six 3 Third Week Lecture Seven Lecture Eight 4 Fourth Week Lecture .