Statistical Learning And Data Mining Stat557

1y ago
10 Views
2 Downloads
651.60 KB
26 Pages
Last View : 13d ago
Last Download : 3m ago
Upload by : Allyson Cromer
Transcription

Statistical Learning and Data Mining Stat557Statistical Learning and Data MiningStat557Jia LiDepartment of StatisticsThe Pennsylvania State UniversityEmail: jiali@stat.psu.eduhttp://www.stat.psu.edu/ jialiJia Lihttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557General InformationIICourse homepage:http://www.stat.psu.edu/ jiali/stat557Prerequisite:IIIJia LiElementary probability theoryConditional distribution, expectationC, Matlab, or S-plus programminghttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557IText books:IIRequired: The Elements of Statistical Learning, by T. Hastie,R. Tibshirani, and J. Friedman(ElemStatLearn).Optional:1. Classification and Regression Trees by L. Breiman, J. H.Friedman, R. A. Olshen, and C. J. Stone2. Pattern Recognition and Neural Networks by B. Ripley3. Principles of Data Mining by H. Mannila, P. Smyth and D. J.Hand4. Data Mining: Concepts and Techniques by J. Han and M.KamberJia Lihttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557What Is Data Mining?Data mining: tools, methodologies, and theories for revealingpatterns in data—a critical step in knowledge discovery.Driving forces:I Big data:IIIIExplosive growth of data in a great variety of fieldsIIIJia LiEnormous volumeHigh complexity: dimension, structureDynamicCheaper storage devices with higher capacityFaster communicationBetter database manage systemsIRapidly increasing computing power: distributed and parallelplatformsIMake data to work for ushttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557Research fieldsJia LiIStatisticsIMachine learningIPattern recognitionISignal processingIDatabasehttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557ApplicationsIBusinessIIIGenomicsIIIIIJia LiTerrabytes of data on the internetMultimedia informationCommunication systemsIIHuman genome project: DNA sequencesMicroarray dataInformation retrievalIIWal-Mart data warehouseCredit card companiesSpeech recognitionImage analysisMany other scientific fieldshttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557Problems Focused: PredictionJia Lihttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557TerminologyNotationIInput X : X is often multidimensional. Each dimension of X isdenoted by Xj and is referred to as a feature, predictor, orindependent variable/variable.IOutput Y : response, dependent variable.CategorizationI Supervised learning vs. unsupervised learningIIIs Y available in the training data?Regression vs. ClassificationIIIs Y quantitative or qualitative?For qualitative Y , it is also denoted byG G {1, 2, ., K }.Jia Lihttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557ExamplesEmail spam: (ElemStatLearn)Jia LiIGoal: predict whether an email is a junk email, i.e., “spam”.IRaw data: text email messages.IInput X : relative frequencies of 57 of the most commonlyoccurring words and punctuation marks in the email message.ITraining data set: 4601 email messages with email typeknown (supervised learning).http://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557ExamplesHandwritten digit recognition:(ElemStatLearn)IIGoal: identify single digits 0 9 based on images.Raw data: images that are scaled segments from five digitZIP codes.IIIJia Li16 16 eight-bit grayscale mapsPixel intensities range from 0 (black) to 255 (white).Input data: a 256 dimension vector, or feature vectors withlower dimensions.http://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557Jia Lihttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557ExamplesImage segmentation:Jia LiIGoal: segment images into regions of different types, e.g.,man-made vs. natural in aerial images, graph and picture vs.text in document images.IRaw data: grayscale images represented by matrices of sizem n, or color images represented by 3 such matrices.http://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557Aerial images. Left: Original image of size 512 512 with pixel intensityranging from 0 to 255, Right: Hand-labeled classified images. White:man-made, Gray: natural.Jia Lihttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557IInput data:IIIIMethodologies:IIJia LiDivide images into blocks of pixels or form a neighborhoodaround each pixel.Compute statistics using pixel intensities in each block.An image is converted to an array of input vectors.Assume the feature vectors are independent.Employ spatial models to capture dependence among thevectors.http://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557Jia Lihttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557ExamplesSpeech recognition:IGoal: identify words spoken according to speech signalsIIIJia LiAutomatic voice recognition systems used by airline companiesAutomatic stock price reportingRaw data: voice amplitude sampled at discrete time spots (atime sequence).http://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557Jia Lihttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557IIInput data: speech feature vectors computed at the samplingtime.Methodology:IIIJia LiEstimate an Hidden Markov Model (HMM) for each word,e.g., State College, San Francisco,Pittsburgh.For a new word, find the HMM that yields the maximumlikelihood.Identify the word as the one associated with the HMM.http://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557ExamplesDNA Expression Microarray:IIIGoal: identify disease or tissue typesRaw data: for each sample taken from a tissue of a particulardisease type, the expression levels of a large collection ofgenes are measured.Input data: cleaned-up gene expression dataIIIIIExample data set: 4026 genes, 96 samples taken from 9classes of tissues.Challenges:IIJia LiNormalizationDenoising.Ample literature on the topic of cleaning microarray datavery high dimensional datavery limited number of sampleshttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557ExamplesDNA sequence classification:Jia LiIGoal: distinguish “junk” segments from coding segments.IRaw data: sequences of letters, e.g., A,C,G,T for DNAsequences.IInput data: likelihood ratio statistics computed fromstochastic models.ISupervised learning: estimate stochastic models, selectmodels.http://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557Supervised LearningTwo types of learning:IRegression: the response Y is quantitative.IClassification: the response Y is qualitative, or categorical.Two aspects in learning:IFit the data well.IRobustEquivalent concepts:Jia LiITraining error vs. testing errorIBias vs. varianceIFitting vs. overfittingIEmpirical risk vs. model complexity (capacity)http://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557Jia Lihttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557Learning SpectrumJia Lihttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557RegressionOverview:I Linear models:IIIGeneralized linear modelsExpand basis:IIIIJia LiThe mean response is a linear function of the independentvariables.Splines (polynomials)Reproducing Kernel Hilbert SpacesWavelet smoothingKernel methodshttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557Classification: A graphic ViewJia Lihttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557OutlinesJia LiILinear regressionILinear methods for classificationIPrototype methodsIClassification and regression tree (CART)IMixture discriminant analysisIHidden Markov models and its applicationshttp://www.stat.psu.edu/ jiali

Statistical Learning and Data Mining Stat557 Examples Email spam: (ElemStatLearn) I Goal: predict whether an email is a junk email, i.e., \spam". I Raw data: text email messages. I Input X: relative frequencies of 57 of the most commonly occurring words and punctuation marks in the email message. I Training data set: 4601 email messages with email type known (supervised learning).

Related Documents:

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

enable mining to leave behind only clean water, rehabilitated landscapes, and healthy ecosystems. Its objective is to improve the mining sector's environmental performance, promote innovation in mining, and position Canada's mining sector as the global leader in green mining technologies and practices. Source: Green Mining Initiative (2013).

Data Mining Popularity lRecent Data Mining explosion based on: lData available -Transactions recorded in data warehouses -From these warehouses specific databases for the goal task can be created lAlgorithms available -Machine Learning and Statistics -Including special purpose Data Mining software products to make it easier for people to work through the entire data mining cycle

Data mining process 6 CS590D 12 Data Mining: Classification Schemes General functionality – Descriptive data mining – Predictive data mining Different views, different classifications – Kinds of data to be mined – Kinds of knowledge to be discovered – Kinds of techniqu