Missing Data Problems In Machine Learning - UMass Amherst

1y ago

8 Views

2 Downloads

2.60 MB

164 Pages

Last View : 2m ago

Last Download : 3m ago

Upload by : Brenna Zink

Report this link

Download PDF

Transcription

Missing Data Problems in Machine LearningbyBenjamin M. MarlinA thesis submitted in conformity with the requirementsfor the degree of Doctor of PhilosophyGraduate Department of Computer ScienceUniversity of TorontoCopyright c 2008 by Benjamin M. Marlin

AbstractMissing Data Problems in Machine LearningBenjamin M. MarlinDoctor of PhilosophyGraduate Department of Computer ScienceUniversity of Toronto2008Learning, inference, and prediction in the presence of missing data are pervasive problems inmachine learning and statistical data analysis. This thesis focuses on the problems of collaborative prediction with non-random missing data and classification with missing features. Webegin by presenting and elaborating on the theory of missing data due to Little and Rubin. Weplace a particular emphasis on the missing at random assumption in the multivariate settingwith arbitrary patterns of missing data. We derive inference and prediction methods in thepresence of random missing data for a variety of probabilistic models including finite mixturemodels, Dirichlet process mixture models, and factor analysis.Based on this foundation, we develop several novel models and inference procedures for boththe collaborative prediction problem and the problem of classification with missing features.We develop models and methods for collaborative prediction with non-random missing data bycombining standard models for complete data with models of the missing data process. Usinga novel recommender system data set and experimental protocol, we show that each proposedmethod achieves a substantial increase in rating prediction performance compared to modelsthat assume missing ratings are missing at random.We describe several strategies for classification with missing features including the use ofgenerative classifiers, and the combination of standard discriminative classifiers with single imputation, multiple imputation, classification in subspaces, and an approach based on modifyingthe classifier input representation to include response indicators. Results on real and syntheticdata sets show that in some cases performance gains over baseline methods can be achieved bymethods that do not learn a detailed model of the feature space.ii

AcknowledgementsI’ve been privileged to enjoy the support and encouragement of many people during the courseof this work. I’ll start by thanking my thesis supervisor, Rich Zemel. I’ve learned a great dealof machine learning from Rich, and have benefitted from his skill and intuition at modellingdifficult problems. I’d also like to thank Sam Roweis, who essentially co-supervised much of myPhD research. His enthusiasm for machine learning is insatiable, and his support of this workhas been greatly appreciated.I have benefitted from the advice of a terrific PhD committee including Geoff Hinton andBrendan Frey, as well as Rich and Sam. Rich, Sam, Geoff, and Brendan were all instrumentalin helping me to pare down a long list of interesting problems to arrive at the present contentsof this thesis. I’ve appreciated their helpful comments and thoughtful questions throughoutthe research and thesis writing process. I would like to extend a special thanks to my externalexaminer, Zoubin Ghahramani, for his thorough reading of this thesis. His detailed comments,questions, and suggestions have helped to significantly improve this thesis.During the course of this work I have also been very fortunate to collaborate with MalcolmSlaney at Yahoo! Research. I’m very grateful to Malcolm for championing our projects withinYahoo!, and to many other people at Yahoo! who were involved in our work including SandraBarnat, Todd Beaupre, Josh Deinsen, Eric Gottschalk, Matt Fukuda, Kristen Jower-Ho, BrianMcGuiness, Mike Mull, Peter Shafton, Zack Steinkamp, and David Tseng. I would like tothank Dennis DeCoste, who co-supervised me at Yahoo! for a short time, for his continuinginterest in this work. Malcolm also helped to coordinate the release of the Yahoo! data setused in this thesis. Malcolm, Rich, and I would like to extend our thanks to Ron Brachman,David Pennock, John Langford, and Lauren McDonnell at Yahoo!, as well as Fred Zhu fromthe University’s Office of Intellectual Property for their efforts in approving the data releaseand putting together a data use agreement.I would like to acknowledge the generous funding of this work provided by the Universityof Toronto Fellowships program, the Ontario Graduate Scholarships program, and the NaturalSciences and Engineering Research Council Canada Graduate Scholarships program. This workwouldn’t have been possible without the support of these programs.On the personal side, I’d like to thank all my lab mates and friends at the University forgood company and interesting discussions over the years including Matt Beal, Miguel CarreiraPerpinan, Stephen Fung, Inmar Givoni, Jenn Listgarten, Ted Meeds, Roland Memisevic, AndriyMnih, Quaid Morris, Rama Natarajan, David Ross, Horst Samulowitz, Rus Salakhutdinov, NatiSrebro, Liam Stewart, Danny Tarlow, and Max Welling. I’m very grateful to Bruce and MauraRowat, for providing me with a home away from home during my final semester of courses inToronto. I’m also grateful to Horst Samulowitz, Nati Srebro and Eli Thomas, Sam Roweis, andiii

Ted Meeds for the use of spare rooms/floor space on numerous visits to the University.I’d like to thank my Mom for never giving up on trying to understand exactly what thisthesis is all about, and my Dad for teaching me that you can fix anything with hard work andthe right tools. I’d like to thank the whole family for providing a great deal of support, and fortheir enthusiasm at the prospect of me finishing the 22nd grade. Finally, I’m incredibly gratefulto my wife Krisztina for reminding me to eat and sleep when things were on a roll, for loveand encouragement when things weren’t going well, for always being ready to drop everythingand get away from it all when I needed a break, and for understanding all the late nights andweekends that went into finishing this thesis.iv

Contents1 Introduction11.1Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.2Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41.2.1Notation for Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . .41.2.2Notation and Conventions for Vector and Matrix Calculus . . . . . . . . .52 Decision Theory, Inference, and Learning72.1Optimal Prediction and Minimizing Expected Loss . . . . . . . . . . . . . . . . .72.2The Bayesian Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82.2.1Bayesian Approximation to the Prediction Function . . . . . . . . . . . .92.2.2Bayesian Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92.2.3Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.32.42.5The Maximum a Posteriori Framework . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1MAP Approximation to The Prediction Function . . . . . . . . . . . . . . 112.3.2MAP Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12The Direct Function Approximation Framework . . . . . . . . . . . . . . . . . . . 132.4.1Function Approximation as Optimization . . . . . . . . . . . . . . . . . . 132.4.2Function Approximation and Regularization . . . . . . . . . . . . . . . . . 14Empirical Evaluation Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.1Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.2Validation Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.3Cross Validation Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 A Theory of Missing Data173.1Categories of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2The Missing at Random Assumption and Multivariate Data . . . . . . . . . . . . 183.3Impact of Incomplete Data on Inference . . . . . . . . . . . . . . . . . . . . . . . 203.4Missing Data, Inference, and Model Misspecification . . . . . . . . . . . . . . . . 21v

4 Unsupervised Learning With Random Missing Data4.14.24.34.4Finite Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.1Maximum A Posteriori Estimation . . . . . . . . . . . . . . . . . . . . . . 274.1.2Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Dirichlet Process Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.1Properties of The Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . 304.2.2Bayesian Inference and the Conjugate Gibbs Sampler . . . . . . . . . . . 324.2.3Bayesian Inference and the Collapsed Gibbs Sampler . . . . . . . . . . . . 344.2.4Predictive Distribution and the Conjugate Gibbs Sampler . . . . . . . . . 354.2.5Predictive Distribution and the Collapsed Gibbs Sampler . . . . . . . . . 36Factor Analysis and Probabilistic Principal Components Analysis . . . . . . . . . 374.3.1Joint, Conditional, and Marginal Distributions . . . . . . . . . . . . . . . 384.3.2Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . 394.3.3Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Mixtures of Factor Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.4.1Joint, Conditional, and Marginal Distributions . . . . . . . . . . . . . . . 424.4.2Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . 424.4.3Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Unsupervised Learning with Non-Random Missing Data5.15.22546The Yahoo! Music Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.1.1User Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.1.2Rating Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.1.3Experimental Protocols for Rating Prediction . . . . . . . . . . . . . . . . 51The Jester Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2.1Experimental Protocols for Rating Prediction . . . . . . . . . . . . . . . . 525.3Test Items and Additional Notation for Missing Data . . . . . . . . . . . . . . . . 545.4The Finite Mixture/CPT-v Model . . . . . . . . . . . . . . . . . . . . . . . . . . 545.55.4.1Conditional Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.4.2Maximum A Posteriori Estimation . . . . . . . . . . . . . . . . . . . . . . 595.4.3Rating Prediction5.4.4Experimentation and Results . . . . . . . . . . . . . . . . . . . . . . . . . 63. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63The Dirichlet Process Mixture/CPT-v Model . . . . . . . . . . . . . . . . . . . . 685.5.1An Auxiliary Variable Gibbs Sampler . . . . . . . . . . . . . . . . . . . . 695.5.2Rating Prediction for Training Cases . . . . . . . . . . . . . . . . . . . . . 725.5.3Rating Prediction for Novel Cases . . . . . . . . . . . . . . . . . . . . . . 73vi

5.5.45.65.75.8Experimentation and Results . . . . . . . . . . . . . . . . . . . . . . . . . 74The Finite Mixture/Logit-vd Model . . . . . . . . . . . . . . . . . . . . . . . . . 755.6.1Maximum A Posteriori Estimation . . . . . . . . . . . . . . . . . . . . . . 775.6.2Rating Prediction5.6.3Experimentation and Results . . . . . . . . . . . . . . . . . . . . . . . . . 81. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.7.1Restricted Boltzmann Machines and Complete Data . . . . . . . . . . . . 825.7.2Conditional Restricted Boltzmann Machines and Missing Data . . . . . . 855.7.3Conditional Restricted Boltzmann Machines and Non User-Selected Items 895.7.4Experimentation and Results . . . . . . . . . . . . . . . . . . . . . . . . . 92Comparison of Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 946 Classification With Missing Data6.16.26.36.499Frameworks for Classification With Missing Features . . . . . . . . . . . . . . . . 996.1.1Generative Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.1.2Case Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.1.3Classification and Imputation . . . . . . . . . . . . . . . . . . . . . . . . . 1006.1.4Classification in Sub-spaces: Reduced Models . . . . . . . . . . . . . . . . 1016.1.5A Framework for Classification with Response Indicators . . . . . . . . . 102Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.2.1Fisher’s Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . 1026.2.2Linear Discriminant Analysis as Maximum Probability Classification . . . 1046.2.3Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . 1046.2.4Regularized Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . 1056.2.5LDA and Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076.2.6Discriminatively Trained LDA and Missing Data . . . . . . . . . . . . . . 1086.2.7Synthetic Data Experiments and Results. . . . . . . . . . . . . . . . . . 112Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.3.1The Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . 1146.3.2Maximum Likelihood Estimation for Logistic Regression . . . . . . . . . . 1156.3.3Regularization for Logistic Regression . . . . . . . . . . . . . . . . . . . . 1166.3.4Logistic Regression and Missing Data . . . . . . . . . . . . . . . . . . . . 1166.3.5An Equivalence Between Missing Data Strategies for Linear Classification 1186.3.6Synthetic Data Experiments and Results. . . . . . . . . . . . . . . . . . 119Perceptrons and Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 1246.4.1Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124vii

6.56.66.76.4.2Hard Margin Support Vector Machines . . . . . . . . . . . . . . . . . . . . 1256.4.3Soft Margin Support Vector Machines . . . . . . . . . . . . . . . . . . . . 1266.4.4Soft Margin Support Vector Machine via Loss PenaltyBasis Expansion and Kernel Methods. . . . . . . . . 126. . . . . . . . . . . . . . . . . . . . . . . . 1276.5.1Basis Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286.5.2Kernel Methods6.5.3Kernel Support Vector Machines and Kernel Logistic Regression . . . . . 1296.5.4Kernels For Missing Data Classification . . . . . . . . . . . . . . . . . . . 1306.5.5Synthetic Data Experiments and Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128. . . . . . . . . . . . . . . . . . 133Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.6.1Feed-Forward Neural Network Architecture . . . . . . . . . . . . . . . . . 1356.6.2One Hidden Layer Neural Networks for Classification . . . . . . . . . . . . 1366.6.3Special Cases of Feed-Forward Neural Networks . . . . . . . . . . . . . . . 1376.6.4Regularization in Neural Networks . . . . . . . . . . . . . . . . . . . . . . 1376.6.5Neural Network Classification and Missing Data . . . . . . . . . . . . . . 1386.6.6Synthetic Data Experiments and Results. . . . . . . . . . . . . . . . . . 139Real Data Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 1406.7.1Hepatitis Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406.7.2Thyroid - AllHypo Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.7.3Thyroid - Sick Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.7.4MNIST Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1437 Conclusions1467.1Unsupervised Learning with Non-Random Missing Data . . . . . . . . . . . . . . 1467.2Classification with Missing Features . . . . . . . . . . . . . . . . . . . . . . . . . 148Bibliography150viii

Chapter 1IntroductionMissing data occur in a wide array of application domains for a variety of reasons. A sensor ina remote sensor network may be damaged and cease to transmit data. Certain regions of a genemicroarray may fail to yield measurements of the underlying gene expressions due to scratches,finger prints, dust, or manufacturing defects. Participants in a clinical study may drop outduring the course of the study leading to missing observations at subsequent time points. Adoctor may not order all applicable tests while diagnosing a patient. Users of a recommendersystem rate an extremely small fraction of the available books, movies, or songs, leading tomassive amounts of missing data.Abstractly, we may consider a random process underlying the generation of incomplete datasets. This generative process can be decomposed into a complete data process that generatescomplete data sets, and a missing data process that determines which elements of the completedata set will be missing. In the examples given above, the hypothetical complete data set wouldinclude measurements from every sensor in a remote sensor network, the result of every medicaltest relevant to a particular medical condition for every patient, and the rating of every userfor every item in a recommender system. The missing data process is sometimes referred toas the missing data mechanism, the observation process, or the selection process. We mightimagine that a remote sensor is less likely to transmit data if its operational temperate rangeis exceeded, that a doctor is less likely to order a test that is invasive, and that a user of arecommender system is less likely to rate a given item if the user does not like that item.The analysis of missing data processes leads to a theory of missing data in terms of itsimpact on learning, inference, and prediction. This theory draws a distinction between twofundamental categories of missing data: data that is missing at random and data that is notmissing at random. When data is missing at random, the missing data process can be ignoredand inference can be based on the observed data only. The resulting computations are tractablein many common generative models. When data is not missing at random, ignoring the missing1

Chapter 1. Introduction2data process leads to a systematic bias in standard algorithms for unsupervised learning, inference, and prediction. An intuitive example of a process that violates the missing at randomassumption is one where the probability of observing the value of a particular feature dependson the value of that feature. All forms of missing data are problematic in the classificationsetting since standard discriminative classifiers do not include a model of the feature space. Asa result, most discriminative classifiers have no natural ability to deal with missing data.1.1Outline and ContributionsThe focus of this thesis is the development of models and algorithms for learning, inference, andprediction in the presence of missing data. The two main problems we study are collaborativeprediction with non-random missing data, and classification with missing features. We beginChapter 2 with a discussion of decision theory as a framework for understanding different learning and inference paradigms including Bayesian inference, maximum a posteriori estimation,maximum likelihood estimation, and regularized function approximation. We review particular algorithms and principles including the Metropolis-Hastings algorithm, the Gibbs sampler,and the Expectation Maximization algorithm. We also discuss procedures for estimating theperformance of prediction methods.Chapter 3 introduces the theory of missing data due to Little and Rubin. We presentformal definitions of the three main classes of missing data. We present a detailed investigationof the missing at random assumption in the multivariate case with arbitrary patterns of missingdata. We argue that the missing at random assumption is best understood in terms of a setof symmetries imposed on the missing data process. We review the impact of random andnon-random missing data on probabilistic inference. We present a study of the effect of datamodel misspecification on inference in the presence of random missing data. We demonstratethat using an incorrect data model can lead to biased inference and learning even when data ismissing at random in the underlying generative process.Chapter 4 introduces unsupervised learning models in the random missing data setting including finite multinomial mixtures, Dirichlet Process multinomial mixtures, factor analysis,and probabilistic principal component analysis. We present maximum a posteriori learning infinite mixture models with missing data. We derive conjugate and collapsed Gibbs samplers forthe Dirichlet Process multinomial mixture model with missing data. We derive complete expectation maximization algorithms for factor analysis, probabilistic principal components analysis,mixtures of factor analyzers, and mixtures of probabilistic principal components analyzers withmissing data.Chapter 5 focuses on the problem of unsupervised learning for collaborative prediction when

Chapter 1. Introduction3missing data may violate the missing at random assumption. Collaborative prediction problemslike rating prediction in recommender systems are typically solved using unsupervised learningmethods. As discussed in Chapter 3, the results of learning and prediction will be biased if themissing at random assumption is violated. We discuss compelling new evidence in the formof a novel user study and the analysis of a new collaborative filtering data set which stronglysuggests that the missing at random assumption does not hold in the recommender systemdomain.We present four novel models for unsupervised learning with non-random missing data thatbuild on the models and inference procedures for random missing data presented in Chapter4. These models include the combination of the finite multinomial mixture model and theDirichlet Process multinomial mixture model with a simple missing data mechanism wherethe probability that a rating is missing depends only on the value of that rating. We referto this mechanism as CPT-v since it is parameterized using a simple conditional probabilitytable. We prove that the parameters of the CPT-v missing data mechanism are conditionallyidentifiable even though the mixture data models are not identifiable. We also combine thefinite multinomial mixture model with a more flexible missing data model that we refer to asLogit-vd. The Logit-vd model allows for response probabilities that differ depending on boththe underlying rating value, and the identity of the item. The name Logit-vd derives from thefact that the missing data mechanism is represented using an additive logistic model. We reviewmodified contrastive divergence learning for restricted Boltzmann machines with missing data,and offer a new derivation of these learning methods as standard contrastive divergence in analternative model. The final model we consider is a conditional Restricted Boltzmann Machinethat includes energy terms that can account for non-random missing data effects similar to theCPT-v model.We show that traditional experimental protocols and testing procedures for collaborativeprediction implicity assume missing ratings are missing at random. We show that these procedures fail to detect the effects of non-random missing ratings. To correct this problem weintroduce novel experimental protocols and testing procedures specifically designed for collaborative prediction with non-random missing data. Our empirical results show that ratingprediction methods based on models that incorporate an explicit non-random missing datamechanism achieve 25% to 40% lower error rates than methods that assume the missing atrandom assumption holds. To put these results in perspective, the best models studied in ourprevious work on collaborative filtering achieve approximately 15% lower error rates relativeto the simplest models we considered [52, p. 107-108]. We also compare the methods studiedin terms of ranking performance, and again show that methods that model the missing datamechanism achieve better ranking performance than methods that treat missing data as if it is

Chapter 1. Introduction4missing at random.In Chapter 6 we consider the problem of classification with missing features. We beginwith a discussion of general strategies for dealing with missing data in the classification setting.We consider the application of generative classifiers where missing data can be analyticallyintegrated out of the model. We derive a variation of Fisher’s linear discriminant analysis formissing data that uses a factor analysis model for the covariance matrix. We then derive anovel discriminative learning procedure for the classifier based on maximizing the conditionalprobability of the labels given the observed data.We study the application of logistic regression, multi-layer neural networks, and kernelclassifiers in conjunction with several frameworks for converting a discriminative classifier intoa classifier for incomplete data cases. We consider the use of various imputation methodsincluding multiple imputation. For data sets with a limited number of patterns of missingdata, we consider a reduced model approach that learns a separate classifier for each patternof missing data. Finally, we consider an approach based on modifying the input representationof a discriminative classifier in such a way that the classification function depends only on theobserved feature values, and which features are observed. Results on real and synthetic datasets show that in some cases performance gains over baseline methods can be achieved withoutlearning detailed models of the input space.1.2NotationWe use capital letters to denote random variables, and lowercase letters to denote instantiationsof random variables. We use a bold typeface to indicate vector and matrix quantities, and aplain typeface to indicate scalar quantities.When describing data sets we denote the total number of feature dimensions by D, and thetotal number of data cases by N . We denote the feature vector for data case n by x n , andindividual feature values by xdn . In the classification setting we denote the total number ofclasses by C. We denote the class variable for data case n by yn , and assume it takes the values{1, 1} in the binary case, and {1, ., C} in the multi-class case.We use square bracket notation [s] to represent an indicator function that takes the value 1if the statement s is true, and 0 if the statement s is false. For example, [xdn v] would takethe value 1 if xdn is equal to v, and 0 otherwise.1.2.1Notation for Missing DataFollowing the standard representation for missing data due to Little and Rubin [49], we introduce a companion vector of response indicators for data case n denoted r n . rdn is 1 if xdn is

Chapter 1. Introduction5observed, and rdn is 0 if xdn is not observed. We denote the number of observed data dimensionsin data case n by Dn . In addition to the response indicator vector, we introduce a vector o nPof length Dn listing the dimensions of xn that are observed. We define oin d if dj 1 rjn iand rdn 1. In other words, oin d if d is the ith observed dimension of xn . We introduce acorresponding vector mn of length D Dn listing the dimensions of xn that are missing. WePdefine min d if dj 1 (1 rjn ) i and (1 rdn ) 1. In other words, min d if d is the ithmissing dimension of xn .We use superscripts to denote sub-vectors and sub-matrices. For example, x onn denotes thesub-vector of xn corresponding to the observed elements of xn . The element-wise definition ofxonn is xoinn xoin n . Similarly, if Σ is a D D matrix then, for example, Σon mn is the sub-matrixof Σ obtained by selecting the rows corresponding to the observed dimensions of x n , and thecolumns corresponding to the missing dimensions of xn . The element-wise definition of Σon mnon m nis Σij Σoin mjn . For simplicity we will often use the notation xo and Σom in place of xonnor Σon mn when it is clear which pattern of observed or missing entries is intended.Projection matrices are another very useful tool for dealing with sub-vectors and submatrices induced by missing data. We define the projection matrix Hon where Hoijn [ojn i].The matrix Hon projects a vector from the Dn dimensional space corresponding to the observeddimensions of xn to the full D dimensional feature space. The missing dimensions are filledmwith zeros. Similarly, we define the projection matrix Hmn such that Hijn [mjn i]. Thematrix Hmn projects a vector from the (D Dn ) dimensional space corresponding to the missingdimensions of xn to the full D dimensional feature space. The observed dimensions are filledwith zeros. As we will see later, these projection matrices arise naturally when taking matrixand vector derivatives of the form Σon mn / Σ.1.2.2Notation and Conventions for Vector and Matrix CalculusThroughout this work we will be deriving optimization algorithms that require the closed-formor iterative solution of a set of gradient equations. The gradient equations are derived usingmatrix calculus. In this section we review the matrix calculus conventions used in this work.First, we assume that all vectors are column vectors unless explicitly stated otherwise. Wewill follow the convention that the gradient of a scalar function f with respect to a matrixvalued function g of dimension A B is a matrix of size A B as seen in Equation 1.2.1. Weadopt this convention since it avoids the need to transpose the matrix of partial derivativeswhen solving gradient equations, and performing iterative gradient updates.

6Chapter 1. Introduction f11 g f g21 f g . f gA1 f

machine learning and statistical data analysis. This thesis focuses on the problems of collab-orative prediction with non-random missing data and classi cation with missing features. We begin by presenting and elaborating on the theory of missing data due to Little and Rubin. We

Related Documents:

Specification and Price of Automatic Rendering Machine (FOB ... - AR

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

18 Views

3m ago

Inference and Missing Data

resulting inferences are generally conditional on the observed pattern of missing data. Further, ignoring the process that causes missing data when making direct-likelihood or Bayesian inferences about 6 is appropriate if the missing data are missing at random and q is distinct from 0.

17 Views

2y ago

ADP Time & Attendance: Timecards - Bradley University

2 Click Quick Actions, and then click Add Missing Punch. 3 Click the field with the missing punch, which is indicated by solid red. Note: You can click multiple missing punch fields if necessary. 4 To turn off the Missing Punch action, click Add Missing Punch. 5 Click Save.

10 Views

1y ago

Missing Data Using Stata - Statistical Horizons

Missing Data Using Stata Paul D. Allison, Ph.D. February 2016 www.StatisticalHorizons.com 1 Basics Definition: Data are missing on some variables for some observations Problem: How to do statistical analysis when data are missing? Three goals: Minimize bias Maximize use of available information Get good estimates of uncertainty

34 Views

3y ago

Missing Data & How to Deal: An overview of missing data

Review useful commands in Stata for missing data. General Steps for Analysis with Missing . Some MAR analysis methods using MNAR data are still pretty good. . 12 grade math score F 45 . M . 99 F 55 86 F 85 88 F 80 75. 81 82 F 75 80 M 95 . M 86 90 F 70 75

41 Views

3y ago

Problem Sets for Linear Algebra in Twenty Five Lectures

1 Problems: What is Linear Algebra 3 2 Problems: Gaussian Elimination 7 3 Problems: Elementary Row Operations 12 4 Problems: Solution Sets for Systems of Linear Equations 15 5 Problems: Vectors in Space, n-Vectors 20 6 Problems: Vector Spaces 23 7 Problems: Linear Transformations 28 8 Problems: Matrices 31 9 Problems: Properties of Matrices 37

57 Views

2y ago

Wheres My Data? Evaluating Visualizations with Missing Data

Where's My Data? Evaluating Visualizations with Missing Data Hayeong Song & Danielle Albers Szaﬁr Visualizations with High Data Quality Visualizations with Low Data Quality Fig. 1: We measured factors inﬂuencing response accuracy, data quality, and conﬁdence in interpretation for time series data with missing values.

4 Views

1y ago

BAB II TINJAUAN PUSTAKA A. Fraktur Antebrachii 1.

Anatomi Antebrachii a. Tulang ulna Menurut Hartanto (2013) ulna adalah tulang stabilisator pada lengan bawah, terletak medial dan merupakan tulang yang lebih panjang dari dua tulang lengan bawah. Ulna adalah tulang medial antebrachium. Ujung proksimal ulna besar dan disebut olecranon, struktur ini membentuk tonjolan siku. Corpus ulna mengecil dari atas ke bawah. 8 Gambar 2.1 Anatomi os Ulna .

121 Views

3y ago

Recent Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

815 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

460 Views

OWNER'S GUIDE - NinjaKitchen

auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. please keep these important safeguards in mind when using the . appliance: mportant: make sure that the .

1y ago

292 Views

Quotes within Quotes: When Single (') and Double (") Quotes . - SAS

Here the outside double quotes are replaced by a single quote and the apostrophe is replaced by two single quotes. This works because when the parser sees two single (or double) quotes immediately following each other, the parser resolves them into one quote mark after the closing quote has been determined.

1y ago

242 Views

What These Inspirational Quotes Say

Self Motivation Quotes Success Quotes Teacher Quotes And after reading all of these inspirational quotes you’d like to share which quotation is . -- Brian Tracy "You must constantly ask yourself these questions: Who am I around? What are they doing to me? Wha

2y ago

308 Views

Consumer Guide Auto Insurance - Tennessee

Auto insurance doesn't cover paying off your loan if your car is damaged and its market value is less than what you owe. Auto dealers and lenders may offer guaranteed auto protection (GAP) insurance for this purpose. Your auto insurance will cover you if you drive into Canada. To drive into Mexico, however, you'll need to buy Mexican auto .

1y ago

206 Views

NAIC Consumer Shopping Tool for Auto Insurance

Whether you are buying auto insurance for the first time, or shopping to be sure you are getting the best deal, you already know how important auto insurance is. By law in most states, if you own a car, you must have some auto insurance. Remember, there is no such thing as a "full coverage" auto insurance policy. Policies are made up of

1y ago

191 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

685 Views

REVIEW OF AUTOMOBILE INSURANCE RATES - Consumers' Association of Canada

In the summer of 2003 the Association compiled over 7,000 auto insurance rate quotes from sources across Canada. In the case of those provinces in which private insurers provide auto insurance the study ensured that the rate quotes obtained reflected the range of prices likely to be found in those markets.

1y ago

222 Views

Broadway towing winchester ky

MO 77 Motors: Rock Hill, SC 7th Avenue Auto Salvage: Fargo, ND 81 Auto Parts & Recycling : Salem, VA 82 Auto Wrecking: Brookfield, OH #9 Truck & Auto Parts (No US Shipping) : Tottenham, ON 97 Auto Wrecking Shull's Towing: Brewster , WA 98 Auto Recyclers: Brooksville, FL 99 Auto Dismantler: Stockton, CA A & A Auto & Truck LLC:

2y ago

471 Views

All about auto insurance - Option Consommateurs

of insurance companies with which they have agreements. Insurance agents: agents work for a specific insurance company. Before you decide to do business with either a broker or an agent, check out prices, the products being proposed and the quality of the service. Buying auto insurance 4 All about auto insurance

1y ago

238 Views

A Message from Our President - Fox Valley Corvette

Bob Jass Chev-rolet 630-365-6481 Auto Parts 25% in most cas-es Ron Westphal Chevrolet 630-898-9630 Auto Parts 25% in most cas-es Thomsons Auto Parts 630-879-6363 Auto Parts 10% in most cas-es American Mod-ern Insurance Co. Collector Car Auto Insurance 10% on Collector Auto Polic

2y ago

231 Views

Quotations - Free Website Builder: Create free websites

cards, but sometimes, playing a poor hand well." . 50th Birthday Quotes 60th Birthday Quotes And there are more. Funny Birthday Quotes Cute Birthday Quotes . it a try, itʼs free. Triumph over failure can be a

2y ago

272 Views

The Top 100 Motivational & Inspirational Quotes for 2015

I've spent hours crawling through the web trying to find the best quotes to keep me motivated and inspired all throughout the New Year. I've saved hundreds of quotes on my laptop and figured that words alone could motivate and inspire me. but if I couple the quotes

2y ago

339 Views

Inspirational Quotes - Guideposts

Inspirational Quotes Inspiring quotes are like vitamins for the soul. From the heartfelt to the humorous, the words of wisdom you’ll find here will strengthen your faith, lift your spirits, and even spark a positive change in your life. This collection of some our favorite inspirational quotes from religious figures, world leaders, authors,

2y ago

558 Views

Missing Data Problems In Machine Learning - UMass Amherst

It looks like you're using an ad-blocker