Missing Data Problems In Machine Learning - UMass Amherst

1y ago
8 Views
2 Downloads
2.60 MB
164 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Brenna Zink
Transcription

Missing Data Problems in Machine LearningbyBenjamin M. MarlinA thesis submitted in conformity with the requirementsfor the degree of Doctor of PhilosophyGraduate Department of Computer ScienceUniversity of TorontoCopyright c 2008 by Benjamin M. Marlin

AbstractMissing Data Problems in Machine LearningBenjamin M. MarlinDoctor of PhilosophyGraduate Department of Computer ScienceUniversity of Toronto2008Learning, inference, and prediction in the presence of missing data are pervasive problems inmachine learning and statistical data analysis. This thesis focuses on the problems of collaborative prediction with non-random missing data and classification with missing features. Webegin by presenting and elaborating on the theory of missing data due to Little and Rubin. Weplace a particular emphasis on the missing at random assumption in the multivariate settingwith arbitrary patterns of missing data. We derive inference and prediction methods in thepresence of random missing data for a variety of probabilistic models including finite mixturemodels, Dirichlet process mixture models, and factor analysis.Based on this foundation, we develop several novel models and inference procedures for boththe collaborative prediction problem and the problem of classification with missing features.We develop models and methods for collaborative prediction with non-random missing data bycombining standard models for complete data with models of the missing data process. Usinga novel recommender system data set and experimental protocol, we show that each proposedmethod achieves a substantial increase in rating prediction performance compared to modelsthat assume missing ratings are missing at random.We describe several strategies for classification with missing features including the use ofgenerative classifiers, and the combination of standard discriminative classifiers with single imputation, multiple imputation, classification in subspaces, and an approach based on modifyingthe classifier input representation to include response indicators. Results on real and syntheticdata sets show that in some cases performance gains over baseline methods can be achieved bymethods that do not learn a detailed model of the feature space.ii

AcknowledgementsI’ve been privileged to enjoy the support and encouragement of many people during the courseof this work. I’ll start by thanking my thesis supervisor, Rich Zemel. I’ve learned a great dealof machine learning from Rich, and have benefitted from his skill and intuition at modellingdifficult problems. I’d also like to thank Sam Roweis, who essentially co-supervised much of myPhD research. His enthusiasm for machine learning is insatiable, and his support of this workhas been greatly appreciated.I have benefitted from the advice of a terrific PhD committee including Geoff Hinton andBrendan Frey, as well as Rich and Sam. Rich, Sam, Geoff, and Brendan were all instrumentalin helping me to pare down a long list of interesting problems to arrive at the present contentsof this thesis. I’ve appreciated their helpful comments and thoughtful questions throughoutthe research and thesis writing process. I would like to extend a special thanks to my externalexaminer, Zoubin Ghahramani, for his thorough reading of this thesis. His detailed comments,questions, and suggestions have helped to significantly improve this thesis.During the course of this work I have also been very fortunate to collaborate with MalcolmSlaney at Yahoo! Research. I’m very grateful to Malcolm for championing our projects withinYahoo!, and to many other people at Yahoo! who were involved in our work including SandraBarnat, Todd Beaupre, Josh Deinsen, Eric Gottschalk, Matt Fukuda, Kristen Jower-Ho, BrianMcGuiness, Mike Mull, Peter Shafton, Zack Steinkamp, and David Tseng. I would like tothank Dennis DeCoste, who co-supervised me at Yahoo! for a short time, for his continuinginterest in this work. Malcolm also helped to coordinate the release of the Yahoo! data setused in this thesis. Malcolm, Rich, and I would like to extend our thanks to Ron Brachman,David Pennock, John Langford, and Lauren McDonnell at Yahoo!, as well as Fred Zhu fromthe University’s Office of Intellectual Property for their efforts in approving the data releaseand putting together a data use agreement.I would like to acknowledge the generous funding of this work provided by the Universityof Toronto Fellowships program, the Ontario Graduate Scholarships program, and the NaturalSciences and Engineering Research Council Canada Graduate Scholarships program. This workwouldn’t have been possible without the support of these programs.On the personal side, I’d like to thank all my lab mates and friends at the University forgood company and interesting discussions over the years including Matt Beal, Miguel CarreiraPerpinan, Stephen Fung, Inmar Givoni, Jenn Listgarten, Ted Meeds, Roland Memisevic, AndriyMnih, Quaid Morris, Rama Natarajan, David Ross, Horst Samulowitz, Rus Salakhutdinov, NatiSrebro, Liam Stewart, Danny Tarlow, and Max Welling. I’m very grateful to Bruce and MauraRowat, for providing me with a home away from home during my final semester of courses inToronto. I’m also grateful to Horst Samulowitz, Nati Srebro and Eli Thomas, Sam Roweis, andiii

Ted Meeds for the use of spare rooms/floor space on numerous visits to the University.I’d like to thank my Mom for never giving up on trying to understand exactly what thisthesis is all about, and my Dad for teaching me that you can fix anything with hard work andthe right tools. I’d like to thank the whole family for providing a great deal of support, and fortheir enthusiasm at the prospect of me finishing the 22nd grade. Finally, I’m incredibly gratefulto my wife Krisztina for reminding me to eat and sleep when things were on a roll, for loveand encouragement when things weren’t going well, for always being ready to drop everythingand get away from it all when I needed a break, and for understanding all the late nights andweekends that went into finishing this thesis.iv

Contents1 Introduction11.1Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.2Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41.2.1Notation for Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . .41.2.2Notation and Conventions for Vector and Matrix Calculus . . . . . . . . .52 Decision Theory, Inference, and Learning72.1Optimal Prediction and Minimizing Expected Loss . . . . . . . . . . . . . . . . .72.2The Bayesian Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82.2.1Bayesian Approximation to the Prediction Function . . . . . . . . . . . .92.2.2Bayesian Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92.2.3Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.32.42.5The Maximum a Posteriori Framework . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1MAP Approximation to The Prediction Function . . . . . . . . . . . . . . 112.3.2MAP Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12The Direct Function Approximation Framework . . . . . . . . . . . . . . . . . . . 132.4.1Function Approximation as Optimization . . . . . . . . . . . . . . . . . . 132.4.2Function Approximation and Regularization . . . . . . . . . . . . . . . . . 14Empirical Evaluation Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.1Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.2Validation Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.3Cross Validation Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 A Theory of Missing Data173.1Categories of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2The Missing at Random Assumption and Multivariate Data . . . . . . . . . . . . 183.3Impact of Incomplete Data on Inference . . . . . . . . . . . . . . . . . . . . . . . 203.4Missing Data, Inference, and Model Misspecification . . . . . . . . . . . . . . . . 21v

4 Unsupervised Learning With Random Missing Data4.14.24.34.4Finite Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.1Maximum A Posteriori Estimation . . . . . . . . . . . . . . . . . . . . . . 274.1.2Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Dirichlet Process Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.1Properties of The Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . 304.2.2Bayesian Inference and the Conjugate Gibbs Sampler . . . . . . . . . . . 324.2.3Bayesian Inference and the Collapsed Gibbs Sampler . . . . . . . . . . . . 344.2.4Predictive Distribution and the Conjugate Gibbs Sampler . . . . . . . . . 354.2.5Predictive Distribution and the Collapsed Gibbs Sampler . . . . . . . . . 36Factor Analysis and Probabilistic Principal Components Analysis . . . . . . . . . 374.3.1Joint, Conditional, and Marginal Distributions . . . . . . . . . . . . . . . 384.3.2Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . 394.3.3Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Mixtures of Factor Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.4.1Joint, Conditional, and Marginal Distributions . . . . . . . . . . . . . . . 424.4.2Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . 424.4.3Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Unsupervised Learning with Non-Random Missing Data5.15.22546The Yahoo! Music Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.1.1User Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.1.2Rating Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.1.3Experimental Protocols for Rating Prediction . . . . . . . . . . . . . . . . 51The Jester Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2.1Experimental Protocols for Rating Prediction . . . . . . . . . . . . . . . . 525.3Test Items and Additional Notation for Missing Data . . . . . . . . . . . . . . . . 545.4The Finite Mixture/CPT-v Model . . . . . . . . . . . . . . . . . . . . . . . . . . 545.55.4.1Conditional Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.4.2Maximum A Posteriori Estimation . . . . . . . . . . . . . . . . . . . . . . 595.4.3Rating Prediction5.4.4Experimentation and Results . . . . . . . . . . . . . . . . . . . . . . . . . 63. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63The Dirichlet Process Mixture/CPT-v Model . . . . . . . . . . . . . . . . . . . . 685.5.1An Auxiliary Variable Gibbs Sampler . . . . . . . . . . . . . . . . . . . . 695.5.2Rating Prediction for Training Cases . . . . . . . . . . . . . . . . . . . . . 725.5.3Rating Prediction for Novel Cases . . . . . . . . . . . . . . . . . . . . . . 73vi

5.5.45.65.75.8Experimentation and Results . . . . . . . . . . . . . . . . . . . . . . . . . 74The Finite Mixture/Logit-vd Model . . . . . . . . . . . . . . . . . . . . . . . . . 755.6.1Maximum A Posteriori Estimation . . . . . . . . . . . . . . . . . . . . . . 775.6.2Rating Prediction5.6.3Experimentation and Results . . . . . . . . . . . . . . . . . . . . . . . . . 81. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.7.1Restricted Boltzmann Machines and Complete Data . . . . . . . . . . . . 825.7.2Conditional Restricted Boltzmann Machines and Missing Data . . . . . . 855.7.3Conditional Restricted Boltzmann Machines and Non User-Selected Items 895.7.4Experimentation and Results . . . . . . . . . . . . . . . . . . . . . . . . . 92Comparison of Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 946 Classification With Missing Data6.16.26.36.499Frameworks for Classification With Missing Features . . . . . . . . . . . . . . . . 996.1.1Generative Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.1.2Case Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.1.3Classification and Imputation . . . . . . . . . . . . . . . . . . . . . . . . . 1006.1.4Classification in Sub-spaces: Reduced Models . . . . . . . . . . . . . . . . 1016.1.5A Framework for Classification with Response Indicators . . . . . . . . . 102Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.2.1Fisher’s Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . 1026.2.2Linear Discriminant Analysis as Maximum Probability Classification . . . 1046.2.3Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . 1046.2.4Regularized Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . 1056.2.5LDA and Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076.2.6Discriminatively Trained LDA and Missing Data . . . . . . . . . . . . . . 1086.2.7Synthetic Data Experiments and Results. . . . . . . . . . . . . . . . . . 112Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.3.1The Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . 1146.3.2Maximum Likelihood Estimation for Logistic Regression . . . . . . . . . . 1156.3.3Regularization for Logistic Regression . . . . . . . . . . . . . . . . . . . . 1166.3.4Logistic Regression and Missing Data . . . . . . . . . . . . . . . . . . . . 1166.3.5An Equivalence Between Missing Data Strategies for Linear Classification 1186.3.6Synthetic Data Experiments and Results. . . . . . . . . . . . . . . . . . 119Perceptrons and Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 1246.4.1Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124vii

6.56.66.76.4.2Hard Margin Support Vector Machines . . . . . . . . . . . . . . . . . . . . 1256.4.3Soft Margin Support Vector Machines . . . . . . . . . . . . . . . . . . . . 1266.4.4Soft Margin Support Vector Machine via Loss PenaltyBasis Expansion and Kernel Methods. . . . . . . . . 126. . . . . . . . . . . . . . . . . . . . . . . . 1276.5.1Basis Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286.5.2Kernel Methods6.5.3Kernel Support Vector Machines and Kernel Logistic Regression . . . . . 1296.5.4Kernels For Missing Data Classification . . . . . . . . . . . . . . . . . . . 1306.5.5Synthetic Data Experiments and Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128. . . . . . . . . . . . . . . . . . 133Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.6.1Feed-Forward Neural Network Architecture . . . . . . . . . . . . . . . . . 1356.6.2One Hidden Layer Neural Networks for Classification . . . . . . . . . . . . 1366.6.3Special Cases of Feed-Forward Neural Networks . . . . . . . . . . . . . . . 1376.6.4Regularization in Neural Networks . . . . . . . . . . . . . . . . . . . . . . 1376.6.5Neural Network Classification and Missing Data . . . . . . . . . . . . . . 1386.6.6Synthetic Data Experiments and Results. . . . . . . . . . . . . . . . . . 139Real Data Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 1406.7.1Hepatitis Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406.7.2Thyroid - AllHypo Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.7.3Thyroid - Sick Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.7.4MNIST Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1437 Conclusions1467.1Unsupervised Learning with Non-Random Missing Data . . . . . . . . . . . . . . 1467.2Classification with Missing Features . . . . . . . . . . . . . . . . . . . . . . . . . 148Bibliography150viii

Chapter 1IntroductionMissing data occur in a wide array of application domains for a variety of reasons. A sensor ina remote sensor network may be damaged and cease to transmit data. Certain regions of a genemicroarray may fail to yield measurements of the underlying gene expressions due to scratches,finger prints, dust, or manufacturing defects. Participants in a clinical study may drop outduring the course of the study leading to missing observations at subsequent time points. Adoctor may not order all applicable tests while diagnosing a patient. Users of a recommendersystem rate an extremely small fraction of the available books, movies, or songs, leading tomassive amounts of missing data.Abstractly, we may consider a random process underlying the generation of incomplete datasets. This generative process can be decomposed into a complete data process that generatescomplete data sets, and a missing data process that determines which elements of the completedata set will be missing. In the examples given above, the hypothetical complete data set wouldinclude measurements from every sensor in a remote sensor network, the result of every medicaltest relevant to a particular medical condition for every patient, and the rating of every userfor every item in a recommender system. The missing data process is sometimes referred toas the missing data mechanism, the observation process, or the selection process. We mightimagine that a remote sensor is less likely to transmit data if its operational temperate rangeis exceeded, that a doctor is less likely to order a test that is invasive, and that a user of arecommender system is less likely to rate a given item if the user does not like that item.The analysis of missing data processes leads to a theory of missing data in terms of itsimpact on learning, inference, and prediction. This theory draws a distinction between twofundamental categories of missing data: data that is missing at random and data that is notmissing at random. When data is missing at random, the missing data process can be ignoredand inference can be based on the observed data only. The resulting computations are tractablein many common generative models. When data is not missing at random, ignoring the missing1

Chapter 1. Introduction2data process leads to a systematic bias in standard algorithms for unsupervised learning, inference, and prediction. An intuitive example of a process that violates the missing at randomassumption is one where the probability of observing the value of a particular feature dependson the value of that feature. All forms of missing data are problematic in the classificationsetting since standard discriminative classifiers do not include a model of the feature space. Asa result, most discriminative classifiers have no natural ability to deal with missing data.1.1Outline and ContributionsThe focus of this thesis is the development of models and algorithms for learning, inference, andprediction in the presence of missing data. The two main problems we study are collaborativeprediction with non-random missing data, and classification with missing features. We beginChapter 2 with a discussion of decision theory as a framework for understanding different learning and inference paradigms including Bayesian inference, maximum a posteriori estimation,maximum likelihood estimation, and regularized function approximation. We review particular algorithms and principles including the Metropolis-Hastings algorithm, the Gibbs sampler,and the Expectation Maximization algorithm. We also discuss procedures for estimating theperformance of prediction methods.Chapter 3 introduces the theory of missing data due to Little and Rubin. We presentformal definitions of the three main classes of missing data. We present a detailed investigationof the missing at random assumption in the multivariate case with arbitrary patterns of missingdata. We argue that the missing at random assumption is best understood in terms of a setof symmetries imposed on the missing data process. We review the impact of random andnon-random missing data on probabilistic inference. We present a study of the effect of datamodel misspecification on inference in the presence of random missing data. We demonstratethat using an incorrect data model can lead to biased inference and learning even when data ismissing at random in the underlying generative process.Chapter 4 introduces unsupervised learning models in the random missing data setting including finite multinomial mixtures, Dirichlet Process multinomial mixtures, factor analysis,and probabilistic principal component analysis. We present maximum a posteriori learning infinite mixture models with missing data. We derive conjugate and collapsed Gibbs samplers forthe Dirichlet Process multinomial mixture model with missing data. We derive complete expectation maximization algorithms for factor analysis, probabilistic principal components analysis,mixtures of factor analyzers, and mixtures of probabilistic principal components analyzers withmissing data.Chapter 5 focuses on the problem of unsupervised learning for collaborative prediction when

Chapter 1. Introduction3missing data may violate the missing at random assumption. Collaborative prediction problemslike rating prediction in recommender systems are typically solved using unsupervised learningmethods. As discussed in Chapter 3, the results of learning and prediction will be biased if themissing at random assumption is violated. We discuss compelling new evidence in the formof a novel user study and the analysis of a new collaborative filtering data set which stronglysuggests that the missing at random assumption does not hold in the recommender systemdomain.We present four novel models for unsupervised learning with non-random missing data thatbuild on the models and inference procedures for random missing data presented in Chapter4. These models include the combination of the finite multinomial mixture model and theDirichlet Process multinomial mixture model with a simple missing data mechanism wherethe probability that a rating is missing depends only on the value of that rating. We referto this mechanism as CPT-v since it is parameterized using a simple conditional probabilitytable. We prove that the parameters of the CPT-v missing data mechanism are conditionallyidentifiable even though the mixture data models are not identifiable. We also combine thefinite multinomial mixture model with a more flexible missing data model that we refer to asLogit-vd. The Logit-vd model allows for response probabilities that differ depending on boththe underlying rating value, and the identity of the item. The name Logit-vd derives from thefact that the missing data mechanism is represented using an additive logistic model. We reviewmodified contrastive divergence learning for restricted Boltzmann machines with missing data,and offer a new derivation of these learning methods as standard contrastive divergence in analternative model. The final model we consider is a conditional Restricted Boltzmann Machinethat includes energy terms that can account for non-random missing data effects similar to theCPT-v model.We show that traditional experimental protocols and testing procedures for collaborativeprediction implicity assume missing ratings are missing at random. We show that these procedures fail to detect the effects of non-random missing ratings. To correct this problem weintroduce novel experimental protocols and testing procedures specifically designed for collaborative prediction with non-random missing data. Our empirical results show that ratingprediction methods based on models that incorporate an explicit non-random missing datamechanism achieve 25% to 40% lower error rates than methods that assume the missing atrandom assumption holds. To put these results in perspective, the best models studied in ourprevious work on collaborative filtering achieve approximately 15% lower error rates relativeto the simplest models we considered [52, p. 107-108]. We also compare the methods studiedin terms of ranking performance, and again show that methods that model the missing datamechanism achieve better ranking performance than methods that treat missing data as if it is

Chapter 1. Introduction4missing at random.In Chapter 6 we consider the problem of classification with missing features. We beginwith a discussion of general strategies for dealing with missing data in the classification setting.We consider the application of generative classifiers where missing data can be analyticallyintegrated out of the model. We derive a variation of Fisher’s linear discriminant analysis formissing data that uses a factor analysis model for the covariance matrix. We then derive anovel discriminative learning procedure for the classifier based on maximizing the conditionalprobability of the labels given the observed data.We study the application of logistic regression, multi-layer neural networks, and kernelclassifiers in conjunction with several frameworks for converting a discriminative classifier intoa classifier for incomplete data cases. We consider the use of various imputation methodsincluding multiple imputation. For data sets with a limited number of patterns of missingdata, we consider a reduced model approach that learns a separate classifier for each patternof missing data. Finally, we consider an approach based on modifying the input representationof a discriminative classifier in such a way that the classification function depends only on theobserved feature values, and which features are observed. Results on real and synthetic datasets show that in some cases performance gains over baseline methods can be achieved withoutlearning detailed models of the input space.1.2NotationWe use capital letters to denote random variables, and lowercase letters to denote instantiationsof random variables. We use a bold typeface to indicate vector and matrix quantities, and aplain typeface to indicate scalar quantities.When describing data sets we denote the total number of feature dimensions by D, and thetotal number of data cases by N . We denote the feature vector for data case n by x n , andindividual feature values by xdn . In the classification setting we denote the total number ofclasses by C. We denote the class variable for data case n by yn , and assume it takes the values{1, 1} in the binary case, and {1, ., C} in the multi-class case.We use square bracket notation [s] to represent an indicator function that takes the value 1if the statement s is true, and 0 if the statement s is false. For example, [xdn v] would takethe value 1 if xdn is equal to v, and 0 otherwise.1.2.1Notation for Missing DataFollowing the standard representation for missing data due to Little and Rubin [49], we introduce a companion vector of response indicators for data case n denoted r n . rdn is 1 if xdn is

Chapter 1. Introduction5observed, and rdn is 0 if xdn is not observed. We denote the number of observed data dimensionsin data case n by Dn . In addition to the response indicator vector, we introduce a vector o nPof length Dn listing the dimensions of xn that are observed. We define oin d if dj 1 rjn iand rdn 1. In other words, oin d if d is the ith observed dimension of xn . We introduce acorresponding vector mn of length D Dn listing the dimensions of xn that are missing. WePdefine min d if dj 1 (1 rjn ) i and (1 rdn ) 1. In other words, min d if d is the ithmissing dimension of xn .We use superscripts to denote sub-vectors and sub-matrices. For example, x onn denotes thesub-vector of xn corresponding to the observed elements of xn . The element-wise definition ofxonn is xoinn xoin n . Similarly, if Σ is a D D matrix then, for example, Σon mn is the sub-matrixof Σ obtained by selecting the rows corresponding to the observed dimensions of x n , and thecolumns corresponding to the missing dimensions of xn . The element-wise definition of Σon mnon m nis Σij Σoin mjn . For simplicity we will often use the notation xo and Σom in place of xonnor Σon mn when it is clear which pattern of observed or missing entries is intended.Projection matrices are another very useful tool for dealing with sub-vectors and submatrices induced by missing data. We define the projection matrix Hon where Hoijn [ojn i].The matrix Hon projects a vector from the Dn dimensional space corresponding to the observeddimensions of xn to the full D dimensional feature space. The missing dimensions are filledmwith zeros. Similarly, we define the projection matrix Hmn such that Hijn [mjn i]. Thematrix Hmn projects a vector from the (D Dn ) dimensional space corresponding to the missingdimensions of xn to the full D dimensional feature space. The observed dimensions are filledwith zeros. As we will see later, these projection matrices arise naturally when taking matrixand vector derivatives of the form Σon mn / Σ.1.2.2Notation and Conventions for Vector and Matrix CalculusThroughout this work we will be deriving optimization algorithms that require the closed-formor iterative solution of a set of gradient equations. The gradient equations are derived usingmatrix calculus. In this section we review the matrix calculus conventions used in this work.First, we assume that all vectors are column vectors unless explicitly stated otherwise. Wewill follow the convention that the gradient of a scalar function f with respect to a matrixvalued function g of dimension A B is a matrix of size A B as seen in Equation 1.2.1. Weadopt this convention since it avoids the need to transpose the matrix of partial derivativeswhen solving gradient equations, and performing iterative gradient updates.

6Chapter 1. Introduction f11 g f g21 f g . f gA1 f

machine learning and statistical data analysis. This thesis focuses on the problems of collab-orative prediction with non-random missing data and classi cation with missing features. We begin by presenting and elaborating on the theory of missing data due to Little and Rubin. We

Related Documents:

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

resulting inferences are generally conditional on the observed pattern of missing data. Further, ignoring the process that causes missing data when making direct-likelihood or Bayesian inferences about 6 is appropriate if the missing data are missing at random and q is distinct from 0.

2 Click Quick Actions, and then click Add Missing Punch. 3 Click the field with the missing punch, which is indicated by solid red. Note: You can click multiple missing punch fields if necessary. 4 To turn off the Missing Punch action, click Add Missing Punch. 5 Click Save.

Missing Data Using Stata Paul D. Allison, Ph.D. February 2016 www.StatisticalHorizons.com 1 Basics Definition: Data are missing on some variables for some observations Problem: How to do statistical analysis when data are missing? Three goals: Minimize bias Maximize use of available information Get good estimates of uncertainty

Review useful commands in Stata for missing data. General Steps for Analysis with Missing . Some MAR analysis methods using MNAR data are still pretty good. . 12 grade math score F 45 . M . 99 F 55 86 F 85 88 F 80 75. 81 82 F 75 80 M 95 . M 86 90 F 70 75

1 Problems: What is Linear Algebra 3 2 Problems: Gaussian Elimination 7 3 Problems: Elementary Row Operations 12 4 Problems: Solution Sets for Systems of Linear Equations 15 5 Problems: Vectors in Space, n-Vectors 20 6 Problems: Vector Spaces 23 7 Problems: Linear Transformations 28 8 Problems: Matrices 31 9 Problems: Properties of Matrices 37

Where's My Data? Evaluating Visualizations with Missing Data Hayeong Song & Danielle Albers Szafir Visualizations with High Data Quality Visualizations with Low Data Quality Fig. 1: We measured factors influencing response accuracy, data quality, and confidence in interpretation for time series data with missing values.

Anatomi Antebrachii a. Tulang ulna Menurut Hartanto (2013) ulna adalah tulang stabilisator pada lengan bawah, terletak medial dan merupakan tulang yang lebih panjang dari dua tulang lengan bawah. Ulna adalah tulang medial antebrachium. Ujung proksimal ulna besar dan disebut olecranon, struktur ini membentuk tonjolan siku. Corpus ulna mengecil dari atas ke bawah. 8 Gambar 2.1 Anatomi os Ulna .