Noise Accumulation In High Dimensional Classiﬁcation And .

3y ago

22 Views

2 Downloads

615.47 KB

23 Pages

Last View : 11d ago

Last Download : 3m ago

Upload by : Elise Ammons

Report this link

Download PDF

Transcription

Journal of Machine Learning Research 21 (2020) 1-23Submitted 2/19; Revised 7/19; Published 1/20Noise Accumulation in High Dimensional Classification and TotalSignal IndexMiriam R. ElmanELMANM @ OHSU . EDUSchool of Public HealthOregon Health & Science University-Portland State University3181 SW Sam Jackson Park RdPortland, OR 97239, USAJessica MinnierMINNIER @ OHSU . EDUSchool of Public HealthOregon Health & Science University-Portland State University3181 SW Sam Jackson Park RdPortland, OR 97239, USAXiaohui ChangXIAOHUI . CHANG @ OREGONSTATE . EDUCollege of BusinessOregon State University2751 SW Jefferson WayCorvallis, OR 97331, USADongseok ChoiCHOID @ OHSU . EDUSchool of Public HealthOregon Health & Science University-Portland State University3181 SW Sam Jackson Park RdPortland, OR 97239, USAEditor: Xiaotong ShenAbstractGreat attention has been paid to Big Data in recent years. Such data hold promise for scientificdiscoveries but also pose challenges to analyses. One potential challenge is noise accumulation.In this paper, we explore noise accumulation in high dimensional two-group classification. First,we revisit a previous assessment of noise accumulation with principal component analyses, whichyields a different threshold for discriminative ability than originally identified. Then we extend ourscope to its impact on classifiers developed with three common machine learning approaches—random forest, support vector machine, and boosted classification trees. We simulate four scenarios with differing amounts of signal strength to evaluate each method. After determining noiseaccumulation may affect the performance of these classifiers, we assess factors that impact it. Weconduct simulations by varying sample size, signal strength, signal strength proportional to thenumber predictors, and signal magnitude with random forest classifiers. These simulations suggestthat noise accumulation affects the discriminative ability of high-dimensional classifiers developedusing common machine learning methods, which can be modified by sample size, signal strength,and signal magnitude. We developed the measure total signal index (TSI) to track the trends of totalsignal and noise accumulation.Keywords: Noise Accumulation, Classification, High Dimensional, Random Forest, Asymptotic,Total Signal Indexc 2020 Miriam R. Elman, Jessica Minnier, Xiaohui Chang, and Dongseok Choi.License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided athttp://jmlr.org/papers/v21/19-117.html.

E LMAN , M INNIER , C HANG , AND C HOI1. IntroductionNoise accumulation occurs when simultaneous estimation or testing of multiple parameters resultsin estimation error. This can happen when many weak predictors or ones unrelated to the outcomeare included in a model. Such noise can concentrate, obstructing true signal and biasing estimationof corresponding parameters. Noise accumulation is generally not an issue in conventional statisticalsettings where sample size exceeds the number of predictors but high dimensional data are highlysusceptible to its effect.Noise accumulation is well known in regression but was quantified first in classification by Fanand Fan (2008). These authors demonstrate that high dimensional prediction with classificationbased on linear discriminant rules performs equivalently to random guessing due to noise accumulation (Fan and Fan, 2008). They also assert that projection methods such as principal componentanalysis (PCA) tend to perform poorly in high dimensional settings. Hall et al. (2008) and Fan(2014) studied distance-based classifiers in these settings and found performance was adversely affected. The impact of noise accumulation on classification using PCA was further explored usingsimulation by Fan et al. (2014) in ”Challenges of Big Data Analysis.” In addition to work donewith distance-classifiers, linear discriminant rules, and PCA, Fan and Fan (2008) showed that theindependent classification rule was susceptible to noise accumulation but could be overcome withvariable selection. Approaches using classifiers developed with machine learning algorithms suchas random forest (Breiman, 2001), commonly used in high dimensional settings, have not yet beenexplored to our knowledge.All simulations were batch processed in R version 3.4.0 on a computer cluster (R Core Team,2017). The nodes employed for analyses were running on CentOS Linux 7. PCA was conductedusing the prcomp function in base R while randomForest (4.6-12), e1071 (1.6-8), and gbm (2.1.3)packages were used to run RF, SVM, and BCT procedures (Liaw and Wiener, 2002; Meyer et al.,2015; Ridgeway, 2017). We mostly used the default settings from each package for the simulations(thus neglecting the importance of tuning for these methods). Additional information is provided inAppendix and code available on GitHub (Elman, 2018).In this paper, we are interested in the impact that noise accumulation has on two-group classification for high dimensional data. In Section 2, we use simulation to recreate the scenario describedby Fan et al. (2014). We expand the simulations to high-dimensional classification methods randomforest (RF), support vector machines (SVM) (Cortes and Vapnik, 1995), and boosted classificationtrees (BCT) (Friedman et al., 2000) in Section 3 then explore characteristics of noise accumulationin two-group classification, using a RF approach to construct classification rules while varying simulation parameters in Section 4. In Section 5, we develop a new index, total signal index (TSI), totrack the trends of total signal and noise accumulation. We conclude in Section 6.2. Simulations with PCATo illustrate the issue of noise accumulation, Fan et al. (2014) explored a classification scenariowith data from two classes. A total of p predictors from both classes were drawn from standardmultivariate normal distributions (MVN) with equal sample size n for each class and an identitycovariance matrix. Classes 1 and 2 were defined as:X1 , . . . , Xn MVN p (µ1 , I p )Y1 , . . . , Yn MVN p (µ2 , I p ),2

N OISE ACCUMULATION IN H IGH D IMENSIONAL C LASSIFICATION AND T OTAL S IGNAL I NDEXwhere µ1 0, n 100 for each class, and p 1000. The first 10 elements of µ2 were nonzerowith value equal to three and all other entries zero: µ2 (3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, . . . , 0). Thus,the nonzero components of µ2 constitute the signal that differentiated the two classes. Fan and colleagues computed principal components for specifed values of predictors q 2, 40, 200, and 1000then visually assessed how well the two classes could be separated by plotting the first two principalcomponents (Fan et al., 2014). They report that discriminative power was high when there werea low number of predictors, which they found to be q 200 in their simulations. When the number of predictors was small enough, there was adequate signal to drown out noise and differentiatebetween the classes. As the number of predictors grew, noise eventually overwhelmed signal andpredicting the class membership for observations became infeasible. Fan et al. (2014) demonstratethat discriminative power was high when q 200 in their simulations and noise overwhelmed signalbeyond this threshold.Like Fan et al. (2014), we simulated data for two classes from standard multivariate normaldistributions with an identity covariance matrix and p predictors, where µ1 0, µ2 was definedto be sparse with m nonzero elements and the remaining entries equal to zero, and n 100 foreach class. In our simulations, we extended the total number of predictors to p 5000 as well asconsidered three additional scenarios for the nonzero elements of µ2 (Table 1).Table 1: Scenarios for different classificationsimulationsScenariomForm of µ21234106210(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, . . . , 0)(3, 3, 3, 3, 3, 3, 0, . . . , 0)(3, 3, 0, . . . , 0)(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, . . . , 0)µ1 0, Σ1 Σ2 I in all scenarios; m represents the number ofnonzero elements in µ2 .We computed the principal components for values q 2, 10, 100, 200, 1000, and 5000 and plottedthe projections of the first two components. Figures 1 through 4 show scatterplots with the resultsof these simulations, depicting class membership by black or red filled circles.In general, our results are analogous to the findings of Fan et al. (2014). That is, high discriminative power appears possible when the number of predictors is sufficiently low but decreases asit increases. However, the threshold for what Fan et al. (2014) deemed low differed in our simulations. We found the threshold for achieving high discriminative power to be much higher. In fact,we found high discriminative power even up through q 5000 (Figure 1). In Scenario 2, PCA produced distinct separation up through q 1000 (Figure 2). When the number of nonzero elementswas reduced to m 2 in Scenario 3 (Figure 3), discriminative ability diminished more quickly, becoming poor at q 200. In Scenario 4, when the number of nonzero elements was m 10 and thevalue of each element one, high discriminative ability appeared possible when q 1000 but wasotherwise low (Figure 4). Based on these results, it appears that discriminative ability is a factor ofboth signal magnitude (value of the nonzero elements) as well as its strength (number of nonzeroelements).3

E LMAN , M INNIER , C HANG , AND C HOI 15 5 01st Principal Component 5 051015 15 50 5 10 151st Principal Component(e) q 1000 15 102nd Principal Component1050 55 10(d) q 200 15 102nd Principal Component105 50 15 102nd Principal Component(c) q 10015(b) q 1015(a) q 2 5 05 101st Principal Component(f) q 50001510 550 15 102nd Principal Component1510 505 15 102nd Principal Component10 505 15 102nd Principal Component15 15 5 05 101st Principal Component 15 5 05 101st Principal Component 15 5 05 101st Principal ComponentFigure 1: Scatterplots of the projection of observed data from Scenario 1 (n 100 for each class,m 10 nonzero elements for µ2 each equal to three and µ1 0) onto the first twoprincipal components of the m-dimensional space. Black circles indicate the first class,red circles indicate the second.4

N OISE ACCUMULATION IN H IGH D IMENSIONAL C LASSIFICATION AND T OTAL S IGNAL I NDEX 5 01050 5 15 10 155 101st Principal Component1510502nd Principal Component151050 5 5 0(f) q 5000 15 102nd Principal Component151050 5 15 10 1st Principal Component(e) q 1000 5 10 51st Principal Component(d) q 20015 152nd Principal Component1050 55 10 15 10 5 0 15 102nd Principal Component1050 5 15 102nd Principal Component 152nd Principal Component(c) q 10015(b) q 1015(a) q 2 15 5 05 101st Principal Component 15 5 05 101st Principal Component 15 5 05 101st Principal ComponentFigure 2: Scatterplots of the projection of observed data from Scenario 2 (n 100 for each class,m 6 nonzero elements for µ2 each equal to three and µ1 0) onto the first two principalcomponents of the m-dimensional space. Black circles indicate the first class, red circlesindicate the second.5

E LMAN , M INNIER , C HANG , AND C HOI 5 0 5 05 101st Principal Component5 101st Principal Component1050051015 5 02nd Principal Component10 505 15 105 10(f) q 5000 15 5 01st Principal Component15 152nd Principal Component15105 15 10 50 5 151st Principal Component(e) q 1000 15 105 10 51st Principal Component(d) q 20015 152nd Principal Component1050 55 10 15 10 5 0 15 102nd Principal Component1050 5 15 102nd Principal Component 152nd Principal Component(c) q 10015(b) q 1015(a) q 2 15 5 05 101st Principal ComponentFigure 3: Scatterplots of the projection of observed data from Scenario 3 (n 100 for each class,m 2 nonzero elements for µ2 each equal to three and µ1 0) onto the first two principalcomponents of the m-dimensional space. Black circles indicate the first class, red circlesindicate the second.6

N OISE ACCUMULATION IN H IGH D IMENSIONAL C LASSIFICATION AND T OTAL S IGNAL I NDEX 5 01050 1510502nd Principal Component5 05 101st Principal Component1510 5 5 0(f) q 5000 15 102nd Principal Component1510 15 10 505 5 151st Principal Component(e) q 1000 15 105 10 51st Principal Component(d) q 20015 152nd Principal Component1050 55 10 15 10 5 0 15 102nd Principal Component1050 5 15 102nd Principal Component 152nd Principal Component(c) q 10015(b) q 1015(a) q 2 15 5 05 101st Principal Component 15 5 05 101st Principal Component 15 5 05 101st Principal ComponentFigure 4: Scatterplots of the projection of observed data from Scenario 4 (n 100 for each class,m 10 nonzero elements for µ2 each equal to one and µ1 0) onto the first two principalcomponents of the m-dimensional space. Black circles indicate the first class, red circlesindicate the second.7

E LMAN , M INNIER , C HANG , AND C HOI3. Simulation with Classification MethodsWe expanded the simulations that were used for PCA to machine learning methods RF, SVM, andBCT. Using the same scenarios we explored previously (Table 1), we built classifiers with thesemethods and evaluated their performance. For each method and scenario, a classification rule wasdeveloped for q 2, . . . , 5000 predictors on the training data set. This classifier was then applied toa corresponding test data set and used to predict whether new observations should be categorizedinto the first or second class. This process was repeated 100 times on training data sets then theseclassifiers were used to predict class membership for 100 test data sets. Classifiers’ discriminativepower was assessed by the median classification error from test data sets with 10th and 90th percentile bounds by comparing the categorization predicted by the classifier to its true class in the testdata set. We evaluated the overall trend of median classification error in the scenarios as well as themaximum classification error for q 10 and q 5000.3.1. Scenario 1: µ2 (3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, . . . , 0)The three classification methods each demonstrated high discriminative ability in Scenario 1. Overall, the median test error was 10% for RF, SVM, and BCT (Figure 5, row 1). In particular, RFand BCT performed with almost no misclassification when q 4. Test error reached its maximumfor RF and BCT when 2 q 4. For q 10, the test error dropped substantially for RF and BCTbut increased for SVM. Table 2 summarizes the maximum test error for q 10, and q 5000.3.2. Scenario 2: µ2 (3, 3, 3, 3, 3, 3, 0, . . . , 0)Results from the second scenario were similar to the first except SVM performed worse (Figure 5,row 2). The overall median test error was 3% for RF and BCT and the test error for these methodspeaked when 2 q 4 (Table 3). After this point, there was almost no test error for these methods.By contrast, SVM had a small initial peak in test error at q 3, which dropped then rose even higheras q grew. Table 3 shows the final value of test error for each method at q 5000.3.3. Scenario 3: µ2 (3, 3, 0, . . . , 0)There was a decline in discriminative ability of RF and especially SVM in Scenario 3 (Figure 5, row3). Despite the increase in test error between this scenario and the previous ones, the RF performedreasonably well with overall median test error 8%. The SVM classifier did not behave as well;its overall median test error was 35%. BCT still performed at nearly an equivalent degree as inScenarios 1 and 2; the overall median test error was 4%. Unlike previous scenarios, the highesttest error did not occur when q 5 for RF and BCT but when q 5000. Table 4 shows the maximummedian test error when q 10 and q 5000.3.4. Scenario 4: µ2 (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, . . . , 0)Scenario 4 proved to be a difficult simulation for all classification approaches (Figure 5, row 4)though the test error for SVM was slightly better in this scenario than the previous one. Overall, themedian test error was 30% for RF and BCT while it was 30% for SVM. The test error peakedat 2 q 3 for RF and BCT but at q 5000 for SVM. Table 5 shows the maximum test error forq 10. After the initial increase, it decreased for all of the methods. The behavior of the test error8

N OISE ACCUMULATION IN H

Portland, OR 97239, USA Xiaohui Chang XIAOHUI.CHANG@OREGONSTATE EDU College of Business Oregon State University 2751 SW Jefferson Way Corvallis, OR 97331, USA Dongseok Choi CHOID@OHSU.EDU School of Public Health Oregon Health & Science University-Portland State University 3181 SW Sam Jackson Park Rd Portland, OR 97239, USA Editor: Xiaotong Shen .

Related Documents:

37W-30477-0 Noise Figure WP

Noise Figure Overview of Noise Measurement Methods 4 White Paper Noise Measurements The noise contribution from circuit elements is usually defined in terms of noise figure, noise factor or noise temperature. These are terms that quantify the amount of noise that a circuit element adds to a signal.

65 Views

3y ago

Employees’ Capital Accumulation Plan (ECAP)

Jan 30, 2012 · What Is ECAP? Booz Allen Hamilton’s Employees’ Capital Accumulation Plan (ECAP) is a tax-deferred, defined contribution plan designed to help employees prepare financially for their future. ECAP is a valuable benefit that is designed to provide its participants an opportunity to accumulate retirement savingsFile Size: 242KBPage Count: 6Employees · Capital Accumulation Plan · Helpful HintsExplore furtherBooz Allen Hamilton Employees’ Capital Accumulation Plan .ecap.voya.comBOOZ ALLEN HAMILTON EMPLOYEES’ CAPITAL ACCUMULATION PLA ecap.voya.comCapital Contribution Agreement - RealDealDocswww.realdealdocs.comECAP Basics 401(K) Economy Of The United Stateswww.scribd.comNew Study Documents ESOP Account Balances - The Menke Groupwww.menke.comRecommended to you b

35 Views

2y ago

Parameter Conversion between Controlled Pass-By Method and Alternative ...

noise and tire noise. The contribution rate of tire noise is high when the vehicle is running at a constant speed of 50 km/h, reaching 86-100%, indicating tire noise is the main noise source [1]. Therefore, reducing tire noise is important for reducing the overall noise of the vehicle and controlling noise pollution [2].

16 Views

10m ago

CHAPTER 12 Noise Element

The Noise Element of a General Plan is a tool for including noise control in the planning process in order to maintain compatible land use with environmental noise levels. This Noise Element identifies noise sensitive land uses and noise sources, and defines areas of noise impact for the purpose of

20 Views

1y ago

ECEN 665 (ESS) : RF Communication Circuits and Systems

7 LNA Metrics: Noise Figure Noise factor is defined by the ratio of output SNR and input SNR. Noise figure is the dB form of noise factor. Noise figure shows the degradation of signal's SNR due to the circuits that the signal passes. Noise factor of cascaded system: LNA's noise factor directly appears in the total noise factor of the system.

19 Views

11m ago

Oﬀset, ﬂicker noise, and ways to deal with them - Schmid-Werren

Figure 1: Power spectral density of white noise overlaid by ﬂicker noise. Figure 2: Flicker noise generated from white noise. 1.1 The nature of ﬂicker noise Looking at processes generating ﬂicker noise in the time domain instead of the frequency domain gives us much more insight into the nature of ﬂicker noise.

18 Views

10m ago

High-Accuracy Noise Figure Measurements Using the PNA-X ...

extract the noise figure of the DUT from the overall system noise measurement. This step is referred to as second-stage noise correction, as the DUT’s mea-sured noise figure is corrected based on the gain and noise figure of a second stage, which in this case is the test instrument’s noise receiver.

56 Views

3y ago

A VARIATIONAL APPROACH TO REMOVE MULTIPLICATIVE NOISE - u-bordeaux.fr

processed by the radar is degraded by a noise with large amplitude: this gives a speck-led aspect to the image, and this is the reason why such a noise is called speckle [24]. To illustrate the diﬃculty of speckle noise removal, Figure 2.1 shows a 1 Dimensional noise free signal, and the corresponding speckled signal (the noise free signal .

5 Views

10m ago

Recent Views

Grammar as a Foreign Language - List of Proceedings

Grammar as a Foreign Language Oriol Vinyals Google vinyals@google.com Lukasz Kaiser Google lukaszkaiser@google.com Terry Koo Google terrykoo@google.com Slav Petrov Google slav@google.com Ilya Sutskever Google ilyasu@google.com Geoffrey Hinton Google geoffhinton@google.com Abstract Synta

2y ago

452 Views

Attention is All you Need - NIPS

Google Brain avaswani@google.com Noam Shazeer Google Brain noam@google.com Niki Parmar Google Research nikip@google.com Jakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.com Aidan N. Gomezy University of Toronto aidan@cs.toronto.edu Łukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhinz illia .

1y ago

313 Views

GSA Implementation of Google (G) Suite

Google Meet Classic Hangouts Google Chat Google Calendar Google Drive and Shared Drive Google Docs Google Sheets Google Slides Google Forms Google Sites Google Keep Apps Script D

2y ago

323 Views

Google Drive (Google Docs, Google Sheets, Google Slides)

Google Drive (Google Docs, Google Sheets, Google Slides) Employees are automatically issued a Kyrene Google account. Navigate to drive.google.com. Use Kyrene email address and network password to login. Launch in Chrome browser for best experience. Google Drive is a cloud storage sys

2y ago

400 Views

Quick Guide of Using Google Home to Control Smart Devices

Configuration needs Google Home app. Search "Google Home" in App Store or Google Play to install the app. 3.1 Set up Google Home with Google Home app You can skip this part if your Google Home is already set up. 1. Make sure your Google Home is energized. 2. Open the Google Home app by tapping the app icon on your mobile device. 3.

1y ago

340 Views

Elaboração de Provas Online usando o Formulário Google Docs

2 Após o login acesse o Google Drive ou o Google Docs e selecione a ferramenta Google Forms (Formulários). Clique na caixa de Ferramentas do Google, localizada no canto direito superior da tela e selecione o Google Drive. Na tela do Google Drive clique em New , opção More e selecione Google Forms. OBS: É possível acessar o google

11m ago

129 Views

ACS WASC Templates

File upload, Folder upload, Google Docs, Google Sheets, or Google Slides. You can also create Google Forms, Google Drawings, Google My Maps, etc. Share with exactly who you want — without email attachments. Search or sort your list of files, folders, and Google Docs. Preview files and Google Docs.

2y ago

372 Views

Google Drive - San Bernardino City Unified School District

Google Apps All of the Google applications that are available upon logging into Google.com (G , Gmail, Gphotos, Gdrive, etc.). Google Suite Google’s online cloud based office companion applications (Docs, Sheets, Slides). Google Drive Google’s online cloud storage and file sharing/collaboration application.

2y ago

388 Views

Single Sign On for Google Apps with NetScaler Unified Gateway

Google Apps for Work is a suite of cloud computing productivity and collaboration applications provided by Google on a subscription basis. It includes Google’s popular web applications including Gmail, Google Drive, Google Hangouts, Google Calendar and Google

2y ago

302 Views

Serviceteil

Google 84, 87, 124 Google 110 Google AdWords 101, 103 Google Alerts 127 Google Analytics 89 Google Maps 100, 110, 173 Google-Maps 63 Google Places 100, 103, 124 Graphiken 66 H Haftung 170 Haftungsausschluss 72 Hausfarbe 11 Headline 35 Heilmittelwerbegesetz 14, 69, 163 Heilversprechen 164 HONcode 78 HTML 58 HWG 31 I Imagefilm 31

2y ago

342 Views

Best practices for managing identities when you move to Google Cloud

Google Cloud. To provide t he informat ion an organizat ion would ne e d to transfer data and ownership from one Google Account to anot her for s ome of t he noncore Google s er vice s, such as Google Ads, Google Analyt ics, or DV360. Intende d audience Organizat ion administrators. Sta planning Google Cloud / Google Wor kspace migrat ion. Key .

1y ago

491 Views

Introduction - Google Earth User Guide

Google Earth Community: Learn from other Google Earth users by asking questions and sharing answers on the Google Earth Community forums. Using Google Earth: This blog describes how you can use some of the interesting features of Google Earth. Selecting a Server Note: This section is relevant to Google Earth Pro and EC users.

3y ago

296 Views

Using Google Forms to Manage Officials Signups

Google Sheets, deleting a response from the form or sheet will not affect the other. Once the Google Form is linked to a Google Sheet, clicking on the spreadsheet icon will open the linked Google Sheet. Google Responses Sheet Google automatically creates and populates the sp

2y ago

281 Views

Google Cheat Sheets - Shake Up Learning

Google Slides Cheat Sheet p. 15-18 Google Sheets Cheat Sheet p. 19-22 Google Drawings Cheat Sheet p. 23-26 Google Drive for iOS Cheat Sheet p. 27-29 Google Chrome Cheat Sheet p. 30-32 ShakeUpLearning.com Google Cheat Sheets - By Kasey Bell 3

2y ago

303 Views

ChromeBox CXI (McQueen) UM (date) EN

Create a new Google Account. You can create a new Google Account if you don’t already have one. Click . Create a Google Account. on the right to set up a new account. A Google Account gives you access to useful web services developed by Google, such as Gmail, Google Docs, and Google Calendar. Browse as a guest

2y ago

184 Views

Noise Accumulation In High Dimensional Classiﬁcation And .

It looks like you're using an ad-blocker