Exploratory Data Analysis - GitHub Pages

2y ago
5 Views
2 Downloads
3.09 MB
46 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Karl Gosselin
Transcription

Exploratory Data AnalysisRoger D. PengStephanie C. HicksAdvanced Data ScienceTerm 12019

“Far better an approximate answer to the rightquestion, which is often vague, than an exactanswer to the wrong question, which can always bemade precise.”–John Tukey, "The Future of Data Analysis",Annals of Mathematical Statistics, 1962

Data Analysis in a esAudienceResults

Data Analysis in a esAudienceResults

Tukey-sian Data Analysis

Tukey-sian Data AnalysisJohn Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics, 1962

Tukey-sian Data Analysis Data analysis must seek for scope and usefulness ratherthan security Data analysis must be willing to err moderately often inorder that inadequate evidence shall more often suggestthe right answer Data analysis must use mathematical argument andmathematical results as bases for judgment rather than asbases for proof or stamps of validity "These points are meant to be taken seriously."

Tukey-sian Data Analysis (a1') Recognition of problem (a1'') One technique used (a2) Competing techniques used (a3) Rough comparisons of efficacy (a4) Comparison in terms of precise (and thereby inadequate) criterion (a5') Optimization in terms of a precise, and similarly inadequatecriterion (a5'') Comparison in terms of several criteria

Tukey-sianData key-designthinking-and-betterquestions/

“In my experience when a moderately goodsolution to a problem has been found, it is seldomworth while to spend much time trying to convertthis to the 'best' solution. The time is much betterspent in real research.”–George Kimball, "A critique of operations research,"J. Wash. Acad. Sci, 1958

Questions to Resolve Do we have the right question? Do we have the right data? Can we sketch the solution?

Phases of Data AnalysisRange of ryData Analysis1Explore data,refine question2RefinedQuestionFormal Modeling,Inference3Results4Synthesis, finalize Develop models, Build narrative,question, sketch formal solutioninterpret evidenceanswer

EDA ProcessDataTidyDataRefined GoalQuestionNew QuestionsContextEDAResourcesAudienceEDA ProductsBetter Understandingof the DataSketch of the Answer

Do We Have the Right Question? Too vague Unwieldy analysisToo specific We don’t have that particular kind of data Affected population is too smallDoes not lead to a decision or intervention Relevance?

Do We Have the Right Data? Data are proxies for the keyvariables Insufficient data to makereasonable inferences orpredictions Missing variables that might beconfounders, modifiers Missing data preventscomplete analysis Data with errors affects canincrease bias, uncertainty Y X1, , Xp Y′ X1, , Xp

Do We Have the Right Data? Data are proxies for the keyvariables Insufficient data to makereasonable inferences orpredictions Missing variables that might beconfounders, modifiers Missing data preventscomplete analysis Data with errors affects canincrease bias, uncertaintyY X1, , Xp

Do We Have the Right Data? Data are proxies for the keyvariables Insufficient data to makereasonable inferences orpredictions Missing variables that might beconfounders, modifiers Missing data preventscomplete analysis Data with errors affects canincrease bias, uncertaintyY X1, , Xp Z1, , Zk

Do We Have the Right Data? Data are proxies for the keyvariables Insufficient data to makereasonable inferences orpredictions Missing variables that might beconfounders, modifiers Missing data preventscomplete analysis Data with errors affects canincrease bias, uncertaintyY X1, , Xp

Do We Have the Right Data? Data are proxies for the keyvariables Insufficient data to makereasonable inferences orpredictions Missing variables that might beconfounders, modifiers Missing data preventscomplete analysis Data with errors affects canincrease bias, uncertaintyY X1, , Xp

Can We Sketch the Solution?("Lo-Fi" Model)

Can We Sketch the Solution? Is there any signal in the data? A picture, table, or figure thattells us 80% of the answer A simplified model thatindicates predictive power Further work will test thesensitivity and robustness ofour solution The sketch will almost neverbe seen by outsiders“Inner City” Asthma?

Can We Sketch the Solution?

EDA Macro Cycle Right Question?Right Data?Y X1, , Xp Z1, , ZkFeasible Solution?

EDA Epicycle“The value of a plot is that it allows us to see whatwe never expected to see.”–John Tukey, Exploratory Data Analysis

EDA Epicycle“The value of a plot is that it allows us to see whatwe never expected to see.”–John Tukey, Exploratory Data Analysis

EDA EpicycleSet ExpectationsCreate Summaryof the DataCompare Summary toExpectations

Expectations vs. RealityObservedData OurExpectation Our Deviationfrom RealityQuestionNatureContextResourcesPrior WorkResults

Prevalence of AsthmaAmongst Medicaid EnrolleesAge Category% of People with Asthma in Medicaid, 2009—2010

Prevalence of AsthmaAmongst Medicaid EnrolleesAge CategoryObserved % Age Race residualExpectationDeviation% of People with Asthma in Medicaid, 2009—2010

Creating More DataWith Median Polish

EDA Pre-Flight Check List Check the packaging Look at the top and bottom of your data Check your "n"s Validate with at least one external data source Make a plot Try the easy solution first Follow up

Check the Packaging What can you learn about the dataset beforelooking directly at the data? Check rows and columns Check metadata; are all variables there that youexpected? Are all metadata present?

Look at the Top and Bottom Okay, now you can look at the data Check the first few rows Check the last few rows; make sure all rows wereread properly and there’s no crud at the end Time/Date data often sorted; make sure all dates/times are in appropriate range

ABC: Always be Counting Count various aspects of your dataset Compare counts with landmarks Number of subjects (unique IDs), number of visitsper subject, number of locations, number ofmissing observations, etc. Always be counting at every phase (“checkingmindset”)

Validate With At Least 1External Source Compare your data to something outside the dataset Even a single number/summary statistic comparison canbe useful Compare your measurements to another similarmeasurement to check that they’re correlated Get external upper/lower bounds Ex: number of people should exceed total population Ex: Check for negative values when they should be positive

Make a Plot Plots show expectations and deviations from thoseexpectations (i.e. distribution mean and outliers) Tables generally only show summaries, not deviations;also everything on the same “scale” Draw a “fake plot” first

Try the Easy Solution First step in building a primary model Build prima facie evidence Basic argument, without nuance (that comes later) Maybe just one plot (or table)

Follow Up Do you have the right question? Do you have the right data? Do you need other data? Could you sketch the solution? Is there signal in the data?

Number of Doctor’s VisitsData Data DataJanuaryDateDecemberDaily doctor’s visits for asthma amongst Medicaid enrollees in Maryland, 2010

Exploring DataWith Models

“All models are wrong, but some are useful.”–George Box

Exploring With Models Models represent a formalization of our expectations Models can tell us about the unobserved population Whether a model fits well depends on the question

Selling A Book

Selling A Book

Selling A Book What is the question? What is the goal? (hint: ) Do we have the right data? Can we sketch a solution?

Data Data Data

Tukey-sian Data Analysis Data analysis must seek for scope and usefulness rather than security Data analysis must be willing to err moderately often in order that inadequate evidence shall more often suggest the right answer Data analysis must use mathematical argument and

Related Documents:

Exploratory Data Analysis - Detailed Table of Contents [1.] This chapter presents the assumptions, principles, and techniques necessary to gain insight into data via EDA--exploratory data analysis. 1. EDA Introduction [1.1.] 1. What is EDA? [1.1.1.] 2. How Does Exploratory Data Analysis differ from Classical Data Analysis?

Lecture 2: Exploratory Data Analysis FalcoJ.BargagliStoffi 05/06/2020 Lecture 2: Exploratory Data Analysis Inthislecturewewillseehowtousevizualization .

for automated suggestions for portions of the exploratory data analysis process. Many intuitive user interface features that would be ideal to have for an exploratory data analy-sis tool are available in Tableau [1] which is descended from earlier research in exploratory data analysis such as Polaris [14] and Mackinlay's earlier work [9].

of methods for Exploratory Data Analysis & Sentiment Analysis by utilizing various packages concerned. Keywords—Exploratory Data Analysis; Sentiment Analysis; Data Analytics; Python; Seaborn; Numpy; Tensorflow - Keras I. INTRODUCTION The term "Data Analysis" is known to be rooted in the statistics space, which itself is known to have a .

contents page 2 fuel consumption pages 3-6 fiat 500 pages 7-10 fiat 500c pages 11-13 fiat 500 dolcevita pages 14-16 fiat 500 120th anniversary pages 17-21 fiat 500x pages 22-24 fiat 500x 120th anniversary pages 25-27 fiat 500x s-design pages 28-31 fiat 500l pages 32-35 fiat 500l 120th anniversary pages 36-39 tipo hatchback pages 40-43 tipo station wagon pages 44-47 tipo s-design

The tasks of Exploratory Data Analysis Exploratory Data Analysis is listed as an important step in most methodologies for data analysis (Biecek,2019;Grolemund and Wickham,2019). One of the most popular methodologies, the CRISP-DM (Wirth,2000), lists the following phases of a data mining project: 1.Business understanding. 2.Data understanding.

built upon existing models. Tukey contrasted exploratory analysis with calculations of values, or con rmatory data analysis . These two sets of methods are both forms of model checking: exploratory data analysis is the search for unanticipated areas of model mis t, and con rmatory data analysis quanti es the extent to which these discrepancies .

TIBCO Spotfire connecting with R to build the interactive exploratory analysis dashboards for post-hoc efficacy analysis (i.e. survival analysis) and safety analysis (i.e. AE summary and Lab summary) using a late phase clinical trial data [5]. Figure 1 above provides a high-level data process flow in building the interactive exploratory analysis