Stat 470/670 Lecture 1 - GitHub Pages

1y ago
29 Views
2 Downloads
3.36 MB
51 Pages
Last View : 4d ago
Last Download : 3m ago
Upload by : Elise Ammons
Transcription

Stat 470/670 Lecture 1

What is Exploratory Data Analysis? 1

2

We will be exploring numbers. We need to handle them easily and look at them effectively. Techniques for handling and looking — whether graphical, arithmetic, or intermediate — will be important. Tukey, Exploratory Data Analysis (1977) 3

A first example: Heights of the highest points by state ## load required packages and data library(tidyverse) ## -- Attaching packages --------------------------------------tidyverse 1.3.0 -## ## ## ## v v v v tibble 3.0.1 tidyr 1.1.0 readr 1.3.1 purrr 0.3.4 v dplyr 1.0.2 v stringr 1.4.0 v forcats 0.5.0 ## -- Conflicts -----------------------------------------tidyverse conflicts() -## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() options(tibble.print min 15) heights read csv("highest-points-by-state.csv") ## Parsed with column specification: ## cols( ## elevation col double(), ## state col character() 4

A first try at looking at the data 5

heights ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## # A tibble: 50 x 2 elevation state dbl chr 1 733. Alabama 2 6168. Alaska 3 3851. Arizona 4 839. Arkansas 5 4418. California 6 4399. Colorado 7 725. Connecticut 8 137. Delaware 9 105. Florida 10 1458. Georgia 11 4205. Hawaii 12 3859. Idaho 13 376. Illinois 14 383. Indiana 15 509. Iowa # . with 35 more rows 6

A second try at looking at the data 7

arrange(heights, elevation) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## # A tibble: 50 x 2 elevation state dbl chr 1 105. Florida 2 137. Delaware 3 163. Louisiana 4 246. Mississippi 5 247. Rhode Island 6 376. Illinois 7 383. Indiana 8 472. Ohio 9 509. Iowa 10 540. Missouri 11 550. New Jersey 12 595. Wisconsin 13 603. Michigan 14 701. Minnesota 15 725. Connecticut # . with 35 more rows 8

arrange(heights, desc(elevation)) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## # A tibble: 50 x 2 elevation state dbl chr 1 6168. Alaska 2 4418. California 3 4399. Colorado 4 4392. Washington 5 4207. Wyoming 6 4205. Hawaii 7 4123. Utah 8 4011. New Mexico 9 4005. Nevada 10 3901. Montana 11 3859. Idaho 12 3851. Arizona 13 3426. Oregon 14 2667. Texas 15 2207. South Dakota # . with 35 more rows 9

Stem-and-leaf plots Goals: Write down the set of numbers, keeping as much detail as possible Pack the numbers efficiently, so you can see all of them at once 10

Stem-and-leaf plots Goals: Write down the set of numbers, keeping as much detail as possible Pack the numbers efficiently, so you can see all of them at once These are in conflict! 10

Stem-and-leaf plots Remedy: Notice that parts of the numbers (the beginnings) are repeated. The first digit of each number is printed at the beginning of the line, the remainder at the ends. The first digit is the “stem”, the remainder are the “leaves”. 10

Stem-and-leaf-plot example Set of numbers: 16, 17, 17, 17, 17, 18 Stem-and-leaf display: 1 677778 11

Stem-and-leaf plot for the elevations in meters: stem(heights elevation) ## ## ## ## ## ## ## ## ## ## The decimal point is 3 digit(s) to the right of the 0 1 2 3 4 5 6 11222445555667778 0011123355566779 0027 4999 00122444 2 12

The stem-and-leaf plot shows that there are three groups of states: Alaska The western and Rocky Mountain states (California, Colorado, Washington, Wyoming, Hawaii, Utah, New Mexico, Nevada, Montana, Idaho, Arizona, Oregon) All the other states 13

Note 1 14

Hoosier Hill: Elevation 1257 feet Source: google street view 15

Note 2 16

Compare the stem-and-leaf plot with a density estimate ggplot(heights, aes(x elevation)) geom density() density 3e 04 2e 04 1e 04 0e 00 0 2000 4000 6000 elevation 17

Compare the stem-and-leaf plot with a density estimate ggplot(heights, aes(x elevation)) geom density() density 3e 04 2e 04 1e 04 0e 00 0 2000 4000 6000 elevation Where is Alaska? 17

Compare the stem-and-leaf plot with a density estimate ggplot(heights, aes(x elevation)) geom density() geom rug() density 3e 04 2e 04 1e 04 0e 00 0 2000 4000 6000 elevation Where is Alaska? 18

We have made an advance in understanding this set of numbers! 19

We have made an advance in understanding this set of numbers! What would traditional statistics have to say about these numbers? 19

What if we have a many more numbers, e.g. census data? Source: US Census Bureau Public Information Office, via the National Geographic Society 20

Or a large matrix? Source: Still from “The Matrix” 20

Or graph data? Source: KEGG PATHWAY Database 20

Exploratory vs. Confirmatory Analyses 21

Confirmatory analysis Probability model for the data specified before analysis takes place Given the probability model, test hypotheses or infer parameter values 22

Exploratory analysis: everything else! In particular: Check distributional assumptions Check for outliers Decide on variable transformations Decide on the form of the model: what variables to include 23

Exploratory analysis: everything else! In particular: Check distributional assumptions Check for outliers Decide on variable transformations Decide on the form of the model: what variables to include BUT: Not limited to the work done before fitting a model! In the highest points example, we had an EDA-based advance that wasn’t related to model fitting at all. 23

What does Tukey say? 24

chapter index on next page 1A. Quantitative detective work Exploratory data analysis is detective work--numerical detective work-or counting detective work--or graphical detective work. A detective investigating a crime needs both tools and understanding. If he has no fingerprint powder, he will fail to find fingerprints on most surfaces. If he does not understand where the criminal is likely to have put his fingers, he will not look in the right places. Equally, the analyst of data needs both tools and understanding. It is the purpose of this book to provide some of each. Time will keep us from learning about many tools--we shall try to look at a few of the most general and powerful among the simple ones. We do not guarantee to introduce you to the "best" tools, particularly since we are not sure that there can be unique bests. Understanding has different limitations. As many detective stories have made clear, one needs quite different sorts of detailed understanding to detect criminals in London's slums, in a remote Welsh village, among Parisian aristocrats, in the cattle-raising west, or in the Australian outback. We do not expect a Scotland Yard officer to do well trailing cattle thieves, or a Texas ranger to be effective in the heart of Birmingham. Equally, very different detailed understandings are needed if we are to be highly effective in dealing with data concerning earthquakes, data concerning techniques of chemical manufacturing, data concerning the sizes and profits of firms in a service industry, data concerning human hearing, data concerning suicide rates, data concerning population growth, data concerning fossil dinosaurs, data concern- 25

ones, there is likely to be nothing for confirmatory data analysis to consider. Experiments and certain planned inquiries provide some exceptions and chapter index on next page partial exceptions to this rule. They do this because one line of data analysis the experiment or inquiry. Even here, however, was planned as part of 1A. Quantitative detective work restricting one's self to the planned analysis--failing to accompany it with Exploratory datasight analysis is detective work--numerical work-exploration--loses of the most interesting results too detective frequently to be or counting detective work--or graphical detective work. comfortable. AAsdetective investigating a crimeus, needs both toolscircumstances and understanding. If he all detective stories remind many of the surrounding has no fingerprint powder, he will fail to find fingerprints on indications most surfaces. If a crime are accidental or misleading. Equally, many of the to be he does notin understand where criminaloris misleading. likely to have his all fingers, he discerned bodies of data arethe accidental To put accept appearwill notaslook in the would right places. Equally, the analyst of data needsdetection both tools ances conclusive be destructively foolish, either in crime or and understanding. It is the purpose of this book to provide some of each. in data analysis. To fail to collect all appearances because some--or even Time willonly keepaccidents us from learning about many shall try to look at most--are would, however, be tools--we gross misfeasance deserving a(and few often of thereceiving) most general and powerful among the simple ones. We do not appropriate punishment. guarantee to introduce you to the "best" tools, particularly since we are not sureExploratory that there can uniquecan bests. databe analysis never be the whole story, but nothing else Understanding differentstone--as limitations. can serve as the has foundation the As firstmany step. detective stories have made clear, one needs quite different sorts of detailed understanding to detect criminals in be London's in aWe remote among We will exploringslums, numbers. need toWelsh handlevillage, them easily andParisian look at aristocrats, in the cattle-raising west, or in the Australian outback. do Tukey, Exploratory Data Analysis (1977) pp.not 1-3 them effectively. Techniques for handling and looking--whetherWe graphical, expect a Scotland Yard officer to do well trailing cattle thieves, or a Texas arithmetic, or intermediate--will be important. The simpler we can make ranger to be effective in the heartlong of as Birmingham. very different these techniques, the better--so they work, Equally, and work well. When detailed understandings are needed if we are to be highly effective in dealing details make an important difference, they deserve--and will get--emphasis. with data concerning earthquakes, data concerning techniques of chemical manufacturing, data concerningreview the sizes and profits of firms in a service questions industry, data concerning human hearing, data concerning suicide rates, data What ispopulation exploratory data analysis? How is fossil it related to confirmatory data concerning growth, data concerning dinosaurs, data concern- 25

Exploratory: Collect everything that even seems to be true about the data, detective in character, “magical thinking” Confirmatory: Given one pre-planned hypothesis, infer parameter values or test hypotheses, judicial in character, set a high bar for what we are willing to believe about the data. 26

The never ending data analysis cycle: 1. Get data. 2. Perform exploratory analysis to suggest a model. 3. Fit the model. 4. Perform exploratory analysis to critique the model and suggest a new model. 5. Return to step 3. 27

The never ending data analysis cycle: 1. Get data. 2. Perform exploratory analysis to suggest a model. 3. Fit the model. 4. Perform exploratory analysis to critique the model and suggest a new model. 5. Return to step 3. This workflow is dangerous! Using the data more than once Assiduous EDA means multiple comparison problems 27

Tukey’s EDA also emphasizes tools and best practices for the practice of data analysis, all pen-and-paper based. 28

The basis of stem-and-Ieaf technique, entering an additional digit--or digits--to mark each value, works well for batches of limited size. Once we have much more than 20 leaves on a stem, however, we are likely to feel cramped--and our stems begin to be hard to count. We ought to be able to Example: Tallying escape to some other way of handling such information, whenever the other way gives us enough detail. Standard method: The fast methods involve one pencil (or pen) stroke per item. One method counts by fives in this style: I II III IIII This has been widely used. The writer finds it treacherous, especially when he tries to go fast. (It is too easy for him to do or for this approach to give satisfactory performance.) The recommended scheme uses first dots, then box lines, then crossed lines to make a final character for 10. Thus: 4 8 10 is is is 29

The basis of stem-and-Ieaf technique, entering an additional digit--or nts bydigits--to fives in mark this each style: value, works well for batches of limited size. Once we s s have much more than 20 leaves on a stem, however, we are likely to feel cramped--and our stems begin to be hard to count. We ought to be able to Example: Tallying escape to some other way of handling such information, whenever the other has been widely used.detail. The writer finds it treacherous, especially way gives us enough Standard method: The fast methods involve onehim pencilto (ordo pen) stroke per item. One method to go fast. (It is too easy for counts by fives in this style: II I I III II IIII when IIIor IIII This has been widely used. The writer finds it treacherous, especially when he this approach to give satisfactory tries to go fast. (It is too easy for himperformance.) to do The Tukey’s recommended proposal:scheme uses first dots, then box lines, then cros s to make a final character for 10.or Thus: for this approach to give satisfactory performance.) is dots, then box lines, then crossed The recommended scheme4 uses first lines to make a final character 8for 10.isThus: 10 4 8 10 is is is is 29

Pen-and-paper methods primarily of historical interest. 30

Pen-and-paper methods primarily of historical interest. Philosophical descendants are the tidyverse packages in R. 30

What about this class? 31

What about this class? Two categories of topics: what to do and how to do it. 31

For what to do, organize by type of data: Univariate data Bivariate data Trivariate/Hypervariate data Categorical data Distance data Graph data Other topics according to interest In addition: Dangers of EDA and how to avoid them 32

In the how to do it bin, we will learn to work with R ggplot2 tidyverse packages 33

How is this class different from others? Machine learning: We put less emphasis on supervised learning. Data mining: More emphasis on visualization. Applied statistics: Less emphasis on 𝑝-values and inference, more flexibility in the methods used. 34

Texts: Cleveland, Visualizing Data Wickham, ggplot2: Elemant Graphics for Data Analysis Wickham and Grolemund, R for Data Science Other notes posted to the class website and canvas as necessary 35

Assessment: Homeworks (30%). Two mini projects (30%). Final project (40%). 36

How to succeed: Practice! Follow along with the code examples, actually type in the commands instead of copying and pasting. Start early on assignments and projects. Presentation matters – make your documents look nice enough thta you would be happy to show them to potential employers as examples of your work. 37

We will be exploring numbers. We need to handle them easily and look at them effectively. Techniques for handling and looking — whether graphical, arithmetic, or intermediate — will be important. Tukey, Exploratory Data Analysis (1977) 38

Exploratory data analysis is detective work--numerical detective work--or counting detective work--or graphical detective work. A detective investigating a crime needs both tools and understanding. If he has no fingerprint powder, he will fail to find fingerprints on most surfaces. If he does not understand where the criminal is likely to have .

Related Documents:

pass4sure 70-470, 70-470 dumps, 70-470 real questions, 70-470 Question bank, 70-470 braindumps, 70-470 questions and answers, 70-470 Q&A, 70-470 vce, free 70-470 download, Free 70-470 braindumps, 70-470 practice test, 70-470 practice exam, killexams.com 70-470, 70-470 actual test, 70-470 PDF download, 70-470 examcollection, Passleader 70-470 .

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

STAT 810: Alpha Seminar STAT 822: Statistical Methods ll STAT 821: Statistical Methods l STAT 883: Mathematical Statistics ll STAT 850: Computing Tools Elective STAT 882: Mathematical Statistics l Choose a faculty advisor and form a MS Supervisory Committee STAT 892*: TA Prep Choose an MS Comprehensive Exam option with the

2 CHANNEL Group 11 Group 12 Group 13 Group 14 Group 15 Group 16 Group 17 Group 18 Group 19 Group 20 1 472.225 470.300 470.500 478.200 486.200 494.200 470.125 470.575 470.525 470.350 2 472.975 472.225 471.400 478.775 486.775 494.775 472.000 472.100 471.575 471.125 3 476.700 477.100 471.925 480.100 488.100 496.100

DIMENSIONS NOUVEAU MASTER FOURGON TRACTION Volume utile (m 3) L1H1 L1H2 L2H2 L2H3 L3H2 L3H3 8 9 10,8 12,3 13 14,8 Dimensions extérieures (mm) Longueur hors tout 5 048 5 048 5 548 5 548 6 198 6 198 Largeur hors tout / avec rétro 2 070 / 2 470 2 070 / 2 470 2 070 / 2 470 2 070 / 2 470 2 070 / 2 470 2 070 / 2 470

7.1ch Home Theater System HT-S6200 AV Receiver (HT-R670) Speaker Package (HTP-670) Front Speakers (SKF-670) Center Speaker (SKC-670) Surround Speakers (SKR-670) Surround Back Speakers (SKB-670) Powered Subwoofer (SKW-770) Dock for iPod (UP-A1) Instruction Manual Thank you for purchasing an Onkyo 7.1ch Home Theater System. Please read this .

MET Grid-Stat Tool John Halley Gotway METplus Tutorial July 31 -August 2, 2019 NRL-Monterey, CA. 2 PB2NC ASCII2NC Gridded NetCDF Gridded Forecast Analysis Obs PrepBufr Point STAT ASCII NetCDF Point Obs ASCII . l Grid-Stat, Point-Stat, and Stat-Analysiscan output the ECLV line type.

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued