Visualizing And Exploring Data

2y ago
1.97 MB
27 Pages
Last View : 11d ago
Last Download : 8m ago
Upload by : Giovanna Wyche

Visualizing and Exploring Data

Visual Methods for findingstructures in data Power of human eye/brain to detect structures– Product of eons of evolution Display data in ways that capitalize on humanpattern processing abilities Can find unexpected relationships– Limitation: very large data sets

Exploratory Data Analysis Explore the data without any clear ideas of whatwe are looking for EDA techniques are– Interactive– Visual Many graphical methods for low-dimensional data For higher dimensions -- Principal ComponentsAnalysis

Topics in Visualization1. Summarizing DataMean, Variance, Standard Deviation, Skewness2.3.4.5.Tools for Single Variables (histogram)Tools for Pairs of Variables (scatterplot)Tools for Multiple VariablesPrincipal Components Analysis– Reduced number of dimensions

1. Summarizing the data1 nMean, µ x(i )n i 1 Centrality– Minimizes the sum of squared errors to all samples– If there are n data values, mean is the value such that the sum of ncopies of the mean equals the sum of data values Measures of Location– Mean is a measure of location– Median (value that has equal no of points above andbelow)– Quartile (value greater than a quarter of the data points)

Measures of Dispersion, orVariabilityn12Variance, σ 2 [x(i) µ] n 1 i 1Average squared errorin mean representing data1 n2[ x(i ) µ ]Standard Deviation, σ n 1 i 12Skewness 3ˆ(x(i) µ) ( ( x(i ) µˆ ) 2 ) 3 / 2Measures how much the datais one-sided (single long tail)

2. Tools for Displaying SingleVariables Basic display for univariate data is thehistogram– No of values of the variable that lie inconsecutive intervals

Manydid not use itat allHistogram of supermarket credit card usageThese used itevery weekexcept holidaysweeks

Histogram of Diastolic blood pressure of individuals(UCI ML archive)Zero BPmeansdata missing

Smoothing estimates Kernel Function K Estimated density at point x isn1x x(i )ˆf ( x) )K( n i 1h GaussianKernel with std dev hK (t , h) Ce1 t ( )22 hwhere t x x(i )

Kernel Estimateswith different values of h:Small values lead to spikyestimatesData is right skewedwith hint of multimodalityK (t , h) Ce1 t ( )22 hHigher smoothing

3. Tools for DisplayingRelationship between two variables Box PlotsScatter PlotsContour PlotsTime as one of the two variables

Box Plot1.5 times inter-quartile rangeUpperQuartileMedianLowerQuartile


ScatterplotCredit card repayment dataHighly correlated dataSignificant number departfrom pattern: worth investigating

Scatterplot Disadvantages1. With large no of data points reveals little structure2. Can conceal overprinting which can be significant for multimodaldata

Contourplot1. Overcomes some scatterplot problemsUnimodalitycan be seen:Not apparentin scatterplot2. Requires a 2-D density estimate to be constructedwith a 2-D kernel

Display when one of the variables is timeAnnualFees introducedJan 1963Dec 1970Peaks in early andlate summer and around new year

Tools for Displaying More thanTwo Variables Scatter plots for all pairs of variables Trellis Plot Parallel Coordinates Plot

More than two variables Sheets of Paper and Computer screens are fine fortwo variables Need projections from higher-dimensional data to2-D plane Methods– Examine all pairs of variables Scatterplot matrix Trellis plot Icons

CPU performanceScatter Plot MatrixIndependent209 CPU data:Cycle TimeMinimum MemoryMaximum MemoryCache Size (Kb)Minimum ChannelsMaximum ChannelsRelative PerformanceEstimated rel perf (wrt IBM)Correlated

Disadvantage of Scatter PlotMatrices Scatter Plot Matrices are multiple bivariatesolutions2-dprojection Not a multivariate solution Such projections sacrificeinformation3 variables8 cubes: alternately empty and fullEach 1-D and 2-D projection isuniformly distributed!

Trellis Plot Rather than displaying scatter plot for eachpair of variables Fix a particular pair of variables andproduce a series of scatter plots, histograms,time series plots, contour plots etc

Trellis PlotMaleFemale(with scatterplots)OlderEpilepticSeizures inlater 2 weekperiodYoungerBest fit lineEpilepticSeizures in 2 weekperiod

Icon PlotStar Plot: Each direction correspondsto a variable. Length correspondsto a value53 samples of minerals12 chemical properties

ParallelCoordinatesPlotEach path representsan individualEach countRepresents 2-weekperiod

produce a series of scatter plots, histograms, time series plots, contour plots etc. Male Female Younger Older Epileptic Seizures in 2 week period Epileptic Seizures in later 2 week period Best fit line Trellis Plot (with scatter

Related Documents:

M259 Visualizing Information George Legrady 2014 Winter M259 Visualizing Information Jan 14: DATA SOURCE George Legrady, Yoon Chung Han M259 Visualizing Information George Legrady 2014 Winter This

Data Science and Machine Learning Essentials Lab 3A - Visualizing Data By Stephen Elston and Graeme Malcolm Overview In this lab, you will learn how to use R or Python to visualize data. If you intend to work with R, complete the Visualizing Data with R exercise. If you plan to work with Python, complete the Visualizing Data with

Visualizing and Exploring Data Sargur Srihari University at Buffalo The State University of New York . Visual Methods for finding structures in data Power of human eye/brain to detect structures - Product of eons of evolution . - 10 data points take value 3, ten value 7 all other values less often than 10 .

A Big Data Challenge: Visualizing Social Media Trends about Cancer using SAS Text Miner Scott Koval, Yijie Li, and Mia Lyst, Pinnacle Solutions, Inc. ABSTRACT Analyzing big data and visualizing trends in social media is a challenge that many companies face as large sources of publically available data become accessible.

Visualizing Data Ben Fry O'REILLY8 Beijing Cambridge Farnham Köln Sebastopol Taipei Tokyo . Table of Contents Preface vii 1. The Seven Stages of Visualizing Data 1 Why Data Display Requires Planning 2 An Example 6 Iteration and Combination 14 Principles 15 Onward 18 2. Getting Started with Processing 19

Visualizing Oceans of Data and lead writer of the Cross-cutting Guideline section Enabling Customization. Amy Busey of EDC was a primary author of Visualizing Oceans of Data. Her particular focus during the literature review and writing was on visual perception and cognitive load theory, and she was lead writer of the

Also Available from Thomson Delmar Learning Exploring Visual Effects/Woody/Order # 1-4018-7987-X Exploring Sound Design for Interactive Media/Cancellaro/Order #1-4018-8102-5 Exploring Digital Software on the Mac/Rysinger/Order # 1-4018-7791-5 Exploring DVD Authoring/Rysinger/Order # 1-4018-8020-7 exploring DIGITAL VIDEO Second Edition Rysinger

Visualizing Data using t-SNE An Intuitive Introduction Simon Carbonnelle Universit e Catholique de Louvain, ICTEAM 12th of May, 2016. Visualization and Dimensionality Reduction Intuition behind t-SNE Visualizing representations. Visualization and Dimensionality Reduction