Python Programming Pandas

2y ago
52 Views
5 Downloads
200.73 KB
34 Pages
Last View : 21d ago
Last Download : 3m ago
Upload by : Giovanna Wyche
Transcription

Python programming — PandasFinn Årup NielsenDTU ComputeTechnical University of DenmarkOctober 5, 2013

PandasOverviewPandas?Reading dataSummary statisticsIndexingMerging, joiningGroup-by and cross-tabulationStatistical modelingFinn Årup Nielsen1October 5, 2013

PandasPandas?“Python Data Analysis Library”Young library for data analysisDeveloped from http://pandas.pydata.org/Main author Wes McKinney has written a 2012 book (McKinney, 2012).Finn Årup Nielsen2October 5, 2013

PandasWhy Pandas?A better Numpy: keep track of variable names, better indexing, easierlinear modeling.A better R: Access to more general programming language.Why not pandas?R: Still primary language for statisticians, means most avanced tools arethere.NaN/NA (Not a number/Not available)Support to third-party algorithms compared to Numpy? Numexpr? (NumExpr in 0.11)Finn Årup Nielsen3October 5, 2013

PandasGet some data from RGet a standard dataset, Pima, from R: R library(MASS) write.csv(Pima.te, "pima.csv")pima.csv now contains comma-separated .587,51,"Yes"Finn Årup Nielsen4October 5, 2013

PandasRead data with PandasBack in Python: import pandas as pd pima pd.read csv("pima.csv")“pima” is now what Pandas call a DataFrame object. This object keepstrack of both data (numerical as well as text), and column and rowheaders.Lets use the first columns and the index column: import pandas as pd pima pd.read csv("pima.csv", index col 0)Finn Årup Nielsen5October 5, 2013

PandasSummary statistics pima.describe()Unnamed: 0npregcount 332.000000 0Finn Årup 000023.00000027.00000037.00000081.0000006October 5, 2013

Pandas. . . Summary statisticsOther summary statistics (McKinney, 2012, around page 101):pima.count() Count the number of rowspima.mean(), pima.median(), pima.quantile()pima.std(), pima.var()pima.min(), pima.max()Operation across columns instead, e.g., with the mean method:pima.mean(axis 1)Finn Årup Nielsen7October 5, 2013

PandasIndexing the rowsFor example, you can see the first two rows or the three last rows: pima[0:2]npreg glu bp skinbmiped age type16 148 7235 33.6 0.62750 Yes2185 6629 26.6 0.35131No pima[-3:]npreg glu bp skinbmiped age type33010 101 7648 32.9 0.17163No3315 121 7223 26.2 0.24530No332193 7031 30.4 0.31523NoNotice that this is not an ordinary numerical matrix: We also got text (inthe “type” column) within the “matrix”!Finn Årup Nielsen8October 5, 2013

PandasIndexing the columnsSee a specific column, here ’bmi’ (body-mass index): pima["bmi"]133.6226.6328.1431.0[here I cut out several lines]33032.933126.233230.4Name: bmi, Length: 332The returned type is another of Pandas Series object, — another of thefundamental objects in the library: type(pima["bmi"]) class ’pandas.core.series.Series’ Finn Årup Nielsen9October 5, 2013

PandasConditional indexingGet the fat people (those with BMI above 30): pima.shape(332, 9) pima[pima["bmi"] 30].shape(210, 9)See histogram (with from pylab import *): pima["bmi"].hist() show()Or kernel density estimation plot (McKinney, 2012, p 239) pima["bmi"].plot(kind "kde") show()Finn Årup Nielsen10October 5, 2013

PandasPlotsHistogram and kernel density estimate (KDE) of the “bmi” variable (bodymass index) of the Pima data set.Finn Årup Nielsen11October 5, 2013

PandasRow and column conditional indexingExample by David Marx in R:ABCDE -runif(10)runif(10)runif(10)runif(10)runif(10)df - data.frame(A,B,C,D,E)sliced df - df[ , df[1,] .5 ]That is, select the columns in a dataframe where the values of the firstrow is below 0.5. Here with a 10-by-5 dataset with uniformly-distributedrandom numbers and columns indexed by letters.Finn Årup Nielsen12October 5, 2013

Pandas. . . Row and column conditional indexingEquivalent in Pythonimport pandas as pdfrom pylab import *df pd.DataFrame(rand(10,5), columns ["A", "B", "C", "D", "E"])df.ix[:, df.ix[0, :] 0.5]These variations do not workdf[:, df[0] 0.5]df[:, df[:1] 0.5]df.ix[:, df[:1] 0.5]Finn Årup Nielsen13October 5, 2013

PandasConstructing a DataFrameConstructing a DataFrame from a dictionary where the keys become thecolumn names import pandas as pd import string spam corpus map(string.split, [ "buy viagra", "buy antibody" ]) unique words set([ word for doc in spam corpus for word in doc ]) word counts [ (word, map(lambda doc: doc.count(word), spam corpus))for word in unique words ] spam bag of words pd.DataFrame(dict(word counts)) print(spam bag of words)antibody buy viagra00111110Finn Årup Nielsen14October 5, 2013

PandasConcatenationAnother corpus and then concatenation with the previous dataset other corpus map(string.split, [ "buy time", "hello" ]) unique words set([ word for doc in other corpus for word in doc ]) word counts [ (word, map(lambda doc: doc.count(word), other corpus))for word in unique words ] other bag of words pd.DataFrame(dict(word counts)) print(other bag of words)buy hello time01011010 pd.concat([spam bag of words, other bag of words], ignore index True)antibody buy hello time viagra001NaNNaN1111NaNNaN02NaN101NaN3NaN010NaNFinn Årup Nielsen15October 5, 2013

PandasFilling in missing data(McKinney, 2012, page 145 ) pd.concat([spam bag of words, other bag of words], ignore index True)antibody buy hello time viagra001NaNNaN1111NaNNaN02NaN101NaN3NaN010NaN pd.concat([spam bag of words, other bag of words], ignore index True).fillna(0)antibody buy hello time viagra001001111000201010300100Finn Årup Nielsen16October 5, 2013

PandasCombining datasetsSee tml for otherPandas operations:concatjoinmergecombine firstFinn Årup Nielsen17October 5, 2013

PandasJoin exampleTwo data sets with partially overlapping rows (as not all students answer each questionnaire) where the columns should be concatenated (i.e.,scores for individual questionnaires)import pandas as pdxl pd.ExcelFile("E13 1 Resultater-2013-10-02.xlsx")df1 xl.parse("Resultater", index col [0, 1, 2], header 3)df1.columns map(lambda colname: unicode(colname) " 1", df1.columns)xl pd.ExcelFile("E13 2 Resultater-2013-10-02.xlsx")df2 xl.parse("Resultater", index col [0, 1, 2], header 3)df2.columns map(lambda colname: unicode(colname) " 2", df2.columns)df pd.DataFrame().join([df1, df2], how "outer")df[["Score 1", "Score 2"]].corr()Finn Årup Nielsen18# Score correlationOctober 5, 2013

PandasProcessing after join df.ix[:5,["Score 1", "Score 2"]]Bruger(faan)s06.s07.s07.s07.FornavnFinn Årup.EfternavnNielsenScore 1Score 709000.7418000.569666(edited)Note that the second user (“s06.”) did not solve the second assignment.The joining operation by default adds a NaN to the missing element, —indicating a missing value (not available, NA).Finn Årup Nielsen19October 5, 2013

PandasThe GroupbyGroupby method (McKinney, 2012, chapter 9): splits the dataset basedon a key, e.g., a DataFrame column name.Think of SQL’s GROUP BY.Example with Pima Indian data set splitting on the ’type’ column (elements are “yes” and “no”) and taking the mean in each of the twogroups: 5 108.188341 70.130045Yes4.614679 141.908257 6.5128440.4645650.65896329.21524735.614679The returned object from groupby is a DataFrameGroupBy object whilethe mean method on that object/class returns a DataFrame objectFinn Årup Nielsen20October 5, 2013

Pandas. . . The GroupbyMore elaborate with two aggregating methods: grouped by type pima.groupby("type") grouped by type.agg([np.mean, np.std])npregglumeanstdmeanstdtypeNo2.932735 2.781852 108.188341 22.645932Yes4.614679 3.901349 141.908257 1Finn Årup Nielsen21October 5, 2013

Pandas. . . The GroupbyWithout groupby checking mean (32.889908) and std (9.065951) for’skin’ ’Yes’: np.mean(pima[pima["type"] "Yes"]["skin"])32.889908256880737# Correct np.std(pima[pima["type"] "Yes"]["skin"])9.0242684519300891# ? import scipy.stats scipy.stats.nanstd(pima[pima["type"] "Yes"]["skin"])9.065951207005341# Ok np.std(pima[pima["type"] "Yes"]["skin"], ddof 1)9.065951207005341# Degrees of freedom!Numpy’s std is the biased estimate while Pandas std is the unbiasedestimate.Finn Årup Nielsen22October 5, 2013

PandasCross-tabulationFor categorical variables select two columns and generate a matrix withcounts for occurences (McKinney, 2012, p. 277) pd.crosstab(pima.type, pima.npreg)npreg 012345678typeNo34 56 38 23 19 13 1495Yes15 15 11 1567486948101112131517451511210101Remember: pima[1:4]npreg glu218531894378Finn Årup 1670.24823age type31No21No26 YesOctober 5, 2013

PandasCross-tabulation plot# Wrong orderingpd.crosstab(pima.type, pima.npreg).plot(kind "bar")Finn Årup Nielsen24October 5, 2013

PandasCross-tabulation plot# Transposepd.crosstab(pima.type, pima.npreg).T.plot(kind "bar")Finn Årup Nielsen25October 5, 2013

PandasCross-tabulation plot# Or better:pd.crosstab(pima.npreg, pima.type).plot(kind "bar")Finn Årup Nielsen26October 5, 2013

PandasOther Pandas capabilitiesHierarchical indexing (McKinney, 2012, page 147 )Missing data support (McKinney, 2012, page 142 )Pivoting (McKinney, 2012, chapter 9)Time series (McKinney, 2012, chapter 10)Finn Årup Nielsen27October 5, 2013

PandasStatistical modeling with statsmodelsExample with Longley dataset.Ordinary least squares fitting a dependent variable “TOTEMP” (TotalEmployment) from 6 independent variables:import statsmodels.api as sm# For ’load pandas’ you need a recent statsmodelsdata sm.datasets.longley.load pandas()# Endogeneous (response/dependent) & exogeneous variables (design matrix)y, x data.endog, data.exogresult sm.OLS(y, x).fit() # OLS: ordinary least squaresresult.summary()# Print summaryFinn Årup Nielsen28October 5, 2013

PandasOLS Regression Results Dep. Variable:TOTEMPR-squared:0.988Model:OLSAdj. R-squared:0.982Method:Least SquaresF-statistic:161.9Date:Mon, 17 Jun 2013Prob -117.56No. Observations:16AIC:247.1Df Residuals:10BIC:251.8Df Model:5 coefstd errtP t [95.0% Conf. 70.0219.00387.832 6Jarque-Bera nd. No.4.56e 05 Finn Årup Nielsen29October 5, 2013

PandasStatsmodels 0.5“Minimal example” from statsmodels documentation:import numpy as npimport pandas as pdimport statsmodels.formula.api as smfurl /HistData/Guerry.csv"dat pd.read csv(url)results smf.ols("Lottery Literacy np.log(Pop1831)", data dat).fit()results.summary()Note: 1) Loading of data with URL, 2) import statsmodels.formula.api(possible in statsmodels 0.5), 3) R-like specification of linear modelformula (from patsy).Finn Årup Nielsen30October 5, 2013

PandasMore informationhttp://pandas.pydata.org/The canonical book “Python for data analysis” (McKinney, 2012).Will it Python?: Porting R projects to Python, exemplified though scriptsfrom Machine Learning for Hackers (MLFH) by Drew Conway and JohnMiles White.Finn Årup Nielsen31October 5, 2013

PandasSummaryPandas helps you represent your data (both numerical and categorical)and helps you keep track of what they refer to (by column and row name).Pandas makes indexing easy.Pandas has some basic statistics and plotting facilities.Pandas may work more or less seamlessly with standard statistical models(e.g., general linear model with OLS-estimation)Watch out: Pandas is still below version 1 numbering!Standard packaging not up to date: Newest version of Pandas is 0.11.0,while, e.g., Ubuntu LTS 12.04 is 0.7.0: sudo pip install --upgrade pandasLatest pip-version of statsmodels is 0.4.3, development version is 0.5 withstatsmodels.formula.api that yields more R-like linear modeling.Finn Årup Nielsen32October 5, 2013

ReferencesReferencesMcKinney, W. (2012).ISBN 9781449319793.Finn Årup NielsenPython for Data Analysis.33O’Reilly, Sebastopol, California, first edition.October 5, 2013

Numpy’s std is the biased estimate while Pandas std is the unbiased estimate. Finn Arup Nielsen 22 October 5, 2013. Pandas Cross-tabulation For categorical variables select two columns and generate a matrix with counts for occurences (McKinney, 2012

Related Documents:

Pandas : Pandas is an open-source library of python providing high-performance data manipulation and analysis tool using its powerful data structure, there are many tools available in python to process the data fast Like-Numpy, Scipy, Cython and Pandas(Series and DataFrame). Data o

Python Programming for the Absolute Beginner Second Edition. CONTENTS CHAPTER 1 GETTING STARTED: THE GAME OVER PROGRAM 1 Examining the Game Over Program 2 Introducing Python 3 Python Is Easy to Use 3 Python Is Powerful 3 Python Is Object Oriented 4 Python Is a "Glue" Language 4 Python Runs Everywhere 4 Python Has a Strong Community 4 Python Is Free and Open Source 5 Setting Up Python on .

Python 2 versus Python 3 - the great debate Installing Python Setting up the Python interpreter About virtualenv Your first virtual environment Your friend, the console How you can run a Python program Running Python scripts Running the Python interactive shell Running Python as a service Running Python as a GUI application How is Python code .

Python is readable 5 Python is complete—"batteries included" 6 Python is cross-platform 6 Python is free 6 1.3 What Python doesn't do as well 7 Python is not the fastest language 7 Python doesn't have the most libraries 8 Python doesn't check variable types at compile time 8 1.4 Why learn Python 3? 8 1.5 Summary 9

Python-based ecosystem of open-source software for mathematics, science, and engineering. In particular, these are some of the core packages: https://www.scipy.org Pandas pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. https://pandas.pydata .

site "Python 2.x is legacy, Python 3.x is the present and future of the language". In addition, "Python 3 eliminates many quirks that can unnecessarily trip up beginning programmers". However, note that Python 2 is currently still rather widely used. Python 2 and 3 are about 90% similar. Hence if you learn Python 3, you will likely

There are currently two versions of Python in use; Python 2 and Python 3. Python 3 is not backward compatible with Python 2. A lot of the imported modules were only available in Python 2 for quite some time, leading to a slow adoption of Python 3. However, this not really an issue anymore. Support for Python 2 will end in 2020.

Think of # (iii) Pandas Python Data Analysis Library, or pandas, is a Python package providing fast, flexible, and expressive data structures designed to make working with is "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data