Python Programming Pandas

2y ago

52 Views

5 Downloads

200.73 KB

34 Pages

Last View : 21d ago

Last Download : 3m ago

Upload by : Giovanna Wyche

Report this link

Download PDF

Transcription

Python programming — PandasFinn Årup NielsenDTU ComputeTechnical University of DenmarkOctober 5, 2013

PandasOverviewPandas?Reading dataSummary statisticsIndexingMerging, joiningGroup-by and cross-tabulationStatistical modelingFinn Årup Nielsen1October 5, 2013

PandasPandas?“Python Data Analysis Library”Young library for data analysisDeveloped from http://pandas.pydata.org/Main author Wes McKinney has written a 2012 book (McKinney, 2012).Finn Årup Nielsen2October 5, 2013

PandasWhy Pandas?A better Numpy: keep track of variable names, better indexing, easierlinear modeling.A better R: Access to more general programming language.Why not pandas?R: Still primary language for statisticians, means most avanced tools arethere.NaN/NA (Not a number/Not available)Support to third-party algorithms compared to Numpy? Numexpr? (NumExpr in 0.11)Finn Årup Nielsen3October 5, 2013

PandasGet some data from RGet a standard dataset, Pima, from R: R library(MASS) write.csv(Pima.te, "pima.csv")pima.csv now contains comma-separated .587,51,"Yes"Finn Årup Nielsen4October 5, 2013

PandasRead data with PandasBack in Python: import pandas as pd pima pd.read csv("pima.csv")“pima” is now what Pandas call a DataFrame object. This object keepstrack of both data (numerical as well as text), and column and rowheaders.Lets use the first columns and the index column: import pandas as pd pima pd.read csv("pima.csv", index col 0)Finn Årup Nielsen5October 5, 2013

PandasSummary statistics pima.describe()Unnamed: 0npregcount 332.000000 0Finn Årup 000023.00000027.00000037.00000081.0000006October 5, 2013

Pandas. . . Summary statisticsOther summary statistics (McKinney, 2012, around page 101):pima.count() Count the number of rowspima.mean(), pima.median(), pima.quantile()pima.std(), pima.var()pima.min(), pima.max()Operation across columns instead, e.g., with the mean method:pima.mean(axis 1)Finn Årup Nielsen7October 5, 2013

PandasIndexing the rowsFor example, you can see the first two rows or the three last rows: pima[0:2]npreg glu bp skinbmiped age type16 148 7235 33.6 0.62750 Yes2185 6629 26.6 0.35131No pima[-3:]npreg glu bp skinbmiped age type33010 101 7648 32.9 0.17163No3315 121 7223 26.2 0.24530No332193 7031 30.4 0.31523NoNotice that this is not an ordinary numerical matrix: We also got text (inthe “type” column) within the “matrix”!Finn Årup Nielsen8October 5, 2013

PandasIndexing the columnsSee a specific column, here ’bmi’ (body-mass index): pima["bmi"]133.6226.6328.1431.0[here I cut out several lines]33032.933126.233230.4Name: bmi, Length: 332The returned type is another of Pandas Series object, — another of thefundamental objects in the library: type(pima["bmi"]) class ’pandas.core.series.Series’ Finn Årup Nielsen9October 5, 2013

PandasConditional indexingGet the fat people (those with BMI above 30): pima.shape(332, 9) pima[pima["bmi"] 30].shape(210, 9)See histogram (with from pylab import *): pima["bmi"].hist() show()Or kernel density estimation plot (McKinney, 2012, p 239) pima["bmi"].plot(kind "kde") show()Finn Årup Nielsen10October 5, 2013

PandasPlotsHistogram and kernel density estimate (KDE) of the “bmi” variable (bodymass index) of the Pima data set.Finn Årup Nielsen11October 5, 2013

PandasRow and column conditional indexingExample by David Marx in R:ABCDE -runif(10)runif(10)runif(10)runif(10)runif(10)df - data.frame(A,B,C,D,E)sliced df - df[ , df[1,] .5 ]That is, select the columns in a dataframe where the values of the firstrow is below 0.5. Here with a 10-by-5 dataset with uniformly-distributedrandom numbers and columns indexed by letters.Finn Årup Nielsen12October 5, 2013

Pandas. . . Row and column conditional indexingEquivalent in Pythonimport pandas as pdfrom pylab import *df pd.DataFrame(rand(10,5), columns ["A", "B", "C", "D", "E"])df.ix[:, df.ix[0, :] 0.5]These variations do not workdf[:, df[0] 0.5]df[:, df[:1] 0.5]df.ix[:, df[:1] 0.5]Finn Årup Nielsen13October 5, 2013

PandasConstructing a DataFrameConstructing a DataFrame from a dictionary where the keys become thecolumn names import pandas as pd import string spam corpus map(string.split, [ "buy viagra", "buy antibody" ]) unique words set([ word for doc in spam corpus for word in doc ]) word counts [ (word, map(lambda doc: doc.count(word), spam corpus))for word in unique words ] spam bag of words pd.DataFrame(dict(word counts)) print(spam bag of words)antibody buy viagra00111110Finn Årup Nielsen14October 5, 2013

PandasConcatenationAnother corpus and then concatenation with the previous dataset other corpus map(string.split, [ "buy time", "hello" ]) unique words set([ word for doc in other corpus for word in doc ]) word counts [ (word, map(lambda doc: doc.count(word), other corpus))for word in unique words ] other bag of words pd.DataFrame(dict(word counts)) print(other bag of words)buy hello time01011010 pd.concat([spam bag of words, other bag of words], ignore index True)antibody buy hello time viagra001NaNNaN1111NaNNaN02NaN101NaN3NaN010NaNFinn Årup Nielsen15October 5, 2013

PandasFilling in missing data(McKinney, 2012, page 145 ) pd.concat([spam bag of words, other bag of words], ignore index True)antibody buy hello time viagra001NaNNaN1111NaNNaN02NaN101NaN3NaN010NaN pd.concat([spam bag of words, other bag of words], ignore index True).fillna(0)antibody buy hello time viagra001001111000201010300100Finn Årup Nielsen16October 5, 2013

PandasCombining datasetsSee tml for otherPandas operations:concatjoinmergecombine firstFinn Årup Nielsen17October 5, 2013

PandasJoin exampleTwo data sets with partially overlapping rows (as not all students answer each questionnaire) where the columns should be concatenated (i.e.,scores for individual questionnaires)import pandas as pdxl pd.ExcelFile("E13 1 Resultater-2013-10-02.xlsx")df1 xl.parse("Resultater", index col [0, 1, 2], header 3)df1.columns map(lambda colname: unicode(colname) " 1", df1.columns)xl pd.ExcelFile("E13 2 Resultater-2013-10-02.xlsx")df2 xl.parse("Resultater", index col [0, 1, 2], header 3)df2.columns map(lambda colname: unicode(colname) " 2", df2.columns)df pd.DataFrame().join([df1, df2], how "outer")df[["Score 1", "Score 2"]].corr()Finn Årup Nielsen18# Score correlationOctober 5, 2013

PandasProcessing after join df.ix[:5,["Score 1", "Score 2"]]Bruger(faan)s06.s07.s07.s07.FornavnFinn Årup.EfternavnNielsenScore 1Score 709000.7418000.569666(edited)Note that the second user (“s06.”) did not solve the second assignment.The joining operation by default adds a NaN to the missing element, —indicating a missing value (not available, NA).Finn Årup Nielsen19October 5, 2013

PandasThe GroupbyGroupby method (McKinney, 2012, chapter 9): splits the dataset basedon a key, e.g., a DataFrame column name.Think of SQL’s GROUP BY.Example with Pima Indian data set splitting on the ’type’ column (elements are “yes” and “no”) and taking the mean in each of the twogroups: 5 108.188341 70.130045Yes4.614679 141.908257 6.5128440.4645650.65896329.21524735.614679The returned object from groupby is a DataFrameGroupBy object whilethe mean method on that object/class returns a DataFrame objectFinn Årup Nielsen20October 5, 2013

Pandas. . . The GroupbyMore elaborate with two aggregating methods: grouped by type pima.groupby("type") grouped by type.agg([np.mean, np.std])npregglumeanstdmeanstdtypeNo2.932735 2.781852 108.188341 22.645932Yes4.614679 3.901349 141.908257 1Finn Årup Nielsen21October 5, 2013

Pandas. . . The GroupbyWithout groupby checking mean (32.889908) and std (9.065951) for’skin’ ’Yes’: np.mean(pima[pima["type"] "Yes"]["skin"])32.889908256880737# Correct np.std(pima[pima["type"] "Yes"]["skin"])9.0242684519300891# ? import scipy.stats scipy.stats.nanstd(pima[pima["type"] "Yes"]["skin"])9.065951207005341# Ok np.std(pima[pima["type"] "Yes"]["skin"], ddof 1)9.065951207005341# Degrees of freedom!Numpy’s std is the biased estimate while Pandas std is the unbiasedestimate.Finn Årup Nielsen22October 5, 2013

PandasCross-tabulationFor categorical variables select two columns and generate a matrix withcounts for occurences (McKinney, 2012, p. 277) pd.crosstab(pima.type, pima.npreg)npreg 012345678typeNo34 56 38 23 19 13 1495Yes15 15 11 1567486948101112131517451511210101Remember: pima[1:4]npreg glu218531894378Finn Årup 1670.24823age type31No21No26 YesOctober 5, 2013

PandasCross-tabulation plot# Wrong orderingpd.crosstab(pima.type, pima.npreg).plot(kind "bar")Finn Årup Nielsen24October 5, 2013

PandasCross-tabulation plot# Transposepd.crosstab(pima.type, pima.npreg).T.plot(kind "bar")Finn Årup Nielsen25October 5, 2013

PandasCross-tabulation plot# Or better:pd.crosstab(pima.npreg, pima.type).plot(kind "bar")Finn Årup Nielsen26October 5, 2013

PandasOther Pandas capabilitiesHierarchical indexing (McKinney, 2012, page 147 )Missing data support (McKinney, 2012, page 142 )Pivoting (McKinney, 2012, chapter 9)Time series (McKinney, 2012, chapter 10)Finn Årup Nielsen27October 5, 2013

PandasStatistical modeling with statsmodelsExample with Longley dataset.Ordinary least squares fitting a dependent variable “TOTEMP” (TotalEmployment) from 6 independent variables:import statsmodels.api as sm# For ’load pandas’ you need a recent statsmodelsdata sm.datasets.longley.load pandas()# Endogeneous (response/dependent) & exogeneous variables (design matrix)y, x data.endog, data.exogresult sm.OLS(y, x).fit() # OLS: ordinary least squaresresult.summary()# Print summaryFinn Årup Nielsen28October 5, 2013

PandasOLS Regression Results Dep. Variable:TOTEMPR-squared:0.988Model:OLSAdj. R-squared:0.982Method:Least SquaresF-statistic:161.9Date:Mon, 17 Jun 2013Prob -117.56No. Observations:16AIC:247.1Df Residuals:10BIC:251.8Df Model:5 coefstd errtP t [95.0% Conf. 70.0219.00387.832 6Jarque-Bera nd. No.4.56e 05 Finn Årup Nielsen29October 5, 2013

PandasStatsmodels 0.5“Minimal example” from statsmodels documentation:import numpy as npimport pandas as pdimport statsmodels.formula.api as smfurl /HistData/Guerry.csv"dat pd.read csv(url)results smf.ols("Lottery Literacy np.log(Pop1831)", data dat).fit()results.summary()Note: 1) Loading of data with URL, 2) import statsmodels.formula.api(possible in statsmodels 0.5), 3) R-like specification of linear modelformula (from patsy).Finn Årup Nielsen30October 5, 2013

PandasMore informationhttp://pandas.pydata.org/The canonical book “Python for data analysis” (McKinney, 2012).Will it Python?: Porting R projects to Python, exemplified though scriptsfrom Machine Learning for Hackers (MLFH) by Drew Conway and JohnMiles White.Finn Årup Nielsen31October 5, 2013

PandasSummaryPandas helps you represent your data (both numerical and categorical)and helps you keep track of what they refer to (by column and row name).Pandas makes indexing easy.Pandas has some basic statistics and plotting facilities.Pandas may work more or less seamlessly with standard statistical models(e.g., general linear model with OLS-estimation)Watch out: Pandas is still below version 1 numbering!Standard packaging not up to date: Newest version of Pandas is 0.11.0,while, e.g., Ubuntu LTS 12.04 is 0.7.0: sudo pip install --upgrade pandasLatest pip-version of statsmodels is 0.4.3, development version is 0.5 withstatsmodels.formula.api that yields more R-like linear modeling.Finn Årup Nielsen32October 5, 2013

ReferencesReferencesMcKinney, W. (2012).ISBN 9781449319793.Finn Årup NielsenPython for Data Analysis.33O’Reilly, Sebastopol, California, first edition.October 5, 2013

Numpy’s std is the biased estimate while Pandas std is the unbiased estimate. Finn Arup Nielsen 22 October 5, 2013. Pandas Cross-tabulation For categorical variables select two columns and generate a matrix with counts for occurences (McKinney, 2012

Related Documents:

Class XII, IP, Python Notes Chapter II Python Pandas

Pandas : Pandas is an open-source library of python providing high-performance data manipulation and analysis tool using its powerful data structure, there are many tools available in python to process the data fast Like-Numpy, Scipy, Cython and Pandas(Series and DataFrame). Data o

150 Views

2y ago

Python Programming for the Absolute Beginner Second Edition

Python Programming for the Absolute Beginner Second Edition. CONTENTS CHAPTER 1 GETTING STARTED: THE GAME OVER PROGRAM 1 Examining the Game Over Program 2 Introducing Python 3 Python Is Easy to Use 3 Python Is Powerful 3 Python Is Object Oriented 4 Python Is a "Glue" Language 4 Python Runs Everywhere 4 Python Has a Strong Community 4 Python Is Free and Open Source 5 Setting Up Python on .

668 Views

3y ago

Learning Python - سیّد صالح اعتمادی

Python 2 versus Python 3 - the great debate Installing Python Setting up the Python interpreter About virtualenv Your first virtual environment Your friend, the console How you can run a Python program Running Python scripts Running the Python interactive shell Running Python as a service Running Python as a GUI application How is Python code .

55 Views

1y ago

The Quick Python Book, Second Edition - Shen@uiuc

Python is readable 5 Python is complete—"batteries included" 6 Python is cross-platform 6 Python is free 6 1.3 What Python doesn't do as well 7 Python is not the fastest language 7 Python doesn't have the most libraries 8 Python doesn't check variable types at compile time 8 1.4 Why learn Python 3? 8 1.5 Summary 9

62 Views

1y ago

programming by example

Python-based ecosystem of open-source software for mathematics, science, and engineering. In particular, these are some of the core packages: https://www.scipy.org Pandas pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. https://pandas.pydata .

17 Views

2y ago

Python (2nd Edition): Learn Python in One Day and Learn It Well. Python ...

site "Python 2.x is legacy, Python 3.x is the present and future of the language". In addition, "Python 3 eliminates many quirks that can unnecessarily trip up beginning programmers". However, note that Python 2 is currently still rather widely used. Python 2 and 3 are about 90% similar. Hence if you learn Python 3, you will likely

32 Views

1y ago

Python (Quick Review)

There are currently two versions of Python in use; Python 2 and Python 3. Python 3 is not backward compatible with Python 2. A lot of the imported modules were only available in Python 2 for quite some time, leading to a slow adoption of Python 3. However, this not really an issue anymore. Support for Python 2 will end in 2020.

10 Views

3m ago

CE 549 Python Lab 1 - Introduction to Python Programming

Think of # (iii) Pandas Python Data Analysis Library, or pandas, is a Python package providing fast, flexible, and expressive data structures designed to make working with is "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data

32 Views

1y ago

Recent Views

Quotes within Quotes: When Single (') and Double (") Quotes . - SAS

Here the outside double quotes are replaced by a single quote and the apostrophe is replaced by two single quotes. This works because when the parser sees two single (or double) quotes immediately following each other, the parser resolves them into one quote mark after the closing quote has been determined.

1y ago

237 Views

What These Inspirational Quotes Say

Self Motivation Quotes Success Quotes Teacher Quotes And after reading all of these inspirational quotes you’d like to share which quotation is . -- Brian Tracy "You must constantly ask yourself these questions: Who am I around? What are they doing to me? Wha

2y ago

302 Views

Columbus,Ohio 1890

Slicing Steaks 3563 Beef Tender, Select In Stock 3852 Angus XT Shoulder Clod, Choice In Stock 3853 Angus XT Chuck Roll, Choice 20/up In Stock 3856 Angus XT Peeled Knuckle In Stock 3857 Angus XT Inside Rounds In Stock 3858 Angus XT Flats, Choice In Stock 3859 Angus XT Eye Of Round, Choice In Stock 3507 Point Off Bnls Beef Brisket, Choice In Stock

2y ago

268 Views

MSN Stock Quotes Web Part - Amrein Engineering

The MSN Stock Quotes Web Part uses the public MSN Money Central Stock Quote Web Service to display selected stock quote information. The data are delayed by 20 minutes and provided by MSN Mo

2y ago

242 Views

Quotations - Free Website Builder: Create free websites

cards, but sometimes, playing a poor hand well." . 50th Birthday Quotes 60th Birthday Quotes And there are more. Funny Birthday Quotes Cute Birthday Quotes . it a try, itʼs free. Triumph over failure can be a

2y ago

267 Views

The Top 100 Motivational & Inspirational Quotes for 2015

I've spent hours crawling through the web trying to find the best quotes to keep me motivated and inspired all throughout the New Year. I've saved hundreds of quotes on my laptop and figured that words alone could motivate and inspire me. but if I couple the quotes

2y ago

329 Views

Inspirational Quotes - Guideposts

Inspirational Quotes Inspiring quotes are like vitamins for the soul. From the heartfelt to the humorous, the words of wisdom you’ll find here will strengthen your faith, lift your spirits, and even spark a positive change in your life. This collection of some our favorite inspirational quotes from religious figures, world leaders, authors,

2y ago

553 Views

Buying Your First Stock - Stock-Trak

Stock Market Game Time: 15 Minutes Requires: StockTrak Curriculum , Computer Access Buying Your First Stock This lesson is an introduction to buying a stock. Students will be introduced to basic vocabulary that is involved with a buying and owning a stock. Stu-dents will be going through the entire process of buying a stock from looking

1y ago

164 Views

TRAINING - CamInstructor

Mastercam Training Guide Mill-Lesson-4-9 6. Change the parameters to match the Stock Setup screenshot below: Stock Setup Stock Origin The stock origin is the X-Y-Z coordinate position of the point indicated by the cross in the picture of the stock model. Use it so Mastercam knows where your stock model is located relative to your part and

3y ago

242 Views

WPX Energy, Inc. - Feltl and Company

WPX Energy, Inc. Common Stock We are offering 27,000,000 shares of our common stock. Our common stock is listed on the New York Stock Exchange under the symbol “WPX.” On July 10, 2015, the last reported sale price for our common stock on the New York Stock Exchange (the “NYSE”) was 11.22 per share.

3y ago

172 Views

Spray 2020 Corporate Profiles - industry-publications

Custom plastic tubes (mono & multi-layer, ABL and Polyami) Stock and custom plastic, metal, and wood caps and closures Stock and custom fine mist, treatment and lotion pumps Stock and custom droppers Stock and custom rollerballs/roll-ons Stock sampler bottles and vials Stock German Quality cosmetic pencil sharpeners

2y ago

180 Views

The Stock Market Profits Blueprint - Liberated Stock Trader

The stock market profits blueprint has been hand crafted to enable you to understand all the factors that play on the stock market. It is called a blueprint because a blueprint is in effect an architectural document to show how something is designed. The Blueprint will show you a powerful way to envisage how the stock market and the stock market

1y ago

181 Views

The Impact of Persian News on Stock Returns Through Text Mining Techniques

Persian news - on the stock prices has been neglected. Consequently, this study aimed to fill this gap. To this aim, the stock index values were collected from the Tehran Stock Exchange along with the . Stock market prediction is a way to understand the future fluctuations of a company's stock price (Jishag et al., 2020). Generally, two .

1y ago

225 Views

Stock Market Uncertainty and the Stock-Bond Return Relation

implied volatility and stock turnover may prove useful for ﬁnancial applications that need to under-stand and predict stock and bond return co-movements. Finally, our empirical results suggest that the beneﬁts of stock-bond diversiﬁcation increase during periods of high stock market uncertainty. This study is organized as follow.

1y ago

158 Views

Operation of Stock Exchange - Williams College

Class Notes Operation of Stock Exchange - 3 - Buying on Margin "Margin" is borrowing money from your broker to buy a stock and using your invest-ment as collateral. Example Buy paying full price Buy stock at 60. Stock price goes to 90. Return (90 - 60)/60 50% Buy on "margin" Buy stock at 60. Borrow 30; you pay 30.

1y ago

138 Views

Python Programming Pandas

It looks like you're using an ad-blocker