Pandas DataFrame Notes - University Of Idaho

2y ago
95 Views
4 Downloads
315.77 KB
10 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Maxton Kershaw
Transcription

Cheat Sheet: The pandas DataFrame ObjectGet your data into a DataFramePreliminariesLoad a DataFrame from a CSV filedf pd.read csv('file.csv')# often worksdf pd.read csv(‘file.csv’, header 0,index col 0, quotechar ’”’,sep ’:’,na values [‘na’, ‘-‘, ‘.’, ‘’])Note: refer to pandas docs for all argumentsStart by importing these Python modulesimport numpy as npimport matplotlib.pyplot as pltimport pandas as pdfrom pandas import DataFrame, SeriesNote: these are the recommended import aliasesThe conceptual modelDataFrame object: The pandas DataFrame is a twodimensional table of data with column and row indexes.The columns are made up of pandas Series (df.index)Columnindex(df.columns)Series object: an ordered, one-dimensional array ofdata with an index. All the data in a Series is of thesame data type. Series arithmetic is vectorised after firstaligning the Series index for each of the operands.s1 Series(range(0,4)) # - 0, 1, 2, 3s2 Series(range(1,5)) # - 1, 2, 3, 4s3 s1 s2# - 1, 3, 5, 7s4 Series(['a','b'])*3 # - 'aaa','bbb'The index object: The pandas Index provides the axislabels for the Series and DataFrame objects. It can onlycontain hashable objects. A pandas Series has oneIndex; and a DataFrame has two Indexes.# --- get Index from Series and DataFrameidx s.indexidx df.columns# the column indexidx df.index# the row index#bbbi--- some Index attributes idx.is monotonic decreasing idx.is monotonic increasing idx.has duplicates idx.nlevels# multi-level indexes# --- some Index methodsa idx.values()# get as numpy arrayl idx.tolist()# get as a python listidx idx.astype(dtype)# change data typeb idx.equals(o) # check for equalityidx idx.union(o) # union of two indexesi idx.nunique() # number unique labelslabel idx.min() # minimum labellabel idx.max() # maximum labelFrom inline CSV text to a DataFramefrom StringIO import StringIO # python2.7#from io import StringIO# python 3data """, Animal, Cuteness, f pd.read csv(StringIO(data),header 0, index col 0,skipinitialspace True)Note: skipinitialspace True allows a pretty layoutLoad DataFrames from a Microsoft Excel file# Each Excel sheet in a Python dictionaryworkbook pd.ExcelFile('file.xlsx')dictionary {}for sheet name in workbook.sheet names:df workbook.parse(sheet name)dictionary[sheet name] dfNote: the parse() method takes many arguments likeread csv() above. Refer to the pandas documentation.Load a DataFrame from a MySQL databaseimport pymysqlfrom sqlalchemy import create engineengine create engine('mysql pymysql://' 'USER:PASSWORD@localhost/DATABASE')df pd.read sql table('table', engine)Data in Series then combine into a DataFrame# Example 1 .s1 Series(range(6))s2 s1 * s1s2.index s2.index 2# misalign indexesdf pd.concat([s1, s2], axis 1)# Example 2 .s3 Series({'Tom':1, 'Dick':4, 'Har':9})s4 Series({'Tom':3, 'Dick':2, 'Mar':5})df pd.concat({'A':s3, 'B':s4 }, axis 1)Note: 1st method has in integer column labelsNote: 2nd method does not guarantee col orderNote: index alignment on DataFrame creationGet a DataFrame from data in a Python dictionary# default --- assume data is in columnsdf DataFrame({'col0' : [1.0, 2.0, 3.0, 4.0],'col1' : [100, 200, 300, 400]})Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark Graph on twitter]1

Get a DataFrame from data in a Python dictionary# --- use helper method for data in rowsdf DataFrame.from dict({ # data by row'row0' : {'col0':0, 'col1':'A'},'row1' : {'col0':1, 'col1':'B'}}, orient 'index')df DataFrame.from dict({ # data by row'row0' : [1, 1 1j, 'A'],'row1' : [2, 2 2j, 'B']}, orient 'index')Create play/fake data (useful for testing)# --- simpledf DataFrame(np.random.rand(50,5))# --- with a time-stamp row index:df DataFrame(np.random.rand(500,5))df.index pd.date range('1/1/2006',periods len(df), freq 'M')# --- with alphabetic row and col indexesimport stringimport randomr 52 # note: min r is 1; max r is 52c 5df DataFrame(np.random.randn(r, c),columns ['col' str(i) for i inrange(c)],index list((string.uppercase string.lowercase)[0:r]))df['group'] list(''.join(random.choice('abcd')for in range(r)))Saving a DataFrameSaving a DataFrame to a CSV filedf.to csv('name.csv', encoding 'utf-8')Saving DataFrames to an Excel Workbookfrom pandas import ExcelWriterwriter ExcelWriter('filename.xlsx')df1.to excel(writer,'Sheet1')df2.to excel(writer,'Sheet2')writer.save()Saving a DataFrame to MySQLimport pymysqlfrom sqlalchemy import create enginee create engine('mysql pymysql://' 'USER:PASSWORD@localhost/DATABASE')df.to sql('TABLE',e, if exists 'replace')Note: if exists ! 'fail', 'replace', 'append'Saving a DataFrame to a Python dictionarydictionary df.to dict()Saving a DataFrame to a Python stringstring df.to string()Note: sometimes may be useful for debuggingWorking with the whole DataFramePeek at the DataFrame contentsdf.info()# index & data typesn 4dfh df.head(n)# get first n rowsdft df.tail(n)# get last n rowsdfs df.describe() # summary stats colstop left corner df df.iloc[:5, :5]DataFrame non-indexing attributesdfT df.T# transpose rows and colsl df.axes# list row and col indexes(r, c) df.axes # from aboves df.dtypes # Series column data typesb df.empty # True for empty DataFramei df.ndim# number of axes (2)t df.shape # (row-count, column-count)(r, c) df.shape # from abovei df.size# row-count * column-counta df.values # get a numpy array for dfDataFrame utility methodsdfc df.copy() # copydfr df.rank() # rankdfs df.sort() # sortdfc df.astype(dtype)a DataFrameeach col (default)each col (default)# type conversionDataFrame iteration methodsdf.iteritems()# (col-index, Series) pairsdf.iterrows() # (row-index, Series) pairs# example . iterating over columnsfor (name, series) in df.iteritems():print('Col name: ' str(name))print('First value: ' str(series.iat[0]) '\n')Maths on the whole DataFrame (not a complete list)df df.abs() # absolute valuesdf df.add(o) # add df, Series or values df.count() # non NA/null valuesdf df.cummax() # (cols default axis)df df.cummin() # (cols default axis)df df.cumsum() # (cols default axis)df df.cumprod() # (cols default axis)df df.diff() # 1st diff (col def axis)df df.div(o) # div by df, Series, valuedf df.dot(o) # matrix dot products df.max()# max of axis (col def)s df.mean() # mean (col default axis)s df.median()# median (col default)s df.min()# min of axis (col def)df df.mul(o) # mul by df Series vals df.sum()# sum axis (cols default)Note: The methods that return a series default toworking on columns.DataFrame filter/select rows or cols on label infodf df.filter(items ['a', 'b']) # by coldf df.filter(items [5], axis 0) #by rowdf df.filter(like 'x') # keep x in coldf df.filter(regex 'x') # regex in coldf df.select(crit (lambda x:not x%5))#rNote: select takes a Boolean function, for cols: axis 1Note: filter defaults to cols; select defaults to rowsVersion 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark Graph on twitter]2

Working with ColumnsA DataFrame column is a pandas Series objectGet column index and labelsidx df.columns# get col indexlabel df.columns[0]# 1st col labellst df.columns.tolist() # get as a listChange column labelsdf.rename(columns {'old':'new'},inplace True)df df.rename(columns {'a':1,'b':'x'})Selecting columnss df['colName'] # select col to Seriesdf df[['colName']] # select col to dfdf df[['a','b']]# select 2 or moredf df[['c','a','b']]# change orders df[df.columns[0]] # select by numberdf df[df.columns[[0, 3, 4]] # by numbers df.pop('c') # get col & drop from dfSelecting columns with Python attributess df.a# same as s df['a']# cannot create new columns by attributedf.existing col df.a / df.bdf['new col'] df.a / df.bTrap: column names must be valid identifiers.Adding new columns to a DataFramedf['new col'] range(len(df))df['new col'] np.repeat(np.nan,len(df))df['random'] np.random.rand(len(df))df['index as col'] df.indexdf1[['b','c']] df2[['e','f']]df3 df1.append(other df2)Trap: When adding an indexed pandas object as a newcolumn, only items from the new series that have acorresponding index in the DataFrame will be added.The receiving DataFrame is not extended toaccommodate the new series. To merge, see below.Trap: when adding a python list or numpy array, thecolumn will be added by integer position.Swap column contents – change column orderdf[['B', 'A']] df[['A', 'B']]Columns value set based on criteriadf['b'] df['a'].where(df['a'] 0,other 0)df['d'] df['a'].where(df.b! 0,other df.c)Note: where other can be a Series or a scalarData type conversionss df['col'].astype(str) # Series dtypena df['col'].values# numpy arraypl df['col'].tolist()# python listNote: useful dtypes for Series conversion: int, float, strTrap: index lost in conversion from Series to array or listCommon column-wide methods/attributesvalue df['col'].dtype # type of datavalue df['col'].size# col dimensionsvalue df['col'].count()# non-NA countvalue df['col'].sum()value df['col'].prod()value df['col'].min()value df['col'].max()value df['col'].mean()value df['col'].median()value df['col'].cov(df['col2'])s df['col'].describe()s df['col'].value counts()Find index label for min/max values in columnlabel df['col1'].idxmin()label df['col1'].idxmax()Common column element-wise methodss df['col'].isnull()s df['col'].notnull() # not isnull()s df['col'].astype(float)s df['col'].round(decimals 0)s df['col'].diff(periods 1)s df['col'].shift(periods 1)s df['col'].to datetime()s df['col'].fillna(0) # replace NaN w 0s df['col'].cumsum()s df['col'].cumprod()s df['col'].pct change(periods 4)s df['col'].rolling sum(periods 4,window 4)Note: also rolling min(), rolling max(), and many more.Append a column of row sums to a DataFramedf['Total'] df.sum(axis 1)Note: also means, mins, maxs, etc.Dropping columns (mostly by label)df df.drop('col1', axis 1)df.drop('col1', axis 1, inplace True)df df.drop(['col1','col2'], axis 1)s df.pop('col') # drops from framedel df['col'] # even classic python worksdf.drop(df.columns[0], inplace True)Multiply every column in DataFrame by Seriesdf df.mul(s, axis 0) # on matched rowsNote: also add, sub, div, etc.Vectorised arithmetic on columnsdf['proportion'] df['count']/df['total']df['percent'] df['proportion'] * 100.0Get the integer position of a column index labelj df.columns.get loc('col name')Apply numpy mathematical functions to columnsdf['log data'] np.log(df['col1'])df['rounded'] np.round(df['col2'], 2)Note: Many more mathematical functionsSelecting columns with .loc, .iloc and .ixdf df.loc[:, 'col1':'col2'] # inclusivedf df.iloc[:, 0:2]# exclusiveTest if column index values are unique/monotonicif df.columns.is unique: pass # .b df.columns.is monotonic increasingb df.columns.is monotonic decreasingVersion 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark Graph on twitter]3

Working with rowsGet the row index and labelsidx df.indexlabel df.index[0]lst df.index.tolist()# get row index# 1st row label# get as a listChange the (row) indexdf.index idx# new ad hoc indexdf.index range(len(df)) # set with listdf df.reset index() # replace old w new# note: old index stored as a col in dfdf df.reindex(index range(len(df)))df df.set index(keys ['r1','r2','etc'])df.rename(index {'old':'new'},inplace True)Adding rowsdf original df.append(more rows in df)Hint: convert to a DataFrame and then append. BothDataFrames should have same column labels.Dropping rows (by name)df df.drop('row label')df df.drop(['row1','row2']) # multi-rowBoolean row selection by values in a columndf df[df['col2'] 0.0]df df[(df['col3'] 1.0) (df['col1'] 0.0)]df df[df['col'].isin([1,2,5,7,11])]df df[ df['col'].isin([1,2,5,7,11])]df df[df['col'].str.contains('hello')]Trap: bitwise "or", "and" “not” (ie. & ) co-opted to beBoolean operators on a Series of BooleanTrap: need parentheses around comparisons.Selecting rows using isin over multiple columns# fake up some datadata {1:[1,2,3], 2:[1,4,9], 3:[1,8,27]}df pd.DataFrame(data)# multi-column isinlf {1:[1, 3], 3:[8, 27]} # look forf df[df[list(lf)].isin(lf).all(axis 1)]Selecting rows using an indexidx df[df['col'] 2].indexprint(df.ix[idx])Select a slice of rows by integer position[inclusive-from : exclusive-to [: step]]default start is 0; default end is len(df)df df[:]# copy DataFramedf df[0:2]# rows 0 and 1df df[-1:]# the last rowdf df[2:3]# row 2 (the third row)df df[:-1]# all but the last rowdf df[::2]# every 2nd row (0 2 .)Trap: a single integer without a colon is a column labelfor integer numbered columns.Select a slice of rows by label/index[inclusive-from : inclusive–to [ : step]]df df['a':'c'] # rows 'a' through 'c'Trap: doesn't work on integer labelled rowsAppend a row of column totals to a DataFrame# Option 1: use dictionary comprehensionsums {col: df[col].sum() for col in df}sums df DataFrame(sums,index ['Total'])df df.append(sums df)# Option 2: All done with pandasdf df.append(DataFrame(df.sum(),columns ['Total']).T)Iterating over DataFrame rowsfor (index, row) in df.iterrows(): # passTrap: row data type may be coerced.Sorting DataFrame rows valuesdf df.sort(df.columns[0],ascending False)df.sort(['col1', 'col2'], inplace True)Random selection of rowsimport random as rk 20 # pick a numberselection r.sample(range(len(df)), k)df sample df.iloc[selection, :]Note: this sample is not sortedSort DataFrame by its row indexdf.sort index(inplace True) # sort by rowdf df.sort index(ascending False)Drop duplicates in the row indexdf['index'] df.index # 1 create new coldf df.drop duplicates(cols 'index',take last True)# 2 use new coldel df['index']# 3 del the coldf.sort index(inplace True)# 4 tidy upTest if two DataFrames have same row indexlen(a) len(b) and all(a.index b.index)Get the integer position of a row or col index labeli df.index.get loc('row label')Trap: index.get loc() returns an integer for a uniquematch. If not a unique match, may return a slice ormask.Get integer position of rows that meet conditiona np.where(df['col'] 2) #numpy arrayTest if the row index values are unique/monotonicif df.index.is unique: pass # .b df.index.is monotonic increasingb df.index.is monotonic decreasingVersion 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark Graph on twitter]4

Working with cellsIn summary: indexes and addressesSelecting a cell by row and column labelsvalue df.at['row', 'col']value df.loc['row', 'col']value df['col'].at['row']# trickyNote: .at[] fastest label based scalar lookupIn the main, these notes focus on the simple, singlelevel Indexes. Pandas also has a hierarchical ormulti-level Indexes (aka the MultiIndex).Setting a cell by row and column labelsdf.at['row, 'col'] valuedf.loc['row, 'col'] valuedf['col'].at['row'] value# trickySelecting and slicing on labelsdf df.loc['row1':'row3', 'col1':'col3']Note: the "to" on this slice is inclusive.Setting a cross-section by labelsdf.loc['A':'C', 'col1':'col3'] np.nandf.loc[1:2,'col1':'col2'] np.zeros((2,2))df.loc[1:2,'A':'C'] othr.loc[1:2,'A':'C']Remember: inclusive "to" in the sliceSelecting a cell by integer positionvalue df.iat[9, 3]# [row, col]value df.iloc[0, 0]# [row, col]value df.iloc[len(df)-1,len(df.columns)-1]Selecting a range of cells by int positiondf df.iloc[2:4, 2:4] # subset of the dfdf df.iloc[:5, :5]# top left corners df.iloc[5, :] # returns row as Seriesdf df.iloc[5:6, :] # returns row as rowNote: exclusive "to" – same as python list slicing.Setting cell by integer positiondf.iloc[0, 0] valuedf.iat[7, 8] value# [row, col]Setting cell range by integer positiondf.iloc[0:3, 0:5] valuedf.iloc[1:3, 1:4] np.ones((2, 3))df.iloc[1:3, 1:4] np.zeros((2, 3))df.iloc[1:3, 1:4] np.array([[1, 1, 1],[2, 2, 2]])Remember: exclusive-to in the slice.ix for mixed label and integer position indexingvalue df.ix[5, 'col1']df df.ix[1:5, 'col1':'col3']Views and copiesFrom the manual: Setting a copy can cause subtleerrors. The rules about when a view on the data isreturned are dependent on NumPy. Whenever an arrayof labels or a Boolean vector are involved in the indexingoperation, the result will be a copy.A DataFrame has two Indexes Typically, the column index (df.columns) is a list ofstrings (observed variable names) or (lesscommonly) integers (the default is numbered from 0to length-1) Typically, the row index (df.index) might be:o Integers - for case or row numbers (default isnumbered from 0 to length-1);o Strings – for case names; oro DatetimeIndex or PeriodIndex – for time seriesdata (more below)Indexing# --- selecting columnss df['col label']df df[['col label']]df df[['L1', 'L2']]df df[index]df df[s]#####scalarone item listmany item listpandas Indexpandas Series# --- selecting rowsdf df['from':'inc to']# label slicedf df[3:7]# integer slicedf df[df['col'] 0.5]# Boolean Seriesdfdfdfdfdfdfdfdf df.loc['label']#df.loc[container] [container] #df.iloc[0:5]#df.ix[x]#single labellab list/Seriesinclusive sliceBoolean Seriessingle integerint list/Seriesexclusive sliceloc then iloc# --- select DataFrame cross-section#r and c can be scalar, list, slicedf.loc[r, c] # label accessor (row, col)df.iloc[r, c]# integer accessordf.ix[r, c] # label access int fallbackdf[c].iloc[r]# chained – also for .loc# --- select cell#r and c must be label or integerdf.at[r, c] # fast scalar label accessordf.iat[r, c] # fast scalar int accessordf[c].iat[r] # chained – also for .at# --- indexing methodsv df.get value(r, c) # get by row, coldf df.set value(r,c,v)# set by row, coldf df.xs(key, axis) # get cross-sectiondf df.filter(items, like, regex, axis)df df.select(crit, axis)Note: the indexing attributes (.loc, .iloc, .ix, .at .iat) canbe used to get and set values in the DataFrame.Note: the .loc, iloc and .ix indexing attributes can acceptpython slice objects. But .at and .iat do not.Note: .loc can also accept Boolean Series argumentsAvoid: chaining in the form df[col indexer][row indexer]Trap: label slices are inclusive, integer slices exclusive.Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark Graph on twitter]5

Joining/Combining DataFramesThree ways to join two DataFrames: merge (a database/SQL-like join operation) concat (stack side by side or one on top of the other) combine first (splice the two together, choosingvalues from one over the other)Merge on indexesdf new pd.merge(left df1, right df2,how 'outer', left index True,right index True)How: 'left', 'right', 'outer', 'inner'How: outer union/all; inner intersectionMerge on columnsdf new pd.merge(left df1, right df2,how 'left', left on 'col1',right on 'col2')Trap: When joining on columns, the indexes on thepassed DataFrames are ignored.Trap: many-to-many merges on a column can result inan explosion of associated data.Join on indexes (another way of merging)df new df1.join(other df2, on 'col1',how 'outer')df new df1.join(other df2,on ['a','b'],how 'outer')Note: DataFrame.join() joins on indexes by default.DataFrame.merge() joins on common columns bydefault.Simple concatenation is often the bestdf pd.concat([df1,df2],axis 0)#top/bottomdf df1.append([df2, df3])#top/bottomdf pd.concat([df1,df2],axis 1)#left/rightTrap: can end up with duplicate rows or colsNote: concat has an ignore index parameterCombine firstdf df1.combine first(other df2)# multi-combine with python reduce()df reduce(lambda x, y:x.combine first(y),[df1, df2, df3

for _ in range(r)) ) Saving a DataFrame Saving a DataFrame to a CSV file df.to_csv('name.csv', encoding 'utf-8') Saving DataFrames to an Excel Workbook from pandas import ExcelWriter writer ExcelWriter('filename.xlsx') df1.to_excel(writer,'Sheet1') df2.to_excel(writer,'Sheet2') writer.s

Related Documents:

Pandas : Pandas is an open-source library of python providing high-performance data manipulation and analysis tool using its powerful data structure, there are many tools available in python to process the data fast Like-Numpy, Scipy, Cython and Pandas(Series and DataFrame). Data o

This section provides an overview of what spark-dataframe is, and why a developer might want to use it. It should also mention any large subjects within spark-dataframe, and link out to the related topics. Since the Documentation for spark-dataframe is new, you may need to create initial versions of those related topics. Examples Installation .

1. Speeding up PySpark with Apache Arrow on Arm64 Spark SQL Arrow columnar Stream Input Arrow columnar Stream Input PySpark Worker (Pandas) To arrow To arrow From arrow From arrow Zero copy via socket 55 1.35 0 10 20 30 40 50 60 Spark to Pandas) Converting a Spark DataFrame to Pandas (Shorter is better) Pyspark No Arrow Pyspark Arrow enabled .

Spark Dataframe, Spark SQL, Hadoop metrics Guoshiwen Han, gh2567@columbia.edu 10/1/2021 1. Agenda Spark Dataframe Spark SQL Hadoop metrics 2. . ambari-server setup service ambari-server start point your browser to AmbariHost :8080 and login with the default user admin and password admin. Third-party tools 22

Lesson Plan Magic Tree House #48: A Perfect Time for Pandas Endangered Species Lapbook After reading A Perfect Time for Pandas and the accompanying Fact Tracker, Pandas and Other Endangered Species, your students will have a wealth of knowledge as well as a new-found interest in helping to save some of the most magnificent animals on the planet

Pandas Efficient for processing tabular, or panel, data Built on top of NumPy Data structures: Series and DataFrame (DF) – Series: one -dimensional , same data type – Da

Finance, Derivatives Analytics & Python Programming Y. Hilpisch (Visixion GmbH) astF Data Mining EuroPython, July 2012, Florence 1 / 60. 1 Data Mining 2 Some pandas Fundamentals Series class DataFrame class Plotting 3 Financial Data Mining in Action 4 High-Frequency Financial Data

measured by ASTM test method C 173 or C 231. Dimensions – Unless otherwise specified, the minimum length of each barrier section will be 10 feet. It is common for DOTs to ask for lengths of 20 feet or even 30 feet. ASTM C 825 Design Steel Reinforcement – Unless designated by the purchaser, reinforcement shall be designed by the producer and be sufficient to permit handling, delivery .