PandasGuide - Read The Docs

2y ago
8 Views
2 Downloads
818.31 KB
65 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Matteo Vollmer
Transcription

Pandas GuideMeher Krishna PatelCreated on : Octorber, 2017Last updated : May, 2020More documents are freely available at PythonDSP

Table of contentsTable of contents1 Pandas Basic1.1 Introduction . . . .1.2 Data structures . .1.2.1 Series . . .1.2.2 DataFramei.222232 Overview2.1 Reading files . . . . . . . . . . . . . . .2.2 Data operations . . . . . . . . . . . . . .2.2.1 Row and column selection . . . .2.2.2 Filter Data . . . . . . . . . . . .2.2.3 Sorting . . . . . . . . . . . . . .2.2.4 Null values . . . . . . . . . . . .2.2.5 String operations . . . . . . . . .2.2.6 Count Values . . . . . . . . . . .2.2.7 Plots . . . . . . . . . . . . . . . .2.3 Groupby . . . . . . . . . . . . . . . . . .2.3.1 Groupby with column-names . .2.3.2 Groupby with custom field . . .2.4 Unstack . . . . . . . . . . . . . . . . . .2.5 Merge . . . . . . . . . . . . . . . . . . .2.5.1 Merge with different files . . . .2.5.2 Merge table with itself . . . . . .2.6 Index . . . . . . . . . . . . . . . . . . .2.6.1 Creating index . . . . . . . . . .2.6.2 Multiple index . . . . . . . . . .2.6.3 Reset index . . . . . . . . . . . .2.7 Implement using Python-CSV library .2.7.1 Read the file . . . . . . . . . . .2.7.2 Display movies according to year2.7.3 operator.iemgetter . . . . . . . .2.7.4 Replace empty string with 0 . . .2.7.5 collections.Counter . . . . . . . .2.7.6 collections.defaultdict . . . . . 3 Numpy3.1 Creating Arrays . . . . .3.2 Boolean indexing . . . .3.3 Reshaping arrays . . . .3.4 Concatenating the data.2727282929.i

4 Data processing4.1 Hierarchical indexing . . . . . . . .4.1.1 Creating multiple index . .4.1.2 Partial indexing . . . . . .4.1.3 Unstack the data . . . . . .4.1.4 Column indexing . . . . . .4.1.5 Swap and sort level . . . . .4.1.6 Summary statistics by level4.2 File operations . . . . . . . . . . .4.2.1 Reading files . . . . . . . .4.2.2 Writing data to a file . . . .4.3 Merge . . . . . . . . . . . . . . . .4.3.1 Many to one . . . . . . . .4.3.2 Inner and outer join . . . .4.3.3 Concatenating the data . .4.4 Data transformation . . . . . . . .4.4.1 Removing duplicates . . . .4.4.2 Replacing values . . . . . .4.5 Groupby and data aggregation . .4.5.1 Basics . . . . . . . . . . . .4.5.2 Iterating over group . . . .4.5.3 Data aggregation . . . . . .313131323233343435353738383940414142434344455 Time series5.1 Dates and times . . . . . . . . .5.1.1 Generate series of time . .5.1.2 Convert string to dates .5.1.3 Periods . . . . . . . . . .5.1.4 Time offsets . . . . . . . .5.1.5 Index data with time . . .5.2 Application . . . . . . . . . . . .5.2.1 Basics . . . . . . . . . . .5.2.2 Resampling . . . . . . . .5.2.3 Plotting the data . . . . .5.2.4 Moving windows functions.4646464748495051525456576 Reading multiple files6.1 Example: Baby names trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6.2 Total boys and girls in year 1880 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6.3 pivot table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59595960.

Pandas GuideNote: Created using Python-3.6.4 and Pandas-0.22.0 CSV files can be downloaded from below ownloads/1

Chapter 1Pandas Basic1.1 IntroductionData processing is important part of analyzing the data, because data is not always available in desired format.Various processing are required before analyzing the data such as cleaning, restructuring or merging etc. Numpy,Scipy, Cython and Panda are the tools available in python which can be used fast processing of the data. Further,Pandas are built on the top of Numpy.Pandas provides rich set of functions to process various types of data. Further, working with Panda is fast, easyand more expressive than other tools. Pandas provides fast data processing as Numpy along with flexible datamanipulation techniques as spreadsheets and relational databases. Lastly, pandas integrates well with matplotliblibrary, which makes it very handy tool for analyzing the data.Note: In chapter 1, two important data structures i.e. Series and DataFrame are discussed. Chapter 2 shows the frequently used features of Pandas with example. And later chapters include variousother information about Pandas.1.2 Data structuresPandas provides two very useful data structures to process the data i.e. Series and DataFrame, which are discussedin this section.1.2.1 SeriesThe Series is a one-dimensional array that can store various data types, including mix data types. The row labelsin a Series are called the index. Any list, tuple and dictionary can be converted in to Series using ‘series’ methodas shown below, import pandas as pd # converting tuple to Series h ('AA', '2012-02-01', 100, 10.2) s pd.Series(h) type(s) class 'pandas.core.series.Series' (continues on next page)2

Pandas Guide(continued from previous page) print(s)0AA12012-02-012100310.2dtype: object # converting dict to Series d {'name' : 'IBM', 'date' : '2010-09-08', 'shares' : 100, 'price' : 10.2} ds pd.Series(d) type(ds) class 'pandas.core.series.Series' ype: objectNote that in the tuple-conversion, the index are set to ‘0, 1, 2 and 3’. We can provide custom index names asfollows. f ['FB', '2001-08-02', 90, 3.2] f pd.Series(f, index ['name', 'date', 'shares', 'price']) print(f)nameFBdate2001-08-02shares90price3.2dtype: object f['shares']90 f[0]'FB' Elements of the Series can be accessed using index name e.g. f[‘shares’] or f[0] in below code. Further, specificelements can be selected by providing the index in the list, f[['shares', 'price']]shares90price3.2dtype: object1.2.2 DataFrameDataFrame is the widely used data structure of pandas. Note that, Series are used to work with one dimensionalarray, whereas DataFrame can be used with two dimensional arrays. DataFrame has two different index i.e.column-index and row-index.The most common way to create a DataFrame is by using the dictionary of equal-length list as shown below.Further, all the spreadsheets and text files are read as DataFrame, therefore it is very important data structure ofpandas.1.2. Data structures3

Pandas Guide data { 'name' : ['AA', 'IBM', 'GOOG'],.'date' : ['2001-12-01', '2012-02-10', '2010-04-09'],.'shares' : [100, 30, 90],.'price' : [12.3, 10.3, 32.2]. } df pd.DataFrame(data) type(df) class 'pandas.core.frame.DataFrame' OGprice12.310.332.2shares1003090Additional columns can be added after defining a DataFrame as below, df['owner'] 'Unknown' dfdate name price shares0 2001-12-01AA12.31001 2012-02-10IBM10.3302 2010-04-09 GOOG32.290ownerUnknownUnknownUnknownCurrently, the row index are set to 0, 1 and 2. These can be changed using ‘index’ attribute as below, df.index ['one', 'two', 'three'] dfdate name price three 2010-04-09 GOOG32.290ownerUnknownUnknownUnknownFurther, any column of the DataFrame can be set as index using ‘set index()’ attribute, as shown below, df df.set index(['name']) dfdate price OG 2010-04-0932.290ownerUnknownUnknownUnknownData can be accessed in two ways i.e. using row and column index, # access data using column-index df['shares']nameAA100IBM30GOOG90Name: shares, dtype: int64 # access data by row-index knownName: AA, dtype: object(continues on next page)1.2. Data structures4

Pandas Guide(continued from previous page) # access all rows for a column df.ix[:, 'name']0AA1IBM2GOOGName: name, dtype: object # access specific element from the DataFrame, df.ix[0, 'shares']100Any column can be deleted using ‘del’ or ‘drop’ commands, del df['owner'] dfdate pricenameAA2001-12-0112.3IBM2012-02-1010.3GOOG 2010-04-0932.2shares1003090 df.drop('shares', axis 1)date pricenameAA2001-12-0112.3IBM2012-02-1010.3GOOG 2010-04-0932.21.2. Data structures5

Chapter 2OverviewIn this chapter, various functionalities of pandas are shown with examples, which are explained in later chaptersas well.Note: CSV files can be downloaded from below ownloads/2.1 Reading filesIn this section, two data files are used i.e. ‘titles.csv’ and ‘cast.csv’. The ‘titles.csv’ file contains the listof movies with the releasing year; whereas ‘cast.csv’ file has five columns which store the title of movie,releasing year, star-casts, type(actor/actress), characters and ratings for actors, as shown below, import pandas as pd casts pd.read csv('cast.csv', index col None) casts.head()title yearnametypecharacter0Closet Monster 2015 Buffy #1 actorBuffy 41Suuri illusioni 1985Homo actorGuests2Battle of the Sexes 2017 hutter actorBobby Riggs Fan3 Secret in Their Eyes 2015 hutter actor2002 Dodger Fan4Steve Jobs 2015 hutter actor 1988 Opera House Patronn31.022.010.0NaNNaN titles pd.read csv('titles.csv', index col None) titles.tail()title year49995Rebel 197049996Suzanne 199649997Bomba 201349998 Aao Jao Ghar Tumhara 198449999Mrs. Munck 1995 read csv : read the data from the csv file.index col None : there is no index i.e. first column is datahead() : show only first five elements of the DataFrametail() : show only last five elements of the DataFrameIf there is some error while reading the file due to encoding, then try for following option as well,titles pd.read csv('titles.csv', index col None, encoding 'utf-8')6

Pandas GuideIf we simply type the name of the DataFrame (i.e. cast in below code), then it will show the first thirty and lasttwenty rows of the file along with complete list of columns. This can be limited using ‘set options’ as below.Further, at the end of the table total number of rows and columns will be displayed. pd.set option('max rows', 10, 'max columns', 10) titlestitle year0The Rising Son 19901The Thousand Plane Raid 19692Crucea de piatra 19933Country 20004Gaiking II 2011.49995Rebel 197049996Suzanne 199649997Bomba 201349998Aao Jao Ghar Tumhara 198449999Mrs. Munck 1995[50000 rows x 2 columns] len : ‘len’ commmand can be used to see the total number of rows in the file, len(titles)50000Note: head() and tail() commands can be used for remind ourselves about the header and contents of the file.These two commands will show the first and last 5 lines respectively of the file. Further, we can change the totalnumber of lines to be displayed by these commands, titles.head(3)012titleThe Rising SonThe Thousand Plane RaidCrucea de piatrayear1990196919932.2 Data operationsIn this section, various useful data operations for DataFrame are shown.2.2.1 Row and column selectionAny row or column of the DataFrame can be selected by passing the name of the column or rows. After selectingone from DataFrame, it becomes one-dimensional therefore it is considered as Series. ix : use ‘ix’ command to select a row from the DataFrame. t titles['title'] type(t) class 'pandas.core.series.Series' t.head()0The Rising Son1The Thousand Plane Raid2Crucea de piatra(continues on next page)2.2. Data operations7

Pandas Guide(continued from previous page)3Country4Gaiking IIName: title, dtype: object titles.ix[0]titleThe Rising Sonyear1990Name: 0, dtype: object 2.2.2 Filter DataData can be filtered by providing some boolean expression in DataFrame. For example, in below code, movieswhich released after 1985 are filtered out from the DataFrame ‘titles’ and stored in a new DataFrame i.e. after85. # movies after 1985 after85 titles[titles['year'] 1985] after85.head()title year0The Rising Son 19902 Crucea de piatra 19933Country 20004Gaiking II 20115Medusa (IV) 2015 Note: When we pass the boolean results to DataFrame, then panda will show all the results which correspondsto True (rather than displaying True and False), as shown in above code. Further, ‘& (and)’ and ‘ (or)’ can beused for joining two conditions as shown below,**In below code all the movies in decade 1990 (i.e. 1900-1999) are selected. Also ‘t titles’ is used for simplicitypurpose only. 02121924 # display movie in years 1990 - 1999t titlesmovies90 t[ (t['year'] 1990) & (t['year'] 2000) ]movies90.head()title yearThe Rising Son 1990Crucea de piatra 1993Poka Makorer Ghar Bosoti 1996Maa Durga Shakti 1999Conflict of Interest 19932.2.3 SortingSorting can be performed using ‘sort index’ or ‘sort values’ keywords, # find all movies named as 'Macbeth't titlesmacbeth t[ t['title'] 'Macbeth']macbeth.head()title year(continues on next page)2.2. Data operations8

Pandas Guide(continued from previous acbethMacbeth19132006201319971998Note that in above filtering operation, the data is sorted by index i.e. by default ‘sort index’ operation is used asshown below, # by default, sort by index i.e. row header macbeth t[ t['title'] 'Macbeth'].sort index() macbeth.head()title year4226Macbeth 19139322Macbeth 200611722 Macbeth 201317166 Macbeth 199725847 Macbeth 1998 To sort the data by values, the ‘sort value’ option can be used. In below code, data is sorted by year now, # sort by year macbeth t[ t['title'] 'Macbeth'].sort values('year') macbeth.head()title year4226Macbeth 191317166 Macbeth 199725847 Macbeth 19989322Macbeth 200611722 Macbeth 2013 2.2.4 Null valuesNote that, various columns may contains no values, which are usually filled as NaN. For example, rows 3-4 of castsare NaN as shown below, casts.ix[3:4]34titleSecret in Their EyesSteve Jobsyear20152015name hutter huttertypeactoractorcharactern2002 Dodger Fan NaN1988 Opera House Patron NaNThese null values can be easily selected, unselected or contents can be replaced by any other values e.g. emptystrings or 0 etc. Various examples of null values are shown in this section. ‘isnull’ command returns the true value if any row of has null values. Since the rows 3-4 has NaN value,therefore, these are displayed as True. c casts Name: n, dtype: bool ‘notnull’ is opposite of isnull, it returns true for not null values,2.2. Data operations9

Pandas Guide Name: n, dtype: bool To display the rows with null values, the condition must be passed in the DataFrame, c[c['n'].isnull()].head(3)title year3Secret in Their Eyes 20154Steve Jobs 20155Straight Outta Compton 2015 name hutter hutter huttertypeactoractoractorcharactern2002 Dodger Fan NaN1988 Opera House Patron NaNClub Patron NaN NaN values can be fill by using fillna, ffill(forward fill), and bfill(backward fill) etc. In below code,‘NaN’ values are replace by NA. Further, example of ffill and bfill are shown in later part of the tutorial, c fill c[c['n'].isnull()].fillna('NA') c fill.head(2)title yearnametype3 Secret in Their Eyes 2015 hutter actor4Steve Jobs 2015 hutter actorcharacter2002 Dodger Fan1988 Opera House PatronnNANA2.2.5 String operationsVarious string operations can be performed using ‘.str.’ option. Let’s search for the movie “Maa” first, t titles t[t['title'] 'Maa']title year38880Maa 1968 There is only one movie in the list. Now, we want to search all the movies which starts with ‘Maa’. The ‘.str.’option is required for such queries as shown below, t[t['title'].str.startswith("Maa ")].head(3)title year19Maa Durga Shakti 19993046Maa Aur Mamta 19707470 Maa Vaibhav Laxmi 1989 2.2.6 Count ValuesTotal number of occurrences can be counted using ‘value counts()’ option. In following code, total number ofmovies are displayed base on years. t['year'].value 31609Name: year, dtype: int642.2. Data operations10

Pandas Guide2.2.7 PlotsPandas supports the matplotlib library and can be used to plot the data as well. In previous section, the totalnumbers of movies/year were filtered out from the DataFrame. In the below code, those values are saved in newDataFrame and then plotted using panda, import matplotlib.pyplot as plt t titles p t['year'].value counts() p.plot() matplotlib.axes. subplots.AxesSubplot object at 0xaf18df6c plt.show()Following plot will be generated from above code, which does not provide any useful information.It’s better to sort the years (i.e. index) first and then plot the data as below. Here, the plot shows that numberof movies are increasing every year. p.sort index().plot() matplotlib.axes. subplots.AxesSubplot object at 0xa9cd134c plt.show()2.2. Data operations11

Pandas GuideNow, the graph provide some useful information i.e. number of movies are increasing each year.2.3 GroupbyData can be grouped by columns-headers. Further, custom formats can be defined to group the various elementsof the DataFrame.2.3.1 Groupby with column-namesIn Section Count Values, the value of movies/year were counted using ‘count values()’ method. Same can beachieve by ‘groupby’ method as well. The ‘groupby’ command return an object, and we need to an additionalfunctionality to it to get some results. For example, in below code, data is grouped by ‘year’ and then size()command is used. The size() option counts the total number for rows for each year; therefore the re

class 'pandas.core.series.Series' t.head() 0 The Rising Son 1 The Thousand Plane Raid 2 Crucea de piatra (continues on next page) 2.2. Dataoperations 7. PandasGuide (continued from previous page) 3 Country 4 Gaiking II Name: title, dtype: object &

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Pré-textes. L’homme préhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Café, thé, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. Crawford M., Marsh D. The driving force : food in human evolution and the future.

API 617, 8TH EDITION. About Kazancompressormash Kazancompressormash (Kazan Compressor- Building Plant, Russia) is a leading Russian manufacturer of compressor equipment and provider of integrated compressor-based solutions for various industries. Key Facts and Figures More than 60 years of successful work in the compressor equipment market A wide range of sophisticated compressor systems for .