ADVANCED PYTHON PROGRAMMING

3y ago
61 Views
3 Downloads
1.91 MB
54 Pages
Last View : Today
Last Download : 3m ago
Upload by : Rosa Marty
Transcription

ADVANCED PYTHONPROGRAMMINGData Science

http://rcg.group.shef.ac.uk/courses/python

Course Outline:-Capture DataManage and Clean DataData AnalysisReport dataRequirements:-CI6010aCI6010bCI6011a

Anaconda Python should be installed on yourdesktop, please start Spyder.If it is not installed, please go to the SoftwareCentre and install it.

WHAT IS DATA SCIENCE?

CAPTURE DATAData Sources:Scraping it from a website from figures, etc.Pulling the data from a database.Accessing an API, etc.Datafor study.

Data types: Observational: Captured in real-time,cannot be reproduced. Experimental: Data from labequipment and under controlledconditions.Observational data Simulation: Data generated from testmodels studying actual or theoreticalsystems Compiled: The results of data analysis,or aggregated from multiple sources. Canonical: Fixed or organic collectiondatasets, usually peer-reviewed, andoften published and curated.Simulation data

Reading Data in PythonComplexityFlexibilityUnstructured: Data without inherent structure.Quasi-Structured: Textual data with erratic format that can be formatted with effort.Semi-Structured: Textual data with apparent pattern (including errors)Structured: Defined data model (errors less likely).Reading line by line.Pandas DataFrame.

PANDAS DATAFRAMEThe Pandas DataFrame is a multidimensional size-mutable, potentiallyheterogeneous tabular data structurewith labeled axes (rows and columns).Advantages:It can present data in a way that is suitable for data analysis.The package contains multiple methods for convenient data filtering.Pandas has a variety of utilities to perform Input/Output operationsin a seamless manner.

Constructing a DataFrameimport pandas as pddf1 pd.read excel(‘sample.xlsx’) # Excel filedf2 pd.read csv(‘sample.csv’)# Comma Separated filedf3 pd.read table(‘sample.txt’, sep ‘ ’) # Text file

Constructing a DataFrame Manuallydf pd.DataFrame(data d, index i, columns c) Parameter data: ndarray, iterable,dictionary or DataFrame.Parameter index: array. RangeIndex bydefault (0, 1, 2, 3, , n).Parameter columns: array.RangeIndex by default (0, 1, 2, 3, , n) orthe keys of a dictionary if the data input isa data4datadatadata

import pandas as pdd 1 [1,2,3]d 2 {‘header 1': [1, 2], ‘header 2': [3, 4]}df 1 pd.DataFrame(data d 1) # Constructing DataFrame from a listdf 2 pd.DataFrame(data d 2) # Constructing DataFrame from a dictprint(df 1)print(df 2)012012301header 112####header 234HeaderFirst rowSecond row # Header# First row# Second row

Create a Pandas DataFrame based on the file‘global temp.txt’. Print out the database.

import pandas as pddf pd.read table(‘global temp.txt’, sep ‘ ’)print(df)

MANAGE DATAUnwanted ObservationsRemove OutliersFix Structural ErrorsHandle Missing DataFiltering and Sorting DataThe least enjoyablepart of data science.Spending the mosttime doing it.

Unwanted observations Duplicates: Frequently arise during collection, such as combiningdifferent datasets. Irrelevant data: They don’t actually fit the specific problem.DuplicatesIrrelevant data

Removing identical rowsdf df.drop duplicates(subset ‘Last Name', keep 'first') Parameter subset: It takes a column or list of column label. After passingcolumns, it will consider them only for duplicates. Parameter keep: It could be ‘first’, ‘last’ or ‘False’ (it consider all of the samevalues as duplicates).

Dropping irrelevant columnsdf df.drop([“Sales”], axis 1) Dropping irrelevant rowsdf df.drop([“Johnson”, “Smith”]) Dropping rows containing NaNdf df.dropna()

Handle Missing Data: Dropping observations: Replace the entry with value “NaN”.By using the method .replace():df df.replace(1, 31)df df.replace(1, np.nan) Remove the whole row where information is missing. # Replace 1 with 31# Replace 1 with “np.nan”Warning! Missing data may be informative itself.Input missing values: The gap will be filled with artificial data (mean, median, std), havingsimilar properties then real observation. The added value will not bescientifically valid, no matter how sophisticated your filling method is.

Unwanted Outliers: An observation that lies outside the overall pattern of a distribution. Common causes: human, measurement, experimental errors. Outliers are innocent until proven guilty.OutlierOutlier

Finding outliers with method .describe() The core statistics about a particular column can be studied by thedescribe() method. The method returns the following:A. For numeric columns: the value count, mean, standard deviation,minimum, maximum, and 25th, 50th, and 75th quantiles for the datain a column.B. For string columns: the number of unique entries, the mostfrequently occurring value (‘top’), and the number of times the topvalue occurs (‘freq’)

import pandas as pdd {“Name”: ��Rocky”],“Age”: [1,27,25,24,31],“IQ”:[100, 120, 95, 1300, 101]}df pd.DataFrame(d)print(df.describe()) Investigate the output and look for potential 0.000000101.000000120.0000001300.000000# Suspicious# Suspicious# Outlier: Too young# Outlier: Too smart

Finding Outliers with Histogramsdf.hist([‘Age’,’IQ])plt.show() #It may be necessary after importing matplotlibUnexpected behaviour,i.e., far from generalpopulation, nonsensevalue, wrongdistribution shape, etc

Removing Outliers from the dataRemove the outlier by dropping the row, replacing its value one byone or introducing a threshold. Dropping column or row can be done by the method .drop() asdiscussed before. Replace the outlier with another valuedf df.replace(1, 31)df df.replace(1, np.nan) # Replace 1 with 31# Replace 1 with “np.nan”Introducing a threshold and remove the outlier:df df.mask(df 1, 10) # Replace every element 1 with 10

Read the database, named “iq scores.csv”. Drop the insignificant rows: UID and LOCATION ID Drop the duplicated lines. Errors are marked by the number -1. Remove them. Investigate the histogram of the variable IQ. Search forunexpected behaviour and remove the outliers is thereare any. Plot the histogram IQ without any outliers or errors.

import pandas as pd# Read the databasedf pd.read csv("iq scores.csv")#Drop duplicatesdf df.drop duplicates(subset ‘UID', keep ‘first’)# Drop irrelevant columnsdf df.drop(['UID','LOCATION ID'], axis 1)# Investigate the datadf.hist(‘IQ’)Outlier

import numpy as np# Remove known errors/missing datadf df.replace(-1, np.nan).dropna()# Remove the outlierdf.mask(df['IQ'] 900)# Investigate the datadf.hist(‘IQ’)

Filtering Data Data segmentation: Limits of computation, e.g.insufficient memory or CPU performance. Filtering by data attributes, e.g. separate the databy time. Use the method .iloc().Sorting Data Sorting by some dimension alphabeticallyor numerically, e.g. sorting by time or date. Ascending or Descending. Use the method .sort values().

Filtering Data by Using iloc() Select one element of the DataFramedf.iloc[row, col])df.iloc[1,2] #Out: 8 Slicing through dimensions:df.iloc[row1: row2, col1: col2])df.iloc[0: 2, 2: 3])Output:01C78012A123B456C789

Select a column of the DataFrameprint(df.iloc[:,1]) Select a row of the DataFrameprint(df.iloc[2,:]) First 2 rows:print(df.iloc[0:2,:]) Remove the last rowprint(df.iloc[:2,:]) # Output: 4, 5, 6And so on # Output: 3, 6, 9

CLEAN DATA Normalisation typically means rescales the values into a range of [0,1]. In most cases, when you normalise data you eliminate the units ofmeasurement for data, enabling you to more easily compare data fromdifferent places.x [1,43,65,23,4,57,87,45,45,23]x xminxnew xmax xminxmin 1xmax 87xnew [0,0.48,0.74,0.25,0.03,0.65,1,0.51,0.51,0.25]

Normalising a Numpy array or Normalising a column of PandasDataFrame (normalise column named “score” in Dataframe “df ”):import numpy as npimport pandas as pdraw data [1,43,65,23,4,57,87,45,45,23])x np.array(raw data)x new (x - x.min()) / (x.max() - x.min())df pd.DataFrame({‘score’: raw data})df[‘score’] (df[‘score’] - df[‘score’].min()) /(df[‘score’].max() - df[‘score’].min())

Data normalisation example

Data Standardisation:Standardisation typically means rescalesdata to have a mean of 0 and a standarddeviation of 1 (unit variance).x [1,43,65,23,4,57,87,45,45,23]x μxnew σμ 39.3xmax 87xnew [ 1.49,0.14,1.00, 0.63, 1.37,0.69,1.86,0.22,0.22, 0.63]

Standardising a Numpy array or a column of PandasDataFrame (normalise column named “sc” in Dataframe “df ”):import numpy as npimport pandas as pdraw data [1,43,65,23,4,57,87,45,45,23])x np.array(raw data)x new (x - x.mean()) / x.std()df pd.DataFrame({‘sc’: raw data})df[‘sc’] d()

EXPLORATORY DATA ANALYSISAim:An approach to understanding the entire dataset.Objectives:1) Detection of mistakes. 2) Checking assumptions. 3) Detecting relationshipsbetween variables. 4) Start to play with the data!Tools:EDA typically relies heavily on visualising the data to assess patterns andidentify data characteristics that the analyst would not otherwise know to lookfor.

Example database: Airline safetyAim: Should Travelers Avoid Flying AirlinesThat Have Had Crashes in the Past?Objectives: We are going to explore theairline safety database between 1985-2014.Tools: Univariate and multivariate datavisualisation and simple statistical tools.

Example database: Airline safetyThe data is stored in csv format and it appears to bestructured (no missing data, no structural error).The data contains the following information: airline: The new of the airline company.avail seat: Passenger capacity. Available seat per km per week.incidents 85 99: Incidents between 1985 and 1999.fatal accidents 85 99: Fatal accidents between 1985 and 1999.fatalities 85 99: Fatalities be between 1985 and 1999.incidents 00 14: Incidents between 2000 and 2014.fatal accidents 00 14: Fatal accidents between 2000 and 2014.fatalities 00 14: Fatalities be between 2000 and 2014.

df pd.read csv(‘airline-safety csv.csv')airlineavail seat kmincidents 85 99fatal accidents 85 99 aeroflot*11976723187614 aerolineas arg.38580364860 aeromexico*59687181331 airline Standardisation: Inconveniently big numbers What is the meaning of these numbers?avail seat kmincidents 85 99fatal accidents 85 99 aeroflot*-0.127614 aerolineas arg.-0.6860 aeromexico*-0.5331

Univariate visualisation For each field in theraw dataset. Is it the expecteddistribution? Are there anyoutliers? 70 incidents byone airline? Outlier?df.hist('incidents 85 99')

Univariate visualisationSomebody is flying a lot Somebody is crashing a lot Less fatalities recently?df.hist()

Somebody is flying a lot Connection?Somebody is crashing a lot Is my data reliable?Is it safer to fly today than before?So on QuestionsIs my data reliable? 70 incidents byone airline? Outlier?InsightsThe investigation starts

Download and load the airline safety database.Standardise the column “avail seat” and find the airlinewho had more than 70 incidents between 1985 and1999.

import pandas as pd# Read the databasedf pd.read csv(‘airline-safety csv.csv’)# Filter out the airlines if incidents 70dfnan df.mask(df[“incidents 85 99"] 70)# Drop the irrelevant rowsdf filtered dfnan.dropna()# Print the resultsprint(df filtered)airline avail seat km per week incidents 85 99 fatal accidents 85 991 aeroflot*-0.12758376.014.0Flying less than the average.High number of incidents.

Multivariate visualisationsIs there any relationship between the investigated data subsets? Is the relationship significant statistically or interesting scientifically? Relation between capacity and incidents?df.plot.scatter('avail seat km per week', 'incidents 85 99')

Use corr() function to find the correlation among the columns in the dataframeusing ‘Pearson’ method. Correlations are never lower than -1. A correlation of -1 indicates thatthe data points in a scatter plot lie exactly on a straight descending line. A correlation of 0 means that two variables don't have any linear relationwhatsoever. However, some non linear relation may exist between the two variables. Correlation coefficients are never higher than 1. A correlationcoefficient of 1 means that two variables are perfectly positively linearly related.avail seat km per weekincidents 85 99fatal accidents 85 99incidents 00 14avail seat km per week1.0000000.2795380.4683000.725917incidents 85 990.2795381.0000000.8569910.403009High correlation coefficient and interesting.High correlation coefficient but not scientifically interesting.

Investigate the relationship between the variables“incidents 85 99” and “incidents 85 99”. Use scatterplot the visualise the results.

import pandas as pd# Read the databasedf pd.read csv(‘airline-safety csv.csv’)df.plot.scatter('incidents 85 99','incidents 00 14')There seems to be arelationship but is it significant?Significant improvement.incidents 85 99

DATA ANALYSISTurn insight and ideas into scientifically valid results.Use the most promising finding.Perform in-depth analysis.Check your results.Prove your results.

Continue to investigate the details!Different behaviours seem to be mixed in this statistics. If possible try toseparate the data. Separate manually or by using an algorithmRandom behaviour?Linear trend?incidents 85 99Unique behaviour?

# Filter incidents 10df l df.mask(df["incidents 85 99"] 10).dropna()# Output: air canada, air india, air new zealand print(df l[‘airline’])# Plot the resultsdf l.plot.scatter(‘incidents 85 99’, ‘incidents 00 14')#Check the correlation: Output: 0.36df l['incidents 85 99'].corr(df l['incidents 00 14'])

df m df.mask((df["incid

Reading Data in Python Pandas DataFrame. y y Unstructured: Data without inherent structure. Quasi-Structured: Textual data with erratic format that can be formatted with effort. Semi-Structured: Textual data with apparent pattern (including errors) Structured: Defined data model (errors less likely).

Related Documents:

Python Programming for the Absolute Beginner Second Edition. CONTENTS CHAPTER 1 GETTING STARTED: THE GAME OVER PROGRAM 1 Examining the Game Over Program 2 Introducing Python 3 Python Is Easy to Use 3 Python Is Powerful 3 Python Is Object Oriented 4 Python Is a "Glue" Language 4 Python Runs Everywhere 4 Python Has a Strong Community 4 Python Is Free and Open Source 5 Setting Up Python on .

Python 2 versus Python 3 - the great debate Installing Python Setting up the Python interpreter About virtualenv Your first virtual environment Your friend, the console How you can run a Python program Running Python scripts Running the Python interactive shell Running Python as a service Running Python as a GUI application How is Python code .

Python is readable 5 Python is complete—"batteries included" 6 Python is cross-platform 6 Python is free 6 1.3 What Python doesn't do as well 7 Python is not the fastest language 7 Python doesn't have the most libraries 8 Python doesn't check variable types at compile time 8 1.4 Why learn Python 3? 8 1.5 Summary 9

site "Python 2.x is legacy, Python 3.x is the present and future of the language". In addition, "Python 3 eliminates many quirks that can unnecessarily trip up beginning programmers". However, note that Python 2 is currently still rather widely used. Python 2 and 3 are about 90% similar. Hence if you learn Python 3, you will likely

There are currently two versions of Python in use; Python 2 and Python 3. Python 3 is not backward compatible with Python 2. A lot of the imported modules were only available in Python 2 for quite some time, leading to a slow adoption of Python 3. However, this not really an issue anymore. Support for Python 2 will end in 2020.

A Python Book A Python Book: Beginning Python, Advanced Python, and Python Exercises Author: Dave Kuhlman Contact: dkuhlman@davekuhlman.org

Python Programming - This is a textbook in Python Programming with lots of Practical Examples and Exercises. You will learn the necessary foundation for basic programming with focus on Python. Python for Science and Engineering - This is a textbook in Python Programming with lots of Examples, Exercises, and Practical Applications

CR ASH COURSE PY THON CR ASH COURSE 2ND EDITION ERIC MATTHES SHELVE IN: PROGRAMMING LANGUAGES/ PYTHON 39.95 ( 53.95 CDN) LEARN PYTHON— FAST! COVERS PYTHON 3.X Python Crash Course is the world's best-selling guide to the Python programming language. This fast-paced, thorough introduction to programming with Python will