ADVANCED PYTHON PROGRAMMING

3y ago

61 Views

3 Downloads

1.91 MB

54 Pages

Last View : Today

Last Download : 3m ago

Upload by : Rosa Marty

Report this link

Download PDF

Transcription

ADVANCED PYTHONPROGRAMMINGData Science

http://rcg.group.shef.ac.uk/courses/python

Course Outline:-Capture DataManage and Clean DataData AnalysisReport dataRequirements:-CI6010aCI6010bCI6011a

Anaconda Python should be installed on yourdesktop, please start Spyder.If it is not installed, please go to the SoftwareCentre and install it.

WHAT IS DATA SCIENCE?

CAPTURE DATAData Sources:Scraping it from a website from figures, etc.Pulling the data from a database.Accessing an API, etc.Datafor study.

Data types: Observational: Captured in real-time,cannot be reproduced. Experimental: Data from labequipment and under controlledconditions.Observational data Simulation: Data generated from testmodels studying actual or theoreticalsystems Compiled: The results of data analysis,or aggregated from multiple sources. Canonical: Fixed or organic collectiondatasets, usually peer-reviewed, andoften published and curated.Simulation data

Reading Data in PythonComplexityFlexibilityUnstructured: Data without inherent structure.Quasi-Structured: Textual data with erratic format that can be formatted with effort.Semi-Structured: Textual data with apparent pattern (including errors)Structured: Defined data model (errors less likely).Reading line by line.Pandas DataFrame.

PANDAS DATAFRAMEThe Pandas DataFrame is a multidimensional size-mutable, potentiallyheterogeneous tabular data structurewith labeled axes (rows and columns).Advantages:It can present data in a way that is suitable for data analysis.The package contains multiple methods for convenient data filtering.Pandas has a variety of utilities to perform Input/Output operationsin a seamless manner.

Constructing a DataFrameimport pandas as pddf1 pd.read excel(‘sample.xlsx’) # Excel filedf2 pd.read csv(‘sample.csv’)# Comma Separated filedf3 pd.read table(‘sample.txt’, sep ‘ ’) # Text file

Constructing a DataFrame Manuallydf pd.DataFrame(data d, index i, columns c) Parameter data: ndarray, iterable,dictionary or DataFrame.Parameter index: array. RangeIndex bydefault (0, 1, 2, 3, , n).Parameter columns: array.RangeIndex by default (0, 1, 2, 3, , n) orthe keys of a dictionary if the data input isa data4datadatadata

import pandas as pdd 1 [1,2,3]d 2 {‘header 1': [1, 2], ‘header 2': [3, 4]}df 1 pd.DataFrame(data d 1) # Constructing DataFrame from a listdf 2 pd.DataFrame(data d 2) # Constructing DataFrame from a dictprint(df 1)print(df 2)012012301header 112####header 234HeaderFirst rowSecond row # Header# First row# Second row

Create a Pandas DataFrame based on the file‘global temp.txt’. Print out the database.

import pandas as pddf pd.read table(‘global temp.txt’, sep ‘ ’)print(df)

MANAGE DATAUnwanted ObservationsRemove OutliersFix Structural ErrorsHandle Missing DataFiltering and Sorting DataThe least enjoyablepart of data science.Spending the mosttime doing it.

Unwanted observations Duplicates: Frequently arise during collection, such as combiningdifferent datasets. Irrelevant data: They don’t actually fit the specific problem.DuplicatesIrrelevant data

Removing identical rowsdf df.drop duplicates(subset ‘Last Name', keep 'first') Parameter subset: It takes a column or list of column label. After passingcolumns, it will consider them only for duplicates. Parameter keep: It could be ‘first’, ‘last’ or ‘False’ (it consider all of the samevalues as duplicates).

Dropping irrelevant columnsdf df.drop([“Sales”], axis 1) Dropping irrelevant rowsdf df.drop([“Johnson”, “Smith”]) Dropping rows containing NaNdf df.dropna()

Handle Missing Data: Dropping observations: Replace the entry with value “NaN”.By using the method .replace():df df.replace(1, 31)df df.replace(1, np.nan) Remove the whole row where information is missing. # Replace 1 with 31# Replace 1 with “np.nan”Warning! Missing data may be informative itself.Input missing values: The gap will be filled with artificial data (mean, median, std), havingsimilar properties then real observation. The added value will not bescientifically valid, no matter how sophisticated your filling method is.

Unwanted Outliers: An observation that lies outside the overall pattern of a distribution. Common causes: human, measurement, experimental errors. Outliers are innocent until proven guilty.OutlierOutlier

Finding outliers with method .describe() The core statistics about a particular column can be studied by thedescribe() method. The method returns the following:A. For numeric columns: the value count, mean, standard deviation,minimum, maximum, and 25th, 50th, and 75th quantiles for the datain a column.B. For string columns: the number of unique entries, the mostfrequently occurring value (‘top’), and the number of times the topvalue occurs (‘freq’)

import pandas as pdd {“Name”: ��Rocky”],“Age”: [1,27,25,24,31],“IQ”:[100, 120, 95, 1300, 101]}df pd.DataFrame(d)print(df.describe()) Investigate the output and look for potential 0.000000101.000000120.0000001300.000000# Suspicious# Suspicious# Outlier: Too young# Outlier: Too smart

Finding Outliers with Histogramsdf.hist([‘Age’,’IQ])plt.show() #It may be necessary after importing matplotlibUnexpected behaviour,i.e., far from generalpopulation, nonsensevalue, wrongdistribution shape, etc

Removing Outliers from the dataRemove the outlier by dropping the row, replacing its value one byone or introducing a threshold. Dropping column or row can be done by the method .drop() asdiscussed before. Replace the outlier with another valuedf df.replace(1, 31)df df.replace(1, np.nan) # Replace 1 with 31# Replace 1 with “np.nan”Introducing a threshold and remove the outlier:df df.mask(df 1, 10) # Replace every element 1 with 10

Read the database, named “iq scores.csv”. Drop the insignificant rows: UID and LOCATION ID Drop the duplicated lines. Errors are marked by the number -1. Remove them. Investigate the histogram of the variable IQ. Search forunexpected behaviour and remove the outliers is thereare any. Plot the histogram IQ without any outliers or errors.

import pandas as pd# Read the databasedf pd.read csv("iq scores.csv")#Drop duplicatesdf df.drop duplicates(subset ‘UID', keep ‘first’)# Drop irrelevant columnsdf df.drop(['UID','LOCATION ID'], axis 1)# Investigate the datadf.hist(‘IQ’)Outlier

import numpy as np# Remove known errors/missing datadf df.replace(-1, np.nan).dropna()# Remove the outlierdf.mask(df['IQ'] 900)# Investigate the datadf.hist(‘IQ’)

Filtering Data Data segmentation: Limits of computation, e.g.insufficient memory or CPU performance. Filtering by data attributes, e.g. separate the databy time. Use the method .iloc().Sorting Data Sorting by some dimension alphabeticallyor numerically, e.g. sorting by time or date. Ascending or Descending. Use the method .sort values().

Filtering Data by Using iloc() Select one element of the DataFramedf.iloc[row, col])df.iloc[1,2] #Out: 8 Slicing through dimensions:df.iloc[row1: row2, col1: col2])df.iloc[0: 2, 2: 3])Output:01C78012A123B456C789

Select a column of the DataFrameprint(df.iloc[:,1]) Select a row of the DataFrameprint(df.iloc[2,:]) First 2 rows:print(df.iloc[0:2,:]) Remove the last rowprint(df.iloc[:2,:]) # Output: 4, 5, 6And so on # Output: 3, 6, 9

CLEAN DATA Normalisation typically means rescales the values into a range of [0,1]. In most cases, when you normalise data you eliminate the units ofmeasurement for data, enabling you to more easily compare data fromdifferent places.x [1,43,65,23,4,57,87,45,45,23]x xminxnew xmax xminxmin 1xmax 87xnew [0,0.48,0.74,0.25,0.03,0.65,1,0.51,0.51,0.25]

Normalising a Numpy array or Normalising a column of PandasDataFrame (normalise column named “score” in Dataframe “df ”):import numpy as npimport pandas as pdraw data [1,43,65,23,4,57,87,45,45,23])x np.array(raw data)x new (x - x.min()) / (x.max() - x.min())df pd.DataFrame({‘score’: raw data})df[‘score’] (df[‘score’] - df[‘score’].min()) /(df[‘score’].max() - df[‘score’].min())

Data normalisation example

Data Standardisation:Standardisation typically means rescalesdata to have a mean of 0 and a standarddeviation of 1 (unit variance).x [1,43,65,23,4,57,87,45,45,23]x μxnew σμ 39.3xmax 87xnew [ 1.49,0.14,1.00, 0.63, 1.37,0.69,1.86,0.22,0.22, 0.63]

Standardising a Numpy array or a column of PandasDataFrame (normalise column named “sc” in Dataframe “df ”):import numpy as npimport pandas as pdraw data [1,43,65,23,4,57,87,45,45,23])x np.array(raw data)x new (x - x.mean()) / x.std()df pd.DataFrame({‘sc’: raw data})df[‘sc’] d()

EXPLORATORY DATA ANALYSISAim:An approach to understanding the entire dataset.Objectives:1) Detection of mistakes. 2) Checking assumptions. 3) Detecting relationshipsbetween variables. 4) Start to play with the data!Tools:EDA typically relies heavily on visualising the data to assess patterns andidentify data characteristics that the analyst would not otherwise know to lookfor.

Example database: Airline safetyAim: Should Travelers Avoid Flying AirlinesThat Have Had Crashes in the Past?Objectives: We are going to explore theairline safety database between 1985-2014.Tools: Univariate and multivariate datavisualisation and simple statistical tools.

Example database: Airline safetyThe data is stored in csv format and it appears to bestructured (no missing data, no structural error).The data contains the following information: airline: The new of the airline company.avail seat: Passenger capacity. Available seat per km per week.incidents 85 99: Incidents between 1985 and 1999.fatal accidents 85 99: Fatal accidents between 1985 and 1999.fatalities 85 99: Fatalities be between 1985 and 1999.incidents 00 14: Incidents between 2000 and 2014.fatal accidents 00 14: Fatal accidents between 2000 and 2014.fatalities 00 14: Fatalities be between 2000 and 2014.

df pd.read csv(‘airline-safety csv.csv')airlineavail seat kmincidents 85 99fatal accidents 85 99 aeroflot*11976723187614 aerolineas arg.38580364860 aeromexico*59687181331 airline Standardisation: Inconveniently big numbers What is the meaning of these numbers?avail seat kmincidents 85 99fatal accidents 85 99 aeroflot*-0.127614 aerolineas arg.-0.6860 aeromexico*-0.5331

Univariate visualisation For each field in theraw dataset. Is it the expecteddistribution? Are there anyoutliers? 70 incidents byone airline? Outlier?df.hist('incidents 85 99')

Univariate visualisationSomebody is flying a lot Somebody is crashing a lot Less fatalities recently?df.hist()

Somebody is flying a lot Connection?Somebody is crashing a lot Is my data reliable?Is it safer to fly today than before?So on QuestionsIs my data reliable? 70 incidents byone airline? Outlier?InsightsThe investigation starts

Download and load the airline safety database.Standardise the column “avail seat” and find the airlinewho had more than 70 incidents between 1985 and1999.

import pandas as pd# Read the databasedf pd.read csv(‘airline-safety csv.csv’)# Filter out the airlines if incidents 70dfnan df.mask(df[“incidents 85 99"] 70)# Drop the irrelevant rowsdf filtered dfnan.dropna()# Print the resultsprint(df filtered)airline avail seat km per week incidents 85 99 fatal accidents 85 991 aeroflot*-0.12758376.014.0Flying less than the average.High number of incidents.

Multivariate visualisationsIs there any relationship between the investigated data subsets? Is the relationship significant statistically or interesting scientifically? Relation between capacity and incidents?df.plot.scatter('avail seat km per week', 'incidents 85 99')

Use corr() function to find the correlation among the columns in the dataframeusing ‘Pearson’ method. Correlations are never lower than -1. A correlation of -1 indicates thatthe data points in a scatter plot lie exactly on a straight descending line. A correlation of 0 means that two variables don't have any linear relationwhatsoever. However, some non linear relation may exist between the two variables. Correlation coefficients are never higher than 1. A correlationcoefficient of 1 means that two variables are perfectly positively linearly related.avail seat km per weekincidents 85 99fatal accidents 85 99incidents 00 14avail seat km per week1.0000000.2795380.4683000.725917incidents 85 990.2795381.0000000.8569910.403009High correlation coefficient and interesting.High correlation coefficient but not scientifically interesting.

Investigate the relationship between the variables“incidents 85 99” and “incidents 85 99”. Use scatterplot the visualise the results.

import pandas as pd# Read the databasedf pd.read csv(‘airline-safety csv.csv’)df.plot.scatter('incidents 85 99','incidents 00 14')There seems to be arelationship but is it significant?Significant improvement.incidents 85 99

DATA ANALYSISTurn insight and ideas into scientifically valid results.Use the most promising finding.Perform in-depth analysis.Check your results.Prove your results.

Continue to investigate the details!Different behaviours seem to be mixed in this statistics. If possible try toseparate the data. Separate manually or by using an algorithmRandom behaviour?Linear trend?incidents 85 99Unique behaviour?

# Filter incidents 10df l df.mask(df["incidents 85 99"] 10).dropna()# Output: air canada, air india, air new zealand print(df l[‘airline’])# Plot the resultsdf l.plot.scatter(‘incidents 85 99’, ‘incidents 00 14')#Check the correlation: Output: 0.36df l['incidents 85 99'].corr(df l['incidents 00 14'])

df m df.mask((df["incid

Reading Data in Python Pandas DataFrame. y y Unstructured: Data without inherent structure. Quasi-Structured: Textual data with erratic format that can be formatted with effort. Semi-Structured: Textual data with apparent pattern (including errors) Structured: Deﬁned data model (errors less likely).

Related Documents:

Python Programming for the Absolute Beginner Second Edition

Python Programming for the Absolute Beginner Second Edition. CONTENTS CHAPTER 1 GETTING STARTED: THE GAME OVER PROGRAM 1 Examining the Game Over Program 2 Introducing Python 3 Python Is Easy to Use 3 Python Is Powerful 3 Python Is Object Oriented 4 Python Is a "Glue" Language 4 Python Runs Everywhere 4 Python Has a Strong Community 4 Python Is Free and Open Source 5 Setting Up Python on .

668 Views

3y ago

Learning Python - سیّد صالح اعتمادی

Python 2 versus Python 3 - the great debate Installing Python Setting up the Python interpreter About virtualenv Your first virtual environment Your friend, the console How you can run a Python program Running Python scripts Running the Python interactive shell Running Python as a service Running Python as a GUI application How is Python code .

55 Views

1y ago

The Quick Python Book, Second Edition - Shen@uiuc

Python is readable 5 Python is complete—"batteries included" 6 Python is cross-platform 6 Python is free 6 1.3 What Python doesn't do as well 7 Python is not the fastest language 7 Python doesn't have the most libraries 8 Python doesn't check variable types at compile time 8 1.4 Why learn Python 3? 8 1.5 Summary 9

62 Views

1y ago

Python (2nd Edition): Learn Python in One Day and Learn It Well. Python ...

site "Python 2.x is legacy, Python 3.x is the present and future of the language". In addition, "Python 3 eliminates many quirks that can unnecessarily trip up beginning programmers". However, note that Python 2 is currently still rather widely used. Python 2 and 3 are about 90% similar. Hence if you learn Python 3, you will likely

32 Views

1y ago

Python (Quick Review)

There are currently two versions of Python in use; Python 2 and Python 3. Python 3 is not backward compatible with Python 2. A lot of the imported modules were only available in Python 2 for quite some time, leading to a slow adoption of Python 3. However, this not really an issue anymore. Support for Python 2 will end in 2020.

10 Views

3m ago

A Python Book: Beginning Python, Advanced Python, and ...

A Python Book A Python Book: Beginning Python, Advanced Python, and Python Exercises Author: Dave Kuhlman Contact: dkuhlman@davekuhlman.org

281 Views

3y ago

Python Programming - halvorsen.blog

Python Programming - This is a textbook in Python Programming with lots of Practical Examples and Exercises. You will learn the necessary foundation for basic programming with focus on Python. Python for Science and Engineering - This is a textbook in Python Programming with lots of Examples, Exercises, and Practical Applications

36 Views

1y ago

Worldwide Over 500,000 Copies Sold Learn Python— Fast! 2nd Edition Py ...

CR ASH COURSE PY THON CR ASH COURSE 2ND EDITION ERIC MATTHES SHELVE IN: PROGRAMMING LANGUAGES/ PYTHON 39.95 ( 53.95 CDN) LEARN PYTHON— FAST! COVERS PYTHON 3.X Python Crash Course is the world's best-selling guide to the Python programming language. This fast-paced, thorough introduction to programming with Python will

23 Views

1y ago

Recent Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

OWNER'S GUIDE - NinjaKitchen

auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. please keep these important safeguards in mind when using the . appliance: mportant: make sure that the .

1y ago

285 Views

Consumer Guide Auto Insurance - Tennessee

Auto insurance doesn't cover paying off your loan if your car is damaged and its market value is less than what you owe. Auto dealers and lenders may offer guaranteed auto protection (GAP) insurance for this purpose. Your auto insurance will cover you if you drive into Canada. To drive into Mexico, however, you'll need to buy Mexican auto .

1y ago

199 Views

NAIC Consumer Shopping Tool for Auto Insurance

Whether you are buying auto insurance for the first time, or shopping to be sure you are getting the best deal, you already know how important auto insurance is. By law in most states, if you own a car, you must have some auto insurance. Remember, there is no such thing as a "full coverage" auto insurance policy. Policies are made up of

1y ago

185 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

Decision Tree Tutorial by Kardi Teknomo - TAN THIAM HUAT 陳添發

Male 1 Cheap Medium Bus Female 1 Cheap Medium Train Female 0 Cheap Low Bus Male 1 Cheap Medium Bus Male 0 Standard Medium Train Female 1 Standard Medium Train Female 1 Expensive High Car Male 2 Expensive Medium Car Female 2 Expensive High Car Based on above training data, we can induce a decision tree as the following:

10m ago

84 Views

Broadway towing winchester ky

MO 77 Motors: Rock Hill, SC 7th Avenue Auto Salvage: Fargo, ND 81 Auto Parts & Recycling : Salem, VA 82 Auto Wrecking: Brookfield, OH #9 Truck & Auto Parts (No US Shipping) : Tottenham, ON 97 Auto Wrecking Shull's Towing: Brewster , WA 98 Auto Recyclers: Brooksville, FL 99 Auto Dismantler: Stockton, CA A & A Auto & Truck LLC:

2y ago

465 Views

All about auto insurance - Option Consommateurs

of insurance companies with which they have agreements. Insurance agents: agents work for a specific insurance company. Before you decide to do business with either a broker or an agent, check out prices, the products being proposed and the quality of the service. Buying auto insurance 4 All about auto insurance

1y ago

230 Views

-xglfldo:Dwfk Xjxvw Wkurxjk)2,

Affordable Care Act - insurance comparison, cheapest insurance, cheap health insurance NJ, cheapest insurance company Priority One High Volume - Washington state health insurance plans, affordable health insurance The best performing ad copy included those that made specific reference to finding "health insurance" for

1y ago

259 Views

A Message from Our President - Fox Valley Corvette

Bob Jass Chev-rolet 630-365-6481 Auto Parts 25% in most cas-es Ron Westphal Chevrolet 630-898-9630 Auto Parts 25% in most cas-es Thomsons Auto Parts 630-879-6363 Auto Parts 10% in most cas-es American Mod-ern Insurance Co. Collector Car Auto Insurance 10% on Collector Auto Polic

2y ago

225 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

A CONSUMER GUIDE TO AUTO INSURANCE - Maryland

AUTO INSURANCE Comparison shopping is the key to getting the most for your insurance dollar . Consumers think nothing of price shopping for televisions, computer tablets or appliances to save 20 or 30, but forget to shop around for auto insurance where hundreds of dollars can be saved . There are more than 150 auto insurers (or

1y ago

147 Views

Auto Insurance Affordability: Countrywide Trends and State Comparisons

Auto Insurance Expenditures as Percent of Median Income 1990s Average 1.93% 2000s Average 1.71% 2010s Average 1.61%. 3 State Rankings Based on the 2018 affordability index, auto insurance was most affordable in Iowa, where households spent 1.02 percent of income on auto insurance. Other states with low expenditure-

1y ago

177 Views

Business Auto Insurance made simple - Allstate

And with our range of innovative insurance and ﬁnancial products, we can help you protect your lifestyle. Personal Auto Insurance Your Choice Auto Featuring: Accident Forgiveness, Safe Driving Bonus Check, Deductible Rewards and New Car Replacement Standard auto Property Insurance House Condo Renters Manufactured home

1y ago

133 Views

ADVANCED PYTHON PROGRAMMING

It looks like you're using an ad-blocker