Web Scraping For Data Science With Python

2y ago

180 Views

9 Downloads

292.73 KB

10 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : River Barajas

Report this link

Download PDF

Transcription

Web Scrapingfor Data Sciencewith PythonSeppe vanden Broucke and Bart Baesens– Free Extract –Get the full bookon AmazonThis is a free extract from the book “Web Scraping for Data Science with Python” by Seppe vanden Brouckeand Bart Baesens (ISBN-13: 978-1979343787), obtained from webscrapingfordatascience.com. This extract isprovided free of charge. You are hereby given permission to use and distribute this extract in a noncommercialsetting, given that its contents remain unmodified (including this title page) and that you do not charge for it.Permission for commercial use or copying parts of this extract in your own work is not granted without priorwritten permission. For permission requests and more information, write to:Seppe vanden Broucke, Naamsestraat 69 - box 3555, 3000 Leuven, Belgiumseppe.vandenbroucke@kuleuven.be

Chapter 1Introduction1.11.1.1About this BookWelcomeCongratulations! By picking up this book, you’ve set the ﬁrst steps into the exciting worldof web scraping. First of all, we want to thank you, the reader, for choosing this guide toaccompany you on this journey.For those who are not familiar with programming or the deeper workings of the web,web scraping often looks like a black art: the ability to write a program that sets off onits own to explore the Internet and collect data is seen as a magical, exciting, perhapseven scary power to possess. Indeed, there are not many programming tasks that areable to fascinate both experienced and novice programmers in quite such a way as webscraping. Seeing a program working for the ﬁrst time as it reaches out on the web andstarts gathering data never fails to provide a certain rush, feeling like you’ve circumventedthe “normal way” of working and just cracked some sort of enigma. It is perhaps becauseof this reason that web scraping is also making a lot of headlines these days.In this book, we set out to provide a concise and modern guide to web scraping, using Python as our programming language. We know that there are a lot of other booksand online tutorials out there, but we felt that there was room for another entry. In particular, we wanted to provide a guide that is “short and sweet”, without falling into thetypical “learn this in X hours”-trap where important details or best practices are glossedover just for the sake of speed. In addition, you’ll note that we have titled this book as“Web Scraping for Data Science”. We’re data scientists ourselves, and have very oftenfound web scraping to be a powerful tool to have in your arsenal for the purpose of data9

1.1. ABOUT THIS BOOKgathering. Many data science projects start with the ﬁrst step of obtaining an appropriatedata set. In some cases (the “ideal situation”, if you will), a data set is readily provided bya business partner, your company’s data warehouse, or your academic supervisor, or canbe bought or obtained in a structured format by external data providers, but many trulyinteresting projects start from collecting a treasure trove of information from the sameplace as humans do: the web. As such, we set out to offer something that: Is concise and to the point, whilst still being thorough. Is geared towards data scientists: we’ll show you how web scraping “plugs into”several parts of the data science workﬂow. Takes a code ﬁrst approach to get you up to speed quickly without too much boilerplate text. Is modern by using well-established best practices and publicly available, opensource Python libraries only. Goes further than simple basics by showing how to handle the web of today, including JavaScript, cookies, and common web scraping mitigation techniques. Includes a thorough managerial and legal discussion regarding web scraping. Provides lots of pointers for further reading and learning. Includes many larger, fully worked out examples.We hope you enjoy reading this book as much as we had writing it. Feel free to contactus in case you have questions, ﬁnd mistakes, or just want to get in touch! We love hearingfrom our readers and are open to receive any thoughts and questions.— Seppe vanden Broucke, seppe.vandenbroucke@kuleuven.be— Bart Baesens, bart.baesens@kuleuven.be1.1.2AudienceWe have written this book with a data science oriented audience in mind. As such, you’llprobably already be familiar with Python or some other programming language or analytical toolkit (be it R, SAS, SPSS, or something else). If you’re using Python already, you’llfeel right at home. If not, we include a quick Python primer later on in this chapter tocatch up with the basics and provide pointers to other guides as well. Even if you’re notusing Python yet for your daily data science tasks (many will argue that you should), wewant to show you that Python is a particularly powerful language to use for scraping datafrom the web. We also assume that you have some basic knowledge regarding how theweb works. That is, you know your way around a web browser and know what URLs are;we’ll explain the details in depth as we go along.To summarize, we have written this book to be useful to the following target groups:10

CHAPTER 1. INTRODUCTION Data science practitioners already using Python and wanting to learn how to scrapethe web using this language. Data science practitioners using another programming language or toolkit, but wantto adopt Python to perform the web scraping part of their pipeline. Lecturers and instructors of web scraping courses. Students working on a web scraping project or aiming to increase their Python skillset. “Citizen data scientists” with interesting ideas requiring data from the web. Data science or business intelligence managers wanting to get an overview of whatweb scraping is all about and how it can bring a beneﬁt to their teams, and what themanagerial and legal aspects are that need to be considered.1.1.3StructureThe chapters in this book can be divided into three parts: Part 1: Web Scraping Basics (Chapters 1 to 3): In these chapters, we’ll introduceyou to web scraping, why it is useful to data scientists, and discuss the key components of the web—HTTP, HTML and CSS. We’ll show you how to write basic scrapersusing Python, using the “requests” and “Beautiful Soup” libraries. Part 2: Advanced Web Scraping (Chapters 4-6): here, we delve deeper into HTTPand show you how to work with forms, login screens and cookies. We’ll also explainhow to deal with JavaScript-heavy websites and show you how to go from simple webscrapers to advanced web crawlers. Part 3: Managerial Concerns and Best Practices (Chapters 7-9): In this concluding part, we discuss managerial and legal concerns regarding web scraping inthe context of data science, and also “open the door” to explore other tools and interesting libraries. We also list a general overview regarding web scraping best practices and tips. The ﬁnal chapter includes some larger web scraping examples to showhow all concepts covered before can be combined and highlights some interestingdata science oriented use cases using web scraped data.This book is set up to be very easy to read and work through. Newcomers are hence simplyadvised to read through this book from start to ﬁnish. That said, the book is structuredin such a way that it should be easy to refer back to any part later on in case you want tobrush up your knowledge or look up a particular concept.11

1.1. ABOUT THIS BOOK1.1.4About the AuthorsSeppe vanden Broucke is an assistant professor of data and processscience at the Faculty of Economics and Business, KU Leuven, Belgium. His research interests include business data mining and analytics, machine learning, process management, and process mining.His work has been published in well-known international journals andpresented at top conferences. Seppe’s teaching includes AdvancedAnalytics, Big Data and Information Management courses. He alsofrequently teaches for industry and business audiences. Besides work,Seppe enjoys travelling, reading (Murakami to Bukowski to Asimov),listening to music (Booka Shade to Miles Davis to Claude Debussy), watching movies andseries (less so these days due to a lack of time), gaming, and keeping up with the news.Bart Baesens is a professor of big data and analytics at KU Leuven, Belgium, and a lecturer at the University of Southampton, UnitedKingdom. He has done extensive research on big data and analytics,credit risk modeling, fraud detection and marketing analytics. Barthas written more than 200 scientiﬁc papers and several books. Besides enjoying time with his family, he is also a diehard Club Bruggesoccer fan. Bart is a foodie and amateur cook. He loves drinking agood glass of wine (his favorites are white Viognier or red CabernetSauvignon) either in his wine cellar or when overlooking the authentic red English phone booth in his garden. Bart loves traveling and isfascinated by World War I and reads many books on the topic.More information about the authors and their research can be found online at www.dataminingapps.com. The companion website for this book can be found at www.webscrapingfordatascience.com, where you’ll ﬁnd more information, an errata list, and where wehost the examples used throughout this book.12

CHAPTER 1. INTRODUCTION1.2What is Web Scraping?Web “scraping” (also called “web harvesting”, “web data extraction” or even “web datamining”), can be deﬁned as “the construction of an agent to download, parse, and organizedata from the web in an automated manner”. Or, in other words: instead of a humanend-user clicking away in a web browser and copy-pasting interesting parts into, say, aspreadsheet, web scraping ofﬂoads this task to a computer program which can execute itmuch faster, and more correctly, than a human can.The automated gathering of data from the Internet is probably as old as the Internetitself, and the term “scraping” has been around for much longer than the web. Before“web scraping” became popularized as a term, a practice known as “screen scraping” wasalready well-established as a way to extract data from a visual representation—which inthe early days of computing (think 1960s-80s) often boiled down to simple, text based“terminals”. Just as today, people in those days were also interested in “scraping” largeamounts of text from such terminals and store this data for later use.1.2.1Why Web Scraping for Data Science?When surﬁng the web using a normal web browser, you’ve probably encountered multiplesites where you considered the possibility of gathering, storing, and analyzing the datapresented on the site’s pages. Especially for data scientists, whose “raw material” is data,the web exposes a lot of interesting opportunities: There might be an interesting table on a Wikipedia page (or pages) you want to retrieve to perform some statistical analysis. Perhaps you want to get a list of reviews from a movie site to perform text mining,create a recommendation engine or build a predictive model to spot fake reviews. You might wish to get a listing of properties on a real-estate site to build an appealinggeo-visualization. You’d like to gather additional features to enrich your data set based on informationfound on the web, say, weather information to forecast e.g. soft drink sales. You might be wondering about doing social network analytics using proﬁle datafound on a web forum. It might be interesting to monitor a news site for trending new stories on a particulartopic of interest.The web contains lots of interesting data sources that provide a treasure trove for all sortsof interesting things. Sadly, the current unstructured nature of the web does not alwaysmake it easy to gather or export this data in an easy manner. Web browsers are very good13

1.2. WHAT IS WEB SCRAPING?at showing images, displaying animations, and laying out websites in a way that is visuallyappealing to humans, but they do not expose a simple way to export their data, at leastnot in most cases. Instead of viewing the web page by page through your web browser’swindow, wouldn’t it be nice to be able to automatically gather a rich data set? This isexactly where web scraping enters the picture.If you know your way around the web a bit, you’ll probably be wondering: “isn’t thisexactly what Application Programming Interface (APIs) are for?” Indeed, many websitesnowadays provide such an API which provides a means for the outside world to access theirdata repository in a structured way—meant to be consumed and accessed by computerprograms, not humans (although the programs are written by humans, of course). Twitter,Facebook, LinkedIn, and Google, for instance, all provide such APIs in order to search andpost tweets, get a list of your friends and their likes, see who you’re connected with, and soon. So why, then, would we still need web scraping? The point is that APIs are great meansto access data sources, provided the website at hand provides one and to begin with andthat the API exposes the functionality you want. The general rule of thumb is to look foran API ﬁrst and use that if you can, before setting off to build a web scraper to gather thedata. For instance, you can easily use Twitter’s API to get a list of recent tweets, insteadof re-inventing the wheel yourself. Nevertheless, there are still various reasons why webscraping might be preferable over the use of an API: The website you want to extract data from does not provide an API. The API provided is not free (whereas the website is). The API provided is rate limited: meaning you can only access it a certain times persecond, per day, . The API does not expose all the data you wish to obtain (whereas the website does).In all of these cases, the usage of web scraping might come in handy. The fact remainsthat if you can view some data in your web browser, you will be able to access and retrieveit through a program. If you can access it through a program, the data can be stored,cleaned, and used in any way.1.2.2Who is Using Web Scraping?There are many practical applications of having access to and gathering data on the web,many of which fall in the realm of data science. The following list outlines some interesting real-life use cases: Many of Google’s products have beneﬁted from Google’s core business of crawlingthe web. Google Translate, for instance, utilizes text stored on the web to train andimprove itself.14

CHAPTER 1. INTRODUCTION Scraping is being applied a lot in HR and employee analytics. The San Francisco based hiQ startup specializes in selling employee analyses by collectingand examining public proﬁle information, for instance from LinkedIn (who wasnot happy about this but was so far unable to prevent this practice following acourt case, see -your-boss). Digital marketeers and digital artists often use data from the web for all sorts ofinteresting and creative projects. “We Feel Fine” by Jonathan Harris and Sep Kamvar,for instance, scraped various blog sites for phrases starting with “I feel”, the resultsof which could then visualize how the world was feeling throughout the day. In another study, messages scraped from Twitter, blogs and other social media werescraped to construct a data set which was used to build a predictive model towardsidentifying patterns of depression and suicidal thoughts. This might be an invaluable tool for aid providers, though of course warrants a thorough consideration ofprivacy related issues as well (see https://www.sas.com/en redict-suicide-risk-canada.html). In a paper titled “The Billion Prices Project: Using Online Prices for Measurementand Research” (see http://www.nber.org/papers/w22111), web scraping was used tocollect a data set of online price information which was used to construct a robustdaily price index for multiple countries. Banks and other ﬁnancial institutions are using web scraping for competitor analysis. For example, banks frequently scrape competitor’s sites to get an idea of wherebranches are being opened or closed, or to track loan rates offered—all of which isinteresting information which can be incorporated in their internal models and forecasting. Investment ﬁrms also often use web scraping, for instance to keep track ofnews articles regarding assets in their portfolio. Sociopolitical scientists are scraping social websites to track population sentimentand political orientation. A famous article called “Dissecting Trump’s Most RabidOnline Following” (see umps-most-rabid-online-following/) analyzes user discussions on reddit using semantic analysisto characterize the online followers and fans of Donald Trump. One researcher was able to train a deep learning model based on scraped imagesfrom Tinder and Instagram together with their “likes” to predict whether an image would be deemed “attractive” (see tphone makers are already incorporating such models in their photo apps tohelp you brush up your pictures. In “The Girl with the Brick Earring”, Lucas Woltmann sets out to scrape Lego brick15

1.2. WHAT IS WEB SCRAPING?information from https://www.bricklink.com to determine the best selection of Legopieces to represent an image (see 8/the-girl-with-the-brick-earring.html). Lyst, a London based online fashion marketplace, scraped the web for semistructured information about fashion products and then applied machine learningto present this information cleanly and elegantly for consumers from one centralwebsite. Other data scientists have done similar projects to cluster similar fashionproducts (see ). We’ve supervised a study where web scraping was used to extract information fromjob sites, to get an idea regarding the popularity of different data science and analytics related tools in the workplace (spoiler: Python and R were both rising steadily). Another study from our research group involved using web scraping to monitor newsoutlets and web forums to track public sentiment regarding Bitcoin.No matter your ﬁeld of interest, there’s almost always a use case to improve or enrich yourpractice based on data. “Data is the new oil”, so the common saying goes, and the webhas a lot of it.16

Web Scrapingfor Data Sciencewith PythonSeppe vanden Broucke and Bart Baesens– End of Extract –Get the full bookon Amazon

Web Scraping for Data Science with Python Seppe vanden Broucke and Bart Baesens – Free Extract – This is a free extract from the book “Web Scraping for Data Science with Python” by Seppe vanden Broucke and Bart Baesens (ISBN-13: 978-1979343787), obtained from webscrapingfor

Related Documents:

Web Scraping with PHP - php[architect]

Web Scraping with PHP, 2nd Ed. III 1. Introduction 1 Intended Audience 1 How to Read This Book 2 Web Scraping Defined 2 Applications of Web Scraping 3 Appropriate Use of Web Scraping 3 Legality of Web Scraping 3 Topics Covered 4 2. HTTP 5 Requests 6 Responses 11 Headers 12 Evolution of HTTP 19 Table of Contents Sample

27 Views

1y ago

Web Scraping with Python - library-it.com

What Is Web Scraping? The automated gathering of data from the Internet is nearly as old as the Internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is .

26 Views

1y ago

Efficient Scraping of Data From Websites Using Selenium

Web Scraping Fig 2 : Web Scraping process 2. Web scraping tools can range from manual browser plug-ins, to desktop applications, to purpose-built libraries within Python language. 3. A web scraping tool is an Application Programming Interface (API) in that it helps the client (you the user) interact with data stored on a server (the text). 4.

42 Views

1y ago

Detection of Web API Content Scraping - DiVA portal

De nition: Web API content scraping is the act of collecting a substantial amount of data from a web API without consent from web API providers. Scraping is a method used to describe the extraction of data by one program from another program. For instance, the term web scraping describes the extraction of data from websites.

14 Views

1y ago

FB Page: ขี่ช้างจับข้อมูล www.elephant-analytics

What is web scraping? Web scraping is a technique for gathering data or information on web pages. A scraper is a script that parses an html site. Scrapers are bound to fail in cases of site re-design. As much as there’re many libraries that support web scraping, we will delve into web scraping using

55 Views

2y ago

Web Scraping with Python - بهروز منصوری

to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots. In theory, web scraping

51 Views

2y ago

WEB DATA SCRAPING - BizzBee Solutions

regarding the web data scraping industry. This document begins with a tabular display of the benefits and drawbacks of employing web scraping solutions, services and software. What follows is an insightful market overview, where the web scraping services and solutions are analyzed by their most common uses and applications. .

9 Views

1y ago

web-scraping - riptutorial.com

from: web-scraping It is an unofficial and free web-scraping ebook created for educational purposes. All the content is extracted from Stack Overflow Documentation, which is written by many hardworking individuals at Stack Overflow. It is neither affiliated with Stack Overflow nor official web-scraping.

19 Views

1y ago

Recent Views

MERRILL ALABAMA CAPITOL SECRETARY OF STATE

Aug 24, 2018 · State House 38 Brian McGee state House 40 Pamela Jean Howard State House 41 Emily Anne Marcum State House 43 Carin Mayo State House 45 Jenn Gray state House 46 Felicia Stewart State House 4 7 1Jim Toomey State House 48 IAlli Summerford State House 51 Veronica R. Johnson State House 52 John W. Rogers, Jr. State House 53 Anthony Daniels

2y ago

375 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

Industry Observations Insurance Industry

Jun 30, 2019 · 6/17/2019 Commercial Insurance Branch of Extraco Banks, N.A. Higginbotham Insurance Group, Inc. Insurance Brokers NA 6/13/2019 Links Insurance Services, LLC World Insurance Associates LLC Property and Casualty Insurance NA 6/13/2019 Abram Interstate Insurance Services, Inc. Risk Placement Services,

2y ago

619 Views

Life Insurance Buyer's Guide Life Insurance - National Association of .

Life Insurance uers uide Naional ssociaion of Insurance Commissioners Compare the Different Types of Insurance Policies There are many types of life insurance pol-icies. You should choose a policy with fea-tures that fit your individual needs. Some things to consider are: Term Insurance vs. Cash Value In-surance. Term insurance is intended to

1y ago

520 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

18.01.41 - REPLACEMENT OF LIFE INSURANCE AND ANNUITIES - Idaho

Department of Insurance Replacement of Life Insurance and Annuities. Page 3. 04. Existing Life Insurance or Annuity. "Existing Life Insurance or Annuity" means any life insurance or annuity in force, including life insurance under a binding or conditional receipt or a lif e insurance policy or annuity that is within an unconditional refund period.

1y ago

407 Views

EXAMINATION REPORT OF THE ADMIRAL INSURANCE COMPANY AS OF . - Delaware

Berkley Regional Specialty Insurance Comp 31295 DE Carolina Casualty Insurance Company 10510 IA Clermont Insurance Company 33480 IA Continental Western Insurance Company 10804 IA Firemen's Insurance Com pany of Wash, D.C. 21784 DE Gemini Insurance Company 10833 DE Great Divide Insurance Company 25224 ND

1y ago

258 Views

American International Group, Inc. - Federal Reserve

American General Life Insurance Company AGL U.S. Life Insurance Company AGC Life Insurance Company AGC Life U.S. Life Insurance Company The United States Life Insurance Company in the City of New York U.S. Life U.S. Life Insurance Company The Variable Annuity Life Insurance Company VALIC U.S. Life Insurance Company

1y ago

269 Views

Japan's Insurance Market - Toa Re

with 61.6% of net premiums written, of which automobile insurance totaled 48.8% and compulsory automobile liability insurance totaled 12.8%. Fire insurance accounted for 13.7%, miscellaneous casualty insurance including liability insurance accounted for 11.6%, accident insurance accounted for 9.8%, and marine insurance accounted for 3.2%.

1y ago

179 Views

List of Insurance Companies by Insurance Manager - Cayman Islands dollar

2447 Batan Insurance Company SPC, Ltd. 29-Sep-03 1307714 BBG Insurance Services, Ltd. 09-Aug-16 1254 BCHS Insurance, Ltd. 07-Oct-98 1168 Bearacuda Re 01-Aug-97 2639 Bedrock Insurance Limited 24-Nov-05 2150 Bom Ambiente Insurance Company 14-Jun-00 2565 Boundless Insurance Company, Ltd. 01-Dec-04 769 Bucap Limited 03-Mar-89

1y ago

293 Views

Insurance Certificate 713705-3 and Assistance Program

Name of insurance product: Purchase Protection and Travel Insurance for National Bank of Canada Mastercard credit cards, group insurance policy no. 713705 (Schedule A Certificate number 3)/713705-3 Type of insurance product: Purchase insurance and extended warranty and travel insurance (group insurance) Assistance provider contact information

4m ago

54 Views

Policy - Kiwibank

House Insurance is provided by The Hollard Insurance Company Pty Ltd. The Hollard Insurance Company Pty Ltd is the only organisation responsible for claims under this cover. Administration of House Insurance and claims handling services are managed by Ando Insurance Group Limited on behalf of The Hollard Insurance Company Pty Ltd.

1y ago

133 Views

House insurance - Tower

insurance in New Zealand. We've included limits and exclusions to your house cover throughout this policy wording and on your certificate of insurance. What your house policy does and does not cover What we cover We cover your house, meaning the domestic buildings you own at the situation shown on your certificate of insurance including its: 1.

1y ago

145 Views

Web Scraping For Data Science With Python

It looks like you're using an ad-blocker