Web Scraping With Python - Programmer Books

2y ago
84 Views
9 Downloads
6.86 MB
306 Pages
Last View : 1m ago
Last Download : 2m ago
Upload by : Tripp Mcmullen
Transcription

2ndEditionWeb Scrapingwith PythonCOLLECTING MORE DATA FROM THE MODERN WEBRyan Mitchellwww.allitebooks.com

www.allitebooks.com

SECOND EDITIONWeb Scraping with PythonCollecting More Data from the Modern WebRyan MitchellBeijingBoston Farnham Sebastopolwww.allitebooks.comTokyo

Web Scraping with Pythonby Ryan MitchellCopyright 2018 Ryan Mitchell. All rights reserved.Printed in the United States of America.Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions arealso available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐tutional sales department: 800-998-9938 or corporate@oreilly.com.Editor: Allyson MacDonaldProduction Editor: Justin BillingCopyeditor: Sharon WilkeyProofreader: Christina EdwardsApril 2018:Indexer: Judith McConvilleInterior Designer: David FutatoCover Designer: Karen MontgomeryIllustrator: Rebecca DemarestSecond EditionRevision History for the Second Edition2018-03-20: First ReleaseSee http://oreilly.com/catalog/errata.csp?isbn 9781491985571 for release details.The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Web Scraping with Python, the coverimage, and related trade dress are trademarks of O’Reilly Media, Inc.While the publisher and the author have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the author disclaim all responsibilityfor errors or omissions, including without limitation responsibility for damages resulting from the use ofor reliance on this work. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is subject to open sourcelicenses or the intellectual property rights of others, it is your responsibility to ensure that your usethereof complies with such licenses and/or rights.978-1-491-98557-1[LSI]www.allitebooks.com

Table of ContentsPreface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixPart I.Building Scrapers1. Your First Web Scraper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3ConnectingAn Introduction to BeautifulSoupInstalling BeautifulSoupRunning BeautifulSoupConnecting Reliably and Handling Exceptions3668102. Advanced HTML Parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15You Don’t Always Need a HammerAnother Serving of BeautifulSoupfind() and find all() with BeautifulSoupOther BeautifulSoup ObjectsNavigating TreesRegular ExpressionsRegular Expressions and BeautifulSoupAccessing AttributesLambda Expressions1516182021252930313. Writing Web Crawlers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Traversing a Single DomainCrawling an Entire SiteCollecting Data Across an Entire SiteCrawling Across the Internet333740424. Web Crawling Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Planning and Defining ObjectsDealing with Different Website Layouts5053iiiwww.allitebooks.com

Structuring CrawlersCrawling Sites Through SearchCrawling Sites Through LinksCrawling Multiple Page TypesThinking About Web Crawler Models58586164655. Scrapy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Installing ScrapyInitializing a New SpiderWriting a Simple ScraperSpidering with RulesCreating ItemsOutputting ItemsThe Item PipelineLogging with ScrapyMore Resources6768697074767780806. Storing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Media FilesStoring Data to CSVMySQLInstalling MySQLSome Basic CommandsIntegrating with PythonDatabase Techniques and Good Practice“Six Degrees” in MySQLEmailPart II.83868889919497100103Advanced Scraping7. Reading Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107Document EncodingTextText Encoding and the Global InternetCSVReading CSV FilesPDFMicrosoft Word and .docx1071081091131131151178. Cleaning Your Dirty Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Cleaning in Codeiv Table of Contents121

Data NormalizationCleaning After the FactOpenRefine1241261269. Reading and Writing Natural Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131Summarizing DataMarkov ModelsSix Degrees of Wikipedia: ConclusionNatural Language ToolkitInstallation and SetupStatistical Analysis with NLTKLexicographical Analysis with NLTKAdditional Resources13213513914214214314514910. Crawling Through Forms and Logins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151Python Requests LibrarySubmitting a Basic FormRadio Buttons, Checkboxes, and Other InputsSubmitting Files and ImagesHandling Logins and CookiesHTTP Basic Access AuthenticationOther Form Problems15115215415515615715811. Scraping JavaScript. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161A Brief Introduction to JavaScriptCommon JavaScript LibrariesAjax and Dynamic HTMLExecuting JavaScript in Python with SeleniumAdditional Selenium WebdriversHandling RedirectsA Final Note on JavaScript16216316516617117117312. Crawling Through APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175A Brief Introduction to APIsHTTP Methods and APIsMore About API ResponsesParsing JSONUndocumented APIsFinding Undocumented APIsDocumenting Undocumented APIsFinding and Documenting APIs AutomaticallyCombining APIs with Other Data Sources175177178179181182184184187Table of Contents v

More About APIs19013. Image Processing and Text Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193Overview of LibrariesPillowTesseractNumPyProcessing Well-Formatted TextAdjusting Images AutomaticallyScraping Text from Images on WebsitesReading CAPTCHAs and Training TesseractTraining TesseractRetrieving CAPTCHAs and Submitting Solutions19419419519719720020320620721114. Avoiding Scraping Traps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215A Note on EthicsLooking Like a HumanAdjust Your HeadersHandling Cookies with JavaScriptTiming Is EverythingCommon Form Security FeaturesHidden Input Field ValuesAvoiding HoneypotsThe Human Checklist21521621721822022122122322415. Testing Your Website with Scrapers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227An Introduction to TestingWhat Are Unit Tests?Python unittestTesting WikipediaTesting with SeleniumInteracting with the Siteunittest or Selenium?22722822823023323323616. Web Crawling in Parallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239Processes versus ThreadsMultithreaded CrawlingRace Conditions and QueuesThe threading ModuleMultiprocess CrawlingMultiprocess CrawlingCommunicating Between Processesvi Table of Contents239240242245247249251

Multiprocess Crawling—Another Approach25317. Scraping Remotely. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255Why Use Remote Servers?Avoiding IP Address BlockingPortability and ExtensibilityTorPySocksRemote HostingRunning from a Website-Hosting AccountRunning from the CloudAdditional Resources25525625725725925926026126218. The Legalities and Ethics of Web Scraping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263Trademarks, Copyrights, Patents, Oh My!Copyright LawTrespass to ChattelsThe Computer Fraud and Abuse Actrobots.txt and Terms of ServiceThree Web ScraperseBay versus Bidder’s Edge and Trespass to ChattelsUnited States v. Auernheimer and The Computer Fraud and Abuse ActField v. Google: Copyright and robots.txtMoving Forward263264266268269272272274275276Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279Table of Contents vii

PrefaceTo those who have not developed the skill, computer programming can seem like akind of magic. If programming is magic, web scraping is wizardry: the application ofmagic for particularly impressive and useful—yet surprisingly effortless—feats.In my years as a software engineer, I’ve found that few programming practices cap‐ture the excitement of both programmers and laymen alike quite like web scraping.The ability to write a simple bot that collects data and streams it down a terminal orstores it in a database, while not difficult, never fails to provide a certain thrill andsense of possibility, no matter how many times you might have done it before.Unfortunately, when I speak to other programmers about web scraping, there’s a lotof misunderstanding and confusion about the practice. Some people aren’t sure it’slegal (it is), or how to handle problems like JavaScript-heavy pages or required logins.Many are confused about how to start a large web scraping project, or even where tofind the data they’re looking for. This book seeks to put an end to many of these com‐mon questions and misconceptions about web scraping, while providing a compre‐hensive guide to most common web scraping tasks.Web scraping is a diverse and fast-changing field, and I’ve tried to provide both highlevel concepts and concrete examples to cover just about any data collection projectyou’re likely to encounter. Throughout the book, code samples are provided todemonstrate these concepts and allow you to try them out. The code samples them‐selves can be used and modified with or without attribution (although acknowledg‐ment is always appreciated). All code samples are available on GitHub for viewingand downloading.What Is Web Scraping?The automated gathering of data from the internet is nearly as old as the internetitself. Although web scraping is not a new term, in years past the practice has beenmore commonly known as screen scraping, data mining, web harvesting, or similarix

variations. General consensus today seems to favor web scraping, so that is the term Iuse throughout the book, although I also refer to programs that specifically traversemultiple pages as web crawlers or refer to the web scraping programs themselves asbots.In theory, web scraping is the practice of gathering data through any means otherthan a program interacting with an API (or, obviously, through a human using a webbrowser). This is most commonly accomplished by writing an automated programthat queries a web server, requests data (usually in the form of HTML and other filesthat compose web pages), and then parses that data to extract needed information.In practice, web scraping encompasses a wide variety of programming techniquesand technologies, such as data analysis, natural language parsing, and informationsecurity. Because the scope of the field is so broad, this book covers the fundamentalbasics of web scraping and crawling in Part I and delves into advanced topics inPart II. I suggest that all readers carefully study the first part and delve into the morespecific in the second part as needed.Why Web Scraping?If the only way you access the internet is through a browser, you’re missing out on ahuge range of possibilities. Although browsers are handy for executing JavaScript,displaying images, and arranging objects in a more human-readable format (amongother things), web scrapers are excellent at gathering and processing large amounts ofdata quickly. Rather than viewing one page at a time through the narrow window of amonitor, you can view databases spanning thousands or even millions of pages atonce.In addition, web scrapers can go places that traditional search engines cannot. AGoogle search for “cheapest flights to Boston” will result in a slew of advertisementsand popular flight search sites. Google knows only what these websites say on theircontent pages, not the exact results of various queries entered into a flight searchapplication. However, a well-developed web scraper can chart the cost of a flight toBoston over time, across a variety of websites, and tell you the best time to buy yourticket.You might be asking: “Isn’t data gathering what APIs are for?” (If you’re unfamiliarwith APIs, see Chapter 12.) Well, APIs can be fantastic, if you find one that suits yourpurposes. They are designed to provide a convenient stream of well-formatted datafrom one computer program to another. You can find an API for many types of datayou might want to use, such as Twitter posts or Wikipedia pages. In general, it is pref‐erable to use an API (if one exists), rather than build a bot to get the same data. How‐ever, an API might not exist or be useful for your purposes, for several reasons:x Preface

You are gathering relatively small, finite sets of data across a large collection ofwebsites without a cohesive API. The data you want is fairly small or uncommon, and the creator did not think itwarranted an API. The source does not have the infrastructure or technical ability to create an API. The data is valuable and/or protected and not intended to be spread widely.Even when an API does exist, the request volume and rate limits, the types of data, orthe format of data that it provides might be insufficient for your purposes.This is where web scraping steps in. With few exceptions, if you can view data in yourbrowser, you can access it via a Python script. If you can access it in a script, you canstore it in a database. And if you can store it in a database, you can do virtually any‐thing with that data.There are obviously many extremely practical applications of having access to nearlyunlimited data: market forecasting, machine-language translation, and even medicaldiagnostics have benefited tremendously from the ability to retrieve and analyze datafrom news sites, translated texts, and health forums, respectively.Even in the art world, web scraping has opened up new frontiers for creation. The2006 project “We Feel Fine” by Jonathan Harris and Sep Kamvar scraped a variety ofEnglish-language blog sites for phrases starting with “I feel” or “I am feeling.” This ledto a popular data visualization, describing how the world was feeling day by day andminute by minute.Regardless of your field, web scraping almost always provides a way to guide businesspractices more effectively, improve productivity, or even branch off into a brand-newfield entirely.About This BookThis book is designed to serve not only as an introduction to web scraping, but as acomprehensive guide to collecting, transforming, and using data from uncooperativesources. Although it uses the Python programming language and covers manyPython basics, it should not be used as an introduction to the language.If you don’t know any Python at all, this book might be a bit of a challenge. Please donot use it as an introductory Python text. With that said, I’ve tried to keep all con‐cepts and code samples at a beginning-to-intermediate Python programming level inorder to make the content accessible to a wide range of readers. To this end, there areoccasional explanations of more advanced Python programming and general com‐puter science topics where appropriate. If you are a more advanced reader, feel free toskim these parts!Preface xi

If you’re looking for a more comprehensive Python resource, Introducing Python byBill Lubanovic (O’Reilly) is a good, if lengthy, guide. For those with shorter attentionspans, the video series Introduction to Python by Jessica McKellar (O’Reilly) is anexcellent resource. I’ve also enjoyed Think Python by a former professor of mine,Allen Downey (O’Reilly). This last book in particular is ideal for those new to pro‐gramming, and teaches computer science and software engineering concepts alongwith the Python language.Technical books are often able to focus on a single language or technology, but webscraping is a relatively disparate subject, with practices that require the use of data‐bases, web servers, HTTP, HTML, internet security, image processing, data science,and other tools. This book attempts to cover all of these, and other topics, from theperspective of “data gathering.” It should not be used as a complete treatment of anyof these subjects, but I believe they are covered in enough detail to get you startedwriting web scrapers!Part I covers the subject of web scraping and web crawling in depth, with a strongfocus on a small handful of libraries used throughout the book. Part I can easily beused as a comprehensive reference for these libraries and techniques (with certainexceptions, where additional references will be provided). The skills taught in the firstpart will likely be useful for everyone writing a web scraper, regardless of their partic‐ular target or application.Part II covers additional subjects that the reader might find useful when writing webscrapers, but that might not be useful for all scrapers all the time. These subjects are,unfortunately, too broad to be neatly wrapped up in a single chapter. Because of this,frequent references are made to other resources for additional information.The structure of this book enables you to easily jump around among chapters to findonly the web scraping technique or information that you are looking for. When aconcept or piece of code builds on another mentioned in a previous chapter, I explic‐itly reference the section that it was addressed in.Conventions Used in This BookThe following typographical conventions are used in this book:ItalicIndicates new terms, URLs, email addresses, filenames, and file extensions.Constant widthUsed for program listings, as well as within paragraphs to refer to program ele‐ments such as variable or function names, databases, data types, environmentvariables, statements, and keywords.xii Preface

Constant width boldShows commands or other text that should be typed by the user.Constant width italicShows text that should be replaced with user-supplied values or by values deter‐mined by context.This element signifies a tip or suggestion.This element signifies a general note.This element indicates a warning or caution.Using Code ExamplesSupplemental material (code examples, exercises, etc.) is available for download is book is here to help you get your job done. If the example code in this book isuseful to you, you may use it in your programs and documentation. You do not needto contact us for permission unless you’re reproducing a significant portion of thecode. For example, writing a program that uses several chunks of code from this bookdoes not require permission. Selling or distributing a CD-ROM of examples fromO’Reilly books does require permission. Answering a question by citing this book andquoting example code does not require permission. Incorporating a significantamount of example code from this book into your product’s documentation doesrequire permission.We appreciate, but do not require, attribution. An attribution usually includes thetitle, author, publisher, and ISBN. For example: “Web Scraping with Python, SecondEdition by Ryan Mitchell (O’Reilly). Copyright 2018 Ryan Mitchell,978-1-491-998557-1.”Preface xiii

If you feel your use of code examples falls outside fair use or the permission givenhere, feel free to contact us at permissions@oreilly.com.Unfortunately, printed books are

Web Scraping with Python COLLECTING MORE DATA FROM THE MODERN WEB n www.allitebooks.com. www.allitebooks.com. Ryan Mitchell Web Scraping with Python Collecting More Data from the Modern Web SECOND EDITION Beijing Boston Farnham Seba

Related Documents:

Web Scraping with PHP, 2nd Ed. III 1. Introduction 1 Intended Audience 1 How to Read This Book 2 Web Scraping Defined 2 Applications of Web Scraping 3 Appropriate Use of Web Scraping 3 Legality of Web Scraping 3 Topics Covered 4 2. HTTP 5 Requests 6 Responses 11 Headers 12 Evolution of HTTP 19 Table of Contents Sample

What Is Web Scraping? The automated gathering of data from the Internet is nearly as old as the Internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is .

Web Scraping Fig 2 : Web Scraping process 2. Web scraping tools can range from manual browser plug-ins, to desktop applications, to purpose-built libraries within Python language. 3. A web scraping tool is an Application Programming Interface (API) in that it helps the client (you the user) interact with data stored on a server (the text). 4.

learner of web scraping. He recommends this book to all Python enthusiasts so that they can enjoy the benefits of scraping. He is enthusiastic about Python web scraping and has worked on projects such as live sports feeds, as well as a generalized

Python Programming for the Absolute Beginner Second Edition. CONTENTS CHAPTER 1 GETTING STARTED: THE GAME OVER PROGRAM 1 Examining the Game Over Program 2 Introducing Python 3 Python Is Easy to Use 3 Python Is Powerful 3 Python Is Object Oriented 4 Python Is a "Glue" Language 4 Python Runs Everywhere 4 Python Has a Strong Community 4 Python Is Free and Open Source 5 Setting Up Python on .

Python 2 versus Python 3 - the great debate Installing Python Setting up the Python interpreter About virtualenv Your first virtual environment Your friend, the console How you can run a Python program Running Python scripts Running the Python interactive shell Running Python as a service Running Python as a GUI application How is Python code .

to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots. In theory, web scraping

to AGMA 9 standard, improved the quality and performance of the QE range. Today, the QE Vibrator not only meets industry expectations, but will out-perform competitive models when correctly selected and operated in line with the information given in this brochure. When a QE Vibrator is directly attached to a trough it is referred to as a “Brute Force” design. It is very simple to calculate .