5 Web Scraping I: Introduction To BeautifulSoup

2y ago

16 Views

3 Downloads

311.54 KB

10 Pages

Last View : 10d ago

Last Download : 3m ago

Upload by : Kian Swinton

Report this link

Download PDF

Transcription

5Web Scraping I:Introduction toBeautifulSoupLab Objective: Web Scraping is the process of gathering data from websites on the internet. Sincealmost everything rendered by an internet browser as a web page uses HTML, the first step in webscraping is being able to extract information from HTML. In this lab, we introduce BeautifulSoup,Python’s canonical tool for efficiently and cleanly navigating and parsing HTML.HTMLHyper Text Markup Language, or HTML, is the standard markup language—a language designed forthe processing, definition, and presentation of text—for creating webpages. It provides a documentwith structure and is composed of pairs of tags to surround and define various types of content.Opening tags have a tag name surrounded by angle brackets ( tag-name ). The companion closingtag looks the same, but with a forward slash before the tag name ( /tag-name ). A list of all currentHTML tags can be found at http://htmldog.com/reference/htmltags.Most tags can be combined with attributes to include more data about the content, help identifyindividual tags, and make navigating the document much simpler. In the following example, the a tag has id and href attributes. html body p !-- Opening tags -- Click a id 'info' href 'http://www.example.com' here /a for more information. /p !-- Closing tags -- /body /html In HTML, href stands for hypertext reference, a link to another website. Thus the aboveexample would be rendered by a browser as a single line of text, with here being a clickable link tohttp://www.example.com:Click here for more information.1

2Lab 5. Introduction to Beautiful SoupUnlike Python, HTML does not enforce indentation (or any whitespace rules), though indentation generally makes HTML more readable. The previous example can even be written equivalentlyin a single line. html body p Click a id 'info' href 'http://www.example.com/info' here /a for more information. /p /body /html Special tags, which don’t contain any text or other tags, are written without a closing tagand in a single pair of brackets. A forward slash is included between the name and the closingbracket. Examples of these include hr/ , which describes a horizontal line, and img/ , the tag forrepresenting an image.Problem 1. The HTML of a website is easy to view in most browsers. In Google Chrome,go to http://www.example.com, right click anywhere on the page that isn’t a picture or alink, and select View Page Source. This will open the HTML source code that defines thepage. Examine the source code. What tags are used? What is the value of the type attributeassociated with the style tag?Write a function that returns the set of names of tags used in the website, and the valueof the type attribute of the style tag (as a string).(Hint: there are ten unique tag names.)BeautifulSoupBeautifulSoup (bs4) is a package1 that makes it simple to navigate and extract data from HTMLdocuments. See oc/index.html for thefull documentation.The bs4.BeautifulSoup class accepts two parameters to its constructor: a string of HTMLcode, and an HTML parser to use under the hood. The HTML parser is technically a keywordargument, but the constructor prints a warning if one is not specified. The standard choice for theparser is "html.parser", which means the object uses the standard library’s html.parser moduleas the engine behind the scenes.NoteDepending on project demands, a parser other than "html.parser" may be useful. A couple ofother options are "lxml", an extremely fast parser written in C, and "html5lib", a slower parserthat treats HTML in much the same way a web browser does, allowing for irregularities. Bothmust be installed independently; see doc/#installing-a-parser for more information.A BeautifulSoup object represents an HTML document as a tree. In the tree, each tag is anode with nested tags and strings as its children. The prettify() method returns a string that canbe printed to represent the BeautifulSoup object in a readable format that reflects the tree structure.1 BeautifulSoupis not part of the standard library; install it withpip install beautifulsoup4.conda install beautifulsoup4 or with

3 from bs4 import BeautifulSoup small example html """ html body p Click a id 'info' href 'http://www.example.com' here /a for more information. /p /body /html """ small soup BeautifulSoup(small example html, 'html.parser') print(small soup.prettify()) html body p Click a href "http://www.example.com" id "info" here /a for more information. /p /body /html Each tag in a BeautifulSoup object’s HTML code is stored as a bs4.element.Tag object, withactual text stored as a bs4.element.NavigableString object. Tags are accessible directly throughthe BeautifulSoup object.# Get the p tag (and everything inside of it). small soup.p p Click a href "http://www.example.com" id "info" here /a for more information. /p # Get the a sub-tag of the p tag. a tag small soup.p.a print(a tag, type(a tag), sep '\n') a href "http://www.example.com" id "info" here /a class 'bs4.element.Tag' # Get just the name, attributes, and text of the a tag. print(a tag.name, a tag.attrs, a tag.string, sep "\n")a{'id': 'info', 'href': 'http://www.example.com'}here

4Lab 5. Introduction to Beautiful SoupAttributenameattrsstringstringsstripped stringstextDescriptionThe name of the tagA dictionary of the attributesThe single string contained in the tagGenerator for strings of children tagsGenerator for strings of children tags, stripping whitespaceConcatenation of strings from all children tagsTable 5.1: Data attributes of the bs4.element.Tag class.Problem 2. The BeautifulSoup class has a find all() method that, when called with Trueas the only argument, returns a list of all tags in the HTML source code.Write a function that accepts a string of HTML code as an argument. Use BeautifulSoupto return a list of the names of the tags in the code. Use your function and the source codefrom http://www.example.com (see example.html) to check your answers from Problem 1.Navigating the Tree StructureNot all tags are easily accessible from a BeautifulSoup object. Consider the following example. pig html """ html head title Three Little Pigs /title /head body p class "title" b The Three Little Pigs /b /p p class "story" Once upon a time, there were three little pigs named a href "http://example.com/larry" class "pig" id "link1" Larry, /a a href "http://example.com/mo" class "pig" id "link2" Mo /a , and a href "http://example.com/curly" class "pig" id "link3" Curly. /a p The three pigs had an odd fascination with experimental construction. /p p . /p /body /html """ pig soup BeautifulSoup(pig html, "html.parser") pig soup.p p class "title" b The Three Little Pigs /b /p pig soup.a a class "pig" href "http://example.com/larry" id "link1" Larry, /a Since the HTML in this example has several p and a tags, only the first tag of each nameis accessible directly from pig soup. The other tags can be accessed by manually navigating throughthe HTML tree.Every HTML tag (except for the topmost tag, which is usually html ) has a parent tag. Eachtag also has and zero or more sibling and children tags or text. Following a true tree structure, everybs4.element.Tag in a soup has multiple attributes for accessing or iterating through parent, sibling,or child tags.

5Attributeparentparentsnext siblingnext siblingsprevious siblingprevious siblingscontentschildrendescendantsDescriptionThe parent tagGenerator for the parent tags up to the top levelThe tag immediately after to the current tagGenerator for sibling tags after the current tagThe tag immediately before to the current tagGenerator for sibling tags before the current tagA list of the immediate children tagsGenerator for immediate children tagsGenerator for all children tags (recursively)Table 5.2: Navigation attributes of the bs4.element.Tag class. print(pig soup.prettify()) html head # head is the parent of the title title Three Little Pigs /title /head body # body is the sibling of head p class "title" # and the parent of two p tags (title and story). b The Three Little Pigs /b /p p class "story" Once upon a time, there were three little pigs named a class "pig" href "http://example.com/larry" id "link1" Larry, /a a class "pig" href "http://example.com/mo" id "link2" Mo /a , and a class "pig" href "http://example.com/curly" id "link3" Curly.# The preceding a tags are siblings with each /a # other and the following two p tags. p The three pigs had an odd fascination with experimental construction. /p p . /p /p /body /html

6Lab 5. Introduction to Beautiful Soup# Start at the first a tag in the soup. a tag pig soup.a a tag a class "pig" href "http://example.com/larry" id "link1" Larry, /a # Get the names of all of a 's parent tags, traveling up to the top.# The name '[document]' means it is the top of the HTML code. [par.name for par in a tag.parents]# a 's parent is p , whose['p', 'body', 'html', '[document]']# parent is body , and so on.# Get the next siblings of a . a tag.next sibling'\n'# The first sibling is just text. a tag.next sibling.next sibling# The second sibling is a tag. a class "pig" href "http://example.com/mo" id "link2" Mo /a # Alternatively, get all siblings past a at once. list(a tag.next siblings)['\n', a class "pig" href "http://example.com/mo" id "link2" Mo /a ,', and\n', a class "pig" href "http://example.com/curly" id "link3" Curly. /a ,'\n', p The three pigs had an odd fascination with experimental construction. /p ,'\n', p . /p ,'\n']Note carefully that newline characters are considered to be children of a parent tag. Thereforeiterating through children or siblings often requires checking which entries are tags and which arejust text.# Get to the p tag that has class "story". p tag pig soup.body.p.next sibling.next sibling p tag.attrs["class"]# Make sure it's the right tag.['story']# Iterate through the child tags of p and print hrefs whenever they exist. for child in p tag.children:.if hasattr(child, "attrs") and "href" in om/curlyNote that the "class" attribute of the p tag is a list. This is because the "class" attributecan take on several values at once; for example, the tag p class "story book" is of class 'story'and of class 'book'.

7NoteThe behavior of the string attribute of a bs4.element.Tag object depends on the structureof the corresponding HTML tag.1. If the tag has a string of text and no other child elements, then string is just that text.2. If the tag has exactly one child tag and the child tag has only a string of text, then thetag has the same string as its child tag.3. If the tag has more than one child, then string is None. In this case, use strings toiterate through the child strings. Alternatively, the get text() method returns all textbelonging to a tag and to all of its descendants. In other words, it returns anything insidea tag that isn’t another tag. pig soup.head head title Three Little Pigs /title /head # Case 1: the title tag's only child is a string. pig soup.head.title.string'Three Little Pigs'# Case 2: The head tag's only child is the title tag. pig soup.head.string'Three Little Pigs'# Case 3: the body tag has several children. pig soup.body.string is NoneTrue print(pig soup.body.get text().strip())The Three Little PigsOnce upon a time, there were three little pigs namedLarry,Mo, andCurly.The three pigs had an odd fascination with experimental construction.Problem 3. The file example.html contains the HTML source for http://www.example.com.Write a function that reads the file and loads the code into BeautifulSoup. Find the only a tag with a hyperlink and return its text.

8Lab 5. Introduction to Beautiful SoupSearching for TagsNavigating the HTML tree manually can be helpful for gathering data out of lists or tables, butthese kinds of structures are usually buried deep in the tree. The find() and find all() methodsof the BeautifulSoup class identify tags that have distinctive characteristics, making it much easierto jump straight to a desired location in the HTML code. The find() method only returns the firsttag that matches a given criteria, while find all() returns a list of all matching tags. Tags can bematched by name, attributes, and/or text.# Find the first b tag in the soup. pig soup.find(name 'b') b The Three Little Pigs /b # Find all tags with a class attribute of 'pig'.# Since 'class' is a Python keyword, use 'class ' as the argument. pig soup.find all(class "pig")[ a class "pig" href "http://example.com/larry" id "link1" Larry, /a , a class "pig" href "http://example.com/mo" id "link2" Mo /a , a class "pig" href "http://example.com/curly" id "link3" Curly. /a ]# Find the first tag that matches several attributes. pig soup.find(attrs {"class": "pig", "href": "http://example.com/mo"}) a class "pig" href "http://example.com/mo" id "link2" Mo /a # Find the first tag whose text is 'Mo'. pig soup.find(string 'Mo')'Mo'# The result is the actual string, soup.find(string 'Mo').parent# so go up one level to get the tag. a class "pig" href "http://example.com/mo" id "link2" Mo /a Problem 4. The file san diego weather.html contains the HTML source for an old pagefrom Weather Underground.a . Write a function that reads the file and loads it into BeautifulSoup. Return a list of the following tags:1. The tag containing the date “Thursday, January 1, 2015”.2. The tags which contain the links “Previous Day” and “Next Day.”3. The tag which contains the number associated with the Actual Max Temperature.This HTML tree is significantly larger than the previous examples. To get started, consideropening the file in a web browser. Find the element that you are searching for on the page,right click it, and select Inspect. This opens the HTML source at the element that the mouseclicked on.a N/2015/1/1/DailyHistory.html?req city San Diego&req state CA&req statename California&reqdb.zip 92101&reqdb.magic 1&reqdb.wmo 99999&MR 1

9Advanced Search TechniquesConsider the problem of finding the tag that is a link the URL http://example.com/curly. pig soup.find(href "http://example.com/curly") a class "pig" href "http://example.com/curly" id "link3" Curly. /a This approach works, but it requires entering in the entire URL. To perform generalizedsearches, the find() and find all() method also accept compile regular expressions from the remodule. This way, the methods locate tags whose name, attributes, and/or string matches a pattern. import re# Find the first tag with an href attribute containing 'curly'. pig soup.find(href re.compile(r"curly")) a class "pig" href "http://example.com/curly" id "link3" Curly. /a# Find the first tag with a string that starts with 'Cu'. pig soup.find(string re.compile(r" Cu")).parent a class "pig" href "http://example.com/curly" id "link3" Curly. /a # Find all tags with text containing 'Three'. [tag.parent for tag in pig soup.find all(string re.compile(r"Three"))][ title Three Little Pigs /title , b The Three Little Pigs /b ]Finally, to find a tag that has a particular attribute, regardless of the actual value of theattribute, use True in place of search values.# Find all tags with an 'id' attribute. pig soup.find all(id True)[ a class "pig" href "http://example.com/larry" id "link1" Larry, /a , a class "pig" href "http://example.com/mo" id "link2" Mo /a , a class "pig" href "http://example.com/curly" id "link3" Curly. /a ]# Final the names all tags WITHOUT an 'id' attribute. [tag.name for tag in pig soup.find all(id False)]['html', 'head', 'title', 'body', 'p', 'b', 'p', 'p', 'p']Problem 5. The file large banks index.html is an index of data about large banks, asrecorded by the Federal Reserve.a Write a function that reads the file and loads the source intoBeautifulSoup. Return a list of the tags containing the links to bank data from September 30,2003 to December 31, 2014, where the dates are in reverse chronological order.a Seehttps://www.federalreserve.gov/releases/lbr/.

10Lab 5. Introduction to Beautiful SoupProblem 6. The file large banks data.html is one of the pages from the index in Problem5.a Write a function that reads the file and loads the source into BeautifulSoup. Create a singlefigure with two subplots:1. A sorted bar chart of the seven banks with the most domestic branches.2. A sorted bar chart of the seven banks with the most foreign branches.In the case of a tie, sort the banks alphabetically by name.a 0930/default.htm.

5 Web Scraping I: Introduction to BeautifulSoup LabObjective: WebScrapingistheprocessofgatheringdatafromweb

Related Documents:

Web Scraping with PHP - php[architect]

Web Scraping with PHP, 2nd Ed. III 1. Introduction 1 Intended Audience 1 How to Read This Book 2 Web Scraping Defined 2 Applications of Web Scraping 3 Appropriate Use of Web Scraping 3 Legality of Web Scraping 3 Topics Covered 4 2. HTTP 5 Requests 6 Responses 11 Headers 12 Evolution of HTTP 19 Table of Contents Sample

26 Views

1y ago

Web Scraping with Python - library-it.com

What Is Web Scraping? The automated gathering of data from the Internet is nearly as old as the Internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is .

26 Views

1y ago

Efficient Scraping of Data From Websites Using Selenium

Web Scraping Fig 2 : Web Scraping process 2. Web scraping tools can range from manual browser plug-ins, to desktop applications, to purpose-built libraries within Python language. 3. A web scraping tool is an Application Programming Interface (API) in that it helps the client (you the user) interact with data stored on a server (the text). 4.

42 Views

1y ago

Web Scraping with Python - بهروز منصوری

to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots. In theory, web scraping

51 Views

2y ago

FB Page: ขี่ช้างจับข้อมูล www.elephant-analytics

What is web scraping? Web scraping is a technique for gathering data or information on web pages. A scraper is a script that parses an html site. Scrapers are bound to fail in cases of site re-design. As much as there’re many libraries that support web scraping, we will delve into web scraping using

54 Views

2y ago

Detection of Web API Content Scraping - DiVA portal

De nition: Web API content scraping is the act of collecting a substantial amount of data from a web API without consent from web API providers. Scraping is a method used to describe the extraction of data by one program from another program. For instance, the term web scraping describes the extraction of data from websites.

14 Views

1y ago

web-scraping - riptutorial.com

from: web-scraping It is an unofficial and free web-scraping ebook created for educational purposes. All the content is extracted from Stack Overflow Documentation, which is written by many hardworking individuals at Stack Overflow. It is neither affiliated with Stack Overflow nor official web-scraping.

19 Views

1y ago

WEB DATA SCRAPING - BizzBee Solutions

regarding the web data scraping industry. This document begins with a tabular display of the benefits and drawbacks of employing web scraping solutions, services and software. What follows is an insightful market overview, where the web scraping services and solutions are analyzed by their most common uses and applications. .

9 Views

1y ago

Recent Views

Quotes within Quotes: When Single (') and Double (") Quotes . - SAS

Here the outside double quotes are replaced by a single quote and the apostrophe is replaced by two single quotes. This works because when the parser sees two single (or double) quotes immediately following each other, the parser resolves them into one quote mark after the closing quote has been determined.

1y ago

237 Views

What These Inspirational Quotes Say

Self Motivation Quotes Success Quotes Teacher Quotes And after reading all of these inspirational quotes you’d like to share which quotation is . -- Brian Tracy "You must constantly ask yourself these questions: Who am I around? What are they doing to me? Wha

2y ago

302 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

Quotations - Free Website Builder: Create free websites

cards, but sometimes, playing a poor hand well." . 50th Birthday Quotes 60th Birthday Quotes And there are more. Funny Birthday Quotes Cute Birthday Quotes . it a try, itʼs free. Triumph over failure can be a

2y ago

267 Views

The Top 100 Motivational & Inspirational Quotes for 2015

I've spent hours crawling through the web trying to find the best quotes to keep me motivated and inspired all throughout the New Year. I've saved hundreds of quotes on my laptop and figured that words alone could motivate and inspire me. but if I couple the quotes

2y ago

329 Views

Inspirational Quotes - Guideposts

Inspirational Quotes Inspiring quotes are like vitamins for the soul. From the heartfelt to the humorous, the words of wisdom you’ll find here will strengthen your faith, lift your spirits, and even spark a positive change in your life. This collection of some our favorite inspirational quotes from religious figures, world leaders, authors,

2y ago

553 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Common Questions About Home Insurance

Homes with good security will generally be offered lower insurance quotes than the equivalent homes with poor security. In fact, some insurers may not offer quotes at all for homes with poor security. Contents Insurance Is money automatically covered? Most insurance policies will cover a limited amount of money (say up to 500) as part of

1y ago

257 Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

Industry Observations Insurance Industry

Jun 30, 2019 · 6/17/2019 Commercial Insurance Branch of Extraco Banks, N.A. Higginbotham Insurance Group, Inc. Insurance Brokers NA 6/13/2019 Links Insurance Services, LLC World Insurance Associates LLC Property and Casualty Insurance NA 6/13/2019 Abram Interstate Insurance Services, Inc. Risk Placement Services,

2y ago

619 Views

Life Insurance Buyer's Guide Life Insurance - National Association of .

Life Insurance uers uide Naional ssociaion of Insurance Commissioners Compare the Different Types of Insurance Policies There are many types of life insurance pol-icies. You should choose a policy with fea-tures that fit your individual needs. Some things to consider are: Term Insurance vs. Cash Value In-surance. Term insurance is intended to

1y ago

520 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

18.01.41 - REPLACEMENT OF LIFE INSURANCE AND ANNUITIES - Idaho

Department of Insurance Replacement of Life Insurance and Annuities. Page 3. 04. Existing Life Insurance or Annuity. "Existing Life Insurance or Annuity" means any life insurance or annuity in force, including life insurance under a binding or conditional receipt or a lif e insurance policy or annuity that is within an unconditional refund period.

1y ago

407 Views

EXAMINATION REPORT OF THE ADMIRAL INSURANCE COMPANY AS OF . - Delaware

Berkley Regional Specialty Insurance Comp 31295 DE Carolina Casualty Insurance Company 10510 IA Clermont Insurance Company 33480 IA Continental Western Insurance Company 10804 IA Firemen's Insurance Com pany of Wash, D.C. 21784 DE Gemini Insurance Company 10833 DE Great Divide Insurance Company 25224 ND

1y ago

258 Views

American International Group, Inc. - Federal Reserve

American General Life Insurance Company AGL U.S. Life Insurance Company AGC Life Insurance Company AGC Life U.S. Life Insurance Company The United States Life Insurance Company in the City of New York U.S. Life U.S. Life Insurance Company The Variable Annuity Life Insurance Company VALIC U.S. Life Insurance Company

1y ago

269 Views

5 Web Scraping I: Introduction To BeautifulSoup

It looks like you're using an ad-blocker