Beautiful Soup Documentation - Read The Docs

2y ago
64 Views
5 Downloads
331.45 KB
84 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Aydin Oneil
Transcription

Beautiful Soup DocumentationRelease 4.4.0Leonard RichardsonDec 24, 2019

Contents1Getting help32Quick Start53Installing Beautiful Soup3.1 Problems after installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.2 Installing a parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99104Making the soup135Kinds of objects5.1 Tag . . . . . . . . . . . . . . . . .5.2 NavigableString . . . . . . .5.3 BeautifulSoup . . . . . . . . .5.4 Comments and other special strings.1515171818Navigating the tree6.1 Going down . . . .6.2 Going up . . . . . .6.3 Going sideways . . .6.4 Going back and forth.2121242527Searching the tree7.1 Kinds of filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.2 find all() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.3 Calling a tag is like calling find all() . . . . . . . . . . . . . . . . .7.4 find() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.5 find parents() and find parent() . . . . . . . . . . . . . . . .7.6 find next siblings() and find next sibling() . . . . . .7.7 find previous siblings() and find previous sibling()7.8 find all next() and find next() . . . . . . . . . . . . . . . . .7.9 find all previous() and find previous() . . . . . . . . . .7.10 CSS selectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2929323636373738383939Modifying the tree8.1 Changing tag names and attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.2 Modifying .string . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .434343678.i

4546464747474748.494950505210 Specifying the parser to use10.1 Differences between parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .555511 Encodings11.1 Output encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.2 Unicode, Dammit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57585912 Line numbers6313 Comparing objects for equality6514 Copying Beautiful Soup objects6715 Parsing only part of a document15.1 SoupStrainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .696916 Troubleshooting16.1 diagnose() . . . . . . . . .16.2 Errors when parsing a document16.3 Version mismatch problems . .16.4 Parsing XML . . . . . . . . . .16.5 Other parser problems . . . . .16.6 Miscellaneous . . . . . . . . .16.7 Improving Performance . . . .71717172727273739append() . . . . . . . . . . . . . . . . . . . .extend() . . . . . . . . . . . . . . . . . . . .NavigableString() and .new tag() . .insert() . . . . . . . . . . . . . . . . . . . .insert before() and insert after()clear() . . . . . . . . . . . . . . . . . . . .extract() . . . . . . . . . . . . . . . . . . .decompose() . . . . . . . . . . . . . . . . .replace with() . . . . . . . . . . . . . . .wrap() . . . . . . . . . . . . . . . . . . . . .unwrap() . . . . . . . . . . . . . . . . . . . .smooth() . . . . . . . . . . . . . . . . . . . .Output9.1 Pretty-printing . .9.2 Non-pretty printing9.3 Output formatters .9.4 get text() . .17 Translating this documentation7518 Beautiful Soup 318.1 Porting code to BS4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7777ii

Beautiful Soup Documentation, Release 4.4.0Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser toprovide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hoursor days of work.These instructions illustrate all major features of Beautiful Soup 4, with examples. I show you what the library is goodfor, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations.This document covers Beautiful Soup version 4.8.1. The examples in this documentation should work the same wayin Python 2.7 and Python 3.2.You might be looking for the documentation for Beautiful Soup 3. If so, you should know that Beautiful Soup 3 is nolonger being developed and that support for it will be dropped on or after December 31, 2020. If you want to learnabout the differences between Beautiful Soup 3 and Beautiful Soup 4, see Porting code to BS4.This documentation has been translated into other languages by Beautiful Soup users: . () . Este documento também está disponível em Português do Brasil.Contents1

Beautiful Soup Documentation, Release 4.4.02Contents

CHAPTER1Getting helpIf you have questions about Beautiful Soup, or run into problems, send mail to the discussion group. If your probleminvolves parsing an HTML document, be sure to mention what the diagnose() function says about that document.3

Beautiful Soup Documentation, Release 4.4.04Chapter 1. Getting help

CHAPTER2Quick StartHere’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice inWonderland:html doc """ html head title The Dormouse's story /title /head body p class "title" b The Dormouse's story /b /p p class "story" Once upon a time there were three little sisters; and their names were a href "http://example.com/elsie" class "sister" id "link1" Elsie /a , a href "http://example.com/lacie" class "sister" id "link2" Lacie /a and a href "http://example.com/tillie" class "sister" id "link3" Tillie /a ;and they lived at the bottom of a well. /p p class "story" . /p """Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which representsthe document as a nested data structure:from bs4 import BeautifulSoupsoup BeautifulSoup(html doc, 'html.parser')print(soup.prettify())# html # head # title #The Dormouse's story# /title # /head # body # p class "title" # b (continues on next page)5

Beautiful Soup Documentation, Release 4.4.0(continued from previous page)#The Dormouse's story# /b # /p # p class "story" #Once upon a time there were three little sisters; and their names were# a class "sister" href "http://example.com/elsie" id "link1" #Elsie# /a #,# a class "sister" href "http://example.com/lacie" id "link2" #Lacie# /a #and# a class "sister" href "http://example.com/tillie" id "link2" #Tillie# /a #; and they lived at the bottom of a well.# /p # p class "story" #.# /p # /body # /html Here are some simple ways to navigate that data structure:soup.title# title The Dormouse's story /title soup.title.name# u'title'soup.title.string# u'The Dormouse's story'soup.title.parent.name# u'head'soup.p# p class "title" b The Dormouse's story /b /p soup.p['class']# u'title'soup.a# a class "sister" href "http://example.com/elsie" id "link1" Elsie /a soup.find all('a')# [ a class "sister" href "http://example.com/elsie" id "link1" Elsie /a ,# a class "sister" href "http://example.com/lacie" id "link2" Lacie /a ,# a class "sister" href "http://example.com/tillie" id "link3" Tillie /a ]soup.find(id "link3")# a class "sister" href "http://example.com/tillie" id "link3" Tillie /a One common task is extracting all the URLs found within a page’s a tags:6Chapter 2. Quick Start

Beautiful Soup Documentation, Release 4.4.0for link in soup.find all('a'):print(link.get('href'))# http://example.com/elsie# http://example.com/lacie# http://example.com/tillieAnother common task is extracting all the text from a page:print(soup.get text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## .Does this look like what you need? If so, read on.7

Beautiful Soup Documentation, Release 4.4.08Chapter 2. Quick Start

CHAPTER3Installing Beautiful SoupIf you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system packagemanager: apt-get install python-bs4 (for Python 2) apt-get install python3-bs4 (for Python 3)Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it witheasy install or pip. The package name is beautifulsoup4, and the same package works on Python 2 andPython 3. Make sure you use the right version of pip or easy install for your Python version (these may benamed pip3 and easy install3 respectively if you’re using Python 3). easy install beautifulsoup4 pip install beautifulsoup4(The BeautifulSoup package is probably not what you want. That’s the previous major release, BeautifulSoup 3. Lots of software uses BS3, so it’s still available, but if you’re writing new code you should installbeautifulsoup4.)If you don’t have easy install or pip installed, you can download the Beautiful Soup 4 source tarball and installit with setup.py. python setup.py installIf all else fails, the license for Beautiful Soup allows you to package the entire library with your application. Youcan download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup withoutinstalling it at all.I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it should work with other recent versions.3.1 Problems after installationBeautiful Soup is packaged as Python 2 code. When you install it for use with Python 3, it’s automatically convertedto Python 3 code. If you don’t install the package, the code won’t be converted. There have also been reports onWindows machines of the wrong version being installed.9

Beautiful Soup Documentation, Release 4.4.0If you get the ImportError “No module named HTMLParser”, your problem is that you’re running the Python 2version of the code under Python 3.If you get the ImportError “No module named html.parser”, your problem is that you’re running the Python 3version of the code under Python 2.In both cases, your best bet is to completely remove the Beautiful Soup installation from your system (including anydirectory created when you unzipped the tarball) and try the installation again.If you get the SyntaxError “Invalid syntax” on the line ROOT TAG NAME u'[document]', you need toconvert the Python 2 code to Python 3. You can do this either by installing the package: python3 setup.py installor by manually running Python’s 2to3 conversion script on the bs4 directory: 2to3-3.2 -w bs43.2 Installing a parserBeautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number ofthird-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of thesecommands: apt-get install python-lxml easy install lxml pip install lxmlAnother alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Dependingon your setup, you might install html5lib with one of these commands: apt-get install python-html5lib easy install html5lib pip install html5libThis table summarizes the advantages and disadvantages of each parser library:10Chapter 3. Installing Beautiful Soup

Beautiful Soup Documentation, Release 4.4.0ParserPython’s html.parserTypical usageAdvantagesBeautifulSoup(markup, Batteries included"html.parser") Decent speed Lenient (As ofPython 2.7.3 and3.2.)lxml’s HTML parserBeautifulSoup(markup, Very fast"lxml") Lenientlxml’s XML parserBeautifulSoup(markup, Very fast"lxml-xml") The only rserhtml5libBeautifulSoup(markup, Extremely lenient"html5lib") Parses pages thesame way a webbrowser does CreatesvalidHTML5Disadvantages Not as fast as lxml,less lenient thanhtml5lib. External C dependency External C dependency Very slow External Python dependencyIf you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, ora version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parseris just not very good in older versions.Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See Differencesbetween parsers for details.3.2. Installing a parser11

Beautiful Soup Documentation, Release 4.4.012Chapter 3. Installing Beautiful Soup

CHAPTER4Making the soupTo parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:from bs4 import BeautifulSoupwith open("index.html") as fp:soup BeautifulSoup(fp)soup BeautifulSoup(" html data /html ")First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:BeautifulSoup("Sacré bleu!") html head /head body Sacré bleu! /body /html Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless youspecifically tell it to use an XML parser. (See Parsing XML.)13

Beautiful Soup Documentation, Release 4.4.014Chapter 4. Making the soup

CHAPTER5Kinds of objectsBeautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only everhave to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.5.1 TagA Tag object corresponds to an XML or HTML tag in the original document:soup BeautifulSoup(' b class "boldest" Extremely bold /b ')tag soup.btype(tag)# class 'bs4.element.Tag' Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree.For now, the most important features of a tag are its name and attributes.5.1.1 NameEvery tag has a name, accessible as .name:tag.name# u'b'If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:tag.name "blockquote"tag# blockquote class "boldest" Extremely bold /blockquote 15

Beautiful Soup Documentation, Release 4.4.05.1.2 AttributesA tag may have any number of attributes. The tag b id "boldest" has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:tag['id']# u'boldest'You can access that dictionary directly as .attrs:tag.attrs# {u'id': 'boldest'}You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:tag['id'] 'verybold'tag['another-attribute'] 1tag# b another-attribute "1" id "verybold" /b del tag['id']del tag['another-attribute']tag# b /b tag['id']# KeyError: 'id'print(tag.get('id'))# NoneMulti-valued attributesHTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a fewmore. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Othersinclude rel, rev, accept-charset, headers, and accesskey. Beautiful Soup presents the value(s) of amulti-valued attribute as a list:css soup BeautifulSoup(' p class "body" /p ')css soup.p['class']# ["body"]css soup BeautifulSoup(' p class "body strikeout" /p ')css soup.p['class']# ["body", "strikeout"]If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version ofthe HTML standard, Beautiful Soup will leave the attribute alone:id soup BeautifulSoup(' p id "my id" /p ')id soup.p['id']# 'my id'When you turn a tag back into a string, multiple attribute values are consolidated:rel soup BeautifulSoup(' p Back to the a rel "index" homepage /a /p ')rel soup.a['rel'](continues on next page)16Chapter 5. Kinds of objects

Beautiful Soup Documentation, Release 4.4.0(continued from previous page)# ['index']rel soup.a['rel'] ['index', 'contents']print(rel soup.p)# p Back to the a rel "index contents" homepage /a /p You can disable this by passing multi valued attributes None as a keyword argument into theBeautifulSoup constructor:no list soup BeautifulSoup(' p class "body strikeout" /p ', 'html', multi valued attributes None)no list soup.p['class']# u'body strikeout'You can use get attribute list to get a value that’s always a list, whether or not it’s a multi-valued atribute:id soup.p.get attribute list('id')# ["my id"]If you parse a document as XML, there are no multi-valued attributes:xml soup BeautifulSoup(' p class "body strikeout" /p ', 'xml')xml soup.p['class']# u'body strikeout'Again, you can configure this using the multi valued attributes argument:class is multi { '*' : 'class'}xml soup BeautifulSoup(' p class "body strikeout" /p ', 'xml', multi valued attributes class is multi)xml soup.p['class']# [u'body', u'strikeout']You probably won’t need to do this, but if you do, use the defaults as a guide. They implement the rules described inthe HTML specification:from bs4.builder import builder registrybuilder registry.lookup('html').DEFAULT CDATA LIST ATTRIBUTES5.2 NavigableStringA string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain thesebits of text:tag.string# u'Extremely bold'type(tag.string)# class 'bs4.element.NavigableString' A NavigableString is just like a Python Unicode string, except that it also supports some of the features describedin Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string withunicode():5.2. NavigableString17

Beautiful Soup Documentation, Release 4.4.0unicode string unicode(tag.string)unicode string# u'Extremely bold'type(unicode string)# type 'unicode' You can’t edit a string in place, but you can replace one string with another, using replace with():tag.string.replace with("No longer bold")tag# blockquote No longer bold /blockquote NavigableString supports most of the features described in Navigating the tree and Searching the tree, but notall of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag),strings don’t support the .contents or .string attributes, or the find() method.If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it intoa normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soupparse tree, even when you’re done using Beautiful Soup. This is a big waste of memory.5.3 BeautifulSoupThe BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as aTag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.You can also pass a BeautifulSoup object into one of the methods defined in Modifying the tree, just as you woulda Tag. This lets you do things like combine

apt-get install python-bs4(for Python 2) apt-get install python3-bs4(for Python 3) Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_installor pip. The package name is beautifulsoup4, and the same package works on

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

Hearty Bean Soup Healthy Heart 320 Hearty Brown Stew McD Plan 292 Hearty Brown Stew New McD 169 Hearty Vegetable Soup Volume Two 30 Hearty White Bean Soup McD Q&E 81 Heavenly Vegetable Soup McD Women 276 Hot Yammy Soup Volume One 22 Hunter’s Flat Bean Soup Volume Two 22

Beef Satay Sushi pizza Dormant Volcano Shrimp Shumai Pan Fried Pork Dumpling Gua Bao Dim Sum Platter Mini Veg. Spring Roll, Veg. Hot & Sour Soup 3.25 Miso Soup 3.25 Seaweed Wonton Soup 3.25 , Seafood Tom Yum Soup 7.95 Japanese Seafood Soup 7.95 Scallop, baby shrimp, fish cake, kani, mussel,