Beautiful Soup - RxJS, Ggplot2, Python Data Persistence .

2y ago
58 Views
14 Downloads
1.34 MB
56 Pages
Last View : 18d ago
Last Download : 3m ago
Upload by : Braxton Mach
Transcription

Beautiful SoupAbout the TutorialIn this tutorial, we will show you, how to perform web scraping in Python using BeautifulSoup 4 for getting data out of HTML, XML and other markup languages. In this we will tryto scrap webpage from various different websites (including IMDB). We will cover beautifulsoup 4, python basic tools for efficiently and clearly navigating, searching and parsingHTML web page. We have tried to cover almost all the functionalities of Beautiful Soup 4in this tutorial. You can combine multiple functionalities introduced in this tutorial into onebigger program to capture multiple meaningful data from the website into some other subprogram as input.AudienceThis tutorial is basically designed to guide you in scarping a web page. Basic requirementof all this is to get meaningful data out of huge unorganized set of data. The targetaudience of this tutorial can be anyone of: Anyone who wants to know – how to scrap webpage in python using BeautifulSoup4. Any data science developer/enthusiasts or anyone, how wants to use this scraped(meaningful) data to different python data science libraries to make better decision.PrerequisitesThough there is NO mandatory requirement to have for this tutorial. However, if you haveany or all (supercool) prior knowledge on any below mentioned technologies that will bean added advantage: Knowledge of any web related technologies (HTML/CSS/Document object Modeletc.). Python Language (as it is the python package). Developers who have any prior knowledge of scraping in any language. Basic understanding of HTML tree structure.Copyright & Disclaimer Copyright 2019 by Tutorials Point (I) Pvt. Ltd.All the content and graphics published in this e-book are the property of Tutorials Point (I)Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republishany contents or a part of contents of this e-book in any manner without written consentof the publisher.We strive to update the contents of our website and tutorials as timely and as precisely aspossible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of ourwebsite or its contents including this tutorial. If you discover any errors on our website orin this tutorial, please notify us at contact@tutorialspoint.comi

Beautiful SoupTable of ContentsAbout the Tutorial . iAudience . iPrerequisites . iCopyright & Disclaimer . iTable of Contents . ii1.Beautiful Soup — Overview . 1What is web-scraping? . 1Why Web-scraping? . 1Why Python for Web Scraping?. 2Introduction to Beautiful Soup . 22.Beautiful Soup — Installation . 3Creating a virtual environment (optional) . 3Installing virtual environment . 3Installing BeautifulSoup . 4Problems after installation . 5Installing a Parser . 6Running BeautifulSoup . 73.Beautiful Soup — Souping the Page . 10HTML tree Structure . 104.Beautiful Soup — Kinds of objects . 13Multi-valued attributes . 15NavigableString. 16BeautifulSoup . 16Comments . 17NavigableString Objects . 175.Beautiful Soup — Navigating by Tags . 18ii

Beautiful SoupGoing down . 18.contents and .children . 19.descendants. 20.string . 21.strings and stripped strings . 21Going up . 23Going sideways . 24Going back and forth . 266.Beautiful Soup — Searching the tree . 28Kinds of Filters . 28find all() . 29find() . 30find parents() and find parent(). 31CSS selectors . 347.Beautiful Soup — Modifying the tree . 35Changing tag names and attributes . 35Modifying .string . 35append() . 36NavigableString() and .new tag() . 36insert() . 37insert before() and insert after() . 38clear(). 38extract() . 39decompose() . 39Replace with(). 40wrap() . 40unwrap() . 408.Beautiful Soup — Encoding . 42iii

Beautiful SoupOutput encoding. 43Unicode, Dammit . 449.Beautiful Soup — Beautiful Objects . 45Comparing objects for equality . 45Copying Beautiful Soup objects . 4510. Beautiful Soup — Parsing only section of a document . 47SoupStrainer . 4711. Beautiful Soup — Trouble Shooting . 48Error Handling . 48diagnose() . 48Parsing error . 49XML parser Error . 50Other parsing errors . 50iv

1. Beautiful Soup — OverviewBeautiful SoupIn today’s world, we have tons of unstructured data/information (mostly web data)available freely. Sometimes the freely available data is easy to read and sometimes not.No matter how your data is available, web scraping is very useful tool to transformunstructured data into structured data that is easier to read & analyze. In other words,one way to collect, organize and analyze this enormous amount of data is through webscraping. So let us first understand what is web-scraping.What is web-scraping?Scraping is simply a process of extracting (from various means), copying and screening ofdata.When we do scraping or extracting data or feeds from the web (like from web-pages orwebsites), it is termed as web-scraping.So, web scraping which is also known as web data extraction or web harvesting is theextraction of data from web. In short, web scraping provides a way to the developers tocollect and analyze data from the internet.Why Web-scraping?Web-scraping provides one of the great tools to automate most of the things a humandoes while browsing. Web-scraping is used in an enterprise in a variety of ways:Data for ResearchSmart analyst (like researcher or journalist) uses web scrapper instead of manuallycollecting and cleaning data from the websites.Products prices & popularity comparisonCurrently there are couple of services which use web scrappers to collect data fromnumerous online sites and use it to compare products popularity and prices.SEO MonitoringThere are numerous SEO tools such as Ahrefs, Seobility, SEMrush, etc., which are usedfor competitive analysis and for pulling data from your client’s websites.Search enginesThere are some big IT companies whose business solely depends on web scraping.Sales and MarketingThe data gathered through web scraping can be used by marketers to analyze differentniches and competitors or by the sales specialist for selling content marketing or socialmedia promotion services.1

Beautiful SoupWhy Python for Web Scraping?Python is one of the most popular languages for web scraping as it can handle most of theweb crawling related tasks very easily.Below are some of the points on why to choose python for web scraping:Ease of UseAs most of the developers agree that python is very easy to code. We don’t have to useany curly braces “{ }” or semi-colons “;” anywhere, which makes it more readable andeasy-to-use while developing web scrapers.Huge Library SupportPython provides huge set of libraries for different requirements, so it is appropriate forweb scraping as well as for data visualization, machine learning, etc.Easily Explicable SyntaxPython is a very readable programming language as python syntax are easy to understand.Python is very expressive and code indentation helps the users to differentiate differentblocks or scoopes in the code.Dynamically-typed languagePython is a dynamically-typed language, which means the data assigned to a variable tells,what type of variable it is. It saves lot of time and makes work faster.Huge CommunityPython community is huge which helps you wherever you stuck while writing code.Introduction to Beautiful SoupThe Beautiful Soup is a python library which is named after a Lewis Carroll poem of thesame name in “Alice’s Adventures in the Wonderland”. Beautiful Soup is a python packageand as the name suggests, parses the unwanted data and helps to organize and formatthe messy web data by fixing bad HTML and present to us in an easily-traversible XMLstructures.In short, Beautiful Soup is a python package which allows us to pull data out of HTML andXML documents.2

2. Beautiful Soup — InstallationBeautiful SoupAs BeautifulSoup is not a standard python library, we need to install it first. We are goingto install the BeautifulSoup 4 library (also known as BS4), which is the latest one.To isolate our working environment so as not to disturb the existing setup, let us firstcreate a virtual environment.Creating a virtual environment (optional)A virtual environment allows us to create an isolated working copy of python for a specificproject without affecting the outside setup.Best way to install any python package machine is using pip, however, if pip is not installedalready (you can check it using – “pip –version” in your command or shell prompt), youcan install by giving below command:Linux environment sudo apt-get install python-pipWindows environmentTo install pip in windows, do the following: Download the get-pip.py from https://bootstrap.pypa.io/get-pip.py or from thegithub to your computer. Open the command prompt and navigate to the folder containing get-pip.py file. Run the following command: python get-pip.pyThat’s it, pip is now installed in your windows machine.You can verify your pip installed by running below command: pip --versionpip 19.2.3 from n37\lib\sitepackages\pip (python 3.7)Installing virtual environmentRun the below command in your command prompt: pip install virtualenv3

Beautiful SoupAfter running, you will see the below screenshot:Below command will create a virtual environment (“myEnv”) in your current directory: virtualenv myEnvScreenshotTo activate your virtual environment, run the following command: myEnv\Scripts\activateIn the above screenshot, you can see we have “myEnv” as prefix which tells us that weare under virtual environment “myEnv”.To come out of virtual environment, run deactivate.(myEnv) C:\Users\yadur deactivateC:\Users\yadur As our virtual environment is ready, now let us install beautifulsoup.Installing BeautifulSoupAs BeautifulSoup is not a standard library, we need to install it. We are going to use theBeautifulSoup 4 package (known as bs4).Linux MachineTo install bs4 on Debian or Ubuntu linux using system package manager, run the belowcommand: sudo apt-get install python-bs4 (for python 2.x)4

Beautiful Soup sudo apt-get install python3-bs4 (for python 3.x)You can install bs4 using easy install or pip (in case you find problem in installing usingsystem packager). easy install beautifulsoup4 pip install beautifulsoup4(You may need to use easy install3 or pip3 respectively if you’re using python3)Windows MachineTo install beautifulsoup4 in windows is very simple, especially if you have pip alreadyinstalled. pip install beautifulsoup4So now beautifulsoup4 is installed in our machine. Let us talk about some problemsencountered after installation.Problems after installationOn windows machine you might encounter, wrong version being installed error mainlythrough: error: ImportError “No module named HTMLParser”, then you must berunning python 2 version of the code under Python 3. error: ImportError “No module named html.parser” error, then you must berunning Python 3 version of the code under Python 2.5

Beautiful SoupBest way to get out of above two situations is to re-install the BeautifulSoup again,completely removing existing installation.If you get the SyntaxError “Invalid syntax” on the lin

Python provides huge set of libraries for different requirements, so it is appropriate for web scraping as well as for data visualization, machine learning, etc. Easily Explicable Syntax Python is a very readable programming language as python syntax are easy to understand. Python is ve

Related Documents:

Python Programming for the Absolute Beginner Second Edition. CONTENTS CHAPTER 1 GETTING STARTED: THE GAME OVER PROGRAM 1 Examining the Game Over Program 2 Introducing Python 3 Python Is Easy to Use 3 Python Is Powerful 3 Python Is Object Oriented 4 Python Is a "Glue" Language 4 Python Runs Everywhere 4 Python Has a Strong Community 4 Python Is Free and Open Source 5 Setting Up Python on .

Python 2 versus Python 3 - the great debate Installing Python Setting up the Python interpreter About virtualenv Your first virtual environment Your friend, the console How you can run a Python program Running Python scripts Running the Python interactive shell Running Python as a service Running Python as a GUI application How is Python code .

Python is readable 5 Python is complete—"batteries included" 6 Python is cross-platform 6 Python is free 6 1.3 What Python doesn't do as well 7 Python is not the fastest language 7 Python doesn't have the most libraries 8 Python doesn't check variable types at compile time 8 1.4 Why learn Python 3? 8 1.5 Summary 9

site "Python 2.x is legacy, Python 3.x is the present and future of the language". In addition, "Python 3 eliminates many quirks that can unnecessarily trip up beginning programmers". However, note that Python 2 is currently still rather widely used. Python 2 and 3 are about 90% similar. Hence if you learn Python 3, you will likely

There are currently two versions of Python in use; Python 2 and Python 3. Python 3 is not backward compatible with Python 2. A lot of the imported modules were only available in Python 2 for quite some time, leading to a slow adoption of Python 3. However, this not really an issue anymore. Support for Python 2 will end in 2020.

Hearty Bean Soup Healthy Heart 320 Hearty Brown Stew McD Plan 292 Hearty Brown Stew New McD 169 Hearty Vegetable Soup Volume Two 30 Hearty White Bean Soup McD Q&E 81 Heavenly Vegetable Soup McD Women 276 Hot Yammy Soup Volume One 22 Hunter’s Flat Bean Soup Volume Two 22

Data can be visualized by representing it as plots which is easy to understand, explore and grasp. Such data helps in drawing the attention of key elements. To analyse a set of data using Python, we make use of Matplotlib, a widely implemented 2D plotting library. Likewise, Seaborn is a visualization library in Python. It is built on top of .

Electromagnetics and Applications - MIT OpenCourseWare . Preface - ix -