Efficient Scraping Of Data From Websites Using Selenium

1y ago

41 Views

2 Downloads

1.33 MB

5 Pages

Last View : 8d ago

Last Download : 3m ago

Upload by : Pierre Damon

Report this link

Download PDF

Transcription

2022 JETIR June 2022, Volume 9, Issue 6 www.jetir.org (ISSN-2349-5162) EFFICIENT SCRAPING OF DATA FROM WEBSITES USING SELENIUM 1Shreya V. Dhoke, 2Anupama D. Sakhare,3Satish J. Sharma 1 1,2,3 1,2,3 Student, 2Assistant Professor, 3Professor Department of Electronics and Computer Science Rashtrasant Tukdoji Maharaj Nagpur University, Nagpur, India Abstract: Internet is an ocean of information spread across various websites, where it is categorized, interlinked and mostly freely available for everyone. A vast amount of data is being created every second. All this ‘Big Data’ is in heterogeneous formats. We need to access information fast and quickly. Data extraction can be done manually but it can be time-consuming and can also be a very complicated task, for this reason Web Scraping is used. Web Scraping is the technique of automating the process of navigating through links, and then navigating and collecting the relevant data from these relevant links. The proposed system is a method of extracting and restructuring information from web pages. It is a technique for targeted, automated extraction of information from websites. This system acquires non-tabular or poorly structured data from websites and converts it into a usable structured format. The main objective of the proposed system is to extract information from one or many websites and process it into simple structures such as CSV files. In this proposed system, Text Grepping technique is used, that offers insight into price data, market dynamics, prevailing trends, practices employed by various competitors, and the challenges they face. The result of this technique is to easily access relevant data from websites. The proposed system can be modified for scraping dynamic websites. This proposed system will be beneficial in many business and at education areas. Keywords: Web scraping, Big Data, CSV file, Structured and Unstructured data I. INTRODUCTION Internet contain various information from various websites, where it is categorized, interlinked and freely available for everyone. Some data that is available on the web is presented in a format that makes it easier to collect and use it. For extracting relevant data from websites, it is very tedious task and time consuming to manually extracting it. For formatting this Web Scraping is used. Web Scraping is process of extracting relevant data from websites. This technique of automating the process of navigating through links, and then navigating and collecting the data from relevant websites. After automation, instead of manually coping the data from websites, Web Scraping will replicate the same task within a fraction of time. The various technique used for Web Scraping are Text pattern matching, HTTP programming, DOM parsing, Text Grepping, Vertical aggregation, Semantic Annotation recognizing, computer vision web-page analysis etc. Most of required data is unstructured data in HTML format which is then converted into structures data in a spreadsheet or a database so that it can be used in various applications. Internet contain various information from various websites, where it is categorized, interlinked and freely available for everyone. Some data that is available on the web is presented in a format that makes it easier to collect and use it. For extracting relevant data from websites, it is very tedious task and time consuming to manually copy-paste it. For this Web Scraping is used. Web Scraping is process of extracting relevant data from websites. This technique of automating the process of navigating through links, and then navigating and collecting the data from relevant websites. After automation, instead of manually coping the data from websites, Web Scraping will replicate the same task within a fraction of time. The various technique used for Web Scraping are Text pattern matching, HTTP programming, DOM parsing, Text Grepping, Vertical aggregation, Semantic Annotation JETIRFM06063 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 353

2022 JETIR June 2022, Volume 9, Issue 6 www.jetir.org (ISSN-2349-5162) recognizing, computer vision web-page analysis etc. Most of required data is unstructured data in HTML format which is then converted into structures data in a spreadsheet or a database so that it can be used in various applications. Fig 1: Web Scraping Structure II. OBJECTIVES The objectives to be achieved in this project are: To acquire non-tabular or poorly structured data To scrap data from other sites To verify the possibility to produce statistical outputs using predicted data. Fig 3: Complete Flowchart of Web Scraping III. REVIEW OF LITERATURE Web Scraping, i.e. the automated and targeted extraction of data, is a traditional technique to retrieve Web content at scale. A multitude of frameworks and Application Programming Interfaces to develop customized scrapers, as well as configurable ready-to-use scraping tools exist. Renita Crystal Pereira [1] provided web scraping summary and techniques and tools that face several complexities as data extraction isn't that simple. These strategies guarantee that the data collected is correct, consistent and has better integrity, because there is a large amount of data present which is hard to handle and retain. Although there are a few problems faced by JETIRFM06063 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 354

2022 JETIR June 2022, Volume 9, Issue 6 www.jetir.org (ISSN-2349-5162) functional techniques that can be such as the elevated amount of web scraping be able to cause rigid harm to the websites. The measurement level of the web scraper will vary with the measurement units of the original source file, making it very difficult to interpret the data. Using social networking sites and internet is amplifying day by day like facebook, twitter, linked-in and some other, user knowledge is also high in the internet available from everywhere. This as well offers hackers an advantage in stealing information. Where the concept of rising income comes into being, social networking is important from a view of business point. Like with online shopping, it will also assist consumers in getting fast shopping and also save time. On the other hand, there is advantage in supporting the company and profiting from it. Kaushal Parikh [2] proposed a web scraping detection with the help of machine learning It is valuable for research dependent companies. Web scraping has forever been a difficult preventive attack. Every time a company places its data on internet, it is probable that it could be copied and pasted and then utilized in the other point of view without the corporation knowing itself about it. The significance of machine learning therefore steps in. Machine learning is quite effective on pattern detection. Therefore if we succeed in making the machine understand a cadence of intruder then it will avoid these types of threats from occurring. Web scraping solutions are aimed primarily at translating complex data obtained through networks into structured data that could be stored and examined in a central database. Sameer Padghan [3] projected an approach where data extraction is done from web pages in assistance with web scraping easily. This method would enable the data to be scrapped from numerous websites that will minimize human intervention, save time and also enhance the quality of data relevance. It will also support the user in gathering data from the site and to save the data to their intent and use it as the individual wishes. The scraped information may be used for database development or for research purposes and also for different similar activities. The scraping used would increase significantly and will often encroach on the framework to obtain the details. However the scraping can be stopped by using effective and safe-web scraping methods. This method should be treated as a blessing that must be used carefully for the advancement of human races. Anand Saurkar[4] discovered latest technique named Web Scraping. Web scraping is a quite important methodology used to produce structured data based on the unstructured data available on the internet. Scraping formed structured data, subsequently collected and evaluated in spreadsheets in central database. This research focuses on a summary of the data extraction process of web scraping, various web scraping strategies and most of the latest tools utilized to scrap web. The primary function of this methodology has been to get webbased information and integrate this into a specific repository. The authors addressed the basics of Web processing in this article. They concentrated on the Web scraping techniques. The final part of the paper presents a summary of the numerous technological resources that are available for effective web scraping in the industry. Federico Polidoro [5] concentrated on the outcomes of web scraping evaluation strategies with particular orientation to user electronics services and goods throughout the sector of commodity price studies. Although the research done has so far been performed in a small amount of time, that you can see in whatever followed, it has enabled to attain important, but not conclusive, novel efficiencies results. Web scraping strategies used in the growth analysis will provide exposure to a greater volume of data than that accessible in the existing data set, thus, with the potential to increase the growth estimate. This topic has been briefly addressed in the portions allocated to both of the examined items, but in reality interacting with this viewpoint requires a concern regarding the current survey architecture that does not require or only selectively permit the use of big data approaches within the existing sampling frameworks. Jan Kinne [6] Proposed a web extraction platform for the accurate and measurable mining of ecosystems for development. Researchers have put special emphasis on exploring a possible bias while examining technology structures across corporation website if all those types of companies could be measured using suggested methodThe proposed system of research enables for an integrated, least expensive simulation of whole business communities, that could be conducted out more efficiently and in relatively short time periods compared to conventional techniques. This method is also conveniently extendable by checking the web pages of research institutions to model information communities. The key point in proposed system is to identify and extract certain bits of data from unstructured content on the site which exposes information regarding the current development practices of companies. To know how the data extraction process has evolved has so much one must understand the techniques involved in this method of web scraping is important scraping has been around nearly as long as the web. The impact behind business web scraping has dependably been to pick up a simple business advantage and incorporate things like undermining a contender's special valuing, taking leads, commandeering promoting efforts, diverting APIs, and the inside and out robbery of information. III. EXPERIMENTAL WORK The proposed system can perform the automation tasks according the given points: 1. The web is filled with text. Most text, though, is structured according to HTML or XHTML markup tags which instruct browsers how to display it. These tags are designed to help text appear in readable ways on the web and like web browsers, web scraping tools can interpret these tags and follow instructions on how to collect the text they contain. JETIRFM06063 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 355

2022 JETIR June 2022, Volume 9, Issue 6 www.jetir.org (ISSN-2349-5162) Pick a website Web Scraping Generation of Structured data Fig 2 : Web Scraping process 2. 3. 4. 5. 6. 7. IV. Web scraping tools can range from manual browser plug-ins, to desktop applications, to purpose-built libraries within Python language. A web scraping tool is an Application Programming Interface (API) in that it helps the client (you the user) interact with data stored on a server (the text). Selenium library is used to connect with Web drivers to browser plug-in tools of chrome or Firefox browser. Selenium automation tool stimulate the automation on relevant link and generate it into CSV file. Unstructured data can be obtained in structured format in CSV file or a spreadsheet. After generation of CSV file one can analyze the data accordingly. CONCLUSION The proposed system reduces the manual work of extracting of data in less amount of time. Automatically generation of data in required file makes work easy to analyze it. Fig 3: CSV file of Desired Output Fig 4. CSV file JETIRFM06063 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 356

2022 JETIR June 2022, Volume 9, Issue 6 www.jetir.org (ISSN-2349-5162) V. REFERENCES [1] Renita Crystal Pereira and Vanitha T, “Web Scraping of Social Networks,” Int’l J. of Inno. Res. in Comp. and Comm. Engg., 3(1), 237-240, 2015 [2] Kaushal Parikh, Dilip Singh, Dinesh Yadav and Mansingh Rathod, “Detection of web scraping using machine learning,” Open access international journal of Science and Engineering, pp.114-118, Vol. 3, 2018. [3] Sameer Padghan, Satish Chigle and Rahul Handoo, “Web Scraping-Data Extraction Using Java Application and Visual Basics Macros,” Journal of Advances and Scholarly Researches in Allied Education, pp. 691-695, Vol.15, 2018. [4] Anand V. Saurkar, Kedar G. Pathare and Shweta A. Gode, “An Overview On Web Scraping Techniques And Tools,” International Journal on Future Revolution in Computer Science & Communication Engineering, pp. 363-367, Vol. 4, 2018. [5] Federico Polidoro, Riccardo Giannini, Rosanna Lo Conte, Stefano Mosca and Francesca Rossetti, “Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation,” Statistical Journal of the IAOS, pp. 165-176, 2015. 6] Jan Kinne and Janna Axenbeck, “Web Mining of Firm Websites: A Framework for Web Scraping and a Pilot Study for Germany,” 2019. [7] Ingolf Boettcher, “Automatic data collection on the Internet,” pp. 1-9, 2015. [8] Erin J. Farley and Lisa Pierotte, “An Emerging Data Collection Method for Criminal Justice Researchers,” Justice Research and statistics association, pp. 1-9, 2017. [9] David Mathew Thomas, Sandeep Mathur ,Amity Institute of Information Technology ,Amity University (AUUP), Sec-125, Noida [10] http://wthtjsjs.cn/gallery/1-whjj-june-541,pdf case study. [12] Sameer Padghan, Satish Chigle and Rahul Handoo, “Web Scraping-Data Extraction Using Java Application and Visual Basics Macros,” Journal of Advances and Scholarly Researches in Allied Education, pp. 691-695, Vol.15, 2018. JETIRFM06063 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 357

Web Scraping Fig 2 : Web Scraping process 2. Web scraping tools can range from manual browser plug-ins, to desktop applications, to purpose-built libraries within Python language. 3. A web scraping tool is an Application Programming Interface (API) in that it helps the client (you the user) interact with data stored on a server (the text). 4.

Related Documents:

Web Scraping with PHP - php[architect]

Web Scraping with PHP, 2nd Ed. III 1. Introduction 1 Intended Audience 1 How to Read This Book 2 Web Scraping Defined 2 Applications of Web Scraping 3 Appropriate Use of Web Scraping 3 Legality of Web Scraping 3 Topics Covered 4 2. HTTP 5 Requests 6 Responses 11 Headers 12 Evolution of HTTP 19 Table of Contents Sample

26 Views

1y ago

Web Scraping with Python - library-it.com

What Is Web Scraping? The automated gathering of data from the Internet is nearly as old as the Internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is .

26 Views

1y ago

Detection of Web API Content Scraping - DiVA portal

De nition: Web API content scraping is the act of collecting a substantial amount of data from a web API without consent from web API providers. Scraping is a method used to describe the extraction of data by one program from another program. For instance, the term web scraping describes the extraction of data from websites.

14 Views

1y ago

WEB DATA SCRAPING - BizzBee Solutions

regarding the web data scraping industry. This document begins with a tabular display of the benefits and drawbacks of employing web scraping solutions, services and software. What follows is an insightful market overview, where the web scraping services and solutions are analyzed by their most common uses and applications. .

9 Views

1y ago

FB Page: ขี่ช้างจับข้อมูล www.elephant-analytics

What is web scraping? Web scraping is a technique for gathering data or information on web pages. A scraper is a script that parses an html site. Scrapers are bound to fail in cases of site re-design. As much as there’re many libraries that support web scraping, we will delve into web scraping using

54 Views

2y ago

Web Scraping with Python - The Eye

learner of web scraping. He recommends this book to all Python enthusiasts so that they can enjoy the benefits of scraping. He is enthusiastic about Python web scraping and has worked on projects such as live sports feeds, as well as a generalized

251 Views

2y ago

Web Scraping with Python - بهروز منصوری

to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots. In theory, web scraping

51 Views

2y ago

Paper Links Project - Place2Be

piece of paper and draw an outline of your chosen animal or person. 2. sing and dance when they If you would like to make more than one of any animal or person, fold your paper a few times behind the outline. You could also cut out your outline and trace around it. 3. from things they may Think of how to connect your paper animals or people.

55 Views

3y ago

Recent Views

Quotes within Quotes: When Single (') and Double (") Quotes . - SAS

Here the outside double quotes are replaced by a single quote and the apostrophe is replaced by two single quotes. This works because when the parser sees two single (or double) quotes immediately following each other, the parser resolves them into one quote mark after the closing quote has been determined.

1y ago

237 Views

IN THIS ISSUE CAR WASH INSIGHT Recent, Notable M&A Transactions .

9/8/2022 Club Car Wash Sites of Tidal Wave Express Car Wash 8 8/29/2022 Take 5 Car Wash Soft Touch Car Wash, Auto Oasis Car Wash, Clearwater Car Wash and Birdie's Car Wash 5 8/25/2022 WhiteWater Express Geaux Clean Car Wash 7 8/19/2022 ModWash Home Team Car Wash 3 8/18/2022 Splash In ECO Car Wash (Wills Group) Blue Hen Car Wash 2

9m ago

100 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

What These Inspirational Quotes Say

Self Motivation Quotes Success Quotes Teacher Quotes And after reading all of these inspirational quotes you’d like to share which quotation is . -- Brian Tracy "You must constantly ask yourself these questions: Who am I around? What are they doing to me? Wha

2y ago

302 Views

ESSENTIAL PLAN - Discovery

Car insurance only Car and home insurance Car insurance only Car and home insurance 12.5% 25% 5% 10% YOUR FUEL CASH BACK PERCENTAGE GET TO THE HIGHEST CASH BACK PERCENTAGE Add at least R250 000 of home insurance (household contents, buildings or both) Take your car to Tiger Wheel & Tyre and pass the Annual MultiPoint check

1y ago

269 Views

CAR INSURANCE EVERYTHING EXPLAINED - RSA Insurance Group

CAR INSURANCE 93013821.indd 1 15/03/2018 10:46. 2 WELCOME TO µ CAR INSURANCE Thank you for choosing µ to protect you and your car. This booklet is intended to help you check your cover and to reassure you that µ will give you the protection you need for the year ahead. First of all, to help you understand your car insurance policy we want to .

1y ago

274 Views

Describe types and purposes of insurance.

D.O. CAPS Consumer Skills: Insurance—10E 3 Your car - The car you drive can also affect your insurance rates. Insurance companies place certain kinds of cars in special risk categories. You should ask your insurance agent before making a car purchase to make sure you aren't getting a car that will cost you extra for your liability insurance.

1y ago

233 Views

Money Online Price Comparison - WordPress

you to compare car insurance quotes. You'll notice at the top of the screen is a warning regarding telling the truth when completing any form of car insurance quote as something withheld, which later becomes known, can void an insurance claim. 7 The process of completing a car insurance price comparison is broken down into 4

1y ago

174 Views

Quotations - Free Website Builder: Create free websites

cards, but sometimes, playing a poor hand well." . 50th Birthday Quotes 60th Birthday Quotes And there are more. Funny Birthday Quotes Cute Birthday Quotes . it a try, itʼs free. Triumph over failure can be a

2y ago

267 Views

The Top 100 Motivational & Inspirational Quotes for 2015

I've spent hours crawling through the web trying to find the best quotes to keep me motivated and inspired all throughout the New Year. I've saved hundreds of quotes on my laptop and figured that words alone could motivate and inspire me. but if I couple the quotes

2y ago

329 Views

Inspirational Quotes - Guideposts

Inspirational Quotes Inspiring quotes are like vitamins for the soul. From the heartfelt to the humorous, the words of wisdom you’ll find here will strengthen your faith, lift your spirits, and even spark a positive change in your life. This collection of some our favorite inspirational quotes from religious figures, world leaders, authors,

2y ago

553 Views

Contours Options Infant Car Seat Adapter Instruction Sheet

your Infant Car Seat, as described in the instruction manual provided by the Infant Car Seat manufacturer. † WHEN USING ONLY ONE INFANT CAR SEAT ADAPTER OR TWO FOR TWINS, THE FOLLOWING INFANT CAR SEATS CAN BE USED: † If your Infant Car Seat is not one of the models listed above, DO NOT use your infant car seat with this car seat adapter.

2y ago

564 Views

Microsoft Advertising Travel Update

last minute cruise deals -58.50% Car Rental Queries WoW Change car rental -43.80% rental cars -46.30% car rentals -40.60% cheap car rentals -48.00% car rentals cheapest rates -52.20% rent a car- 40.30% cheap rental cars -45.60% rental car -41.80% car rental deals -49.30% rental cars lowest price -53.90% Flight Queries WoW Change cheap flights .

1y ago

337 Views

Design and development of lift for an automatic car parking system

1. Stacker type car parking system 2. Puzzle type car parking system 3. Level type car parking system 4. Chess type car parking system 5. Rotary type car parking system 6. Tower type car parking system But lift is used only in tower type car parking system. Objectives:-

6m ago

172 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Efficient Scraping Of Data From Websites Using Selenium

It looks like you're using an ad-blocker