Extracting Data From XML - University Of California, Berkeley

10m ago

2 Views

1 Downloads

982.25 KB

43 Pages

Last View : 2m ago

Last Download : 3m ago

Upload by : Philip Renner

Report this link

Download PDF

Transcription

Extracting data from XML Wednesday DTL

Parsing - XML package 2 basic models - DOM & SAX Document Object Model (DOM) Tree stored internally as C, or as regular R objects Use XPath to query nodes of interest, extract info. Write recursive functions to "visit" nodes, extracting information as it descends tree extract information to R data structures via handler functions that are called for particular XML elements by matching XML name For processing very large XML files with low-level state machine via R handler functions - closures.

Preferred Approach DOM (with internal C representation and XPath) Given a node, several operations xmlName() - element name (w/w.o. namespace prefix) xmlNamespace() xmlAttrs() - all attributes xmlGetAttr() - particular value xmlValue() - get text content. xmlChildren(), node[[ i ]], node [[ "el-name" ]] xmlSApply() xmlNamespaceDefinitions()

Scraping HTML - (you name it!) zillow - house price estimates Examples PubMed articles/abstracts European Bank exchange rates itunes - CDs, tracks, play lists, . PMML - predictive modeling markup language CIS - Current Index of Statistics/Google Scholar Google - Page Rank, Natural Language Processing Wikipedia - History of changes, . SBML - Systems biology markup language Books - Docbook SOAP - eBay, KEGG, . Yahoo Geo/places - given name, get most likely location

PubMed Professionally archived collection of "medically-related" articles. Vast collection of information, including article abstracts submission, acceptance and publication date authors .

PubMed We'll use a sample PubMed example article for simplicity. Can get very large, rich ArticleSet with many articles via an HTTP query done from within R/XML package directly. Take a look at the data, see what is available or read the documentation Or explore the contents. http://www.ncbi.nlm.nih.gov/books/bv.fcgi? rid helppubmed.section.publisherhelp.XML Tag Descripti ons

doc xmlTreeParse("pubmed.xml", useInternal TRUE) top xmlRoot(doc) xmlName(top) [1] "ArticleSet" names(top) - child nodes of this root [1] "Article" "Article" - so 2 articles in this set.

Let's fetch the author list for each article. Do it first for just one and then use "apply" to iterate names( top[[ 1 ]] ) Journal "Journal" LastPage "LastPage" Language "Language" ArticleIdList "ArticleIdList" ObjectList "ObjectList" ArticleTitle "ArticleTitle" ELocationID "ELocationID" AuthorList "AuthorList" History "History" art top[[ 1 ]] [[ "AuthorList" ]] what we want FirstPage "FirstPage" ELocationID "ELocationID" GroupList "GroupList" Abstract "Abstract"

names(art) [1] "Author" "Author" "Author" "Author" "Author" "Author" names(art[[1]]) [1] "FirstName" [5] "Affiliation" "MiddleName" "LastName" "Suffix" So how do we get these values, e.g. to put in a data frame. Each element is a node with text content.

So loop over the nodes and get the content as a string xmlSApply(art[[1]], xmlValue) To do this for all authors of the article xmlSApply(art, function(x) xmlSApply(x, xmlValue)) How do we deal with the different types of fields in the names? e.g. First, Middle, Last, Affiliation CollectiveName data representation/analysis question from here.

Pubmed Dates In the History element, have date received, accepted, aheadofprint May want to look at time publication lag (i.e. received to publication time) for different journals. So get these dates for all the articles History PubDate PubStatus "received" year . /year Month 06 /Month Day 15 /Day PubDate PubDate PubStatus "accepted" year . /day /PubDate

Find the element PubDate within History which has an attribute whose value is "received" Can use art[["History"]][["PubDate"]] to get all 3 elements. But what if we want to access the 'received' dates for all the articles in a single operation, then the accepted, . Need a language to identify nodes with a particular characteristic/condition

XPath XPath is a language for expressing such node subsetting with rich semantics for identifying nodes by name with specific attributes present with attributes with particular values with parents, ancestors, children XPath YALTL (Yet another language to learn)

XPath language /node - top-level node //node - node at any level node[@attr-name] - node that has an attribute named "attr-name" node[@attr-name 'bob'] - node that has attribute named attr-name with value 'bob' node/@x - value of attribute x in node with such attr. Returns a collection of nodes, attributes, etc.

Let's find the date when the articles were received nodes getNodeSet(top, "//History/PubDate[@PubStatus 'received']") 2 nodes - 1 per article Extract year, month, day lapply(nodes, function(x) xmlSApply(x, xmlValue)) Easy to get date "accepted" and "aheadofprint"

Text mining of abstract Content of abstract as words abstracts xpathApply(top, "//Abstract", xmlValue) Now, break up into words, stem the words, remove the stop-words, abstractWords lapply(abstracts, strsplit, "[[:space:]]") library(Rstem) abstractWords lapply(abstractWords, function(x) wordStem[[1]]) Remove stop words lapply(abstractWords, function(x) x[x %in% stopWords])

Zillow - house prices Thanks to Roger, yesterday evening I found the Zillow XML API - (Application Programming Interface) Can register with Zillow, make queries to find estimated house prices for a given house, comparables, demographics, . Put address, city-state-zip & Zillow login in URL request Can put this at the end of a URL within xmlTreeParse() "http://www.zillow.com/./.?zwsid .&address 1029%20Bob's %20Way&citstatezip Berkeley" But spaces are problematic, as are other characters.

So I use library(RCurl) reply hResults.htm", 'zws-id' "AB-XXXXXXXXXXX 10312q", address "1093 Zuchini Way", citystatezip "Berkeley, CA, 94212") reply is text from the Web server containing XML

?xml version \"1.0\" encoding \"utf-8\"? \n SearchResults:searchresults xsi:schemaLocation sd /vstatic/ Results.xsd\" xmlns:xsi \"http://www.w3.org/2001/XMLSchema-instance\" xmlns:SearchResults \"http:// www.zillow.com/static/xsd/SearchResults.xsd\" \n\n request \n address 112 Bob's Way Avenue /address \n citystatezip Berkeley, CA, 94212 /citystatezip \n /request \n \n message \n text Request successfully processed /text \n code 0 /code \n\t\t\n /message \n\n \n response \n\t\t results \n\t\t\t\n\t\t\t result \n\t\t\t\t \t zpid 24842792 /zpid \n\t links \n\t\t homedetails http://www.zillow.com/ HomeDetails.htm?city Berkeley&state CA&zprop 24842792&s cid Pa-Cv-X1CLz1carc3c49ms htxqb&partner X1-CLz1carc3c49ms htxqb /homedetails \n\t \t graphsanddata http://www.zillow.com/Charts.htm? chartDuration 5years&zpid 24842792&cbt 8965965681136447050%7E1%7E43-17yrvL 7nIj-Y5pqbsoqb nh1QW4CVIhubJRAXIOkwbPosbEGChw**&s cid Pa-Cv-X1CLz1carc3c49ms htxqb&partner X1-CLz1carc3c49ms htxqb /graphsanddata \n\t \t mapthishome http://www.zillow.com/search/RealEstateSearch.htm? zpid 24842792#src url&s cid Pa-Cv-X1-CLz1carc3c49ms htxqb&partner X1CLz1carc3c49ms htxqb /mapthishome \n\t\t myestimator http://www.zillow.com/ myestimator/Edit.htm?zprop 24842792&s cid Pa-Cv-X1CLz1carc3c49ms htxqb&partner X1-CLz1carc3c49ms htxqb /myestimator \n\t \t myzestimator deprecated \"true\" http://www.zillow.com/myestimator/Edit.htm? zprop 24842792&s cid Pa-Cv-X1-CLz1carc3c49ms htxqb&partner X1CLz1carc3c49ms htxqb /myzestimator \n\t /links \n\t address \n\t\t street 1292 Bob's way /street \n\t\t zipcode 94 /zipcode \n\t\t city Berkeley /city \n\t \t state CA /state \n\t\t latitude 34.882544 /latitude \n\t \t longitude -123.11111 /longitude \n\t /address \n\t\n\t\n\t zestimate \n\t \t amount currency \"USD\" 803000 /amount \n\t\t last-updated 07/14/2008 /lastupdated \n\t\t\n\t\t\n\t\t\t oneWeekChange deprecated \"true\" /oneWeekChange \n \t\t\n\t\t\n\t\t\t valueChange currency \"USD\" duration \"31\" -33500 / valueChange \n\t\t\n\t\t\n\t\t valuationRange \n\t\t\t low currency \"USD \" 650430 /low \n\t\t\t

?xml version "1.0" encoding "utf-8"? SearchResults:searchresults xsi:schemaLocation "http:// www.zillow.com/static/xsd/SearchResults.xsd /vstatic/ Results.xsd" xmlns:xsi "http://www.w3.org/2001/XMLSchema-instance" xmlns:SearchResults "http://www.zillow.com/static/xsd/ SearchResults.xsd" request address 123 Bob's Way /address citystatezip Berkeley, CA, 94217 /citystatezip /request message text Request successfully processed /text code 0 /code /message response results result zpid 1111111 /zpid links

Processing the result We want to get the value of the element amount 803000 /amount doc xmlTreeParse(reply, asText TRUE, useInternal TRUE) xmlValue(doc[["//amount"]]) [1] "803000" Other information too

2004 Election Results http://www.princeton.edu/ rvdb/JAVA/election2004/

Where are the data? Within days of the election ? USA Today, CNN, . http://www.usatoday.com/news/politicselections/ vote2004/results.htm By state, by county, by senate/house, .

read.table ? Within the noise/ads, look for a table whose first cell is "County" Actually a td b County /b /td How do we know this? Look at one or two HTML files out of the 50. Verify the rest. Then, given the associated table element, we can extract the values row by row and get a data.frame/.

XPath expression table . tr td class "notch medium" width "153" b County / b /td td class "notch medium" align "Right" width "65" b Total Precincts /b /td td class "notch medium" align "Right" width "70" b Precincts Reporting /b /td td class "notch medium" align "Right" width "60" b Bush /b /td td class "notch medium" align "Right" width "60" b Kerry /b /td td class "notch medium" align "Right" width "60" b Nader / b /td /tr Little bit of trial and error getNodeSet(nj, "//table[tr/td/b/text() 'Total Precincts']") Could be more specific, e.g. tr[1] - first row

Now that we have the table node, read the data into an R data structure rows xmlApply(v[[1]], function(x) xmlSApply(x, xmlValue)) i.e. for each row, loop over the td and get its value. Got some "\n\t\t\t" and last row is "Updated." first row is the County, Total Precincts, . So discard the rows without 7 entries then remove the 7th entry ("\n\t\t\t")

v getNodeSet(nj, "//table[tr/td/b/text() 'Total Precincts']") rows xmlApply(v[[1]], function(x) xmlSApply(x, xmlValue)) # only the rows with 7 elements rows rows[sapply(rows, length) 7] # Remove the 7th element, and transpose to put back into # counties as rows, precinct, candidates, . as columns. # So get a matrix of # counties by 6 matrix of character # vectors. rows t(sapply(rows, "[", -7))

Learning XPath XPath is another language part of the XML technologies XInclude XPointer XSL XQuery Can't we extract the data from the XML tree/DOM (Document Object Model) without it and just use R programming - Yes

doc xmlTreeParse("pubmed.xml") Now have a tree in R recursive - list of children which are lists of children or recursive tree of C-level nodes Write an R function which "visits" each node and extracts and stores the data from those nodes that are relevant e.g. the Author , PubDate nodes

Recursive functions are sometimes difficult to write Have to store the results "globally"/non-locally leads to closures/lexical scoping - "advanced R" Have to traverse the entire tree via R code - SLOW!

Handlers Alternative approach when we read the XML tree into R and convert it to a list of lists of children . when convert each C-level node, see if caller has a function registered corresponding to the name/type of node if so call it and allow it to extract and store the data.

Efficient Parsing Problem with previous styles is we have the entire tree in memory and then extract the data 2 times the data in memory at the end Bad news for large datasets All of Wikipedia pages - 11Gigabytes Need to read the XML as it passes as a stream, extracting and storing the contents and discarding the XML. SAX parsing - "Simple API for XML"!

xmlEventParse(content, list(startElement function(node, .)., endElement function(node, .) ., text function(x) ., comment function(x) . , .)) Whenever XML parser sees start/end/text/comment node, calls R function which maintains state. Awkward to write, but there to handle very large data.

Schema. Just like a database has a schema describing the characteristics of columns in all tables within a database, XML documents often have an XML Schema (or Document Type Definition - DTD) describing the "template" tree and what elements can/must go where, attributes, etc. The XML Schema is written in XML, so we can read it! And we can actually create R data types to represent the same elements in XML directly in R. So we can automate some of the reading of XML elements into useful, meaning R objects harder to programmatically flatten into data frames.

RCurl xmlTreeParse() & xmlEventParse() can read from files, compressed files, URLs, direct text - but limited connection support. RCurl package provides very rich ways that extend R's ability to access content from URLs, etc. over the Internet. HTTPS - encrypted/secure HTTP passwords/authentication efficient, persistent connections multiplexing different protocols Pass results to XML parser or other consumers.

Exceptions/Conditions

For processing very large XML ﬁles with low-level state machine via R handler functions - closures. Preferred Approach DOM (with internal C representation and XPath) . XPath YALTL (Yet another language to learn) XPath language /node - top-level node //node - node at any level node[@attr-name] - node that has an attribute

Related Documents:

XML CSC 443: Web Programming - GitHub Pages

Uses of XML XML data comes from many sources on the web: web servers store data as XML files databasessometimes return query results as XML webservices use XML to communicate XML is the de facto universal format for exchange of data XML languages are used for music, math, vector graphics popular use: RSS for news feeds & podcasts CSC443: Web Programming

11 Views

7m ago

XML and Databases - UNSW Sites

Overview XML More about XML We will talk about algorithms and programming techniques to eﬃciently manipulate XML data: I Regular expressions can be used to validate XML data, I ﬁnite state machines lie at the heart of highly eﬃcient XPath implementations, I tree traversals may be used to preprocess XML trees in order to support XPath evaluation, to store XML trees in databases, etc.

17 Views

10m ago

XML Tutorial - Learn about Electronic Data Interchange (EDI)

The number of optional features in XML is to be kept to the absolute minimum, ideally zero XML documents should be human-legible and reasonably clear The XML design should be prepared quickly The design of XML shall be formal and concise XML documents should be easy to create Terseness in XML markup is of minimal importance

16 Views

10m ago

Extensible Markup Language (XML) 1.0 (Second Edition)

The design goals for XML are: 1. XML shall be straightforwardly usable over the Internet. 2. XML shall support a wide variety of applications. 3. XML shall be compatible with SGML. 4. It shall be easy to write programs which process XML documents. 5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero. 6.

52 Views

3y ago

How and Why Are Companies Using XML? - Mulberry Tech

C Provide the XML services more and more customers want, or C Watch your customer base shrink You can: C Learn to work with XML smoothly and easily, or C Fight XML tooth and nail You can: C Use XML content to make some of your processes easier C Let XML be an added step, added expense, and continual nuisance You can't make XML go away! Page 2

9 Views

10m ago

Lab Assignments

2. Learn how to construct a valid XML Schema and associate it with an XML document. 3. Learn why XML Schemas are more powerful than DTDs. 1. amazon.dtdOpen files "amazon.xml", " " and "amazon.xsd" with EditX. The "amazon.xsd" is an XML Schema document that describes part of the structure of the " amazon.xml" XML document presented in Lab 1.1.1 .

12 Views

10m ago

Introduction About the Author Part I Chapter 1—What Is XML and Why ...

development of XML code. In the first week, you'll learn a lot of the basics about XML itself: On Day 1, you'll get a basic introduction on what XML is and why it's so important. You will also see your first XML document. On Day 2, you will dissect an XML document to discover exactly what goes into making usable XML code.

11 Views

10m ago

Altova XMLSpy 2013 Tutorial

XMLSpy Tutorial XML Schemas: Basics 3 Altova XMLSpy 2013 Tutorial 2 XML Schemas: Basics An XML Schema describes the structure of an XML document. An XML document can be validated against an XML Schema to check whether it conforms to the requirements specified in the schema. If it does, it is said to be valid; otherwise it is invalid. XML .

9 Views

10m ago

Recent Views

Fifth ASISA Insurance Gap Study

Insurance Gap Insurance Need -Actual Cover gap: k) www.truesouth.co.za Need for insurance Earnings R0.6m Replacement requirement 54% Capitalisation factor 13.8 Insurance need R4.6m Actual insurance Retail R1.5m Group Life R0.8m Government grants R0.0m Total R2.3m R4.6m -R2.3m R2.3m Average death insurance gap for richest 20% of SA .

1y ago

166 Views

FCA GAP Insurance research

purchase GAP insurance 6 2.6. Add-on GAP insurance purchasers are not a homogeneous group 6 2.7. The remedies may have provided reassurance, but have not yet helped improve knowledge 6 3. Profile of research participants 8 3.1. Car purchase 8 3.2. Demographics 8 3.3. Awareness of GAP insurance 8 3.4. Purchase of GAP insurance 9 3.5.

1y ago

155 Views

A world at risk Closing the insurance gap

Closing the insurance gap A world at risk 07 1. The size of the global insurance gap A world at risk, Lloyd's second underinsurance report, shows there is a global insurance gap of US 162.5 billion in 2018. This shows there is a significant gap between the level of insurance in place to cover

1y ago

137 Views

The Life Insurance Need Gap - LIMRA

Need Life Insurance Have Life Insurance The gap between "I need" and "I "have" equals 18-points, or 46 million consumers This understates unmet need in the market. Life Insurance Ownership Gap - 2011 to 2021 Source: 2021 Insurance Barometer Life Insurance Ownership Gap 18-points

1y ago

164 Views

Sample Gap Analysis Template

Traditionally, a skills gap analysis is undertaken using paper-based assessments and supporting interviews; however, technological advancements, such as skill management software, are allowing large companies to administer a skills gap analysis without using a significant proportion of human resources (Antonucci and d’Ovidio, 2012).File Size: 778KBPage Count: 24Explore furtherSkills gap analysis template - Skills for Care - Homewww.skillsforcare.org.uk40 Gap Analysis Templates & Exmaples (Word, Excel, PDF)templatelab.comConducting A Gap Analysis: A Four-Step .com(PDF) Gap Analysis - et30 FREE Gap Analysis Templates & Examples - .comRecommended to you b

2y ago

181 Views

Making Sense of GAP Insurance - How To Mind Your GAP

find more information under "What is excluded under a GAP insurance policy?". 9 These figures apply where the customer is required to pay a motor insurer's excess of 250. Some GAP insurance providers will pay an amount towards this excess. Please check your GAP insurance policy for details. Written off at 6 months Written off at 30 months

1y ago

127 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

2 4 About Girl Ambassador Program (GAP) 6 Closing Gaps through GAP 7 .

GAP Pathways GAP Benefits Opportunities GAP Commitments Participants Parents Ambassadors GAP Process Get Connected 2 4 6 7 8 10 12 15 1. TABLE OF CONTENTS About Girls For A Change . GAP is a four-year, tiered approach that supports paced learning and development, where certified instructors

10m ago

108 Views

Gap Year Alumni Survey 2020 - Gap Year Association

Canadian gap year participants and a lack of knowledge about the "American" gap year. The Gap Year Alumni Survey of U.S. and Canadian gap year participants was conducted in 2020, following the first ever survey of its kind in 2015. Like the previous survey, the 2020 survey sought to capture the scale, scope, and outcomes of gap year .

10m ago

82 Views

INGENI SERVICES RTI a n d RPP GAP INSURANCE

Ingeni Services RTI and RPP GAP Insurance V10 April 2018 Page 2 of 13 INGENI SERVICES RTI and RPP GAP INSURANCE This module should be taken AFTER the generic ‘Finance & Total Gap Insurance - Part 1 - an overview’ Unit (Unit 8) within the FCA Refresher Training Course. All of the following produc

2y ago

349 Views

Gender Pay Gap Report 2020 - RSA Insurance Group

Pay Gap is 27.4%, our Mean Bonus Gap is 64.4% and our Median Bonus Gap is 43.0%. The information presented below relates to employees of Royal & Sun Alliance Insurance plc and is calculated in line with the government regulations. Please see overleaf for an explanation of the comparison between 2020 and previous years. Median Mean Gender Pay Gap

1y ago

136 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Biba Webinar Gap Insurance

September 2015 - FCA introduced new rules for dealers selling GAP Insurance. WHY? To achieve better customer outcomes from more informed purchasing decisions; and Improved competition. FCA recognised GAP insurance premiums are significantly higher. Almost half of customers unaware they could buy GAP elsewhere.

1y ago

142 Views

Statutory Pay Gap Report 2019 Gender; Disability .

3. Statutory Gender Pay Gap Report 2019 In this section is reported the Statutory Gender Pay Gap, the Gender Pay Gap (Excluding Casual Staff), and a review of Bonus Pay. A positive black number, means that there is a pay gap in favour of men, whereas a negative red number means that there is a pay gap in favour of women. 3.1. Statutory Gender .

3y ago

216 Views

Gender Pay Gap Report - Gleeds

Gleeds Gender Pay Gap Report 2019 Gleeds figures 2018 PAY GAP This table shows the mean and median pay gap between men and women, based on hourly rates of pay and presented relative to men’s earnings. The median gender pay gap differs from the mean as it shows the mid-point of data, rather than the average. BONUS GAP

3y ago

165 Views

Extracting Data From XML - University Of California, Berkeley

It looks like you're using an ad-blocker