Data Path - Python Crawler - Xpath

Posted by robkir on Fri, 02 Aug 2019 20:49:05 +0200

Introduction of XML

What is XML?

  • XML refers to EXtensible Markup Language
  • XML is a markup language, much like HTML
  • XML is designed to transfer data, not display it
  • XML tags need to be defined by ourselves.
  • XML is designed to be self-descriptive.
  • XML is the recommended standard for W3C

W3School Official Documentation: http://www.w3school.com.cn/xm...

The difference between XML and HTML

Different grammar requirements

  • It is case-insensitive in html and strict in xml.
  • In HTML, sometimes it's not strict. If the context clearly shows where the paragraph or list key ends, you can omit closing tags such as </p> or </li>.In XML, it is a strict tree structure and the end tag must not be omitted.
  • In XML, elements that have a single tag without a matching closing tag must end with one/character.This way the analyzer knows it won't have to look for an end tag.
  • In XML, attribute values must be enclosed in quotes.Quotes are available or not required in HTML.
  • In HTML, you can have attribute names without values.In XML, all attributes must have corresponding values.
  • In an XML document, white space is not automatically removed by the parser; however, html filters out white space.

Different design goals

  • XML is designed to transfer and store data, with the focus on the content of the data.
  • HTML displays data and how to better display it.

Node Relationships in XML

1. Parent
Each element and attribute has a parent.
Here is a simple XML example where the book element is the parent of the title, author, year, and price elements:

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

2. Children
Element nodes can have zero, one or more children.
In the following example, the title, author, year, and price elements are all children of the book element:

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

3. Sibling
Node with the same parent
In the following example, the title, author, year, and price elements are siblings:

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

4. Ancestor
The parent of a node, the parent of a parent, and so on.
In the following example, the title element is predecessored by the book and bookstore elements:

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

5. Descendant
Children of a node, children of a child, and so on.
In the following example, the descendants of a bookstore are book, title, author, year, and price elements:

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

Xpath

What is Xpath?

Xpath, the full name of XML Path Language, is the XML Path Language. It is a language for finding information in an XML document and can be used to traverse elements and attributes in an XML document.Originally used to search for XML documents, it also applies to search for HTML documents.
So when crawling, you can use XPath to extract information accordingly.

W3School Official Documentation: http://www.w3school.com.cn/xp...

Xpath Development Tool

  1. Open source XPath expression editing tool: XMLQuire(XML format file available)
  2. Chrome Plugin XPath Helper
  3. Firefox Plugin XPath Checker

Use Xpath

XPath uses path expressions to select nodes or node sets in an XML document.These path expressions are very similar to those we see in regular computer file systems.
1. Common Xpath Rules

Expression describe
nodename Select all child nodes of this node
/ Select a direct child node from the current node
// Select a descendant node from the current node
. Select Current Node
.. Select the parent of the current node
@ Select Properties

2. Xpath usage examples
Take the following xmL document for example:

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>
Path expression Result
bookstore Select all child nodes of the bookstore element
/bookstore Select the root element bookstore.Note: If the path starts with a forward slash /, this path represents the absolute path of an element
bookstore/book Select all book elements that belong to the child elements of the bookstore
//book Select all book elements, no matter where in the document
bookstore//book Select all the book elements that belong to the descendants of the bookstore element, regardless of where they are located beneath the bookstore.
//@lang Select all properties named lang

lxml library use

lxml library installation

1. Windows Installation
cmd enters command line mode, executes

pip3 install lxml

2. Installation of ubuntu16.04
ctrl+alt+t enters terminal mode and executes:

sudo apt-get install -y build-essential libssl-devl libffi-dev libxml2-dev libxslt1-dev zlib1g-dev

After installing the dependent class libraries, perform the pip installation:

sudo pip3 install lxml

3. Verify installation
Import the lxml module and install it successfully without error.

$ python3
>>> import lxml

etree module use

Initial use
Filename lxml_test.py

# etree library using lxml
from lxml import etree 

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # Notice that one is missing here </li> Close label
     </ul>
 </div>
'''

#Using etree.HTML, a string is parsed into an HTML document, and the etree module automatically corrects the HTML text
html = etree.HTML(text) 

# Serialize HTML documents by string
ret = etree.tostring(html) 

# The result returned by the torstring() method is of type bytes, which is converted to a string using the decode() method
print(ret.decode('utf-8'))

Output results:

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

etree module can automatically correct html code, in the example not only completes the li tag, but also adds the body, html tag.

File Read
In addition to reading strings directly, lxml supports reading from files.Here I save the executed contents of the lxml_test.py file above as test.html

python lxml_test.py >> test.html

The above output is cat test.html:

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

Use etree.parse() method to read the file.

from lxml import etree

html = etree.parse('./test.html',HTMLParser())

ret = etree.tostring(html)
print(ret.decode('utf-8'))

Output Results

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

The output has an additional DOCTYPE declaration, which has no effect on the parsing results.

Topics: PHP xml encoding Attribute sudo