Introduction of XML
What is XML?
- XML refers to EXtensible Markup Language
- XML is a markup language, much like HTML
- XML is designed to transfer data, not display it
- XML tags need to be defined by ourselves.
- XML is designed to be self-descriptive.
- XML is the recommended standard for W3C
W3School Official Documentation: http://www.w3school.com.cn/xm...
The difference between XML and HTML
Different grammar requirements
- It is case-insensitive in html and strict in xml.
- In HTML, sometimes it's not strict. If the context clearly shows where the paragraph or list key ends, you can omit closing tags such as </p> or </li>.In XML, it is a strict tree structure and the end tag must not be omitted.
- In XML, elements that have a single tag without a matching closing tag must end with one/character.This way the analyzer knows it won't have to look for an end tag.
- In XML, attribute values must be enclosed in quotes.Quotes are available or not required in HTML.
- In HTML, you can have attribute names without values.In XML, all attributes must have corresponding values.
- In an XML document, white space is not automatically removed by the parser; however, html filters out white space.
Different design goals
- XML is designed to transfer and store data, with the focus on the content of the data.
- HTML displays data and how to better display it.
Node Relationships in XML
1. Parent
Each element and attribute has a parent.
Here is a simple XML example where the book element is the parent of the title, author, year, and price elements:
<?xml version="1.0" encoding="utf-8"?>
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
2. Children
Element nodes can have zero, one or more children.
In the following example, the title, author, year, and price elements are all children of the book element:
<?xml version="1.0" encoding="utf-8"?>
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
3. Sibling
Node with the same parent
In the following example, the title, author, year, and price elements are siblings:
<?xml version="1.0" encoding="utf-8"?>
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
4. Ancestor
The parent of a node, the parent of a parent, and so on.
In the following example, the title element is predecessored by the book and bookstore elements:
<?xml version="1.0" encoding="utf-8"?>
<bookstore>
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
5. Descendant
Children of a node, children of a child, and so on.
In the following example, the descendants of a bookstore are book, title, author, year, and price elements:
<?xml version="1.0" encoding="utf-8"?>
<bookstore>
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
Xpath
What is Xpath?
Xpath, the full name of XML Path Language, is the XML Path Language. It is a language for finding information in an XML document and can be used to traverse elements and attributes in an XML document.Originally used to search for XML documents, it also applies to search for HTML documents.
So when crawling, you can use XPath to extract information accordingly.
W3School Official Documentation: http://www.w3school.com.cn/xp...
Xpath Development Tool
- Open source XPath expression editing tool: XMLQuire(XML format file available)
- Chrome Plugin XPath Helper
- Firefox Plugin XPath Checker
Use Xpath
XPath uses path expressions to select nodes or node sets in an XML document.These path expressions are very similar to those we see in regular computer file systems.
1. Common Xpath Rules
Expression | describe |
---|---|
nodename | Select all child nodes of this node |
/ | Select a direct child node from the current node |
// | Select a descendant node from the current node |
. | Select Current Node |
.. | Select the parent of the current node |
@ | Select Properties |
2. Xpath usage examples
Take the following xmL document for example:
<?xml version="1.0" encoding="utf-8"?>
<bookstore>
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
Path expression | Result |
---|---|
bookstore | Select all child nodes of the bookstore element |
/bookstore | Select the root element bookstore.Note: If the path starts with a forward slash /, this path represents the absolute path of an element |
bookstore/book | Select all book elements that belong to the child elements of the bookstore |
//book | Select all book elements, no matter where in the document |
bookstore//book | Select all the book elements that belong to the descendants of the bookstore element, regardless of where they are located beneath the bookstore. |
//@lang | Select all properties named lang |
lxml library use
lxml library installation
1. Windows Installation
cmd enters command line mode, executes
pip3 install lxml
2. Installation of ubuntu16.04
ctrl+alt+t enters terminal mode and executes:
sudo apt-get install -y build-essential libssl-devl libffi-dev libxml2-dev libxslt1-dev zlib1g-dev
After installing the dependent class libraries, perform the pip installation:
sudo pip3 install lxml
3. Verify installation
Import the lxml module and install it successfully without error.
$ python3
>>> import lxml
etree module use
Initial use
Filename lxml_test.py
# etree library using lxml
from lxml import etree
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a> # Notice that one is missing here </li> Close label
</ul>
</div>
'''
#Using etree.HTML, a string is parsed into an HTML document, and the etree module automatically corrects the HTML text
html = etree.HTML(text)
# Serialize HTML documents by string
ret = etree.tostring(html)
# The result returned by the torstring() method is of type bytes, which is converted to a string using the decode() method
print(ret.decode('utf-8'))
Output results:
<html><body>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</body></html>
etree module can automatically correct html code, in the example not only completes the li tag, but also adds the body, html tag.
File Read
In addition to reading strings directly, lxml supports reading from files.Here I save the executed contents of the lxml_test.py file above as test.html
python lxml_test.py >> test.html
The above output is cat test.html:
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
Use etree.parse() method to read the file.
from lxml import etree
html = etree.parse('./test.html',HTMLParser())
ret = etree.tostring(html)
print(ret.decode('utf-8'))
Output Results
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</body></html>
The output has an additional DOCTYPE declaration, which has no effect on the parsing results.