Python 3 crawler notes -- parsing library XPath

Posted by jester626 on Sun, 27 Oct 2019 12:47:55 +0100

  • XPath overview: XPath, the full name of XML Path Language, namely XML Path Language, is a language for finding information in XML documents. It was originally used to search XML documents, but it is also suitable for HTML documents.

Common rules of XPath

Expression describe
nodename Pick all children of this node
/ Select a direct child from the current node
// Select a descendant node from the current node
. Select current node
... Select the parent of the current node
@ Select attributes
  • The following example represents selecting all nodes with the name title and the value eng of the attribute lang.
//title[@lang='eng']
  • In the following example, first import the etree module of lxml library, then declare a piece of HTML text, call HTML class to initialize, and then construct an XPath parsing object successfully. It should be noted that the last li node in HTML text is not closed, but the etree module can automatically correct HTML text.
from lxml import etree
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))

#Output results:
<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </li></ul>
 </div>
</body></html>
  • Here we call tostring() method to output the modified HTML code. After processing, the label of li node is completed, and body and HTML nodes are added automatically. But the result is a byte type. Here we use the decode() method to convert it to str type

  • In addition, you can also read the text file directly for parsing, for example:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

All nodes

  • Here * is used to match all nodes, that is, all nodes in the entire HTML text will be retrieved. As you can see, the return form is a list. Each Element is of Element type, followed by the name of the node, such as HTML, body, div, ul, li, a, etc. all nodes are included in the list.
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)

#Operation result:
[<Element html at 0x10510d9c8>, <Element body at 0x10510da08>, <Element div at 0x10510da48>, <Element ul at 0x10510da88>, <Element li at 0x10510dac8>, <Element a at 0x10510db48>, <Element li at 0x10510db88>, <Element a at 0x10510dbc8>, <Element li at 0x10510dc08>, <Element a at 0x10510db08>, <Element li at 0x10510dc48>, <Element a at 0x10510dc88>, <Element li at 0x10510dcc8>, <Element a at 0x10510dd08>]

Of course, node names can also be specified for matching here. If you want to get all li nodes, an example is as follows:

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)
print(result[0])

#Operation result
[<Element li at 0x105849208>, <Element li at 0x105849248>, <Element li at 0x105849288>, <Element li at 0x1058492c8>, <Element li at 0x105849308>]
<Element li at 0x105849208>

Child node

  • We can find the child node or child node of the element through / or / /. If you want to select all direct a child nodes of the li node now, you can do this:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)
  • The / here is used to select the direct child node. If you want to get all the child nodes, you can use / /. For example, to obtain all descendant a nodes under the ul node, you can do this:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//ul//a')
print(result)
  • But if you use / / ul/a here, you won't get any results. Because / is used to obtain the direct child node, and there is no direct a child node under ul node, only li node, so no matching results can be obtained.

Parent node

  • To find the parent node, you can use..
  • For example, first select the a node whose href attribute is link4.html, then obtain its parent node, and then obtain its class attribute. The relevant code is as follows:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/../@class')
print(result)
  • At the same time, we can get the parent node through parent:: as follows:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)

Attribute matching

  • @: attribute filtering
  • For example, if you want to select the li node whose class is item-0, you can do this:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]')
print(result)

Text acquisition

  • text(): get the text in the node
  • In the following example, result1 selects the content of li node, with only one line break; result2 selects the content of li node layer by layer, first selects the content of li node, and then selects the content of direct word node of li node; while result3 matches all the content of byte points of li and li.
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result1 = html.xpath('//li[@class="item-0"]/text()')
print(result1)

result2 = html.xpath('//li[@class="item-0"]/a/text()')
print (result2)

result3 = html.xpath('//li[@class="item-0"]//text()')
print (result3)

//Output results:
['\n     ']
['first item', 'fifth item']
['first item', 'fifth item', '\n     ']

Attribute acquisition

  • @: property get
  • Note the difference between attribute matching and attribute matching: attribute matching is to define an attribute by adding attribute name and value in brackets, such as [@ href = "link1.html"], where @ href refers to getting an attribute of a node.
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)

#Output results:
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

Attribute multi value matching

  • contains(): when a node's attribute has multiple values, the method can only enter one attribute to match
from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)

Multi attribute matching

  • and: used to connect multiple matching attributes
from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)

Sequential selection

  • Sometimes, when we select, some attributes may match multiple nodes at the same time, but we only want one of them, such as the second node or the last node.
  • Note that the ordinal in square brackets starts with 1, not 0
from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()')
print(result)
result = html.xpath('//li[position()<3]/a/text()')
print(result)
result = html.xpath('//li[last()-2]/a/text()')
print(result)

Node axis selection

  • In the first selection, we call the ancestor axis to get all the ancestor nodes. After that, we need to follow two colons, and then the node selector. Here we use * directly to match all nodes, so the return result is all the ancestor nodes of the first li node, including html, body, div and ul.

  • In the second selection, we added some restrictions. This time, we added div after the colon, so that the only result is the ancestor node of Div.

  • In the third selection, we call the attribute axis to get all attribute values, followed by a selector of *, which means to get all attributes of the node. The return value is all attribute values of the li node.

  • In the fourth selection, we call the child axis to get all the direct child nodes. Here, we add a qualification, and select the a node whose attribute is link1.html.

  • In the fifth selection, we call the descendant axis to get all the descendant nodes. In this case, we add a constraint to get the span node, so the returned result only contains the span node but not the a node.

  • In the sixth selection, we call the following axis to get all the nodes after the current node. Although we use * matching here, we also add index selection, so we only get the second subsequent node.

  • In the seventh selection, we call the following sibling axis to get all peers after the current node. Here we use * matching, so we get all subsequent siblings.

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html"><span>first item</span></a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/ancestor::*')
print(result)
result = html.xpath('//li[1]/ancestor::div')
print(result)
result = html.xpath('//li[1]/attribute::*')
print(result)
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
result = html.xpath('//li[1]/descendant::span')
print(result)
result = html.xpath('//li[1]/following::*[2]')
print(result)
result = html.xpath('//li[1]/following-sibling::*')<code class="lang-python">
<span class="kwd">print</span><span class="pun">(</span><span class="pln">result</span><span class="pun">)</span>


#Operation result
[<Element html at 0x107941808>, <Element body at 0x1079418c8>, <Element div at 0x107941908>, <Element ul at 0x107941948>]
[<Element div at 0x107941908>]
['item-0']
[<Element a at 0x1079418c8>]
[<Element span at 0x107941948>]
[<Element a at 0x1079418c8>]
[<Element li at 0x107941948>, <Element li at 0x107941988>, <Element li at 0x1079419c8>, <Element li at 0x107941a08>]

Topics: Attribute xml Python