Illiterate Python entry diary: on the seventh day, learn the use of xml, xslt and xpath under Python and the first Python capture

Posted by Danno13 on Sun, 23 Jan 2022 22:50:10 +0100

Now it seems very popular to use Python crawlers for data collection. Let's also learn about Python crawlers. Well, take a look at the python technology roadmap in Uncle Long's blog. It's very good. It all includes.... Wait, is there something missing?

Carefully read the technology roadmap again... There seems to be no mention of xml, xslt and only xpath?

Why does Lao Gu suddenly care about xml and xslt? That's because he has encountered such wonderful flowers in his previous collection career... Give an instance page https://www.govinfo.gov/content/pkg/BILLS-117hr3237ih/xml/BILLS-117hr3237ih.xml , the historical case information in an American legal service website, this page... It looks like html. Open the console in the browser to view the elements. Well, it's html, but once you view the source file, you'll look pale... This is the xml information with xslt style!!!

In order that the collection will not be blocked by xml sites in the future, we should first learn the basics and continue after learning xml well. After all, html is relatively easy to parse. We can basically implement etree, dom, xpath and even css selector. Although the page I gave is relatively simple, the xml is relatively standard, and it is easy to extract information, but... This is not the content of HTML! You have to do it in xml! What's more, Lao Gu has encountered xml content that can get the expected results correctly only after a series of complex combinations in xslt! Therefore, in Gu's mind, xml and xslt must take precedence over crawler learning.

-----------------------------------

Post knowledge source first https://docs.python.org/zh-cn/3/library/xml.html , in this article, we mentioned six sub modules of xml. Let's look at them one by one and sort out what we use first. The with namespace will not be discussed for the time being. (aside from the topic, Gu has always believed that the most important thing in programming is the idea, followed by the understanding of the tools, language and environment used. You only need to know what packages are in the language, what each package can do, and how to realize the idea with these things, so you can start programming. The specific implementation is actually a mechanical labor.)

1,xml.etree.ElementTree: ElementTree API, a simple and lightweight XML processor

import xml.etree.ElementTree as ET

# Load xml from the document and get ElementTree, that is, XmlDocument
tree = ET.parse('xml_file.xml')
# Get the root Node (Element), i.e. Node, from the ElementTree object
root = tree.getroot()

# Load the xml using a string and get the root node (Element) directly
root = ET.fromstring(xml_doc_string)

# Use ET.tostring to output the specified Element object as an xml string document, that is, node OuterXML
xml_str = ET.tostring(root)

# Element.tag gets the tag name of the specified element, that is, NodeName
print(root.tag)

# Elment.attrib gets the attribute dictionary of the specified Element, that is, Attributes
print(root.attrib)

# If the Element object is traversed, the child node of the current Element is obtained, which is equivalent to node ChildNodes
for e in root:
    print(e.tag)

# Element can be directly treated as a list. Any word node can be superimposed and directly indexed with numbers
root[0] # Returns the first child node of root
root[0][0] # Returns the first child node of the first child node of root

# Element can use the iter method to iterate over an element with a specified name in all descendants (not limited to children)
print([n.tag for n in root.iter('item')])

# Element.findall() finds only the elements with the specified label in the immediate child elements of the current element. 
# Element.find() finds the first child with a specific label,
# Then you can use element Textaccess the text content of the element. 
# Element.get access to the attributes of the element:

# The text attribute of Element is to obtain the text content of the current node (excluding child nodes), that is, InnerText
print(root.text)

# Element's findall() can support xpath, and it is the current node without path modification. If path modification is added, the current node must be declared first, that is, the content with path in xpath must be start
print([n.tag for n in root.findall('.//item')])

# ElementTree.SubElement can add child nodes
ET.SubElement(root,newNodeName,{Attribute dictionary})
ElementTree After reading, there are several summaries
1,xpath For simple xpath,Not all xpath Support, e.g[name()="item"],In addition to text comparison([.]),Attribute comparison([@])Compare with child nodes([tag])It seems that there are few supported functions, and only one is known at present last()
2,because * As a wildcard, add the first point, ET Regular is not supported xpath
3,text It is only the current node text and does not contain child node text. Although this is a good one, it is not used to it for the time being

2,xml.dom: DOM API definition

After watching for a long time.... This is a base class without any implementation method. Skip. Looking at the description of this document as a whole, it is the same set of things as the browser DOM. It should be very convenient for people who are used to js to use it

3,xml.dom.minidom: minimal DOM implementation

After looking at this part, I found a very simple content. There is nothing to look at except parse and a parseString

Combine XML DOM: the part defined by the DOM API is basically determined. This minidom is a DOM implementation of js long ago. It doesn't say that there is no new gadget such as querySelector. Even getElementById is not implemented, only getelementsbyxxx. Then it is tested with parseString. If the returned html is not standard, it can't be parsed and can't be used as an HTML parser

4,xml.dom.pulldom: supports building partial DOM trees

I don't understand this part for the time being. Pull the parser??? What's this for? After looking at several examples in this section, the general feeling is to support some dom events? Forget it. For collection, you can parse js, xml/xslt and html, but you don't need event support. Skip.

5,xml.sax: SAX2 base class and convenience function

What the hell is this... Direct a = XML sax. Parsestring ('< R / >') actually prompts missing 1 required positive argument: 'handler'. Do you need to provide a handle yourself? Parser? Baidu Baidu...

https://www.cnblogs.com/hongfei/p/python-xml-sax.html , this article provides an example, um... Uh huh... Uh huh... I see. sax is a parsing method that includes partial xsd, partial xslt and serialization structure. We need to provide parsing format (similar to xsd), data reading (similar to xslt) and structured serialization. The specific structured serialization is realized by rewriting the definition of handler. Well, these are very good, but when we collect, Most of the time, you don't look at the specific content of xml, and the reading is also parsed with the xslt specified by xml itself. Therefore, you'd better skip it

6,xml.parsers.expat: expat parser binding

Uh huh.... The first feeling is like a simplified version of sax? Anyway, this is not helpful for collection.

------------------------------

After reading the above introduction about xml, only etree is more suitable for c#, vb's xml processing? In short, first, let's see how python uses xslt to parse xml. We have to continue Baidu. There is no relevant content in this xml package... dizzy

https://www.cnblogs.com/gooseeker/p/5501716.html

Oh, roar...

You have to install a new python package, lxml, okay....

------------------------------

Or with https://www.govinfo.gov/content/pkg/BILLS-117hr3237ih/xml/BILLS-117hr3237ih.xml Take this page as an example. Its xslt style has been given in the declaration. The xslt address is https://www.govinfo.gov/content/pkg/BILLS-117hr3237ih/xml/billres.xsl

Do a simple test to try our first capture and xslt parsing. Take your time. This process is a long process. It's a process of trial and error. In addition, I also advise current students to learn to read tips and learn Baidu, so as to truly self-study and improve themselves. Don't ask and answer so simple questions.

# The first step is to collect our target content. The address is as follows
# https://www.govinfo.gov/content/pkg/BILLS-117hr3237ih/xml/BILLS-117hr3237ih.xml

from urllib import request

xml_doc = request.urlopen('https://www.govinfo.gov/content/pkg/BILLS-117hr3237ih/xml/BILLS-117hr3237ih.xml')
print(xml_doc)

The first step was wrong. Hey, hey, as expected, HTTPError: Forbidden

Why? Because the default request header information is missing too much when using python collection, many websites have basic identification of the request header information. At least, the agent information needs to be looked at. If it starts with php, including spider or bot, or java -, it is directly rejected. Although I didn't look at the default request header in python, However, it is estimated that there is no browser information, so the second step is to forge a user agent

from urllib import request

req = request.Request('https://www.govinfo.gov/content/pkg/BILLS-117hr3237ih/xml/BILLS-117hr3237ih.xml')
req.encoding = 'utf-8'
req.add_header('user-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3870.400 QQBrowser/10.8.4405.400')
xml_str = request.urlopen(req).read()
print(xml_str)

It's really easy to collect data in c# python. In the past, Gu used to collect data in c# the collection class alone. He wrote nearly 2000 lines to support various situations. I don't know urllib We will discuss the extent to which the request can achieve later to see the results of this collection..

Well, the collection succeeded.... Here's the question. Is the collected content a string? Is there an additional modifier b on the front edge of the hair?

After several twists and turns, it was found that the regular support was not friendly when the repair character b was used. etree was used to parse this variable and directly report an error. It always prompted that the first row and the first column were not < angle brackets, which was tmd outrageous. Later, I thought, I took it out character by character and reorganized it to make it without b modification

print([n for n in xml_str])

As a result, when traversing this string.... Hmmm. . . It's all numbers, oh! c# byte [] type! This directly shows me as a string and makes me misunderstand.... Plus b modification, it is a binary string! Come on, you know why, it's easy later. I'll decode it directly

xml_str = xml_str.decode(encoding='utf-8')
print(xml_str)

Hey, is that right... This is the xml text content we really collected

Then, the collected content is converted into xml

import lxml.etree as ET

xml_doc = ET.XML(xml_str)
print(xml_doc)

Um... Is an Element object

print(ET.tostring(xml_doc))

Looks like lxml Etree and XML etree. ElementTree is very similar, except that xpath and xslt support are strengthened, others seem to be similar to XML etree. Elementtrees are consistent. Well, everyone has seen ElementTree before. Blessed are the students who have transferred from c# or vb. They don't have to look at lxml Etree

As it happens, this is a relatively complete xml document. Let's try what xpath can do

....

....

....

xpath doesn't seem to be enhanced? In a rage again, open the definition file and see {site packages \ lxml\_ elementpath. py

Extract the content

    if signature == "@-":
        # [@attribute] predicate
    if signature == "@-='":
        # [@attribute='value']
    if signature == "-" and not re.match(r"-?\d+$", predicate[0]):
        # [tag]
    if signature == ".='" or (signature == "-='" and not re.match(r"-?\d+$", predicate[0])):
        # [.='value'] or [tag='value']
    if signature == "-" or signature == "-()" or signature == "-()-":
        # [index] or [last()] or [last()-index]
    raise SyntaxError("invalid predicate")

????? What about the agreed enhancement? Isn't that all? Not xpath2 0,xpath1.0 also has many functions and methods. As a result, these packages of python are simply supported by their own XPath, which is not a real XPath at all!!! Forget it, I'm too naive... Continue to collect xml....

print(re.findall('<\?xml[:-]stylesheet.*?href="([^"]+)',xml_str))

Since it is an xml file, let's check whether the file specifies an xslt style file. If so, we need to take down the style file and parse the xml itself as xslt

xslt_url = re.sub(r'(?<=[/\\])[^/\\]+$',re.findall('<\?xml[:-]stylesheet.*?href="([^"]+)',xml_str)[0],req.full_url)
print(xslt_url)

HMM, then continue to collect this xslt file

req = request.Request(xslt_url)
req.encoding = 'utf-8'
req.add_header('user-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3870.400 QQBrowser/10.8.4405.400')
xsl_str = request.urlopen(req).read()
xsl_str = xsl_str.decode(encoding='utf-8')
xslt_doc = ET.XML(xsl_str)
print(xslt_doc)

Good, another normal xml document. Now, try to make it xslt

xsl_parse = ET.XSLT(xslt_doc)

It's an expected error again, but I haven't seen the error prompt. Let's take a look at XSLT parseError: cannot resolve URI string://__STRING__XSLT__/billres-details.xsl

Soga, this xslt file also include s other xslt files. As a result, the collected files are incomplete and not physically saved, so the file cannot be found at the relative address. Then, modify xslt directly_ The relevant content data in str is good, and the xslt is newly generated_ doc

xsl_str = re.sub('(<xsl:include href=")([^"]+)("/>)',r'\1'+re.sub(r'(?<=[/\\])[^/\\]+$','',req.full_url)+r'\2\3',xsl_str)
xslt_doc = ET.XML(xsl_str)
print(xslt_doc)
xsl_parse = ET.XSLT(xslt_doc)

An error is also reported: XSLT parseError: cannot resolve URI https://www.govinfo.gov/content/pkg/BILLS-117hr3237ih/xml/billres-details.xsl

Hmmmm..... The URI cannot be resolved. It is the same as the error just now, except that the address is different, Hmmm Well, if it's not in the same folder, he can't load it automatically. In another way, download the relevant xslt files into the same folder, and then try loading again. Hmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm.... It's hard to learn a new language.

OK, let's reset the code to just get the xml content. The code is as follows

import lxml.etree as ET
from urllib import request

req = request.Request('https://www.govinfo.gov/content/pkg/BILLS-117hr3237ih/xml/BILLS-117hr3237ih.xml')
req.encoding = 'utf-8'
req.add_header('user-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3870.400 QQBrowser/10.8.4405.400')
xml_str = request.urlopen(req).read()
xml_str = xml_str.decode(encoding='utf-8')
xml_doc = ET.XML(xml_str)

Then, use regular to obtain all xslt files from xml. Well, record the uri address of the xml to calculate the location of the xslt file. Otherwise, you won't be able to collect it

base_url = re.sub(r'(?<=[/\\])[^/\\]+$','',req.full_url)
xslt_url = re.findall('<\?xml[:-]stylesheet.*?href="([^"]+)',xml_str)
xslt_first = xslt_url[0]

Because we are not sure whether there are other xslt files nested in xslt, we use an array to represent the xslt file and record the xslt entry file

if os.path.exists(r'D:\\work\\Log\\' + xslt_first)==False: # If the entry xslt file does not exist
    while len(xslt_url)>0: # If the list of xlst files to be collected is not empty
        fn  = xslt_url.pop(0) # Assign fn to the current collection xslt file name and remove it from the list
        url = base_url + fn   # Calculate the real uri of xslt
        req_xslt = request.Request(url)
        req_xslt.encoding = 'utf-8'
        req_xslt.add_header('user-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3870.400 QQBrowser/10.8.4405.400')
        xsl_str = request.urlopen(req_xslt).read()
        xsl_str = xsl_str.decode(encoding='utf-8')  # Get xslt file contents
        f = open(r'D:\\work\\Log\\'+fn,'w+',encoding='utf-8')  # Open files with the same name in the temporary folder as overwrite write
        f.write(xsl_str) # Write xslt information
        f.close()
        xslt_url += re.findall(r'(?<=xsl:include href=")([^"]+)',xsl_str) # If other xslt files are introduced into the current xslt file, they will be added to the list to be downloaded

Uh huh. In this way, we downloaded three xslt files. In addition to the one that reported an error before, we also added a table xsl

For the last step, let's do a magic trick to turn xml into html~~~~

xslt_doc = ET.parse(r'D:\\work\\Log\\' + xslt_first)
xsl_parse = ET.XSLT(xslt_doc)
result = xsl_parse(xml_doc)
print(result)

so, our first collection attempt, xml and xslt parsing attempt, and file reading and writing attempt are completed. Congratulations.

---------------------------------

Postscript: this blog post is difficult to give birth, because Lao Gu usually works normally and takes time to study. I hope you can put forward some opinions.

Notice: the next article encapsulates a collection class to inherit the collected information, forge header information, etc., and prepare for formal collection.

Topics: Python xml xpath