Use of xpath selector

Posted by ki on Mon, 09 Sep 2019 13:43:30 +0200

Use of xpath selector
1. What is an xpath selector?
XPath (XML Path Language) is a language for searching information in XML documents. It can be used to traverse elements and attributes in XML documents. It is more convenient, simple and acceptable than regular operation.
Speaking of this, let's talk about xml. What is xml?

XML refers to EXtensible Markup Language.

XML is a markup language, very similar to HTML

XML is designed to transfer data, not display it.

XML tags need to be defined by ourselves.

XML is designed to be self-descriptive.

XML is W3C's Recommendation Standard
Briefly speaking, the difference between xml and html is that xml is mainly used to transmit data and html is mainly used to display data.

Selecting Nodes
XPath uses path expressions to select nodes or node sets in an XML document. These path expressions are very similar to those we see in conventional computer file systems.
The most commonly used path expressions are listed below:
1. nodeName selects all the child nodes of this node.
2, / Select from the root node.
3, // Select the nodes in the document from the current node that matches the selection, regardless of their location.
4. Select the current node.
5. Select the parent of the current node.
6, @ Select Properties.

Predicates
The following predicate is embedded in square brackets to find a particular node or a node containing a specified value.
In the table below, we list some path expressions with predicates and the results of the expressions:
For example, / bookstore/book[1] selects the first book element that belongs to the bookstore child element.
// title[@lang] selects all title elements that have attributes called Lang.

Selecting XPath wildcards for unknown nodes can be used to select unknown XML elements.

  • Match any element node.
    @* Match any attribute node.
    node() matches any type of node.

By selecting several paths and using the "|" operator in the path expression, you can select several paths.
Example: //book/title |//book/price selects all title and price elements of the book element.
// title | //price selects all title and price elements in the document.
These are the common content of XPath grammar, XPath usage

XPath uses
We use pip3 install lxml to install xpath. Let's look at an XPath selector using real columns:

import requests
from lxml import etree
import json

class Tieba:

def __init__(self,tieba_name):
self.tieba_name = tieba_name #Receive the post name
#UA on the mobile phone, or UA on the browser
self.headers = {"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1"}

def get_total_url_list(self):
'''Get all urllist'''
url = "https://tieba.baidu.com/f?kw="+self.tieba_name+"&ie=utf-8&pn={}&"
url_list = []
for i in range(100): #Splicing 100 URLs through a loop
url_list.append(url.format(i*50))
return url_list #Return url list of 100 URLs

def parse_url(self,url):
'''Send a request, get a response, and etree Handle html'''
print("parsing url:",url)
response = requests.get(url,headers=self.headers,timeout=10) #Send requests
html = response.content.decode() #Get the html string
html = etree.HTML(html) #Get html of element type
return html

def get_title_href(self,url):
'''Getting a page title and href'''
html = self.parse_url(url)
li_temp_list = html.xpath("//li [@]"# grouping, grouped according to li label
total_items = []
for i in li_temp_list: #Traversal grouping
href = "https:"+i.xpath("./a/@href")[0] if len(i.xpath("./a/@href"))>0 else None
text = i.xpath("./a/div[1]/span[1]/text()")
text = text[0] if len(text)>0 else None
item = dict( #Put it in the dictionary
href = href,
text = text
)
total_items.append(item)
return total_items #Returns all item s on a page

def get_img(self,url):
'''Get all the pictures in a post'''
html = self.parse_url(url) #Returns elemet-centric html with xpath methods
img_list = html.xpath('//div[@data-]/@data-url')
img_list = [i.split("src=")[-1] for i in img_list] #Extracting url of picture
img_list = [requests.utils.unquote(i) for i in img_list]
return img_list

def save_item(self,item):
'''Save one item'''
with open("teibatupian.txt","a") as f:
f.write(json.dumps(item,ensure_ascii=False,indent=2))
f.write("\n")

def run(self):
#1. Find the url rule, url list
url_list = self.get_total_url_list()
for url in url_list:
#2. Traversing urllist to send requests and get responses, etree processes html
# 3. Extracting title, href
total_item = self.get_title_href(url)
for item in total_item:
href = item["href"]
img_list = self.get_img(href) #Get the picture list of the post
item["img"] = img_list
# 4. Save it locally
print(item)
self.save_item(item)

if __name__ == "__main__":
tieba = Tieba("Beauty")
tieba.run()

That's the basics of xpath

Topics: xml JSON Mobile Attribute