Use of xpath selector
1. What is an xpath selector?
XPath (XML Path Language) is a language for searching information in XML documents. It can be used to traverse elements and attributes in XML documents. It is more convenient, simple and acceptable than regular operation.
Speaking of this, let's talk about xml. What is xml?
XML refers to EXtensible Markup Language.
XML is a markup language, very similar to HTML
XML is designed to transfer data, not display it.
XML tags need to be defined by ourselves.
XML is designed to be self-descriptive.
XML is W3C's Recommendation Standard
Briefly speaking, the difference between xml and html is that xml is mainly used to transmit data and html is mainly used to display data.
Selecting Nodes
XPath uses path expressions to select nodes or node sets in an XML document. These path expressions are very similar to those we see in conventional computer file systems.
The most commonly used path expressions are listed below:
1. nodeName selects all the child nodes of this node.
2, / Select from the root node.
3, // Select the nodes in the document from the current node that matches the selection, regardless of their location.
4. Select the current node.
5. Select the parent of the current node.
6, @ Select Properties.
Predicates
The following predicate is embedded in square brackets to find a particular node or a node containing a specified value.
In the table below, we list some path expressions with predicates and the results of the expressions:
For example, / bookstore/book[1] selects the first book element that belongs to the bookstore child element.
// title[@lang] selects all title elements that have attributes called Lang.
Selecting XPath wildcards for unknown nodes can be used to select unknown XML elements.
- Match any element node.
@* Match any attribute node.
node() matches any type of node.
By selecting several paths and using the "|" operator in the path expression, you can select several paths.
Example: //book/title |//book/price selects all title and price elements of the book element.
// title | //price selects all title and price elements in the document.
These are the common content of XPath grammar, XPath usage
XPath uses
We use pip3 install lxml to install xpath. Let's look at an XPath selector using real columns:
import requests from lxml import etree import json class Tieba: def __init__(self,tieba_name): self.tieba_name = tieba_name #Receive the post name #UA on the mobile phone, or UA on the browser self.headers = {"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1"} def get_total_url_list(self): '''Get all urllist''' url = "https://tieba.baidu.com/f?kw="+self.tieba_name+"&ie=utf-8&pn={}&" url_list = [] for i in range(100): #Splicing 100 URLs through a loop url_list.append(url.format(i*50)) return url_list #Return url list of 100 URLs def parse_url(self,url): '''Send a request, get a response, and etree Handle html''' print("parsing url:",url) response = requests.get(url,headers=self.headers,timeout=10) #Send requests html = response.content.decode() #Get the html string html = etree.HTML(html) #Get html of element type return html def get_title_href(self,url): '''Getting a page title and href''' html = self.parse_url(url) li_temp_list = html.xpath("//li [@]"# grouping, grouped according to li label total_items = [] for i in li_temp_list: #Traversal grouping href = "https:"+i.xpath("./a/@href")[0] if len(i.xpath("./a/@href"))>0 else None text = i.xpath("./a/div[1]/span[1]/text()") text = text[0] if len(text)>0 else None item = dict( #Put it in the dictionary href = href, text = text ) total_items.append(item) return total_items #Returns all item s on a page def get_img(self,url): '''Get all the pictures in a post''' html = self.parse_url(url) #Returns elemet-centric html with xpath methods img_list = html.xpath('//div[@data-]/@data-url') img_list = [i.split("src=")[-1] for i in img_list] #Extracting url of picture img_list = [requests.utils.unquote(i) for i in img_list] return img_list def save_item(self,item): '''Save one item''' with open("teibatupian.txt","a") as f: f.write(json.dumps(item,ensure_ascii=False,indent=2)) f.write("\n") def run(self): #1. Find the url rule, url list url_list = self.get_total_url_list() for url in url_list: #2. Traversing urllist to send requests and get responses, etree processes html # 3. Extracting title, href total_item = self.get_title_href(url) for item in total_item: href = item["href"] img_list = self.get_img(href) #Get the picture list of the post item["img"] = img_list # 4. Save it locally print(item) self.save_item(item) if __name__ == "__main__": tieba = Tieba("Beauty") tieba.run()
That's the basics of xpath