This blog is a supplement to the knowledge of the sweep selector.
Sweep selector
The scratch framework has its own data extraction mechanism. The related content is called selector seletors, which can select the specified part in HTML through XPath and CSS expressions.
The sweep selector is implemented based on the parsel library, which is also a parsing library. Lxml is used at the bottom, so its usage and efficiency are close to lxml. In the subsequent part of the 120 crawlers column, we will supplement the relevant knowledge points of the library.
selectors basic usage
In this learning process, CSDN is used Column ranking Test.
Selector object, which can be called directly through the response object
import scrapy class CSpider(scrapy.Spider): name = 'c' allowed_domains = ['csdn.net'] start_urls = ['https://blog.csdn.net/rank/list/column'] def parse(self, response): # Selector object, which can be called directly through the response object print(response.selector)
Since XPath and CSS selectors are often used, these two methods can be called directly using the response object, for example:
def parse(self, response): # Selector object, which can be called directly through the response object # print(response.selector) response.xpath("XPath expression") response.css("CSS expression")
If you look at the source code of the above two methods, you will find that the core is the relevant methods of the called selector object.
Source code access
def xpath(self, query, **kwargs): return self.selector.xpath(query, **kwargs) def css(self, query): return self.selector.css(query)
During code writing, the method of using the response object can meet most requirements, but the selector is also applicable to some special scenarios, such as reading a piece of HTML code from a local file:
import scrapy from scrapy.selector import Selector class CSpider(scrapy.Spider): name = 'c' allowed_domains = ['csdn.net'] start_urls = ['https://blog.csdn.net/rank/list/column'] def parse(self, response): body =""" <html> <head> <title>This is a title</title> </head> <body> This is the content </body> </html> """ # Instantiate the Selector object and call the xpath method ret = Selector(text=body).xpath("//title").get() print(ret)
Learn about selectors from the scratch command line
Use the following command to enter the matching mode. The case uses the Huawei cloud blog address, https://bbs.huaweicloud.com/blogs .
> scrapy shell https://bbs.huaweicloud.com/blogs [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x0000000004E64320> [s] item {} [s] request <GET bbs.huaweicloud.com/blogs> [s] response <200 bbs.huaweicloud.com/blogs> [s] settings <scrapy.settings.Settings object at 0x0000000004E640F0> [s] spider <CSpider 'c' at 0x5161080> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser >>>
At this time, if you enter response, you can get the corresponding object.
>>> response <200 bbs.huaweicloud.com/blogs> >>>
Try to get the page title
>>> response.xpath("//title") [<Selector xpath='//Title 'data =' < title > Huawei cloud blog_ Big data blog_ AI blog_ Cloud computing blog_ Developer Center - China... >] >>> response.xpath("//title/text()") [<Selector xpath='//Title / text() 'data =' Huawei cloud blog_ Big data blog_ AI blog_ Cloud computing blog_ Developer Center - Huawei cloud '>] >>>
Since the obtained data is a sequence, it is extracted by the following method.
>>> response.xpath("//title/text()").get() 'Huawei cloud blog_Big data blog_AI Blog_Cloud computing blog_Developer Center-Hua Weiyun' >>> response.xpath("//title/text()").getall() ['Huawei cloud blog_Big data blog_AI Blog_Cloud computing blog_Developer Center-Hua Weiyun'] >>>
It doesn't make much sense to get the title attribute of the web page. Next, get the blog title in the web page.
>>> response.xpath("//a[@class='blogs-title two-line']/@title").get() 'AppCube Standard page development of practice - playing with application magic cube' >>> response.xpath("//a[@class='blogs-title two-line']/@title").getall() ['AppCube Standard page development of practice - playing with application magic cube', 'Hongmeng light core M Nuclear source code analysis series XVII (3) exception information ExcInfo', '1024 Solicitation order - [prize solicitation] play with the application cube and the low code construction platform', 'Does the front end need to write automated tests','''''Content omission]
At this point, you should have noticed that to extract the contents of the Selector object, you need to use the get() and getall() methods, which return a single element and multiple elements respectively.
The CSS selector is consistent with the xpath() method except for the syntax of the selector part. All returned objects are SelectorList objects.
Like the Selector object, this object also has its own instance methods, such as xpath(), css(), getall(), get(), re()_ First (), and attrib attribute.
Another thing to note is that the get() method has an alias extract_first() is also often used by the driver.
When using the get() method, if the tag is not found, you can judge whether it is None (yes, return True), or provide a default value.
# Judge whether it is None response.xpath("//a[@class='blogs-title']/@title").get() is None # Provide a default value >>> response.xpath("//a[@class='blogs-title']/@title").get(default =' no data ') 'No data'
The above title attribute can also be obtained with the attrib attribute of the object instead of @ title. The following code will obtain all the attributes of the first matched element.
>>> response.xpath("//a[@class='blogs-title two-line']").attrib
CSS selectors note the following
CSS selectors do not support selecting text nodes or attribute values, so the following extended writing methods are derived.
- Select the label text, using:: text;
- Select the attribute value using:: attr(attr_name).
The test code is as follows:
>>> response.css("a.two-line::text") >>> response.css("a.two-line::attr(title)")
In the above article, did you notice the re() method
Using the re() method, you can apply regularization to the extraction results, for example, match the data beginning with Hongmeng in all the extracted titles.
>>> response.xpath("//a[@class='blogs-title two-line']/@title").re(r' Hongmeng. * ') ['Hongmeng light core M Nuclear source code analysis series XVII (3) exception information ExcInfo']
A usage scenario difference between XPath and CSS
If there are too many class attributes on an element of a web page, it will become inconvenient to use XPath, and CSS selectors are more suitable for this scenario.
If you use XPath for fuzzy matching, the following code will appear:
*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]
In this case, CSS selectors will become very faceted, and only need the following short line of code.
*.someclass
Other supplementary instructions
If spaces appear in the returned data, you can use the remove of the Selector object_ The namespaces () method removes spaces.
The use of selectors often depends on the proficiency of XPath expressions. For the basic learning of this knowledge, please refer to the previous article Blog.
Some high-order parts are added here:
- Starts with (): judge the beginning content;
- contains: detect the contents contained;
- re:text(): regular can be used in it;
- Has class: judge whether a class is included;
- Normalize space: remove front and back spaces.
Write it at the back
Today is the 252nd / 365 day of continuous writing.
Look forward to attention, praise, comment and collection.
More wonderful
"100 crawlers, column sales, you can learn a series of columns after buying"