CSDN hot list and huaweiyun blog can be used to practice Python scratch crawler

Posted by heshan on Sun, 31 Oct 2021 09:57:39 +0100

This blog is a supplement to the knowledge of the sweep selector.

Sweep selector

The scratch framework has its own data extraction mechanism. The related content is called selector seletors, which can select the specified part in HTML through XPath and CSS expressions.

The sweep selector is implemented based on the parsel library, which is also a parsing library. Lxml is used at the bottom, so its usage and efficiency are close to lxml. In the subsequent part of the 120 crawlers column, we will supplement the relevant knowledge points of the library.

selectors basic usage

In this learning process, CSDN is used Column ranking Test.

Selector object, which can be called directly through the response object

import scrapy


class CSpider(scrapy.Spider):
    name = 'c'
    allowed_domains = ['csdn.net']
    start_urls = ['https://blog.csdn.net/rank/list/column']

    def parse(self, response):
        # Selector object, which can be called directly through the response object
        print(response.selector)

Since XPath and CSS selectors are often used, these two methods can be called directly using the response object, for example:

def parse(self, response):
     # Selector object, which can be called directly through the response object
     # print(response.selector)
     response.xpath("XPath expression")
     response.css("CSS expression")

If you look at the source code of the above two methods, you will find that the core is the relevant methods of the called selector object.
Source code access

def xpath(self, query, **kwargs):
    return self.selector.xpath(query, **kwargs)

def css(self, query):
    return self.selector.css(query)

During code writing, the method of using the response object can meet most requirements, but the selector is also applicable to some special scenarios, such as reading a piece of HTML code from a local file:

import scrapy
from scrapy.selector import Selector

class CSpider(scrapy.Spider):
    name = 'c'
    allowed_domains = ['csdn.net']
    start_urls = ['https://blog.csdn.net/rank/list/column']

    def parse(self, response):
        body ="""
        <html>
            <head>
                <title>This is a title</title>
            </head>
            <body>
                This is the content
            </body>
        </html>
        """
		# Instantiate the Selector object and call the xpath method
        ret = Selector(text=body).xpath("//title").get()
        print(ret)

Learn about selectors from the scratch command line

Use the following command to enter the matching mode. The case uses the Huawei cloud blog address, https://bbs.huaweicloud.com/blogs .

> scrapy shell https://bbs.huaweicloud.com/blogs

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000000004E64320>
[s]   item       {}
[s]   request    <GET bbs.huaweicloud.com/blogs>
[s]   response   <200 bbs.huaweicloud.com/blogs>
[s]   settings   <scrapy.settings.Settings object at 0x0000000004E640F0>
[s]   spider     <CSpider 'c' at 0x5161080>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

At this time, if you enter response, you can get the corresponding object.

>>> response
<200 bbs.huaweicloud.com/blogs>
>>>

Try to get the page title

>>> response.xpath("//title")
[<Selector xpath='//Title 'data =' < title > Huawei cloud blog_ Big data blog_ AI blog_ Cloud computing blog_ Developer Center - China... >]
>>> response.xpath("//title/text()")
[<Selector xpath='//Title / text() 'data =' Huawei cloud blog_ Big data blog_ AI blog_ Cloud computing blog_ Developer Center - Huawei cloud '>]
>>>

Since the obtained data is a sequence, it is extracted by the following method.

>>> response.xpath("//title/text()").get()
'Huawei cloud blog_Big data blog_AI Blog_Cloud computing blog_Developer Center-Hua Weiyun'
>>> response.xpath("//title/text()").getall()
['Huawei cloud blog_Big data blog_AI Blog_Cloud computing blog_Developer Center-Hua Weiyun']
>>>

It doesn't make much sense to get the title attribute of the web page. Next, get the blog title in the web page.

>>> response.xpath("//a[@class='blogs-title two-line']/@title").get()
'AppCube Standard page development of practice - playing with application magic cube'
>>> response.xpath("//a[@class='blogs-title two-line']/@title").getall()
['AppCube Standard page development of practice - playing with application magic cube', 'Hongmeng light core M Nuclear source code analysis series XVII (3) exception information ExcInfo', '1024 Solicitation order - [prize solicitation] play with the application cube and the low code construction platform', 'Does the front end need to write automated tests','''''Content omission]

At this point, you should have noticed that to extract the contents of the Selector object, you need to use the get() and getall() methods, which return a single element and multiple elements respectively.

The CSS selector is consistent with the xpath() method except for the syntax of the selector part. All returned objects are SelectorList objects.
Like the Selector object, this object also has its own instance methods, such as xpath(), css(), getall(), get(), re()_ First (), and attrib attribute.

Another thing to note is that the get() method has an alias extract_first() is also often used by the driver.

When using the get() method, if the tag is not found, you can judge whether it is None (yes, return True), or provide a default value.

# Judge whether it is None
response.xpath("//a[@class='blogs-title']/@title").get() is None
# Provide a default value
>>> response.xpath("//a[@class='blogs-title']/@title").get(default =' no data ')
'No data'

The above title attribute can also be obtained with the attrib attribute of the object instead of @ title. The following code will obtain all the attributes of the first matched element.

>>> response.xpath("//a[@class='blogs-title two-line']").attrib

CSS selectors note the following

CSS selectors do not support selecting text nodes or attribute values, so the following extended writing methods are derived.

  • Select the label text, using:: text;
  • Select the attribute value using:: attr(attr_name).

The test code is as follows:

>>> response.css("a.two-line::text")
>>> response.css("a.two-line::attr(title)")

In the above article, did you notice the re() method

Using the re() method, you can apply regularization to the extraction results, for example, match the data beginning with Hongmeng in all the extracted titles.

>>> response.xpath("//a[@class='blogs-title two-line']/@title").re(r' Hongmeng. * ')
['Hongmeng light core M Nuclear source code analysis series XVII (3) exception information ExcInfo']

A usage scenario difference between XPath and CSS
If there are too many class attributes on an element of a web page, it will become inconvenient to use XPath, and CSS selectors are more suitable for this scenario.
If you use XPath for fuzzy matching, the following code will appear:

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

In this case, CSS selectors will become very faceted, and only need the following short line of code.

*.someclass

Other supplementary instructions

If spaces appear in the returned data, you can use the remove of the Selector object_ The namespaces () method removes spaces.

The use of selectors often depends on the proficiency of XPath expressions. For the basic learning of this knowledge, please refer to the previous article Blog.

Some high-order parts are added here:

  • Starts with (): judge the beginning content;
  • contains: detect the contents contained;
  • re:text(): regular can be used in it;
  • Has class: judge whether a class is included;
  • Normalize space: remove front and back spaces.

Write it at the back

Today is the 252nd / 365 day of continuous writing.
Look forward to attention, praise, comment and collection.

More wonderful

"100 crawlers, column sales, you can learn a series of columns after buying"

↓↓↓↓ guide your questions one-on-one ↓↓↓ ↓↓↓ scan the code to add bloggers to participate in [78 technician community] ~ Python branch ↓↓

Topics: Python crawler Python crawler