25. Python Rapid Development of Scrapy, a Distributed Search Engine - Introduction to Requests Request and Response Response

Posted by evaoparah on Tue, 30 Jul 2019 21:31:03 +0200

[Baidu cloud search, search for various information: http://www.lqkweb.com]

[Search the Web Disk for all kinds of information: http://www.swpan.cn]

Requests request

The Requests request is the Requests() method we write in the crawler file, which is to submit a request address. The Requests request is our custom**

Requests() method submits a request

Parameters:

url = string type url address

Callback = callback function name

method = string type request mode, if GET,POST

headers = dictionary type, browser user agent

Cookies = set cookies

meta = dictionary-type key-value pairs, passing a specified value directly to the callback function

Enoding = Setting up Web Coding

Priority = defaults to 0, and the higher the settings, the higher the priority of scheduling

dont_filter = defaults to False, which filters out the current url if set to true

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request,FormRequest
import re

class PachSpider(scrapy.Spider):                            #To define reptiles, you must inherit scrapy.Spider
    name = 'pach'                                           #Set the crawler name
    allowed_domains = ['www.luyin.org/']                    #Crawling Domain Names
    # start_urls = ['']                                     #Crawling Web addresses is only suitable for requests that do not require login, because cookie s and other information cannot be set up.

    header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'}  #Setting up browser user agent

    def start_requests(self):    #The start url function replaces start_urls
        """The first time you request the login page, set it to open cookie Make it get cookie,Setting callback function"""
        return [Request(
            url='http://www.luyin.org/',
            headers=self.header,
            meta={'cookiejar':1},       #Open Cookies record and pass Cookies to callback function
            callback=self.parse
        )]

    def parse(self, response):
        title = response.xpath('/html/head/title/text()').extract()
        print(title)

 

Response response

Response response is the response returned by downloader

Response response parameters
headers return the response header
Is status returned
body returns page content, byte type
The url returns the crawl url

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request,FormRequest
import re

class PachSpider(scrapy.Spider):                            #To define reptiles, you must inherit scrapy.Spider
    name = 'pach'                                           #Set the crawler name
    allowed_domains = ['www.luyin.org/']                    #Crawling Domain Names
    # start_urls = ['']                                     #Crawling Web addresses is only suitable for requests that do not require login, because cookie s and other information cannot be set up.

    header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'}  #Setting up browser user agent

    def start_requests(self):    #The start url function replaces start_urls
        """The first time you request the login page, set it to open cookie Make it get cookie,Setting callback function"""
        return [Request(
            url='http://www.luyin.org/',
            headers=self.header,
            meta={'cookiejar':1},       #Open Cookies record and pass Cookies to callback function
            callback=self.parse
        )]

    def parse(self, response):
        title = response.xpath('/html/head/title/text()').extract()
        print(title)
        print(response.headers)
        print(response.status)
        # print(response.body)
        print(response.url)

Topics: Python Windows Firefox