1. Concept
A reptile is not an animal, but a computer program.
This program has its own specific functions, which can browse the world wide web and obtain the required information according to a series of rules given by users. Such programs are called web crawler s or spider s.
It has the ability of intelligent analysis, also known as robot program.
Application fields of crawler:
For example, Baidu, Google and other search oriented companies use their self-developed crawler programs to crawl, analyze, classify, store... The data in the web page on the Internet, and then provide it to users.
News aggregation application also uses crawler program to crawl the news information on each news website, classify and provide it to users.
Crawler program can be used in various application fields that need data analysis. For example, price analysis, crawl the commodity prices in each mall according to the commodity keywords, compare and analyze the prices, and then display an intuitive comparison table to users.
When the crawler crawls data from the network, it needs to abide by the reboots protocol.
Rebots protocol is a resource sharing list drawn up by the website, which specifies which resources can be crawled and which resources cannot be crawled when crawlers crawl data on the website.
Workflow of crawler:
- Identify the target page. This page is the start page or entry page.
- Get the data of the page and get the relevant information in the page in some way (such as regular expression). The links in the page can be extracted, and the page data can be analyzed and extracted recursively.
- Store the information persistently for subsequent processing.
2. Python crawler module
One of the core logic of the crawler program is to download the data of the specified page through the network request mode.
The essence of a crawler is a web application.
Python provides rich libraries or modules to help developers quickly develop such web applications.
2.1 urllib Library
The urllib library is a python built-in library and does not need to be installed separately. The complete urllib library includes the following five modules:
- urllib.request: different protocols can be used to send request packets and obtain the response results after the request.
- urllib.response: used to parse the response packet data.
- urllib.error: contains urllib Exception generated by request.
- urllib.parse: used to parse and process URL s.
- urllib. Robot parse: robots used to parse pages Txt file.
Use urllib The request module sends a network request:
import urllib.request # url address based on https protocol url = "https://www.cnblogs.com/guo-ke/p/15951196.html" # Build a request object req = urllib.request.Request(url) # Send the request packet using the urlopen method with urllib.request.urlopen(req) as resp: # Parse data data = resp.read() print(data.decode())
- urllib.request.Request() class description: build request package.
Class prototype declaration:
class Request: def __init__(self, url, data=None, headers={},origin_req_host=None, unverifiable=False,method=None): #... other code blocks
Description of construction method parameters
-
**url: * * url address to request.
-
**Data: * * data must be of type bytes. If it is a dictionary, use urllib urlencode() code in parse module
-
**Headers: * * headers is a dictionary type used to describe the request header information.
It can be specified in the construction method or by calling add_ Add header () method
The default user agent is Python urllib
-
**origin_req_host: * * specify the host name or ip address of the requestor.
-
unverifiable: set whether the web page needs to be verified. The default is False.
-
**Method: * * used to specify the method used in the request, such as * * GET, POST or PUT * *.
Many websites have anti crawler settings, and visits other than browsers are considered illegal requests. So the crawler needs to disguise itself as a browser.
from urllib import request, parse url = 'http://www.guo-ke.com/post' headers = { # Disguised as Google browser 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36', 'Host': 'guo-ke.org' } dict = { 'name': 'guoke', 'key':'python' } # Data must be a byte stream data = bytes(parse.urlencode(dict), encoding='utf8') ''' # String converted to URL format data = parse.urlencode(dict) # Encoded into byte stream data data = data.encode() ''' request = request.Request(url=url, data=data, headers=headers, method='POST') # req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36') with request.urlopen(req) as response: res=(response.read() print(res.decode('utf-8'))
**Tip: * * when the data parameter is used or method = "POST" is specified, it is a POST request.
GET requests can also attach request parameters: https://www.baidu.com/s?wd=java
- urllib.request.urlopen() method description: Send a network request.
def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, *, cafile=None, capath=None, cadefault=False, context=None):
Parameter Description:
-
URL: you can receive a URL string or a urllib request. Request object.
-
Data: the data in the POST request. The GET request is set to None.
-
Timeout: set the access timeout of the website. It is only used when requesting a connection using HTTP, HTTPS and FTP protocols.
-
cafile, capath: used to specify CA digital certificate when HTTPS request is used.
cafile specifies the digital certificate file.
capath specifies the directory that contains the digital authentication files.
**Description of return type: * * no matter what protocol is used, the returned object will include three general methods.
-
geturl() returns the requested resource URL.
-
info() returns metadata information, such as message header.
-
getcode() returns the status code of the response If there is an error, a URLError exception will be thrown.
When the request is made using http or https protocol, an http is returned client. Httpresponse object. In addition to the above three methods, this object also includes:
- read(): get the byte type data returned by the response, which can only be used once. decode() is required for output.
- getheaders(): get the returned response header information.
Use urllib Request to download a picture:
The strength of crawler program is that it can batch and recursively download the data required by users. The idea behind a powerful logic may be supported by a simple principle. We can try to download a picture first to get a glimpse of the big picture.
import urllib.request # Picture URL url = "https://img2022.cnblogs.com/blog/2749732/202202/2749732-20220222195931956-448995015.jpg" with urllib.request.urlopen(url) as resp: data = resp.read() with open("d:/my_file.jpg", "wb") as f: f.write(data)
Open the corresponding drive letter and you can see that the picture has been downloaded successfully.
urllib.request also provides a more convenient urlretrieve() method. The downloaded byte stream data can be directly stored as a file.
from urllib import request url = "https://img2022.cnblogs.com/blog/2749732/202202/2749732-20220222195931956-448995015.jpg" # The second parameter passed in by the urlretrieve() method is the location where the file is saved and the file name. request.urlretrieve(url, 'd:/temp.jpg')
2.2. requests Library
requests is a third-party library written based on urllib. When using, you need to download and install:
pip3 install requests
It mainly provides two methods:
1. The GET () method is used to send GET requests.
def get(url, params=None, **kwargs): #......
Parameter Description:
- url: the url resource (string type) that needs to be requested.
- params: query data, which can be dictionary, list, tuple and byte types.
- kwargs: request header parameters described in key value pairs.
Basic GET usage:
import requests # Send the request in get mode and get the response response = requests.get('https://www.cnblogs.com/guo-ke/p/15925214.html') # View the response content with text print(response.text)
get request with parameters
import requests response = requests.get('https://www.baidu.com/s?wd=java') # The parameters are spliced after the url, separated by a question mark, and the parameters are separated by & print(response.text)
Parameters in dictionary format
import requests data = { 'wd': 'java' } response = requests.get('https://www.baidu.com/s', params=data) # The params parameter is passed in the form of a dictionary. You don't need to write your own url code print(response.text)
The GET method returns a response object, which provides corresponding properties or methods to parse the data in the response package.
- response.encoding: get the current encoding.
- response.encoding = 'utf-8': set encoding.
- response.text: automatically decode according to the character encoding of the response header.
- response.content: returned in byte form (binary).
- response.headers: store the server response header as a dictionary object. The dictionary key is case insensitive. If the key does not exist, it returns None.
- response.status_code: response status code.
- response.raw: return the original response body, that is, the response object of urllib, and use response raw. read()
- response.json(): the built-in JSON decoder in Requests, which returns in the form of JSON, provided that the returned content is in JSON format, otherwise an exception will be thrown if there is an error in parsing.
Download a picture
import requests #; Picture address url = "https://img2022.cnblogs.com/blog/2749732/202202/2749732-20220222195931956-448995015.jpg" #Request header metadata headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' } response = requests.get(url, headers=headers) # Get byte stream data data = response.content #Save byte stream data with open("d:/test.jpg", "wb") as f: f.write(data)
2. post() method: send the request by post
def post(url, data=None, json=None, **kwargs):
Parameter Description:
- url: a url resource of string type that needs to be requested.
- Data: send data, which can be dictionary, list, tuple and byte type
- json: data in json format.
- kwargs: request header parameters described in key value pairs.
Basic usage
import requests data={'name':'zhuzhu','age':'23'} headers={ 'User-Agent':'Mozilla/5.0(Macintosh;Intel Mac OS X 10_11_4)AppleWebKit/537.36(KHTML,like Gecko)Chrome/52.0.2743.116 Safari/537.36' } response=requests.post("http://httpbin.org/post",data=data,headers=headers) #Data must be json encapsulated print(response.json())
3. Summary
requests is easier to use than urllib in the third-party library based on urllib. This article only gives a brief description of its API. More methods or use can be found in the documentation.