The concept of crawler and the use of requests Library

Posted by mY.sweeT.shadoW on Fri, 14 Jan 2022 12:46:00 +0100

1, What is a reptile

1.1 reptiles

Crawler generally refers to web crawler, which is a technical means to collect information. Its core is * * to simulate the browser to send a network request to the target website, then accept the response, parse and extract the information we want and save it** In principle, as long as it is the information that the browser can access (normal access, not the stuff of hackers), it can climb down.

The essence of search engines such as Baidu and Google is also a crawler.

1.2 anti crawler

Crawlers usually need to obtain a large amount of target data, which means they will send a large amount of requests. It will also consume a lot of resources of the target website. Therefore, most websites will take certain anti crawling measures to identify and reject the requests sent by crawlers. Picture verification code is the most commonly used anti crawling measure. The difficulty of crawlers is to "cheat" all kinds of anti climbing measures of major websites.

1.3 crawler protocol (Robots protocol)

Crawler protocol is used to tell crawlers or search engines which pages can be crawled and which pages cannot be crawled. It is usually displayed in the form of a text file called "robots.txt", so it is also called Robots protocol or robot protocol.

The crawler protocol is a = = "gentleman Protocol" = =. It does not take any technical means to force crawlers or search engines to comply with the protocol. In other words, technically speaking, there will be no problem if we do not comply, but it is not allowed at the legal level.

The URL of almost all web crawler protocols is domain name / robots txt. For example, the URL of GitHub's crawler protocol is https://github.com/robots.txt , some of its contents are as follows:

# If you would like to crawl GitHub contact us via https://support.github.com?tags=dotcom-robots
# We also provide an extensive API: https://docs.github.com
User-agent: baidu
crawl-delay: 1


User-agent: *

Disallow: /*/pulse
Disallow: /*/tree/
......
Disallow: /account-login
Disallow: /Explodingstuff/

Syntax rules for crawler protocol:

User agent: specifies which crawlers are effective, * means all crawlers are effective.
Disallow: URL path that is not allowed to crawl.
Allow: the path allowed to crawl.
Crawl delay: indicates the time interval of crawling, usually once every 5 seconds.

2, requests Library

requests is a simple and elegant python HTTP library. The design is very humanized and can be started soon.

The following code shows the simplicity and elegance of the requests Library:

r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
r.status_code    # 200
r.headers['content-type']    # 'application/json; charset=utf8'
r.encoding    # 'utf-8'
r.text    # u'{"type":"User"...'
r.json()    # {u'private_gists': 419, u'total_private_repos': 77, ...}

3, requests library quick start

3.1 pip installation requests

pip install requests

3.2 send request

Sending a network request using Requests is very simple. The method name is consistent with the name of the HTTP method.

import requests

r = requests.get('https://api.github.com/events')
r = requests.post('http://httpbin.org/post', data = {'key':'value'})
r = requests.put('http://httpbin.org/put', data = {'key':'value'})
r = requests.delete('http://httpbin.org/delete')
r = requests.head('http://httpbin.org/get')
r = requests.options('http://httpbin.org/get')

The return value r is a Response Object, we can get all the information we want from this object.

3.3 transfer parameters

If you want to pass some data for the query string, you can write it directly into the URL or pass it using the params keyword parameter, which accepts a dictionary.

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get("http://httpbin.org/get", params=payload)

Through r, you can query the URL:

print(r.url)
# http://httpbin.org/get?key2=value2&key1=value1

be careful! The key with the value of None in the dictionary will not be added to the query string of the URL.

We can also pass a list as a value:

payload = {'key1': 'value1', 'key2': ['value2', 'value3']}

r = requests.get('http://httpbin.org/get', params=payload)
print(r.url)
# http://httpbin.org/get?key1=value1&key2=value2&key2=value3

3.4 response content

Requests automatically decodes the content from the server. Most unicode character sets can be decoded seamlessly.

r = requests.get('https://api.github.com/events')
r.text
# u'[{"repository": {"open_issues":0,"url":"https://github.com/...

After the request is sent, Requests will guess the encoding of the response based on the HTTP header, and then automatically decode it when calling r.text. We can use the r.encoding attribute to read and set the encoding:

r.encoding
# 'utf-8'
r.encoding = 'ISO-8859-1'

The correct code can be found in r.content.

3.4.1 binary response content

Requests automatically decodes gzip and deflate transmission encoded response data.

For example, to create a picture with the binary data returned by the request, you can use the following code:

from PIL import Image
from io import BytesIO

i = Image.open(BytesIO(r.content))

3.4.2 JSON response content

Requests has a built-in JSON decoder, which can help us process JSON data:

 import requests

r = requests.get('https://api.github.com/events')
r.json()
# [{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...

If JSON decoding fails, r.json() will throw an exception. A successful call to r.json() does not mean a successful response. Some servers will include a JSON object in the failed response (such as the error details of HTTP 500). This JSON will be decoded and returned. To check whether the request was successful, use r.raise_for_status() or check r.status_ Whether the code is the same as your expectations.

3.5 original response content

Get the raw socket response from the server and access r.raw. However, you must ensure that stream=True is set in the initial request.

r = requests.get('https://api.github.com/events', stream=True)
r.raw
# <requests.packages.urllib3.response.HTTPResponse object at 0x101194810>
r.raw.read(10)
# '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

However, generally, the text stream should be saved to a file in the following mode:

with open(filename, 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)

iter_ The content method is used to traverse the response data. chunk_size must be of type int or None. The value of None will play different roles according to the value of stream:

When stream=True is set on the request, the size of the received data block will be read when the data arrives, which will avoid reading the content into memory at one time when the response is large.
If stream=False, the data is returned as a single block.

3.6 custom request header

If you want to add an HTTP header to the request, simply pass a dict to the headers parameter. For example, specify content type:

url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}

r = requests.get(url, headers=headers)

Note: the header value must be string, bytestring or unicode. Although passing a unicode header is allowed, it is not recommended.

3.7 complex POST requests

If you need to submit HTML form data, simply pass a dictionary to the data parameter:

payload = {'key1': 'value1', 'key2': 'value2'}

r = requests.post("http://httpbin.org/post", data=payload)
print(r.text)
"""
{
  ...
  "form": {
    "key2": "value2",
    "key1": "value1"
  },
  ...
}
"""

When multiple elements in the form use the same key, you can pass in a tuple list for the data parameter.

payload = (('key1', 'value1'), ('key1', 'value2'))
r = requests.post('http://httpbin.org/post', data=payload)
print(r.text)
"""
{
  ...
  "form": {
    "key1": [
      "value1",
      "value2"
    ]
  },
  ...
}
"""

If you want to submit in JSON format, you can do the following:

url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}

r = requests.post(url, json=payload)

3.8 submission of documents using POST

url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}

r = requests.post(url, files=files)
r.text
"""
{
  ...
  "files": {
    "file": "<censored...binary...data>"
  },
  ...
}
"""

You can explicitly set the file name, file type, and request header:

url = 'http://httpbin.org/post'
files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': '0'})}

r = requests.post(url, files=files)
r.text
"""
{
  ...
  "files": {
    "file": "<censored...binary...data>"
  },
  ...
}
"""

3.9 response status code

Get response status code:

r = requests.get('http://httpbin.org/get')
r.status_code  # 200

3.10 response head

View a server response header, which will be displayed in the form of Python Dictionary:

r.headers
"""
{
    'content-encoding': 'gzip',
    'transfer-encoding': 'chunked',
    'connection': 'close',
    'server': 'nginx/1.0.4',
    'x-runtime': '148ms',
    'etag': '"e1ca502697e5c9317743dc078f67693f"',
    'content-type': 'application/json'
}
"""

But this dictionary is special: it is only for HTTP headers. according to RFC 2616 , HTTP headers are case insensitive. Therefore, we can access these response header fields in any uppercase form:

r.headers['Content-Type']  # 'application/json'

3.11 Cookie

Get cookie:

r.cookies['example_cookie_name']

Attach a cookie to the request:

cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)

You can also use RequestsCookieJar to construct a cookie object. Its behavior is similar to that of a dictionary, but the interface is more complete and suitable for cross domain and cross path use.

jar = requests.cookies.RequestsCookieJar()
jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')
jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')
url = 'http://httpbin.org/cookies'
r = requests.get(url, cookies=jar)

3.12 redirection and request history

By default, Requests automatically handles all redirects except HEAD.

You can use the history method of the Response object to track redirection. It is a list of Response objects, sorted from far to near.

r.history    # [<Response [301]>]

If you are using GET, OPTIONS, POST, PUT, PATCH, or DELETE, you can use allow_ The redirects parameter disables redirection processing:

r = requests.get('http://github.com', allow_redirects=False)
r.status_code    # 301
r.history    # []

If HEAD is used, redirection can also be enabled:

r = requests.head('http://github.com', allow_redirects=True)
r.url    # 'https://github.com/'
r.history    # [<Response [301]>]

3.13 timeout

You can use the timeout parameter to set the response waiting time, and stop waiting for the response after exceeding the set number of seconds. Basically all production codes should use this parameter. If you don't use it, your program may lose its response forever:

requests.get('http://github.com', timeout=0.01)

3.14 errors and exceptions

When encountering network problems (such as DNS query failure, connection rejection, etc.), Requests will throw a ConnectionError exception.

If the HTTP request returns an unsuccessful status code, response raise_ for_ Status() will throw an HTTPError exception.

If request times out, a Timeout exception is thrown.

If the request exceeds the set maximum number of redirects, a TooManyRedirects exception will be thrown.

All exceptions explicitly thrown by Requests inherit from Requests exceptions. RequestException .

For more advanced usage, please refer to the official documentation: Portal

Topics: Python crawler

Programmer Think