python_ Crawler 04 requests Library

Posted by leocon on Tue, 01 Feb 2022 16:40:48 +0100

catalogue

1, Installation and documentation address

2, Send GET request

Add headers and query parameters

response.text and response Difference of content

3, Send POST request

4, Use agent

5, Cookies

6, session

7, Handling untrusted SSL certificates

 

Although the urllib module in Python's standard library already contains most of the functions we usually use, its API doesn't feel very good to use, and Requests advertises "HTTP for Humans", indicating that it is more concise and convenient to use.

 

1, Installation and documentation address

pip can be installed easily:

pip install requests
Chinese documents: http://docs.python-requests.org/zh_CN/latest/index.html
github address: https://github.com/requests/requests
 

2, Send GET request

The simplest way to send a get request is through requests Get to call

import requests
response = requests.get("https://www.baidu.com/")

Add headers and query parameters

If you want to add headers, you can pass in the headers parameter to add header information in the request header. If you want to pass parameters in a url, you can use the params parameter. The relevant example codes are as follows

import requests

headers = {
     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER"
}
data = {
    "kw": "China"
}
url = "https://www.baidu.com/s"
# params receives the query parameters of a dictionary or string. The dictionary type is automatically converted to url without urlencode()
response = requests.get(url=url, params=data, headers=headers)

# View response code
print("status_code: {}".format(response.status_code))  # status_code: 200

# View decoding response Decoding method used in text
print("encoding:{}".format(response.encoding))  # encoding:utf-8

# View the full url address
print("url:{}".format(response.url))  # url:https://www.baidu.com/

# Check the response content, response Byte stream data returned by content
print("context:{}".format(response.content))

# Requests will automatically decode the content, and most unicode character sets can be decoded seamlessly.
print("text:{}".format(response.text))

response.text and response Difference of content

    1,response.content: This is the data captured directly from the network without any decoding. So it is a byte type. In fact, the data transmitted on the hard disk and on the network are bytes.
    2,response.text: This is the data type of str, which is the response of the requests library Content to decode the string. Decoding requires specifying an encoding method, and requests will judge the decoding method according to their own guess. Therefore, sometimes there may be a guess error, which will lead to garbled code after decoding. At this time, response should be used content. Decode ("UTF-8").

 

3, Send POST request

The most basic post request can use the post method:

response = requests.post("http://www.baidu.com/",data=data)

Incoming data
At this time, don't use urlencode to encode. Just pass it into a dictionary. For example, the code of the data requested to pull the hook:

import requests

url = "https://www.lagou.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false"
headers = {
    "Referer": "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER",
    "Cookie": "LGUID=20161229121751-c39adc5c-cd7d-11e6-8409-5254005c3644; user_trace_token=20210531172956-5c418892-9a48-4fa1-8377-2c4f7a20b741; LG_HAS_LOGIN=1; hasDeliver=0; privacyPolicyPopup=false; WEBTJ-ID=20210531185241-179c20de542206-0e1348daf26807-2b6f686a-1049088-179c20de543150; RECOMMEND_TIP=true; __lg_stoken__=677cc1b348553c3ed5e9cbb7b390a2ff300eb24fefe8a8e97e42e2872fc9543fba2800c9390bbd1d173c49e0c0362f67288bd32b2db49b0ed2db58d21a0b452d975350e4ed22; index_location_city=%E5%B9%BF%E5%B7%9E; login=false; unick=""; _putrc=""; JSESSIONID=ABAAAECABIEACCAAF01E6707FDF7DE8820405BA09C6C439; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; X_HTTP_TOKEN=34e72e60c648e0f923883522611a51a83da2b43601; sensorsdata2015session=%7B%7D; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2221787087%22%2C%22first_id%22%3A%22179c20de60575-03879bc46582c2-2b6f686a-1049088-179c20de606146%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24os%22%3A%22Windows%22%2C%22%24browser%22%3A%22Chrome%22%2C%22%24browser_version%22%3A%2257.0.2987.98%22%7D%2C%22%24device_id%22%3A%22179c20de60575-03879bc46582c2-2b6f686a-1049088-179c20de606146%22%7D; _gat=1; _ga=GA1.2.2125950788.1622453397; _gid=GA1.2.1174524155.1622458625; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1622453401,1622453584,1622458348,1622460836; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1622538830; TG-TRACK-CODE=index_search; LGSID=20210601171352-cf8ce267-baa7-496a-85b4-d5c3239d2f39; LGRID=20210601171402-1ac136e6-0d19-488f-9639-d1ad2bb75fe7; SEARCH_ID=6a4a97b38c434d13830a64516514c7e3"
}
data = {
    "first": "true",
    "pn": 1,
    "kd": "python"
}

response = requests.post(url=url, data=data, headers=headers)
print(response.text)
print(response.json())  # Call the built-in JSON decoder to parse the data

Sending a post request is very simple. Call requests directly The post method is OK.
If json data is returned, you can call response json () to convert the json string into a dictionary. If json decoding fails, response json () will throw an exception.
To check whether the request was successful, use response raise_ for_ Status() or check the response status_ Whether the code is the same as your expectations.

 

4, Use agent

Using requests to add a proxy is also very simple. Just pass the proxies parameter in the requested method (such as get or post). The example code is as follows:

import requests

url = "http://httpbin.org/ip"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER",
}
proxy = {
    "http": "42.193.23.248:16817"
}
resp = requests.get(url=url, headers=headers, proxies=proxy)
print(resp.text)

 

5, Cookies

If a cookie is included in a response, you can use the cookie attribute to get the returned cookie value:

import requests
url = "http://www.baidu.com"
resp = requests.get(url)
print("cookies: {}".format(resp.cookies))
print("cookies_dict: {}".format(resp.cookies.get_dict()))
"""
result:
    cookies: <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
    cookies_dict: {'BDORZ': '27315'}
"""

6, session

Session objects allow you to maintain certain parameters across requests. It also maintains cookie s between all requests made by the same session instance.

Therefore, if you send multiple requests to the same host, the underlying TCP connection will be reused, resulting in significant performance improvement. (see HTTP persistent connection). 

Previously, using the urllib library, you can use opener to send multiple requests, and cookies can be shared among multiple requests. If we use requests to share cookies, we can use the session object provided by the requests library. Note that the session here is not the session in web development. This place is just a session object. Take logging in to Renren as an example and use requests to implement it. The example code is as follows

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER"
}
data = {
    "log": "123123@qq.com",
    "pwd": "12312",
    "wp-submit": "Log In",
    "redirect_to": "http://47.106.134.xx:10086/wp-blog/wp-admin/",
    "testcookie": "1"
}
login_url = "http://47.106.134.xx:10086/wp-blog/wp-login.php"
admin_url = "http://47.106.134.xx:10086/wp-blog/wp-admin/"

session = requests.Session()
# land
session.post(login_url, data=data, headers=headers)
# Enter the management page
response = session.get(admin_url)
with open("admin.html", "wb") as f:
    f.write(response.content)

7, Handling untrusted SSL certificates

For websites that have been trusted with SSL integers, such as https://www.baidu.com/ , then you can directly return the normal response using requests. The example code is as follows:

resp = requests.get('http://www.12306.cn/mormhweb/',verify=False)
print(resp.content.decode('utf-8'))

Topics: Python Python crawler requests