Reptile Foundation_ urllib

Posted by keithschm on Tue, 25 Jan 2022 21:31:25 +0100

Basic reptile learning notes

urllib Library

The urllib library is one of the most basic network request libraries in Python. It can simulate the behavior of the browser, send a request to the specified server, and save the data returned by the server.

1. urlopen function

In the urllib Library of Python 3, all methods related to network requests are in urllib Under the request module, the basic usage of urlopen() function is as follows:

request.urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)
from urllib import request

# A way of writing
resp = request.urlopen('http://www.baidu.com/')
# Another way of writing
url = 'http://www.baidu.com/'
resp = request.urlopen(url=url)

# Read everything
print(resp.read())
# Read 10 characters
print(resp.read(10))
# Read one line
print(resp.readline())
# Read multiple lines
print(resp.readlines())
# Get status code
print(resp.getcode())

The urlopen function is explained in detail below:

(1) url: the url of the request

(2) Data: the requested data. If this value is set, it will become a POST request

(3) Return value: the return value is an HTTP client. Httpresponse object, which is a class file handle object. There are read(size), readline, readlines, getcode and other methods.

2. urlretrieve function

This function can easily save a file on the web page locally. The following code can easily download Baidu home page locally:

from urllib import request
# The syntax of this function must be strictly written like this. Download the file name: Baidu HTML, save to current path
request.urlretrieve('http://www.baidu.com/', 'baidu.html')
# If you write this to the following, you can't
url = 'http://www.baidu.com/'
request.urlretrieve(url=url, 'baidu.html')

# There will be the following error reports
  request.urlretrieve(url=url, 'baidu.html')
                                             ^
SyntaxError: positional argument follows keyword argument

The following example is to save a picture locally:

from urllib import request

# Prevent validation errors
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

request.urlretrieve('https://tenfei03.cfp.cn/creative/vcg/800/new/VCG21a1834bfe5.jpg', '1.jpg')

3. urlencode function

When a real browser sends a request, if the url contains Chinese or other special characters, the browser will automatically decode it for us. However, if you use code to send a request, you need to code it manually, so the urlencode function is used. urlencode can convert dictionary data into url encoded data. The example code is as follows:

# parse is related to parsing
from urllib import parse
data = {'name':'Reptile Foundation', 'year': 2022, 'add': 'Jilin Baicheng'}
qs = parse.urlencode(data)
print(qs)

If we send the request in an uncoded format in the code, the example code is as follows:

from urllib import request
url = 'https://www.baidu.com/s?wd = Bruce Lee '
resp = request.urlopen(url)
print(resp.read())

The error is as follows:

UnicodeEncodeError                        Traceback (most recent call last)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-12: ordinal not in range(128)

For comparison, now code "Bruce Lee" with urlencode, and then make a code request. The code is as follows:

from urllib import request
from urllib import parse

url = 'https://www.baidu.com/s?'
data = {'wd':'Bruce Lee'}
params = parse.urlencode(data)
url = url+params
resp = request.urlopen(url)
print(resp.read())

The response is normal, and the return result is shown in the figure below:

4,parse_qs function

The encoded url parameters can be decoded. The example code is as follows:

from urllib import parse
qs='name=%E7%88%AC%E8%99%AB%E5%9F%BA%E7%A1%80&year=2022&add=%E5%90%89%E6%9E%97%E7%99%BD%E5%9F%8E'
print(parse.parse_qs(qs))

The decoded value is still in dictionary format, but the value becomes a list.

5. urlparse and urlplit

Sometimes the components of a url need to be segmented, which requires urlparse or urlplit. The code is as follows:

from urllib import request, parse
url = 'http://www.baidu.com/s?wd=java&username=123#23'

res = parse.urlparse(url)
print(res)

Resolution: scheme corresponds to http and netloc corresponds to www.baidu.com COM, path corresponds to / s,? The query statement corresponds to query =, and the fragment corresponds to#

To obtain an attribute value, the example code is as follows:

from urllib import request, parse
url = 'http://www.baidu.com/s?wd=java&username=123#23'

res = parse.urlparse(url)
print('scheme:', res.scheme)
print('netloc:', res.netloc)
print('path:', res.path)
print('params:', res.params)
print('query: ', res.query)
print('fragment:', res.fragment)
# result
scheme: http
netloc: www.baidu.com
path: /s
params: 
query:  wd=java&username=123
fragment: 23

Urlplit application, the example code is as follows:

from urllib import request, parse
url = 'http://www.baidu.com/s?wd=java&username=123#23'

res = parse.urlsplit(url)

print(res)
print('scheme:', res.scheme)
print('netloc:', res.netloc)
print('path:', res.path)
# print('params:', res.params)
print('query: ', res.query)
print('fragment:', res.fragment)
SplitResult(scheme='http', netloc='www.baidu.com', path='/s', query='wd=java&username=123', fragment='23')
scheme: http
netloc: www.baidu.com
path: /s
query:  wd=java&username=123
fragment: 23

The difference between the two: urlplit has no params attribute, and urlparse has params attribute. If it is in's' and '?' of url Add a data between, then the urlparse will have paramas = data value

request.Request class

First, run the access method without headers. The example code is as follows:

from urllib import request
url = 'https://www.lagou.com/wn/jobs?labelWords=&fromSearch=true&suginput=&kd=python'
resp = request.urlopen(url)
print(len(resp.read()))
executed in 707ms, finished 12:47:22 2022-01-23
127034

If you want to add some request headers to the request, you must use request Request class. For example, to add a user agent, the example code is as follows:

from urllib import request
url = 'https://www.lagou.com/wn/jobs?labelWords=&fromSearch=true&suginput=&kd=python'
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'}

# Use request The request class instantiates a req request object
req = request.Request(url, headers=headers)

# Then use request Urlopen () opens the request object
resp = request.urlopen(req)
print(len(resp.read()))
executed in 788ms, finished 12:47:42 2022-01-23
124447

ProxyHandler processor (proxy settings)

Many websites will detect the access times of an IP in a certain period of time (through traffic statistics, system logs, etc.). If the access times are abnormal, it will prohibit the access of this IP. Therefore, we can set up some agents and change an agent every other period of time. Even if the IP is blocked, we can still change an IP to continue crawling. Use ProxyHandler to set proxy server in urllib. The following code describes how to use custom opener to use proxy server:

from urllib import request

# No agent
resp = request.urlopen('http://httpbin.org/get')
print(resp.read().decode('utf-8'))

# Use the proxy. The format is dictionary. The proxy supports http, the key is http, supports https, the key is https, and the value is the ip of the proxy server
handler = request.ProxyHandler({'http':'218.66.161.88:31769'})

opener = request.build_opener(handler)
req = request.Request('http://httpbin.org/ip')
resp = opener.open(req)
print(resp.read())

The request result of using agent is as follows:

b'<!DOCTYPE html>\r\n<html>\r\n\r\n<head>\r\n    <meta charset="UTF-8" />\r\n    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\r\n    <meta name="viewport" content="width=device-width, initial-scale=1.0,maximum-scale=1.0, user-scalable=no" />\r\n    <meta name="apple-mobile-web-app-title" content="\xe4\xba\xac\xe4\xb8\x9c\xe8\xb7\xaf\xe7\x94\xb1" />\r\n    <meta http-equiv="pragma" content="no-cache">\r\n    <meta http-equiv="cache-control" content="no-cache">\r\n    <meta http-equiv="expires" content="0">\r\n    <title>JD</title>\r\n    <link href="./P_IMG/icon.png" rel="icon" type="image/x-icon" />\r\n    <script type="text/Javascript" src="./Public_JS/jquery.min.js"></script>\r\n    <script type="text/Javascript" src="./Public_JS/redirect.js"></script>\r\n    <script type="text/Javascript" src="./Public_JS/jquery.cookie.js"></script>\r\n</head>\r\n\r\n<body>\r\n    <div id="main"></div>\r\n</body>\r\n<script type="text/Javascript">\r\n    var errNo = 0;\r\n    $(function () {\r\n        var DEFAULT_VERSION = 9.0;\r\n        var ua = navigator.userAgent.toLowerCase();\r\n        var isIE = ua.indexOf("msie") > -1;\r\n        var safariVersion;\r\n        if (isIE) {\r\n            safariVersion = ua.match(/msie ([\\d.]+)/)[1];\r\n        }\r\n        if (safariVersion <= DEFAULT_VERSION) {\r\n            alert(\'\xe7\xb3\xbb\xe7\xbb\x9f\xe6\xa3\x80\xe6\xb5\x8b\xe5\x88\xb0\xe6\x82\xa8\xe6\xad\xa3\xe5\x9c\xa8\xe4\xbd\xbf\xe7\x94\xa8ie9\xe5\x8f\x8a\xe4\xbb\xa5\xe4\xb8\x8b\xe5\x86\x85\xe6\xa0\xb8\xe7\x9a\x84\xe6\xb5\x8f\xe8\xa7\x88\xe5\x99\xa8\xef\xbc\x8c\xe4\xb8\x8d\xe8\x83\xbd\xe5\xae\x9e\xe7\x8e\xb0\xe5\xae\x8c\xe7\xbe\x8e\xe4\xbd\x93\xe9\xaa\x8c\xef\xbc\x8c\xe8\xaf\xb7\xe6\x9b\xb4\xe6\x8d\xa2\xe6\x88\x96\xe5\x8d\x87\xe7\xba\xa7\xe6\xb5\x8f\xe8\xa7\x88\xe5\x99\xa8\xe8\xae\xbf\xe9\x97\xae\xef\xbc\x81\')\r\n        };\r\n\r\n        var htmlHref = window.location.href;\r\n        htmlHref = htmlHref.replace(/^http:\\/\\//, "");\r\n        var index = htmlHref.indexOf(\'/\');\r\n        htmlHref = htmlHref.substring(0, index);\r\n        $.cookie("HostAddrIP", htmlHref, {\r\n            expires: 7, path: \'/\'\r\n        });\r\n        $.ajax({\r\n            type: "post",\r\n            url: "http://" + htmlHref + "/jdcapi",\r\n            data: JSON.stringify({\r\n                "jsonrpc": "2.0",\r\n                "id": 20,\r\n                "method": "call",\r\n                "params": [\r\n                    "00000000000000000000000000000000",\r\n                    "jdcapi.static",\r\n                    "get_initialization_info",\r\n                    {}\r\n                ]\r\n            }),\r\n            dataType: "json",\r\n            success: function (data) {\r\n                if(data.result[0] == 0 && data.result[1].intialized == \'1\'){\r\n                    if (browserRedirect()) { \r\n                        window.location.href = "./M_Guide/html/M_Login.html";\r\n                    } else \r\n                    { \r\n                        window.location.href = "./P_Guide/html/P_Login.html"; \r\n                    }\r\n                }else{\r\n                    initUserLogin(htmlHref);\r\n                    if(errNo){\r\n                        if (browserRedirect()) { window.location.href = "./M_Guide/html/M_Welcome.html";} else { window.location.href = "./P_Guide/html/P_Welcome.html"; }\r\n                    }else{\r\n                        alert(\'\xe7\xb3\xbb\xe7\xbb\x9f\xe9\x94\x99\xe8\xaf\xaf\xef\xbc\x8c\xe8\xaf\xb7\xe9\x87\x8d\xe8\xaf\x95\');\r\n                    }\r\n                }\r\n            }\r\n        });\r\n        \r\n    });\r\n    var initUserLogin = function (htmlHref) {\r\n        $.ajax({\r\n            type: "post",\r\n            url: "http://" + htmlHref + "/jdcapi",\r\n            async:false,\r\n            data: JSON.stringify({\r\n                "jsonrpc":"2.0",\r\n                "id":1,\r\n                "method":"call",\r\n                "params":[\r\n                    "00000000000000000000000000000000",\r\n                    "session",\r\n                    "login",\r\n                    {\r\n                        "username": "root",\r\n                        "password": "admin",\r\n                        "timeout":0\r\n                    }\r\n                ]\r\n            }),\r\n            dataType: "json",\r\n            success: function (data) {\r\n                if(data.result[0] == 0){\r\n                    errNo = 1;\r\n                    $.cookie("sessionid", data.result[1].ubus_rpc_session, {\r\n                        expires: 7, path: \'/\'\r\n                    });\r\n                }else{\r\n                    errNo = 0;\r\n                }\r\n            }\r\n        });\r\n    };\r\n  </script>\r\n\r\n</html>'

Common agents are:

• express agent: http://www.kuaidaili.com/

• proxy cloud: http://www.dailiyun.com/

Access without proxy:

from urllib import request

url = 'http://httpbin.org/ip'
resp = request.urlopen(url)
print(resp.read())
executed in 544ms, finished 19:51:12 2022-01-23
b'{\n  "origin": "222.163.139.128"\n}\n'

Use proxy access:

1. Use ProxyHandler to pass in a handler built by the agent

2. Build an opener using the created handler

3. Send a request using opener

 from urllib import request

url = 'http://httpbin.org/ip'
# Use ProxyHandler to pass in a handler built by the agent
handler = request.ProxyHandler({'http':'122.226.57.50:8888'})
# Build an opener using the created handler
opener = request.build_opener(handler)
# Send a request using opener
resp = opener.open(url)
print(resp.read())

The results are as follows:

executed in 698ms, finished 20:04:30 2022-01-23
b'{\n  "origin": "122.226.57.70"\n}\n'

Summary:

1. Principle of proxy: before requesting the destination website, first request the proxy server, and then let the proxy server request the destination website. After the proxy server obtains the data of the destination website, it will forward it to our code.

2, http://httpbin.org: This website can easily view some parameters of HTTP request. The format is as follows:

http://httpbin.org/get
http://httpbin.org/user-agent
http://httpbin.org/ip

3. Using proxies in Code:

• use 'URL lib request. Proxyhandler 'is passed into an agent, which is a dictionary. The key of the dictionary is' http' or 'https'

• key depends on the type that the proxy server can accept. value is' ip:port ', and then use the created' handler 'and request build_ Opener 'create an' opener '.

• use the created 'opener' to call the 'open' function and send the request.

What is a cookie

In the website, the http request is stateless, that is to say, even after the first connection with the server and successful login, the second request server still cannot know who is currently requesting. The emergence of cookies solves this problem. After logging in to the server for the first time, it returns some data (cookies) to the browser, and then the browser saves it locally. When the user sends the second request, it will automatically carry the last stored cookie data to the server. The server can judge which user is the current user through the data carried by the browser. The amount of data stored in cookies is limited. Different browsers have different storage sizes, but generally no more than 4KB. Therefore, using cookies can only store a small amount of data.

cookie format:

Set-Cookie:NAME=VALUE; Expires/Max-age = DATE; Path=PATH;Domain=DOMAIN_NAME;SECURE

Parameter meaning:

• NAME: the NAME of the cookie

• value: the value of the cookie

• expires: the expiration time of the cookie

• path: the path where the cookie works

• domain: the domain name used by cookies

• SECURE: whether it only works under https protocol.

Simulate login using cookielib library and HTTP cookieprocessor

Cookie refers to the text file stored on the user's browser by the website server in order to identify the user's identity and track the Session. The cookie can keep the login information until the user's next Session with the server. Let's take logging in to Renren as an example. To access someone's home page, you must log in first, that is, there must be cookie information when logging in. If we want to access by code, we must have the correct cookie information to access. There are two solutions. The first is to use the browser to access, and then copy the cookie information and put it in the headers. The example code is as follows:

from urllib import request

# Enter in the browser address bar http://httpbin.org/user-agent You can get the user agent in the headers
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36", 
 'Cookie':'taihe_bi_sdk_uid=1034eaf8795b3aa571215b0229493246; taihe_bi_sdk_session=74f6c86d2578f7944791f18bc8de8141; Hm_lvt_ad6b0fd84f08dc70750c5ee6ba650172=1642950134,1642950199,1642990773; LOCAL_STORAGE_KEY_RENREN_USER_BASIC_INFO={"userName":"Bruce_liu","userId":2147694583,"headUrl":"http://rrfmn.rrimg.com/","secretKey":"32d3ccb0d3f81e9bfec587d6b2b8074c","sessionKey":"g1c1iAHmWq1VXPxR"}; Hm_lpvt_ad6b0fd84f08dc70750c5ee6ba650172=1642990794' }
# Do not use cookie s to request bruce_liu's personal home page
url = 'http://www.renren.com/personal/2147694583'

req = request.Request(ul=url, headers=headers)
resp = request.urlopen(req)
with open('renren.html', 'w') as fp:
  # The write function must write a str data type
  # resp. What read() reads out is a byte data type
  # bytes -> decode -> str
  # str -> encode ->bytes
  fp.write(resp.read().decode('utf-8'))

http. Cookie jar module

The main classes of this module include CookieJar, FileCookieJar, Mozilla CookieJar and lwpcookeiejar. The functions of the four categories are as follows:

1. Cookie jar: an object that manages HTTP cookie values, stores cookies generated by HTTP requests, and adds cookies to outgoing HTTP requests. The entire cookie is stored in memory. After garbage collection of the cookie jar instance, the cookie will also be lost.

2. Filecookie jar (filename, delayload = none, policy = none): derived from cookie jar, it is used to create an instance of filecookie jar, retrieve cookie information and store cookies in files. Filename is the name of the file where the cookie is stored. When delayload is True, deferred access to files is supported, that is, files are read or data is stored in files only when needed.

3. Mozilla cookiejar (filename, delaylod = none, policy = none): derived from FileCookieJar, it creates cookies with Mozilla browser Txt compatible FileCookieJar instance.

4. Lwpcookeiejar (filename, delayload = none, poly = none): derived from FileCookieJar and created with libwww Perl standard Set-Cookie3 file format compatible FileCookieJar instance.

Using HTTP Cookie jar and request Http cookieprocessor logs into Renren. The example code is as follows:

from urllib import request
from urllib import parse
from http.cookiejar import CookieJar


headers = {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"}

# Define a function that returns opener
def get_opener():
    # Sign in
    # Create a cookie jar object
    cookiejar = CookieJar()
    # Create an HTTP cookieprocessor object using cookiejar
    handler = request.HTTPCookieProcessor(cookiejar)
    
    # Create an opener using the handler created in the previous step
    opener = request.build_opener(handler)
    
    return opener
   
    
# Define a login function of Renren
def login_renren(opener):    
    
    # Use opener to send login request (Renren email and password)


    data = {
        'email': '11343154@qq.com',
        'password': '2008-lxw'
        }
    login_url = 'http://www.renren.com/personal/2147694583'
    req = request.Request(login_url, data=parse.urlencode(data).encode('utf-8'), headers=headers)

    opener.open(req)
    
# Define a function to access personal homepage
def vist_profile(opener):

    # Visit personal homepage
    bruce_liu_url = 'http://www.renren.com/personal/2147694583'

    # You don't need to create a new opener to get the page of your personal home page
    # Because the previous opener already contains the cookie information required for login
    req = request.Request(bruce_liu_url, headers=headers)
    resp = opener.open(req)
    with open('renren2.html', 'w', encoding='utf-8') as fp:
        fp.write(resp.read().decode('utf-8'))

if __name__=='__main__':
    opener1 = get_opener()
    login_renren(opener1)
    vist_profile(opener1)

Save cookie s locally

To save cookies locally, you can use the save method of cookie jar, and you need to specify a file name.

                                                              
from urllib import request                                    
from http.cookiejar import MozillaCookieJar                   
                                                              
headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS
# Sir, into a cookie jar                                                 
cookiejar = MozillaCookieJar('cookie.txt')                    
# Using cookie jar to generate handle handler                                       
handler = request.HTTPCookieProcessor(cookiejar)              
# Using handler to generate opener                                           
opener = request.build_opener(handler)                        
                                                              
req = request.Request('http://www.baidu.com/cookies/set/course/spider', headers=headers)
                                                              
resp = opener.open(req)                                       
# print(resp.read())                                            
cookiejar.save(ignore_discard=True, ignore_expires=True)      

Then load the local cookie with the following code:

from urllib import request
from http.cookiejar import MozillaCookieJar
import  ssl
ssl._create_default_https_context=ssl._create_unverified_context
headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"}
# Sir, into a cookie jar
cookiejar = MozillaCookieJar('cookie.txt')
# Using the load() function, set the parameter to ignore_discard=True
cookiejar.load(ignore_discard=True)
# Using cookie jar to generate handle handler
handler = request.HTTPCookieProcessor(cookiejar)
# Using handler to generate opener
opener = request.build_opener(handler)

resp = opener.open('http://httpbin.org/cookies')
for cookie in cookiejar:
    print(cookie)

requests Library

Although the urllib module in Python's standard library already contains most of the functions we usually use, its API feels cumbersome to use, and the publicity of Requests is "HTTP for Humans", which indicates that it is more concise and convenient to use.

Installation and documentation address:

Using pip can be very easy to install

pip install requests

gihut address: https://github.com/requests/requests

Send GET request

1. The simplest way to send a get request is through requests Get to call

import requests
url = 'http://www.baidu.com/'
resp = requests.get(url)

2. Add headers and query parameters. If you want to add headers, you can pass in the headers parameter to increase the headers information in the request header. If you want to pass the parameters in the url, you can use the params parameter. The example code is as follows:

import requests
kw = {'wd': 'python'}
headers = {
  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
}

# params receives the query parameters of a dictionary or string. The dictionary type is automatically converted to url code without urlencode()
resp = requests.get('http://www.baidu.com/s', params=kw, headers=headers)

# Check the response content, response Text returns data in unicode format
print(resp.text)

# View the response content, resp Byte data returned by content
print(resp.content)

# View full url address
print(resp.url)

# View response header character encoding
print(resp.encoding)

# View response code
print(resp.status_code)

# If the response content is decoded and written into html file
with open('baidu.html', 'w', encoding='utf-8') as fp:
    fp.write(resp.content.decode('utf-8'))

response.text and response Difference of content

  1. response.content: This is the data captured directly from the network without any coding, so it is a byte type. The strings transmitted on the hard disk and the network are of type bytes.
  2. response.text: This is requests, which will respond Content to decode the string. Decoding requires specifying an encoding method, and requests will judge the encoding method according to their own guess. Therefore, there may be a guess error when there is time, which will lead to random code in decoding. At this time, "response.content.decode('utf-8 ')" should be used for manual decoding.

Send POST request

1. The most basic post request can use the post method. If json data is returned, you can call 'response json() 'to convert json strings into dictionaries or lists.

response = requests.post('http://www.baidu.com/', data=data)

2. Incoming data

At this time, don't use urlencode to encode. Just pass it into a dictionary. For example, the code of the data requested to pull the hook:

#_*_coding:utf-8_*_
# Author: liuxiaowei
# Created on: 1/24/22 6:06 PM
# File: post py
# IDE      : PyCharm

import requests

url = 'https://www.lagou.com/jobs/position'
headers = {
  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
}

data = {
    'first':'true',
    'pn':1,
    'kd':'python'
}

resp = requests.post(url, headers=headers, data=data)

# If it is json data, you can call the json method directly
print(resp.json())

Difference between GE T request and POST request

The get request parameter is params =, and the post request parameter is data=

Use agent

It is also very simple to add a proxy using requests. Just pass the proxies parameter in the request method (such as get or post). The example code is as follows:

import requests

url = 'http://httpbin.org/get'

headers = {
  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
}

proxy = {
    'http':'61.150.96.27:36880'  # https://www.kuaidaili.com/free/inha/3/ Shaanxi Hanzhong Telecom
}

resp = requests.get(url, headers=headers, proxies=proxy)
with open('xx.html', 'w', encoding='utf-8') as fp:
    fp.write(resp.text)
    
    
 The results are as follows:
{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36", "X-Amzn-Trace-Id": "Root=1-61eea7d1-0f1dae897b843858628d2d7e" }, "origin": "61.150.96.27", "url": "http://httpbin.org/get" }

Test the proxy IP, and the results are as follows:

cookie

If a cookie is included in a response, use the cookie attribute to get the returned cookie value:

import requests

url = 'http://www.renren.com/PLogin.do'
data = {'emai':'970138074@qq.com', 'password':'pythonspider'}
resp = requests.get('http://www.baidu.com/')
print(resp.cookies)
print(resp.cookies.get_dict())

The results are as follows:

executed in 102ms, finished 21:51:15 2022-01-24
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
{'BDORZ': '27315'}

session

Previously, using the urllib library, you can use opener to send multiple requests, and cookies can be shared among multiple requests. If we use requests to share cookies, we can use the session object provided by the requests library. Note that the session here is not the session in web development. This place only knows the object of a session. Take Renren as an example and implement it with requests. The example code is as follows:

import requests

url = 'http://www.renren.com/PLogin.do'
headers = {
  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
}


data = {'emai':'970138074@qq.com', 'password':'pythonspider'}
session = requests.Session()
session.post(url, data=data, headers=headers)

resp = session.get('http://www.renren.com/880151247/profile')

with open('renren.html', 'w', encoding='utf-8') as fp:
    fp.write(resp.text)

Handling untrusted SSL certificates

For websites that do not trust SSL certificates, such as: https://www.baidu.com/ , then the response can be returned directly by using requests. The example code is as follows:

resp = requests.get(url,verify=False)  # verify set False

Topics: Python crawler