Small reptile learning -- urllib and urllib 3

Posted by youngp on Tue, 18 Jan 2022 23:29:02 +0100

Urllib is a built-in official standard library without downloading; It is a combination of urllib and urllib 2 in Python 2. Urllib 3 library is a third-party standard library, which solves thread safety and adds functions such as connection pool. Urllib and urllib 3 complement each other;

1, urllib Library

The urllib library mainly includes four modules:

  1. urllib.requests: request module
  2. urlib.error: exception handling module
  3. urllib.parse: url parsing module
  4. urllib.robotparser : robots.txt parsing module

1.1,urllib.request module

The request module is mainly responsible for constructing and initiating network requests, and adding Headers, proxies, etc.

It can simulate the request initiation process of the browser.
1. Initiate a network request.
2. Add Headers.
3. Operate cookie s.
4. Use agents.

1.1.1 initiate network request

urlopen method

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

Function: urlopen is a method to send a simple network request, and then return the result.
Parameters:
① url: required; It can be a string or a Request object.
② Data: None – GET request; There is data (byte type / file object / iteratable object) – POST request (in case of POST request, the data will be put into the form for submission);
③ Timeout: default setting; In seconds, for example: set timeout=0.1 and timeout to 0.1 seconds (if it exceeds this time, an error will be reported!)
Return value: the class or method in the urlib library will return a urllib after sending a network request The object of the response. It contains the results of the requested data. It contains some properties and methods for us to process the returned results.

Example:

from urllib import request 
# test_url="http://httpbin.org/get "Note that if you use the get request, the data should be empty
test_url="http://httpbin.org/post"
res=request.urlopen(test_url,data=b"spider")
print(res.read())#All contents of byte string
print(res.getcode())#Get status code
print(res.info())#Get response header information
print(res.read())#The byte string is read again and empty

Request object

Using urlopen can initiate the most basic request, but these simple parameters are not enough to build a complete request (add request headers and different request methods). You can build a more complete request through construction.

class Request:
	def __init__(self, url, data=None, headers={},
	    origin_req_host=None, unverifiable=False,
	    method=None):
    pass

Function: Request is an object that constructs a complete network Request, and then returns the Request object.
Parameters:
① url: required; Is a string
② data: byte type
③ headers: request header information
④ method: GET by default. You can fill in POST, PUT, DELETE, etc
Return value: a request object

Example:

from urllib import request
#Request object
test_url="http://httpbin.org/get"
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"}
req=request.Request(test_url,headers=headers)
res=request.urlopen(req)
print(res.read())

#Use of data method parameter of Request object
print("************************************")
test_url="http://httpbin.org/put"
req=request.Request(test_url,headers=headers,data=b"updatedata",method="PUT")
res=request.urlopen(req)
print(res.read())

response object

The classes or methods in the urlib library will return a urllib after sending a network request The object of the response. It contains the results of the requested data. It contains some properties and methods for us to process the returned results.

  • read() gets the data returned by the response. It can only be used once
  • readline() reads a line
  • info() get response header information
  • geturl() gets the url to access
  • getcode() returns the status code

1.1.2 add request header

from urllib import request
test_url="http://httpbin.org/get"
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"}
req=request.Request(test_url,headers=headers)
res=request.urlopen(req)
print(res.read())

1.1.3. Operating cookie s

from  urllib  import request
from  http  import  cookiejar
#Create a cookie object
cookie=cookiejar.CookieJar()
#Create a cookie handler
cookies=request.HTTPCookieProcessor(cookie)
#Take it as a parameter to create an Openner object
opener=request.build_opener(cookies)
#Use this opener to send requests
res=opener.open("http://www.baidu.com")

1.1.4. Setting agent

from  urllib  import  request
url='http://httpbin.org/ip'
#Proxy address
proxy={'http':'180.76.111.69:3128'}
#Agent processor
proxies=request.ProxyHandler(proxy)
#Creating an opener object
opener=request.build_opener(proxies)

res=opener.open(url)
print(res.read().decode())

1.2,urllib.parse module

The parse module is a tool module that provides a method for url processing. It is used to parse the url. The url can only contain ascii characters. In the actual operation process, there will be a large number of special characters in the parameters passed by the get request through the url, such as Chinese characters, so url coding is required.

1.2.1 transcoding of single parameter

  • parse.quote() Chinese character to ascII code
from urllib import parse
name="cartoon"
asc_name=parse.quote(name)# Chinese character to ascII code
print(asc_name)  #Result:% E5%8A%A8%E7%94%BB%E7%89%87
  • parse.unquote() ascll to Chinese
from urllib import parse
name = '%E5%8A%A8%E7%94%BB%E7%89%87'
print(parse.unquote(name))   #Result: Animation

1.2.2 transcoding multiple parameters

When sending a request, you often need to pass a lot of parameters. It will be troublesome to splice with string method. Parse The URLEncode () method can convert the dictionary into the request parameters of the url and complete the splicing. You can also use parse parse_ The QS () method returns it to the dictionary.

  • parse.urlencode()
  • parse.parse_qs()

Example:

from urllib import parse,request
#parse. The URLEncode () method converts the dictionary to the request parameter of the url
params={"name":"film","name2":"TV play","name3":"cartoon"}
asc_name=parse.urlencode(params)# To convert dictionary form to url request parameter form
print(asc_name)#name=%E7%94%B5%E5%BD%B1&name2=%E7%94%B5%E8%A7%86%E5%89%A7&name3=%E5%8A%A8%E7%94%BB%E7%89%87
test_url="http://httpbin.org/get?{}".format(asc_name)
print(test_url)
res=request.urlopen(test_url)
print(res.read())

#parse_qs is converted back to its original form
new_params=parse.parse_qs(asc_name)
print(new_params)#{'name': ['movie'],'name2 ': [' TV play '],'name3': ['cartoon']}

1.3,urllib.error module

1.3.1 URLError and HTTPError

The error module is mainly responsible for handling exceptions. If an error occurs in the request, we can use the error module to handle it, mainly including URLError and HTTPError.

  • URLError: it is the base class of the error exception module. Exceptions generated by the request module can be handled with this class.
  • HTTPError: is a subclass of URLError, which mainly contains three attributes:
    • Code: status code of the request
    • Reason: the reason for the error
    • headers: the header of the response

Example:

from urllib import error,request
try:
    res=request.urlopen("https://jianshu.com")
    print(res.read())
except error.HTTPError as e:
    print('Requested status code:',e.code)
    print('Cause of error:',e.reason)
    print('Response header:',e.headers)

------------result-----------------
Requested status code: 403
 Cause of error: Forbidden
 Response header: Server: Tengine
Date: Mon, 12 Jul 2021 04:40:02 GMT
Content-Type: text/html
Content-Length: 584
Connection: close
Vary: Accept-Encoding
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload

1.4,urllib. Robot parse module

The robot parse module is mainly responsible for processing crawler protocol files, robots Txt parsing. (gentleman's agreement) crawlers generally do not abide by it, so they basically do not use this module;

View robots protocol: add robots after the web address Txt.

For example, Baidu's robots protocol( http://www.baidu.com/robots.txt)

The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Robots Exclusion Protocol". The website tells search engines which pages can be crawled and which pages cannot be crawled through Robots protocol.

robots.txt file is a text file. You can create and edit it by using any common text editor, such as Notepad, which comes with Windows system. robots.txt is a protocol, not a command. robots.txt is the first file to view when visiting a website in a search engine. robots.txt file tells the spider program what files can be viewed on the server.

2, Urllib 3 Library

2.1. Features

Urllib 3 is a powerful, well-organized Python library for HTTP clients. Many Python native systems have started to use urllib 3. Urllib3 provides many important features that are not available in the python standard library:
1. Thread safety; 2. Connection pool; 3. Client SSL/TLS authentication; 4. File segment code upload; 5. Assist in handling duplicate requests and HTTP relocation; 6. Support compression coding; 7. Support HTTP and SOCKS proxy; 8. 100% test coverage.

2.2 installation

Install with the pip command:

pip install urllib3

2.3 use of urlib3

2.3.1 basic steps of initiating a request

1. Import urllib3 Library

import utllib3

2. Instantiate a PoolManager object that handles all the details of connection pooling and thread safety

http=urllib3.PoolManager()

3. Send a request using the request method

res=http.request("GET","http://www.baidu.com")

2.3.2. request method

request(self, method, url, fields=None, headers=None,**urlopen_kw)

Function: send complete network request
Parameters:
① method: request methods get, post, PUT, DELETE
② url: string format
③ fields: dictionary type is converted to url parameter in GET request and form data in POST request
④ headers: dictionary type
Return value: response object

Example:

import urllib3

http = urllib3.PoolManager()
url = 'http://httpbin.org/get'
headers = {'header1':'python','header2':'java'}
fields = {'name':'you','passwd':'12345'}

res = http.request('GET',url,fields=fields,headers=headers)

print('Status code:',res.status)
print('Response header:',res.headers)
print('data:',res.data)

2.3.3,Proxies

You can use ProxyManager for http proxy operations

import urllib3
proxy=urllib3.ProxyManager('http://180.76.111.69:31281')
res=proxy.request('get','http://httpbin.org/ip')
print(res.data)

2.3.4,Request data

  • For get, head and delete requests, you can add query parameters by providing dictionary type parameter fields
import urllib3
http=urllib3.PoolManager()
r=http.request('get','http://httpbin.org/get',fields={'mydata':'python'})
print(r.data.decode())
  • For post and put requests, the parameters need to be encoded into the correct format through url encoding, and then spliced into the url
import urllib3
from urllib import parse

http=urllib3.PoolManager()
data = parse.urlencode({'myname':'pipi'})
url = 'http://httpbin.org/post?'+data
r=http.request('post',url)
print(r.data.decode())
  • JSON
    When initiating a request, you can send compiled JSON data by defining the body parameter and the content type parameter of headers
import urllib3
import json
http=urllib3. PoolManager()
data={'username':'python'}
encoded_data=json.dumps(data).encode('utf-8')
r=http.request('post',
                'http://httpbin.org/post',
                body=encoded_data,
                headers={'Content-Type1':'appLication/json'})
print(json.loads(r.data.decode('utf-8'))['json'])
  • Files
    For file upload, we can imitate the way of browser form
import json
import urllib3

http=urllib3.PoolManager()
with open('example.txt') as fp: 
    file_data=fp.read()
r=http.request('POST',
               'http://httpbin. org/post', 
               fields={'filefield':('example.txt', file_data)}
              )
print(json.loads(r.data.decode('utf-8'))['files'])
  • binary data
    For binary data upload, we specify the body and set the content type request header
import urllib3
import json

http=urllib3. PoolManager()
with open('example.jpg','rb') as fb: 
    binary_data=fb.read()
r=http.request('post',
               'http://httpbin.org/post', 
                body=binary_data, 
                headers={'Content-Type':'image/jpeg'}
               )
print(json.loads(r.data.decode('utf-8')))

2.3.4. response object

  • The http response object provides properties such as status, data, and header
import urllib3
http=urllib3.PoolManager()
r=http.request('GET','http://httpbin.org/ip')
print(r.status)
print(r.data)
print(r.headers)
  • JSON content
    The returned json format data can be through the json module, and loads is the dictionary data type

  • Binary content
    The data returned by the response is of byte type. For a large amount of data, we can process it better through stream

import urllib3
http=urllib3.PoolManager()
r=http.request('GET','http://httpbin.org/bytes/10241', preload_content=False)
for chunk in r.stream(32):
	print(chunk)

It can also be treated as a file object

import urllib3
http=urllib3.PoolManager()
r=http.request('GET','http://httpbin.org/bytes/10241', preload_content=False)
for line in r:
    print(line)