Structure of urllib Library
The urllib library contains the following four modules:
- Request: basic HTTP request module
- error: exception handling module
- parse: tool module
- Robot parser: identify robots Txt module
urlopen method
Simple requests can be sent using the urlopen method
API
urllib.request.urlopen(url, data=None, [timeout,] *, cafile=None, capath=None, cadefault=False, context=None)
- URL: the URL to request
- data: the parameter carried by the request. If this parameter is set, the request method will change to POST instead of GET
- Timeout: timeout, in seconds. URLError exception is thrown in case of timeout
- cafile: CA certificate
- Cspath: path of CA certificate
- cadefault: deprecated, default False
- context: used to specify SSL settings. The value must be SSL Object of sslcontext class
In addition, the urlopen method can also accept a Request object as a parameter, as described later
Send GET request
from urllib.request import urlopen url = 'https://www.python.org' resp = urlopen(url=url) print(resp.read().decode('utf-8')) # The data returned by the read() method is bytes and needs to be decoded manually
Send POST request
from urllib.request import urlopen from urllib.parse import urlencode url = 'https://www.httpbin.org/post' data = {'name': 'germey'} # Use urlencode to encode the data, and then convert it from bytes to bytes data = bytes(urlencode(data), encoding='utf-8') # After carrying the data, the request mode changes to POST resp = urlopen(url=url, data=data) print(resp.read().decode('utf-8'))
Processing timeout
import socket from urllib.request import urlopen from urllib.error import URLError url = 'https://www.httpbin.org/get' try: resp = urlopen(url=url, timeout=0.1) # timeout in seconds html = resp.read().decode('utf-8') print(html) except URLError as e: # The URLError exception is thrown when the timeout expires if isinstance(e.reason, socket.timeout): # Determine the specific type of exception print('TIME OUT')
Request class
The Request class can add more Request information, such as Request header information, Request method, etc
API
class urllib.request.Request(url, data=None, headers={}, origin_rep_host=None, unverifiable=False, method=None)
- URL: the URL to request
- Data: the data to be passed must be of type bytes
- Headers: Request header information. The type is dictionary. The Request header information can be passed through the headers parameter or through the add of the Request object_ Header method passing
- origin_req_host: requestor name or IP address
- Unverifiable: is the request unverifiable
- Method: request method
usage method
from urllib.request import Request from urllib.request import urlopen from urllib.parse import urlencode url = 'https://www.httpbin.org/post' data = bytes(urlencode({'name': 'germey'}), encoding='utf-8') headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.81 Safari/537.36', 'host': 'www.httpbin.org', } req = Request(url=url, data=data, headers=headers, method='POST') resp = urlopen(req) # Still use urlopen to send the Request, and pass in the Request object as a parameter print(resp.read().decode('utf-8'))
Using Handler
Using Handler can handle some special situations in the request process, such as login authentication, cookies, proxies, etc
Base class urllib request. Basehandler provides the most basic methods, such as default_open, protocol_request, etc
Various Handler subclasses inherit BaseHandler to handle various situations:
- HTTPDefaultErrorHandler: handle the corresponding HTTP error and throw an HTTPError exception
- HTTPRedirectHandler: handle redirection
- Httpcookie processor: Processing cookies
- ProxyHandler: set proxy, which is empty by default
- HTTPPasswordMgr: manages passwords and maintains a cross reference table of user names and passwords
- HTTPBasicAuthHandler: Handling authentication issues
Process login authentication
from urllib.request import HTTPPasswordMgrWithDefaultRealm from urllib.request import HTTPBasicAuthHandler from urllib.request import build_opener from urllib.error import URLError url = 'https://ssr3.scrape.center/' username = 'admin' password = 'admin' pwd_mgr = HTTPPasswordMgrWithDefaultRealm() # Create password manager instance pwd_mgr.add_password(None, url, username, password) # Add user name and password auth_handler = HTTPBasicAuthHandler(pwd_mgr) # Use the password manager object to create an authentication processor object opener = build_opener(auth_handler) # Use the authentication processor object to build opener, similar to urlopen, which is used to send requests try: # Using opener When the open method sends a request, the account information configured above will be carried resp = opener.open(url) html = resp.read().decode('utf-8') print(html) except URLError as e: print(e.reason)
Processing agent
from urllib.error import URLError from urllib.request import ProxyHandler from urllib.request import build_opener url = 'https://www.baidu.com' proxy_handler = ProxyHandler({ # Create agent processor 'http': 'http://118.190.244.234:3128', 'https': 'https://118.190.244.234:3128' }) opener = build_opener(proxy_handler) # Create opener try: # Send request resp = opener.open(url) html = resp.read().decode('utf-8') print(html) except URLError as e: print(e.reason)
Processing cookies
# Direct output Cookie import http.cookiejar import urllib.request url = 'https://www.baidu.com' cookies = http.cookiejar.CookieJar() # Create a CookieJar object handler = urllib.request.HTTPCookieProcessor(cookies) # Create a handler object using the CookieJar object opener = urllib.request.build_opener(handler) # Create opener resp = opener.open(url) for cookie in cookies: # Cookie information can be obtained through the cookie jar object, which is similar to a list # Get the name and value attributes of the Cookie object print(cookie.name, '=', cookie.value)
# Write Mozilla format Cookie information to a file import http.cookiejar import urllib.request url = 'https://www.baidu.com' filename = 'bd_m_cookie.txt' # The name of the file where you want to save the Cookie information # Mozilla Cookie jar can handle the file reading and writing of Cookie information, and supports Mozilla format Cookie files cookies = http.cookiejar.MozillaCookieJar(filename=filename) handler = urllib.request.HTTPCookieProcessor(cookiejar=cookies) opener = urllib.request.build_opener(handler) resp = opener.open(url) # Save Cookie information into file cookies.save(ignore_discard=True, ignore_expires=True) """Document content # Netscape HTTP Cookie File # http://curl.haxx.se/rfc/cookie_spec.html # This is a generated file! Do not edit. .baidu.com TRUE / FALSE 1675640364 BAIDUID 48B3F4D3CCDDB7205C471C7941363BCE:FG=1 .baidu.com TRUE / FALSE 3791588011 BIDUPSID 48B3F4D3CCDDB72072B89C5EEAF3C1AE .baidu.com TRUE / FALSE 3791588011 PSTM 1644104364 www.baidu.com FALSE / FALSE 1644104664 BD_NOT_HTTPS 1 """
# Write Cookie information in LWP format to file import http.cookiejar import urllib.request url = 'https://www.baidu.com' filename = 'bd_lwp_cookie.txt' # Lwpcookeiejar needs to be replaced here. Everything else is the same cookies = http.cookiejar.LWPCookieJar(filename=filename) handler = urllib.request.HTTPCookieProcessor(cookiejar=cookies) opener = urllib.request.build_opener(handler) resp = opener.open(url) cookies.save(ignore_expires=True, ignore_discard=True) """Document content #LWP-Cookies-2.0 Set-Cookie3: BAIDUID="519E24A62494ECF40B4A6244CFFA07C3:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2023-02-06 00:13:16Z"; comment=bd; version=0 Set-Cookie3: BIDUPSID=519E24A62494ECF45DB636DC550D8CA7; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2090-02-24 03:27:23Z"; version=0 Set-Cookie3: PSTM=1644106396; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2090-02-24 03:27:23Z"; version=0 Set-Cookie3: BD_NOT_HTTPS=1; path="/"; domain="www.baidu.com"; path_spec; expires="2022-02-06 00:18:16Z"; version=0 """
# Read Cookie information from file import urllib.request import http.cookiejar url = 'https://www.baidu.com' filename = 'bd_lwp_cookie.txt' cookies = http.cookiejar.LWPCookieJar() # Load previously saved files cookies.load(filename=filename, ignore_discard=True, ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookiejar=cookies) opener = urllib.request.build_opener(handler) resp = opener.open(url) html = resp.read().decode('utf-8') print(html)
Handling exceptions
URLError class
URLError class inherits from OSError class and is urllib Base class in error module
Any exception that occurs in the process of sending a request using urllib can be caught by URLError
URLError has a reason attribute indicating the reason for the error
reason may return a string or an Error object (such as the < class' socket.timeout '> object when timeout occurs)
HTTPError class
The HTTPError class is a subclass of the URLError class, which is specialized in handling HTTP request errors
Contains three attributes:
- Code: response status code
- reason: the cause of the error may be a string or an object
- headers: request header information
code
import urllib.request from urllib.error import URLError, HTTPError url = 'https://cuiqingcai.com/404' try: resp = urllib.request.urlopen(url, timeout=1) html = resp.read().decode('utf-8') print(html) except HTTPError as e: print(e.reason, e.headers, e.url, e.fp, e.code, sep='\n') except URLError as e: print(type(e.reason), '\n', e.reason) else: print('success') """ Not Found Server: GitHub.com Content-Type: text/html; charset=utf-8 Access-Control-Allow-Origin: * ETag: "60789243-247b" Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; img-src data:; connect-src 'self' x-proxy-cache: MISS X-GitHub-Request-Id: E15A:6107:132CB29:158E796:61FF1AA9 Accept-Ranges: bytes Date: Sun, 06 Feb 2022 00:55:58 GMT Via: 1.1 varnish Age: 501 X-Served-By: cache-hkg17931-HKG X-Cache: HIT X-Cache-Hits: 1 X-Timer: S1644108959.779112,VS0,VE1 Vary: Accept-Encoding X-Fastly-Request-ID: cce2ac7f081b0d937fe93e90656fce56b5e6cc03 X-Cache-Lookup: Cache Miss X-Cache-Lookup: Cache Miss X-Cache-Lookup: Cache Miss Content-Length: 9339 X-NWS-LOG-UUID: 17106243714350687226 Connection: close X-Cache-Lookup: Cache Miss https://cuiqingcai.com/404 <http.client.HTTPResponse object at 0x0000019F340B1B80> 404 """
Common methods in parse module
urllib. There are many ways to handle URL s in the parse module
urlparse
urllib.parse.urlparse(url=url, scheme='', allow_fragments=True)
- URL: URL to be resolved
- scheme: the default protocol. This is used if the URL does not contain protocol information
- allow_fragments: whether to separate the fragment part. If it is set to false, the fragment will follow other parts
# Use urlparse to disassemble a URL from urllib.parse import urlparse url = 'https://www.baidu.com/index.html;user?id=5#comment' result = urlparse(url=url, scheme='https', allow_fragments=True) print(result) # The return value is similar to a named tuple. You can use index value or attribute value # ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment') print(result.scheme) # Use attribute value # https print(result[1]) # Use index value # www.baidu.com
urlunparse
In contrast to urlparse, urlparse assembles all parts of a URL to get a complete URL
urllib.parse.urlunparse(components)
- Components: the URL component that receives an iteratable object with a fixed length of 6
# Splicing URL s with urlunparse from urllib.parse import urlunparse data = ['https', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment'] print(urlunparse(data)) # https://www.baidu.com/index.html;user?a=6#comment
urlsplit
Urlplit is similar to urlparse except that params is incorporated into the path
urllib.parse.urlsplit(url, scheme='', allow_fragments=True)
# Splitting URL s with urlplit from urllib.parse import urlsplit url = 'https://www.baidu.com/index.html;user?id=5#comment' print(urlsplit(url)) # SplitResult(scheme='https', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')
urlunsplit
urlunsplit is similar to urlunparse, except that the length of the passed in component iteratable object must be 5
# Splicing URL s with urlunsplit from urllib.parse import urlunsplit data = ['https', 'www.baidu.com', 'index.html', 'a=6', 'comment'] print(urlunsplit(data)) # https://www.baidu.com/index.html?a=6#comment
urljoin
urljoin receives a base_url and another URL (generally relative URL) can automatically analyze the base_ The scheme, netloc and path parts of the URL are spliced together with the relative URL to obtain a complete URL
urllib.parse.urljoin(base, url, allow_fragments=True)
- Base: base URL
- URL: the relative URL to be spliced with the base URL
- allow_fragments: whether to splice fragment s separately
from urllib.parse import urljoin print(urljoin('https://www.baidu.com', 'FAQ.html')) # https://www.baidu.com/FAQ.html
urlencode
urlencode can serialize parameters in dictionary form into string form ("name = germey & age = 25")
from urllib.parse import urlencode params = { 'user': 'germey', 'age':25 } base_url = 'https://www.baidu.com?' # Pay attention to your own handling? Question of url = base_url + urlencode(params) print(url) # https://www.baidu.com?user=germey&age=25
parse_qs
parse_qs deserializes the GET request parameter string and returns the parameters in dictionary form
from urllib.parse import parse_qs query = 'name=mergey&age=25' print(parse_qs(query)) # {'name': ['mergey'], 'age': ['25']}
parse_qsl
parse_ The function of QSL and parse_qs is similar, but returns a list of tuples
from urllib.parse import parse_qsl query = 'name=mergey&age=25' print(parse_qsl(query)) # [('name', 'mergey'), ('age', '25')]
quote
quote can convert a Chinese string to a string with a hexadecimal number of% for use as a parameter in the URL
from urllib.parse import quote keyword = 'HD Wallpapers ' base_url = 'https://www.baidu.com/s?wd=' url = base_url + quote(keyword) print(url) # https://www.baidu.com/s?wd=%E9%AB%98%E6%B8%85%E5%A3%81%E7%BA%B8 # In fact, the bottom layer is to convert Chinese into bytes, convert each byte into hexadecimal, and add% before it bs = bytes(keyword, encoding='utf-8') b_list = [] for b in bs: b_list.append(hex(b)[-2:].upper()) b_str = '%' + '%'.join(b_list) print(b_str) # %E9%AB%98%E6%B8%85%E5%A3%81%E7%BA%B8
unquote
unquote has the opposite function to quote
from urllib.parse import unquote url = 'https://www.baidu.com/s?wd=%E9%AB%98%E6%B8%85%E5%A3%81%E7%BA%B8' print(unquote(url)) # https://www.baidu.com/s?wd= HD Wallpapers
Robots protocol
The Robots protocol is also called the crawler protocol. The full name of the Robots exclusion protocol is used to tell crawlers which pages can be crawled and which cannot
It is usually called robots Txt file, put it in the root directory of the website
robots.txt files generally have three types of entries:
- Use agent: the name of the crawler
- Disallow: crawl path not allowed
- Allow: allowed crawl path
Examples
- Disable all crawlers from accessing all directories
User-agent: *
Disallow: / - Allow all crawlers to access all directories (or leave the robots.txt file blank)
User-agent: *
Disallow: - Disable all crawlers from accessing certain directories
User-agent: *
Disallow: /private/
Disallow: /tmp/ - Only one crawler is allowed to access all directories
User-agent: WebCrawler
Disallow:
User-agent: *
Disallow: /
Common crawler names
Reptile name | Website name |
---|---|
BaiduSpider | Baidu |
Googlebot | |
360Sipder | 360 search |
YodaoBot | Youdao |
ia_archiver | Alexa |
Scooter | altavista |
Bingbot | Bing |
Parsing Robots protocol
Use urllib The RobotFileParser class in the robotparser module can read and parse the Robots protocol
The RobotFileParser class has several common methods:
- set_url: set robots Txt file. If the url parameter is passed in when instantiating the RobotFileParser class, this method is not needed
- Read: read robots Txt file must be read before parsing, otherwise the parsing will fail and all return False
- Parse: parse robots Txt file. Parameters are the contents of some lines in the file
- can_fetch: judge whether the specified user agent can grab the specified URL
- mtime: returns the last resolved robots Txt file time
- modified: set resolution robots Txt file time
from urllib.robotparser import RobotFileParser parser = RobotFileParser() parser.set_url('https://www.baidu.com/robots.txt') parser.read() # Read before parsing print(parser.can_fetch('Baiduspider', 'https://www.baidu.com')) # True print(parser.can_fetch('Baiduspider', 'https://www.baidu.com/homepage/')) # True print(parser.can_fetch('Googlebot', 'https://www.baidu.com/homepage/')) # False