Reptile Foundation_ urllib

Posted by 01hanstu on Sun, 06 Feb 2022 04:49:52 +0100

Structure of urllib Library

The urllib library contains the following four modules:

  • Request: basic HTTP request module
  • error: exception handling module
  • parse: tool module
  • Robot parser: identify robots Txt module

urlopen method

Simple requests can be sent using the urlopen method

API

urllib.request.urlopen(url, data=None, [timeout,] *, cafile=None, capath=None, cadefault=False, context=None)

  • URL: the URL to request
  • data: the parameter carried by the request. If this parameter is set, the request method will change to POST instead of GET
  • Timeout: timeout, in seconds. URLError exception is thrown in case of timeout
  • cafile: CA certificate
  • Cspath: path of CA certificate
  • cadefault: deprecated, default False
  • context: used to specify SSL settings. The value must be SSL Object of sslcontext class

In addition, the urlopen method can also accept a Request object as a parameter, as described later

Send GET request

from urllib.request import urlopen

url = 'https://www.python.org'
resp = urlopen(url=url)
print(resp.read().decode('utf-8'))  # The data returned by the read() method is bytes and needs to be decoded manually

Send POST request

from urllib.request import urlopen
from urllib.parse import urlencode

url = 'https://www.httpbin.org/post'
data = {'name': 'germey'}
# Use urlencode to encode the data, and then convert it from bytes to bytes
data = bytes(urlencode(data), encoding='utf-8')
# After carrying the data, the request mode changes to POST
resp = urlopen(url=url, data=data)
print(resp.read().decode('utf-8'))

Processing timeout

import socket
from urllib.request import urlopen
from urllib.error import URLError

url = 'https://www.httpbin.org/get'
try:
    resp = urlopen(url=url, timeout=0.1)  # timeout in seconds
    html = resp.read().decode('utf-8')
    print(html)
except URLError as e:  # The URLError exception is thrown when the timeout expires
    if isinstance(e.reason, socket.timeout):  # Determine the specific type of exception
        print('TIME OUT')

Request class

The Request class can add more Request information, such as Request header information, Request method, etc

API

class urllib.request.Request(url, data=None, headers={}, origin_rep_host=None, unverifiable=False, method=None)

  • URL: the URL to request
  • Data: the data to be passed must be of type bytes
  • Headers: Request header information. The type is dictionary. The Request header information can be passed through the headers parameter or through the add of the Request object_ Header method passing
  • origin_req_host: requestor name or IP address
  • Unverifiable: is the request unverifiable
  • Method: request method

usage method

from urllib.request import Request
from urllib.request import urlopen
from urllib.parse import urlencode

url = 'https://www.httpbin.org/post'
data = bytes(urlencode({'name': 'germey'}), encoding='utf-8')
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.81 Safari/537.36',
    'host': 'www.httpbin.org',
}
req = Request(url=url, data=data, headers=headers, method='POST')
resp = urlopen(req)  # Still use urlopen to send the Request, and pass in the Request object as a parameter
print(resp.read().decode('utf-8'))

Using Handler

Using Handler can handle some special situations in the request process, such as login authentication, cookies, proxies, etc

Base class urllib request. Basehandler provides the most basic methods, such as default_open, protocol_request, etc

Various Handler subclasses inherit BaseHandler to handle various situations:

  • HTTPDefaultErrorHandler: handle the corresponding HTTP error and throw an HTTPError exception
  • HTTPRedirectHandler: handle redirection
  • Httpcookie processor: Processing cookies
  • ProxyHandler: set proxy, which is empty by default
  • HTTPPasswordMgr: manages passwords and maintains a cross reference table of user names and passwords
  • HTTPBasicAuthHandler: Handling authentication issues

Process login authentication

from urllib.request import HTTPPasswordMgrWithDefaultRealm
from urllib.request import HTTPBasicAuthHandler
from urllib.request import build_opener
from urllib.error import URLError

url = 'https://ssr3.scrape.center/'
username = 'admin'
password = 'admin'

pwd_mgr = HTTPPasswordMgrWithDefaultRealm()  # Create password manager instance
pwd_mgr.add_password(None, url, username, password)  # Add user name and password

auth_handler = HTTPBasicAuthHandler(pwd_mgr)  # Use the password manager object to create an authentication processor object
opener = build_opener(auth_handler)  # Use the authentication processor object to build opener, similar to urlopen, which is used to send requests

try:  # Using opener When the open method sends a request, the account information configured above will be carried
    resp = opener.open(url)
    html = resp.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

Processing agent

from urllib.error import URLError
from urllib.request import ProxyHandler
from urllib.request import build_opener

url = 'https://www.baidu.com'
proxy_handler = ProxyHandler({  # Create agent processor
    'http': 'http://118.190.244.234:3128',
    'https': 'https://118.190.244.234:3128'
})
opener = build_opener(proxy_handler)  # Create opener
try:  # Send request
    resp = opener.open(url)
    html = resp.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

Processing cookies

# Direct output Cookie
import http.cookiejar
import urllib.request

url = 'https://www.baidu.com'

cookies = http.cookiejar.CookieJar()  # Create a CookieJar object
handler = urllib.request.HTTPCookieProcessor(cookies)  # Create a handler object using the CookieJar object
opener = urllib.request.build_opener(handler)  # Create opener
resp = opener.open(url)
for cookie in cookies:  # Cookie information can be obtained through the cookie jar object, which is similar to a list
    # Get the name and value attributes of the Cookie object
    print(cookie.name, '=', cookie.value)
# Write Mozilla format Cookie information to a file
import http.cookiejar
import urllib.request

url = 'https://www.baidu.com'

filename = 'bd_m_cookie.txt'  # The name of the file where you want to save the Cookie information
# Mozilla Cookie jar can handle the file reading and writing of Cookie information, and supports Mozilla format Cookie files
cookies = http.cookiejar.MozillaCookieJar(filename=filename)
handler = urllib.request.HTTPCookieProcessor(cookiejar=cookies)
opener = urllib.request.build_opener(handler)
resp = opener.open(url)
# Save Cookie information into file
cookies.save(ignore_discard=True, ignore_expires=True)

"""Document content
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com	TRUE	/	FALSE	1675640364	BAIDUID	48B3F4D3CCDDB7205C471C7941363BCE:FG=1
.baidu.com	TRUE	/	FALSE	3791588011	BIDUPSID	48B3F4D3CCDDB72072B89C5EEAF3C1AE
.baidu.com	TRUE	/	FALSE	3791588011	PSTM	1644104364
www.baidu.com	FALSE	/	FALSE	1644104664	BD_NOT_HTTPS	1
"""
# Write Cookie information in LWP format to file
import http.cookiejar
import urllib.request

url = 'https://www.baidu.com'
filename = 'bd_lwp_cookie.txt'
# Lwpcookeiejar needs to be replaced here. Everything else is the same
cookies = http.cookiejar.LWPCookieJar(filename=filename)
handler = urllib.request.HTTPCookieProcessor(cookiejar=cookies)
opener = urllib.request.build_opener(handler)
resp = opener.open(url)
cookies.save(ignore_expires=True, ignore_discard=True)

"""Document content
#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="519E24A62494ECF40B4A6244CFFA07C3:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2023-02-06 00:13:16Z"; comment=bd; version=0
Set-Cookie3: BIDUPSID=519E24A62494ECF45DB636DC550D8CA7; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2090-02-24 03:27:23Z"; version=0
Set-Cookie3: PSTM=1644106396; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2090-02-24 03:27:23Z"; version=0
Set-Cookie3: BD_NOT_HTTPS=1; path="/"; domain="www.baidu.com"; path_spec; expires="2022-02-06 00:18:16Z"; version=0
"""
# Read Cookie information from file
import urllib.request
import http.cookiejar

url = 'https://www.baidu.com'
filename = 'bd_lwp_cookie.txt'
cookies = http.cookiejar.LWPCookieJar()
# Load previously saved files
cookies.load(filename=filename, ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookiejar=cookies)
opener = urllib.request.build_opener(handler)
resp = opener.open(url)
html = resp.read().decode('utf-8')
print(html)

Handling exceptions

URLError class

URLError class inherits from OSError class and is urllib Base class in error module

Any exception that occurs in the process of sending a request using urllib can be caught by URLError

URLError has a reason attribute indicating the reason for the error

reason may return a string or an Error object (such as the < class' socket.timeout '> object when timeout occurs)

HTTPError class

The HTTPError class is a subclass of the URLError class, which is specialized in handling HTTP request errors

Contains three attributes:

  • Code: response status code
  • reason: the cause of the error may be a string or an object
  • headers: request header information

code

import urllib.request
from urllib.error import URLError, HTTPError

url = 'https://cuiqingcai.com/404'

try:
    resp = urllib.request.urlopen(url, timeout=1)
    html = resp.read().decode('utf-8')
    print(html)
except HTTPError as e:
    print(e.reason, e.headers, e.url, e.fp, e.code, sep='\n')
except URLError as e:
    print(type(e.reason), '\n', e.reason)
else:
    print('success')

"""
Not Found
Server: GitHub.com
Content-Type: text/html; charset=utf-8
Access-Control-Allow-Origin: *
ETag: "60789243-247b"
Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; img-src data:; connect-src 'self'
x-proxy-cache: MISS
X-GitHub-Request-Id: E15A:6107:132CB29:158E796:61FF1AA9
Accept-Ranges: bytes
Date: Sun, 06 Feb 2022 00:55:58 GMT
Via: 1.1 varnish
Age: 501
X-Served-By: cache-hkg17931-HKG
X-Cache: HIT
X-Cache-Hits: 1
X-Timer: S1644108959.779112,VS0,VE1
Vary: Accept-Encoding
X-Fastly-Request-ID: cce2ac7f081b0d937fe93e90656fce56b5e6cc03
X-Cache-Lookup: Cache Miss
X-Cache-Lookup: Cache Miss
X-Cache-Lookup: Cache Miss
Content-Length: 9339
X-NWS-LOG-UUID: 17106243714350687226
Connection: close
X-Cache-Lookup: Cache Miss


https://cuiqingcai.com/404
<http.client.HTTPResponse object at 0x0000019F340B1B80>
404
"""

Common methods in parse module

urllib. There are many ways to handle URL s in the parse module

urlparse

urllib.parse.urlparse(url=url, scheme='', allow_fragments=True)

  • URL: URL to be resolved
  • scheme: the default protocol. This is used if the URL does not contain protocol information
  • allow_fragments: whether to separate the fragment part. If it is set to false, the fragment will follow other parts
# Use urlparse to disassemble a URL
from urllib.parse import urlparse

url = 'https://www.baidu.com/index.html;user?id=5#comment'
result = urlparse(url=url, scheme='https', allow_fragments=True)
print(result)  # The return value is similar to a named tuple. You can use index value or attribute value
# ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
print(result.scheme)  # Use attribute value
# https
print(result[1])  # Use index value
# www.baidu.com

urlunparse

In contrast to urlparse, urlparse assembles all parts of a URL to get a complete URL

urllib.parse.urlunparse(components)

  • Components: the URL component that receives an iteratable object with a fixed length of 6
# Splicing URL s with urlunparse
from urllib.parse import urlunparse

data = ['https', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))
# https://www.baidu.com/index.html;user?a=6#comment

urlsplit

Urlplit is similar to urlparse except that params is incorporated into the path

urllib.parse.urlsplit(url, scheme='', allow_fragments=True)

# Splitting URL s with urlplit
from urllib.parse import urlsplit

url = 'https://www.baidu.com/index.html;user?id=5#comment'
print(urlsplit(url))
# SplitResult(scheme='https', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

urlunsplit

urlunsplit is similar to urlunparse, except that the length of the passed in component iteratable object must be 5

# Splicing URL s with urlunsplit
from urllib.parse import urlunsplit

data = ['https', 'www.baidu.com', 'index.html', 'a=6', 'comment']
print(urlunsplit(data))
# https://www.baidu.com/index.html?a=6#comment

urljoin

urljoin receives a base_url and another URL (generally relative URL) can automatically analyze the base_ The scheme, netloc and path parts of the URL are spliced together with the relative URL to obtain a complete URL

urllib.parse.urljoin(base, url, allow_fragments=True)

  • Base: base URL
  • URL: the relative URL to be spliced with the base URL
  • allow_fragments: whether to splice fragment s separately
from urllib.parse import urljoin

print(urljoin('https://www.baidu.com', 'FAQ.html'))
# https://www.baidu.com/FAQ.html

urlencode

urlencode can serialize parameters in dictionary form into string form ("name = germey & age = 25")

from urllib.parse import urlencode

params = {
    'user': 'germey',
    'age':25
}
base_url = 'https://www.baidu.com?'  #  Pay attention to your own handling? Question of
url = base_url + urlencode(params)
print(url)
# https://www.baidu.com?user=germey&age=25

parse_qs

parse_qs deserializes the GET request parameter string and returns the parameters in dictionary form

from urllib.parse import parse_qs

query = 'name=mergey&age=25'
print(parse_qs(query))
# {'name': ['mergey'], 'age': ['25']}

parse_qsl

parse_ The function of QSL and parse_qs is similar, but returns a list of tuples

from urllib.parse import parse_qsl

query = 'name=mergey&age=25'
print(parse_qsl(query))
# [('name', 'mergey'), ('age', '25')]

quote

quote can convert a Chinese string to a string with a hexadecimal number of% for use as a parameter in the URL

from urllib.parse import quote

keyword = 'HD Wallpapers '
base_url = 'https://www.baidu.com/s?wd='

url = base_url + quote(keyword)
print(url)
# https://www.baidu.com/s?wd=%E9%AB%98%E6%B8%85%E5%A3%81%E7%BA%B8

# In fact, the bottom layer is to convert Chinese into bytes, convert each byte into hexadecimal, and add% before it
bs = bytes(keyword, encoding='utf-8')
b_list = []
for b in bs:
    b_list.append(hex(b)[-2:].upper())
b_str = '%' + '%'.join(b_list)
print(b_str)
# %E9%AB%98%E6%B8%85%E5%A3%81%E7%BA%B8

unquote

unquote has the opposite function to quote

from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E9%AB%98%E6%B8%85%E5%A3%81%E7%BA%B8'
print(unquote(url))
# https://www.baidu.com/s?wd= HD Wallpapers 

Robots protocol

The Robots protocol is also called the crawler protocol. The full name of the Robots exclusion protocol is used to tell crawlers which pages can be crawled and which cannot

It is usually called robots Txt file, put it in the root directory of the website

robots.txt files generally have three types of entries:

  • Use agent: the name of the crawler
  • Disallow: crawl path not allowed
  • Allow: allowed crawl path

Examples

  • Disable all crawlers from accessing all directories
    User-agent: *
    Disallow: /
  • Allow all crawlers to access all directories (or leave the robots.txt file blank)
    User-agent: *
    Disallow:
  • Disable all crawlers from accessing certain directories
    User-agent: *
    Disallow: /private/
    Disallow: /tmp/
  • Only one crawler is allowed to access all directories
    User-agent: WebCrawler
    Disallow:
    User-agent: *
    Disallow: /

Common crawler names

Reptile nameWebsite name
BaiduSpiderBaidu
GooglebotGoogle
360Sipder360 search
YodaoBotYoudao
ia_archiverAlexa
Scooteraltavista
BingbotBing

Parsing Robots protocol

Use urllib The RobotFileParser class in the robotparser module can read and parse the Robots protocol

The RobotFileParser class has several common methods:

  • set_url: set robots Txt file. If the url parameter is passed in when instantiating the RobotFileParser class, this method is not needed
  • Read: read robots Txt file must be read before parsing, otherwise the parsing will fail and all return False
  • Parse: parse robots Txt file. Parameters are the contents of some lines in the file
  • can_fetch: judge whether the specified user agent can grab the specified URL
  • mtime: returns the last resolved robots Txt file time
  • modified: set resolution robots Txt file time
from urllib.robotparser import RobotFileParser

parser = RobotFileParser()
parser.set_url('https://www.baidu.com/robots.txt')
parser.read()  # Read before parsing
print(parser.can_fetch('Baiduspider', 'https://www.baidu.com'))  # True
print(parser.can_fetch('Baiduspider', 'https://www.baidu.com/homepage/'))  # True
print(parser.can_fetch('Googlebot', 'https://www.baidu.com/homepage/'))  # False

Topics: Python crawler