Using and mastering the basic knowledge of urllib Library of python crawler

Posted by lily on Tue, 04 Feb 2020 14:14:01 +0100

Four modules of urlib

urrlib.request
urrlib.error
urrlib.parse
urrlib.robotparser

Get page source code

import urllib.request
response=urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode('utf-8'))
#Get Baidu's source code

post request

import urllib.parse
import urllib.request
data=bytes(urllib.parse.urlencode({"name":"hello"}),encoding='utf-8')
response=urllib.request.urlopen("http://httpbin.org/post",data=data)
print(response.read().decode('utf-8'))

The data here is the passed parameter. If the parameter is byte stream encoded content (i.e. byte type), it needs to be converted by the bytes () method, and the request method is POST

Timeout tests

import urllib.request
import urllib.error
import socket
try:
	response=urllib.request.urlopen("http://httpbin.org/get",timeout=0.1)
except urllib.error.URLError as e:
	if isinstance(e.reason,socket.timeout):
		print("TIME OUT")

The time parameter is passed in here

response

1. Response type

import urllib.request
response=urllib.request.urlopen("http://httpbin.org/get")
print(type(response))

The returned result is: < class' http.client.HTTPResponse '>

2. status code
3. response head
4. response body

import urllib.request
response=urllib.request.urlopen("http://www.python.org")
print(response.status)#Response state
print(response.getheaders())#Get header information
print(response.getheader('Server'))

response with multiple parameters

from urllib import request,parse
url='http://httpbin.org/post'
headers={
	'User-Agent':'Mozillia/4.0(comoatible;MSIE 5.5;Windows NT)',
	'Host':'httpbin.org'
}
dict={
	'name':'Germey'
}
data=bytes(parse.urlencode(dict),encoding='utf-8')
req=request.Request(url=url,data=data,headers=headers,method='POST')
response=request.urlopen(req)
print(response.read().decode('utf-8'))

Advanced usage handler

In some requests, cookies and proxy settings are needed, so Handler is used

mport urllib.request
proxy_handler=urllib.request.ProxyHandler({
	'http':'http://127.0.0.1:9743'
	'https':'https://127.0.0.1:9743'
})
opener=urllib.request.build_opener(proxy_handler)
resopnse=opener.open('http://httpbin.org/get')
print(resopnse.read())

cookies()

import http.cookiejar,urllib.request
cookie=http.cookiejar.CookieJar()
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open("http://www.baidu.com")
for item in cookie:
    print(item.name+"="+item.value)
# Declare a CookieJar object, use HTTP cookieprocessor to build a Handler, and finally use build ﹣ openr to execute the open method

exception handling

from urllib import request,error
try:
	response=request.urllib.urlopen('http://cuiqingcai.com/index.html')
except error.HTTPError as e:
	print(e.reason,e.code,e.headers,seq='\n') #HTTPError is a subclass of URLError
except error.URLError as e:
	print(e.reason)
else:
print('request successfully')

URL analysis:

Part of URL:

What parts does a URL contain? For example, for example, "http://www.baidu.com/index.html? Name = MO & age = 25 × Dowell", in this example, we can divide it into six parts;

1. Transmission protocol: http, https

2. Domain name: for example, www.baidu.com is the name of the website. Baidu.com is the primary domain name, WWW is the server

3. Port: if it is not filled in, port 80 will be used by default

4. Path: http://www.baidu.com/ path 1 / path 1.2. /Represents the root directory

5. Parameters carried:? name=mo

6. Hash value: ×× dowell

--------
Original link: https://blog.csdn.net/qq_/article/details/83689928

urlparse: split operation for url

==urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

url: the url to be resolved
scheme = ': if the parsed url does not have a protocol, you can set the default protocol. If the url has a protocol, setting this parameter is invalid
allow_fragments=True: whether to ignore the anchor point. By default, True means not to ignore, False means ignore

from urllib.parse import urlparse
result=urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)
#Print result: parseresult (scheme =, netloc = ', path =, www.baidu. COM / index. HTML', params =, user ', query =, id = 5', segment =, comment ')

urlunparse (combination)

import urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

urljoin

For the urljoin method, we can provide a base uurl (basic link) as the first parameter and a new link as the second parameter. This method will analyze the scheme, neloc and path of base uurl to supplement the missing part of the link, and finally return the result

from urllib.parse import urljoin
print(urljoin("http://www.baidu.com","FAQ.html"))
print(urljoin("http://www.baidu.com","https://cuiqinghua.com/FAQ.html"))
print(urljoin("http://www.baidu.com","?category=2"))
#Print results
#http://www.baidu.com/FAQ.html
#https://cuiqinghua.com/FAQ.html
#http://www.baidu.com?category=2

urlencode can convert dictionary objects to parameters

from urllib.parse import urlencode
params={
	'name':'germey'
	'age':22
	}
base_url='http://www.baidu.com?'
url=base_url+urlencode(params)
print(url)

quote

This method can convert the content to url format. When the url has Chinese parameters, it may cause garbled code

from urllib.parse import quote
keyword="wallpaper"
url="https://www.baidu/s?wd"+quote(keyword)
print(url)
#https://www.baidu/s?wd%E5%A3%81%E7%BA%B8

unquote

This method can decode the url

from urllib.parse import unquote
url='https://www.baidu/s?wd%E5%A3%81%E7%BA%B8'
print(unquote(url))
`#https://www.baidu/s?wd wallpaper``

Published 52 original articles, won praise 11, visited 3174
Private letter follow

Topics: encoding socket Python Windows