1-20 crawler urlib module

Posted by clanstyles on Wed, 06 May 2020 02:06:44 +0200

cookie usage for urlib:

If you already know the cookie, or the cookie you obtained through packet grabbing, you can log in directly in the header's information;
The cookie information on the website of Jingdong is different from that on the website of Jingdong,
You can log in to JD, grab the cookie information, and then visit any website.

import urllib.request
url = "http://www.jd.com"
header = {"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
          "cookie": "xxxxx"}
req = urllib.request.Request(url=url, headers=header)
res = urllib.request.urlopen(req)
text = res.read()

Urlib's cookie related classes

The class of cookie in python2 is called: import cookie IB
The class of cookie in Python 3 is called: import http.cookie jar

The concept of opener

When you get a URL you use an opener (an instance of urllib2.OpenerDirector). Previously, we all used the default opener, urlopen.

urlopen is a special opener, which can be understood as a special instance of opener. The parameters passed in are only url, data and timeout.
If we need to use cookies, we can't achieve the goal only with this opener, so we need to create a more general opener to set cookies.

Terminal output cookie object

import urllib.request
import http.cookiejar

url = "http://www.hao123.com"
req = urllib.request.Request(url)
cookiejar = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookiejar)
opener = urllib.request.build_opener(handler)
r = opener.open(req)
print(cookiejar)
<CookieJar[<Cookie BAIDUID=93B415355E0704B2BC94B5D514468898:FG=1 for .hao123.com/>, <Cookie hz=0 for .www.hao123.com/>, <Cookie ft=1 for www.hao123.com/>, <Cookie v_pg=normal for www.hao123.com/>]>

Cookie s saved to file:

import urllib.request
import http.cookiejar

url = "http://www.hao123.com"
req = urllib.request.Request(url)

cookieFileName = "cookie.txt"
cookiejar = http.cookiejar.MozillaCookieJar(cookieFileName)#File cookie
handler = urllib.request.HTTPCookieProcessor(cookiejar)
opener = urllib.request.build_opener(handler)
r = opener.open(req)
print(cookiejar)
cookiejar.save()
//Saved in file cookie.txt



MozillaCookieJar inherit FileCookieJar()inherit CookieJar

Cookie Read from file cookie Information and access:
import urllib.request
import http.cookiejar
cookie_filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(cookie_filename)
cookie.load(cookie_filename, ignore_discard=True, ignore_expires=True)
print(cookie)
url = "http://www.hao123.com"
req = urllib.request.Request(url)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)  # Create an opener with the method of build and opener of urlib2
response = opener.open(req)

print(response.read().decode("utf-8"))#Solve the problem of garbled code

Topics: Windows Python

Programmer Think