pyhton reptile learning 02

Posted by TheIrishThug on Tue, 01 Feb 2022 22:45:16 +0100

Crawler instance and requests module

The underlying implementation of requests is urllib.
requests encapsulates urilib, which is easy to use.
requests can help us decompress the (gzip compressed) response content.
Simple application of requests

import requests
r = requests.get('http://www.baidu.com')

print(r)
print(r.status_code)

#< response [200] > indicates that the status code of the response is 200, indicating that the request is successful
#200

Define a random variable r, and then use requests Get method obtains Baidu's website, and get is the request method. (request method of website: get, post)

Common status codes:
200: success
302: temporarily transfer to a new url (request redirection)
307: temporarily transfer to new url
404: page not found
500: server internal error
503: internal error of the server, which is generally backcrawled

print(r.text)
#<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é"</title></head>

Get a string of html code, which is the front-end code of Baidu home page, but the Chinese in it has become garbled. Because of the coding problem, what comes through the server is a kind of code, and what comes out is also a kind of code. If there is a conflict, garbled code will appear. So specify the encoding format.

r.encoding = 'gbk'
#< title > sign in and add your name < / Title >  
Into another kind of garbled code


r.encoding = 'utf-8'
#< title > Baidu, you will know < / Title >
Into Chinese
print(r.content.decode())
#You can also simplify the code
 change into content Method, later decode The default code is utf-8
#< title > Baidu, you will know < / Title >

Save web page picture

import requests

image_url = 'https://image.baidu.com/search/detail?ct=503316480&z=0&ipn=d&word=tupian1&step_word=&hs=2&pn=1&spn=0&di=660&pi=0&rn=1&tn=baiduimagedetail&is=0%2C0&istype=0&ie=utf-8&oe=utf-8&in=&cl=2&lm=-1&st=undefined&cs=3815466111%2C1684641624&os=4137808727%2C3345370830&simid=0%2C0&adpicid=0&lpn=0&ln=785&fr=&fmq=1622687818028_R&fm=&ic=undefined&s=undefined&hd=undefined&latest=undefined&copyright=undefined&se=&sme=&tab=0&width=undefined&height=undefined&face=undefined&ist=&jit=&cg=&bdtype=0&oriquery=&objurl=https%3A%2F%2Fgimg2.baidu.com%2Fimage_search%2Fsrc%3Dhttp%3A%2F%2Fb-ssl.duitang.com%2Fuploads%2Fitem%2F201504%2F09%2F20150409H2449_KLesa.thumb.700_0.jpeg%26refer%3Dhttp%3A%2F%2Fb-ssl.duitang.com%26app%3D2002%26size%3Df9999%2C10000%26q%3Da80%26n%3D0%26g%3D0n%26fmt%3Djpeg%3Fsec%3D1625279823%26t%3D71a1f72cb8b25e5f22356998128bfe34&fromurl=ippr_z2C%24qAzdH3FAzdH3Fooo_z%26e3B17tpwg2_z%26e3Bv54AzdH3Fks52AzdH3F%3Ft1%3Dm8m9bn8n9&gsm=2&rpstart=0&rpnum=0&islist=&querylist=&nojc=undefined'
response = requests.get(image_url)
print(response.content)

# Save pictures locally
#File operation syntax
with open('image.jpg','wb') as f:
    f.write(response.content)

Use the requests method to get the pictures on the web page locally.
First, import the requests module, define a variable to store the link of the picture, and then use requests The get () method requests the picture, and then print the format of the picture. At this time, there is no need to define the encoding format after the content (because the encoding format of the picture is different from that of the text)
The running result is a large segment of binary code,
Save pictures locally
File operation syntax: with open ()
There are two parameters in parentheses. The first is the file name saved locally, (image.jpg)
The second is the file access mode (w is to write a file. If the file does not exist, a new file will be created directly. wb is to write the file in binary format)

Pictures are saved in the default current folder.

So the picture is under my Python text.

headers request header

import requests
r = requests.get('http://www.baidu.com')

print(r.content.decode())

The running result is the front-end code of Baidu home page. Copy the code into the html file. Run, baidu home page appears, but in the real Baidu home page, the code is much worse, less js code.

15 episodes

Tertiary directory

Tertiary directory

Topics: Python