The text and pictures of this article come from the network, only for learning and communication, and do not have any commercial purpose. If you have any questions, please contact us in time for handling.
Start body
Requests library is the most important and common library in Python crawlers. You must master it
Let's meet this library
Requests
Requests is the most commonly used http request Library in Python and is extremely simple When using, you first need to install requests, and directly use pychart for one click installation. Finally, if your time is not very tight and you want to improve quickly, the most important thing is not afraid of hardship. I suggest you can contact Wei: mengy7762. That's really good. Many people make rapid progress and need you to be not afraid of hardship! You can add it and have a look~
1. Response and coding
import requests url = 'http://www.baidu.com' r = requests.get(url) print type(r) print r.status_code print r.encoding #print r.content print r.cookies Get: <class 'requests.models.Response'> 200 ISO-8859-1 <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
2.Get request mode
values = {'user':'aaa','id':'123'} url = 'http://www.baidu.com' r = requests.get(url,values) print r.url Get: http://www.baidu.com/?user=aaa&id=123
3.Post request method
values = {'user':'aaa','id':'123'} url = 'http://www.baidu.com' r = requests.post(url,values) print r.url #print r.text Get: http://www.baidu.com/
4. Request header processing
user_agent = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400'} header = {'User-Agent':user_agent} url = 'http://www.baidu.com/' r = requests.get(url,headers=header) print r.content
Note the headers that handle the request
Many times, our server will check whether the request comes from the browser, so we need to disguise the browser at the head of the request. Request server Generally, when making requests, it is best to disguise as a browser to prevent access denial and other errors. This is also an anti crawler strategy. Finally, if your time is not very tight and you want to improve quickly, the most important thing is not afraid of hardship. I suggest you can contact Wei: mengy7762. That's really good. Many people make rapid progress and need you to be not afraid of hardship! You can add it and have a look~
In particular, no matter what requests we make in the future, we must bring headers. Don't be lazy and save trouble. Understand this as a traffic rule. Running a red light is not necessarily dangerous but unsafe. In order to save trouble, it's enough for us to follow the red light and stop at the green light. The same is true for web crawler requests. We must add this header to prevent mistakes
user_agent = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400'} header = {'User-Agent':user_agent} url = 'http://www.qq.com/' request = urllib2.Request(url,headers=header) response = urllib2.urlopen(request) print response.read().decode('gbk')#Note here that you need to transcode the content of the web page. First, check the format of the chatset of the web page
Open www.qq.com on the browser COM and press F12 to view user agent:
User agent: some servers or proxies use this value to determine whether the request is made by the browser
Content type: when using the REST interface, the server will check the value to determine how the content in the HTTP Body should be parsed.
application/xml: used in XML RPC, such as RESTful/SOAP calls
application/json: used in JSON RPC calls
application/x-www-form-urlencoded: used when a browser submits a Web form
When using RESTful or SOAP services provided by the server, the wrong content type setting will cause the server to refuse service
5. Response code and response header processing
url = 'http://www.baidu.com' r = requests.get(url) if r.status_code == requests.codes.ok: print r.status_code print r.headers print r.headers.get('content-type')#It is recommended to use this get method to get header fields else: r.raise_for_status() Get: 200 {'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Server': 'bfe/1.0.8.18', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:57 GMT', 'Connection': 'Keep-Alive', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Date': 'Wed, 17 Jan 2018 07:21:21 GMT', 'Content-Type': 'text/html'} text/html
6.cookie handling
url = 'https://www.zhihu.com/' r = requests.get(url) print r.cookies print r.cookies.keys() Get: <RequestsCookieJar[<Cookie aliyungf_tc=AQAAACYMglZy2QsAEnaG2yYR0vrtlxfz for www.zhihu.com/>]> ['aliyungf_tc']
7 redirection and historical messages
To handle redirection, you just need to set allow_ The redirects field is OK, and allow_ If redirectsy is set to True, redirection is allowed. If it is set to False, redirection is prohibited. Finally, if you are not very nervous about your time and want to improve quickly, the most important thing is not afraid of hardship. It is recommended that you can contact dimension: mengy7762. That's really good. Many people make rapid progress and need you to be not afraid of hardship! You can add it and have a look~
r = requests.get(url,allow_redirects = True) print r.url print r.status_code print r.history Get: http://www.baidu.com/ 200 []
8. Timeout setting
The timeout option is set through the parameter timeout
python url = 'http://www.baidu.com' r = requests.get(url,timeout = 2)
9. Proxy settings
proxis = { 'http':'http://www.baidu.com', 'http':'http://www.qq.com', 'http':'http://www.sohu.com', } url = 'http://www.baidu.com' r = requests.get(url,proxies = proxis)
If you think the article is good, please praise and share it. Your affirmation is my greatest encouragement and support.