Python crawler is the most important and common library that must be mastered

Posted by Repgahroll on Tue, 18 Jan 2022 21:10:45 +0100

The text and pictures of this article come from the network, only for learning and communication, and do not have any commercial purpose. If you have any questions, please contact us in time for handling.

Start body

Requests library is the most important and common library in Python crawlers. You must master it

Let's meet this library

Requests

Requests is the most commonly used http request Library in Python and is extremely simple When using, you first need to install requests, and directly use pychart for one click installation. Finally, if your time is not very tight and you want to improve quickly, the most important thing is not afraid of hardship. I suggest you can contact Wei: mengy7762. That's really good. Many people make rapid progress and need you to be not afraid of hardship! You can add it and have a look~

1. Response and coding

import requests
url = 'http://www.baidu.com'
r = requests.get(url)
print type(r)
print r.status_code
print r.encoding
#print r.content
print r.cookies


Get:
<class 'requests.models.Response'>
200
ISO-8859-1
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

2.Get request mode

values = {'user':'aaa','id':'123'}
url = 'http://www.baidu.com'
r = requests.get(url,values)
print r.url

Get: http://www.baidu.com/?user=aaa&id=123

3.Post request method

values = {'user':'aaa','id':'123'}
url = 'http://www.baidu.com'
r = requests.post(url,values)
print r.url
#print r.text

Get:
http://www.baidu.com/

4. Request header processing

user_agent = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400'}
header = {'User-Agent':user_agent}
url = 'http://www.baidu.com/'
r = requests.get(url,headers=header)
print r.content

Note the headers that handle the request
Many times, our server will check whether the request comes from the browser, so we need to disguise the browser at the head of the request. Request server Generally, when making requests, it is best to disguise as a browser to prevent access denial and other errors. This is also an anti crawler strategy. Finally, if your time is not very tight and you want to improve quickly, the most important thing is not afraid of hardship. I suggest you can contact Wei: mengy7762. That's really good. Many people make rapid progress and need you to be not afraid of hardship! You can add it and have a look~

In particular, no matter what requests we make in the future, we must bring headers. Don't be lazy and save trouble. Understand this as a traffic rule. Running a red light is not necessarily dangerous but unsafe. In order to save trouble, it's enough for us to follow the red light and stop at the green light. The same is true for web crawler requests. We must add this header to prevent mistakes

user_agent = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400'}
header = {'User-Agent':user_agent}
url = 'http://www.qq.com/'
request = urllib2.Request(url,headers=header)
response = urllib2.urlopen(request)
print response.read().decode('gbk')#Note here that you need to transcode the content of the web page. First, check the format of the chatset of the web page

Open www.qq.com on the browser COM and press F12 to view user agent:

User agent: some servers or proxies use this value to determine whether the request is made by the browser
Content type: when using the REST interface, the server will check the value to determine how the content in the HTTP Body should be parsed.
application/xml: used in XML RPC, such as RESTful/SOAP calls
application/json: used in JSON RPC calls
application/x-www-form-urlencoded: used when a browser submits a Web form
When using RESTful or SOAP services provided by the server, the wrong content type setting will cause the server to refuse service

5. Response code and response header processing

url = 'http://www.baidu.com'
r = requests.get(url)

if r.status_code == requests.codes.ok:
    print r.status_code
    print r.headers
    print r.headers.get('content-type')#It is recommended to use this get method to get header fields
else:
    r.raise_for_status()

Get:
200
{'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Server': 'bfe/1.0.8.18', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:57 GMT', 'Connection': 'Keep-Alive', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Date': 'Wed, 17 Jan 2018 07:21:21 GMT', 'Content-Type': 'text/html'}
text/html

6.cookie handling

url = 'https://www.zhihu.com/'
r = requests.get(url)
print r.cookies
print r.cookies.keys()

Get:
<RequestsCookieJar[<Cookie aliyungf_tc=AQAAACYMglZy2QsAEnaG2yYR0vrtlxfz for www.zhihu.com/>]>
['aliyungf_tc']

7 redirection and historical messages

To handle redirection, you just need to set allow_ The redirects field is OK, and allow_ If redirectsy is set to True, redirection is allowed. If it is set to False, redirection is prohibited. Finally, if you are not very nervous about your time and want to improve quickly, the most important thing is not afraid of hardship. It is recommended that you can contact dimension: mengy7762. That's really good. Many people make rapid progress and need you to be not afraid of hardship! You can add it and have a look~

r = requests.get(url,allow_redirects = True)
print r.url
print r.status_code
print r.history

Get:
http://www.baidu.com/
200
[]

8. Timeout setting

The timeout option is set through the parameter timeout
python url = 'http://www.baidu.com' r = requests.get(url,timeout = 2)

9. Proxy settings

proxis = {
    'http':'http://www.baidu.com',
    'http':'http://www.qq.com',
    'http':'http://www.sohu.com',

}

url = 'http://www.baidu.com'
r = requests.get(url,proxies = proxis)

If you think the article is good, please praise and share it. Your affirmation is my greatest encouragement and support.

Topics: Python Java C++ Web Development network

Programmer Think