requests module advanced application - simulated Login

Posted by osti on Sat, 22 Jan 2022 08:17:10 +0100

  • Simulated Login purpose:

Crawl the user information of some users.

  • Requirements Description:

Simulate the login of ancient poetry network.

  • Coding process:
  1. Crawl the home page data, obtain the verification code information and save it
  2. Automatic identification of verification code through super Eagle
  3. Click login to obtain the URL of the login page, simulate a post request to the login page, and the parameters carry the verification code (dynamically changing)
  4. Direct code:
    # @Time : 2022/1/20 0020 9:56
    # @Author : Tzy0425
    # @File : Simulated landing of ancient poetry network.py
    
    import requests
    from lxml import etree
    from chaojiying import Chaojiying_Client
    
    # Function to encapsulate the picture of identification verification code
    def getCode(filePath):
        chaojiying = Chaojiying_Client('Super Eagle user account', 'Super Eagle user password', 'Software ID')  
        im = open('code_gushi.jpg', 'rb').read()  # Local picture file path
        return chaojiying.PostPic(im, 1004)
    
    
    # Capture and identify the verification code picture
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
    }
    
    url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
    page_text = requests.get(url=url,headers=headers).text
    tree = etree.HTML(page_text)
    code_src = 'https://so.gushiwen.cn' + tree.xpath('//*[@id="imgCode"]/@src')[0]
    code_data = requests.get(url=code_src,headers=headers).content
    with open('./code_gushi.jpg','wb') as fp:
        fp.write(code_data)
    
    # Analytical dynamic parameters
    __VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
    __VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]
    
    code_text = getCode('code_gushi.jpg')
    code = code_text.get('pic_str')
    print(code)
    
    login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'
    data = {
        '__VIEWSTATE':__VIEWSTATE,
        '__VIEWSTATEGENERATOR':__VIEWSTATEGENERATOR,
        'from':' http://so.gushiwen.cn/user/collect.aspx',
        'email':' Own account',
        'pwd':' Own password',
        'code':code,
        'denglu':' Sign in',
    }
    # use session conduct post Sending of requests
    response = requests.post(url=login_url,data=data,headers=headers)
    login_page_text = response.text
    print(response.status_code)
    with open('./gushici.html','w',encoding='utf-8') as fp:
        fp.write(login_page_text)

    Under normal circumstances, the above code can successfully simulate the user to log in. The HTML generated after running should be the home page to jump after successful login. However, after running, it is found that although the returned status code is 200, the generated page is the login page rather than the user interface after successful login. Why???

 

 

Don't worry, you will understand from the following explanations.

  • http / https protocol features:

Stateless, that is, when the user sends a request to the server, the server will not record the relevant status of the current user.

  • Reason why the corresponding page data is not requested:

Therefore, when the second request based on the personal home page (i.e. the page to jump to after successful login) is initiated in the code, the premise is that the login has been successful. However, due to the stateless nature of http / https protocol, the server does not know that the request is based on the request in login status, so it will return to your login interface again, How does the server record the relevant status of the client------ cookie

  • cookie: used to let the server record the status of the client
  1. Manual processing: obtain the cookie value through the packet capture tool and encapsulate the value into headers. (not recommended, because some cookies change dynamically and the operation is cumbersome)
  2. Automatic processing: first of all, we need to know that the cookie value is created by the server after the post request sent during login. Secondly, the session object is used. It can send requests (the encoding method is the same as requests). If a cookie is generated in the process of the request, the cookie will be automatically saved in the session object.
  3. session usage process:

(1) create a session object: session = requests Session()

(2) use the session object to send the simulated Login post request

(3) the session object sends the get request corresponding to the personal home page

Then, after analysis, you only need to change the previous request through requests to request through session. For the code, you can create a session object at the beginning, and then change the part of subsequent requests to session.

 

Finally, an anti crawling mechanism and corresponding anti crawling strategy are introduced:

 

  • Proxy: there is an anti crawling mechanism, which limits the number of ip accesses per unit time. The anti crawling strategy uses ip proxy.
  1. Agent related websites: fast agent, Xici agent, www.goubanjia.com com
  2. The types of proxy ip: http and https, which are applied to the corresponding protocols respectively
  3. Anonymity of proxy ip:

(1) transparency: the server knows that the proxy is used for the request and the real ip corresponding to the request

(2) anonymity: the server knows that the proxy is used, but does not know the real ip address

(3) high concealment: the server does not know that the proxy is used, let alone the real ip address

Relevant codes:

Just add proxies to the request parameters of requests and assign a proxy ip.

import requests
url = 'https://www.baidu.com/s?wd=ip'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}

page_text = requests.get(url=url,headers=headers,proxies={"https":'222.110.147.50:3128'}).text

with open('ip.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

 

Topics: Python crawler