[python life from scratch ②] improvement method of Likou crawler

Posted by Sekka on Thu, 27 Jan 2022 18:12:01 +0100

☘ preface ☘

What can you learn after reading this blog?

  • Reading and writing methods of excel
  • How to get the interface
  • Multithreading improves crawler efficiency
  • Cloud server deployment web service real-time update

In this blog, based on the previous article, I will further explore the website interface api to improve the access speed, further explore the faster speed of multithreading, and finally deploy it to the web server to update the ranking list in real time.
Students who have not read the previous article and have no foundation can read the previous article first [python life from scratch ①] user's question brushing data of hand tearing reptiles and raking force buttons

Approximate reading time of the full text: 20min

🧑🏻 About the author: a young man who changed from industrial design to embedded
✨ Contact: 2201891280(QQ)

1, Reasons for improvement

Code improvement of the previous article (mainly using headless mode without displaying the screen to improve the speed):

ch_options = webdriver.ChromeOptions()
# Configure headless mode for Chrome
ch_options.add_argument("--headless")
ch_options.add_argument('--no-sandbox')
ch_options.add_argument('--disable-gpu')
ch_options.add_argument('--disable-dev-shm-usage')
ch_options.add_argument('log-level=3')
ch_options.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging'])
ch_options.add_experimental_option('useAutomationExtension', False)
ch_options.add_argument("--disable-blink-features=AutomationControlled")

driver = webdriver.Chrome(options=ch_options)
driver.implicitly_wait(10)

🚨 1. Time efficiency

The following is a screenshot of my running time on raspberry pie 4B. 900s, which is not acceptable on the server
>## 🎶 2. Resource occupation >
This is the resource occupation of my computer when it is running. If the server is running, it will burst directly... It may cause my own web service to stop responding.

2, Acquisition of interface

1. Find query information

First visit your home page and press f12 to enter the developer tool

Press ctrl+f to find the number of questions you brush

You can successfully find that the corresponding keyword is acTotal, and the corresponding request is to look at the header.

2. Determine the incoming parameters

2.1 determining header information


The address where the header can be found is https://leetcode-cn.com/graphql/ , let's post directly to see the feedback.

The picture above tells us that I didn't give him CSRF verification.

Our normal request parameters are here, actually in the header. Generally, this kind of website also needs to apply for website information, so we can send the corresponding header information to it.

Now we have successfully given the header, prompting us to pass the parameter error. We didn't pass the parameter at all, 233

2.2 determining transmission parameters


Check the payload to find the main transmission data. We copy it and send it to the corresponding website as data:

The error message tells me that variables do not meet the requirements, ah, this??? In fact, you need to change the value of variables into a string:

Well, we've got the data we need, but it's too complicated. You can find that the query results all depend on the query field. Let's simplify it:

Without destroying the hierarchy, it is streamlined to the extreme, and the returned information is only the data we want
Let's modify the value of the returned data again, and it's perfect.

For convenience, we encapsulate the query information according to the name into a function and verify it

3, Data read in and query write back

1. Data reading and query

In fact, we finished reading the relevant data last time, but this time we only need the user's name. So we have a little change. We read in the data and directly output 0 if it does not meet the requirements
If it meets the requirements, we need to find the user name from it and analyze the link to find / u / {user name} /, so we can extract the corresponding user name with regular expression

At this time, the corresponding data can be output

2. Write back of data

At first, in order to copy back Tencent documents, I directly used txt as output, which is very simple:


It can be found that in this case, the cpu load is very low and the speed has been greatly improved, but we know that requests are to access network resources, and the main problem of limiting time is network waiting, so we prefer to use multi-threaded concurrent requests to meet the requirements of improving program speed.

4, Multithreaded request information

Thanks to Minglao's guidance here, which saves a lot of trouble @Xiaoming - code entity
First, introduce the ThreadPoolExecutor Thread Library:
Import:

from concurrent.futures import ThreadPoolExecutor

Create thread library:

with ThreadPoolExecutor(max_workers=10) as executor:

Return the results in order and receive:

       nums = executor.map(chaxun, df.Force buckle home page.str.extract(
           r"leetcode-cn.com/u/([^/]+)(?:/|$)", expand=False))

The second parameter is the passed in parameter list, and map can be output in the order of input.
As long as we convert the returned nums into a list, we can get the results in the previous order.
Overall function:

Time consuming:>Results obtained

5, Writing of web front end

Thanks for the hard work of Minglao: @Xiaoming - code entity
pandas can generate html files directly by default. We need to modify it a little

1. Default generation result:


2. Change to the color display of the whole line

To view the generated stylesheet, we need to change the corresponding value #T_18e34_row0_col3 becomes row0
Use regular expressions to replace (I really don't understand it here. It's still Minglao who asks for help, Minglao yyds!!!)

3. Improve the style sheet

Finally, add the display mode you want to customize, and then return the corresponding html information.

6, Other supplements

1. Acquisition of CSRF

requests can directly obtain a cookie, which we can extract from cookies

def int_csrf():
	global headers
	sess= requests.session()
	sess.head("https://leetcode-cn.com/graphql/")
	headers['x-csrftoken'] = sess.cookies["csrftoken"]

2. The web server writes the static page address directly


It can be directly written to the root directory of the corresponding web page and accessed directly. The effect can be seen Mobile version of ten thousand questions list

3. Presentation page on computer side

The resolution of the computer and mobile phone is different, so the display effect is different. So we made some PC side pages separately and used the embedded method to browse the data. In fact, ifame is used for nesting and scaling.

<style>
	.iframe-body-sty{position: relative;overflow: hidden;height:700px;width: 500px;background-color:#FFF;}
	.iframe-body-sty>#iframe-shrink{position: absolute;transform:scale(0.43);left: -620px;top: -550px;height:1900px;max-width:1600px;width: 1600px;}
</style>
<div class="iframe-body-sty">
	<iframe id="iframe-shrink" src="https://www.xingleigao.top/score.html"></iframe>
</div>

The final effect can be seen as follows: List of thousands of questions for 10000 people on PC

4. Regular update of data

Use sh to perform corresponding operations

cd /home/leetcode
date >> log.txt
python3 leetcode.py >> log.txt 2>&1

Using cron for scheduled tasks

crontab -e

7, Write at the end

The whole optimization process is relatively smooth this time, and I have learned a lot from a little white who doesn't understand anything. If you have any questions, please criticize and correct them. Here is the complete code:

"""
Xinglei's code
CSDN Home page: https://blog.csdn.net/qq_17593855
"""
__author__ = 'Xing Lei'
__time__ = '2022/1/27'

import pandas as pd
import re
import time
from urllib.parse import urlencode
import requests
import json
from concurrent.futures import ThreadPoolExecutor

headers={
    "x-csrftoken":'',
    "Referer":"https://leetcode-cn.com",

}

utf = '''
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<script>
var _hmt = _hmt || [];
(function() {
  var hm = document.createElement("script");
  hm.src = "https://hm.baidu.com/hm.js?f114c8d036eda9fc450e6cbc06a31ebc";
  var s = document.getElementsByTagName("script")[0]; 
  s.parentNode.insertBefore(hm, s);
})();
</script>

'''

payload = {"operation_name": "userPublicProfile",
    "query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
    submissionProgress {
        acTotal
    }
}
}
''',
    "variables": '{"userSlug":"kingley"}'
}

def int_csrf():
    global headers
    sess= requests.session()
    sess.head("https://leetcode-cn.com/graphql/")
    headers['x-csrftoken'] = sess.cookies["csrftoken"]

def chaxun(username):
    payload['variables'] = json.dumps({"userSlug" : f"{username}"})
    res= requests.post("https://leetcode-cn.com/graphql/"+"?"+urlencode(payload),headers = headers)
    if res.status_code != 200:
        return -1
    return res.json()['data']['userProfilePublicProfile']['submissionProgress']['acTotal']
    
def get_html(df,cmap="Set3"):
    df.sort_values("Number of force deduction questions", ascending=False, inplace=True)
    del df['Force buckle home page']
    del df['CSDN homepage']
    del df['''B Station home page
(It is mainly used to find the corresponding person when giving rewards)''']
    r = (
        df.style.hide_index()
            .background_gradient(cmap=cmap, subset=["Number of force deduction questions"])
    )
    #print(r.render())
    html = '<div>' + r.render() + '</div>'
    html = re.sub("#T_.+?(row\d+)_col\d+", r".\1", html)
    with open("style.css") as f:
        css = "<style>" + f.read() + "</style>"
    css = css.format(fontsize=28, justify="center")
    html = utf + css + html
    return html


if __name__ == '__main__':
    int_csrf()
    df = pd.read_excel('111.xlsx')
    #Read an entire column of data
    start = time.time()
    with ThreadPoolExecutor(max_workers=10) as executor:
        nums = executor.map(chaxun, df.Force buckle home page.str.extract(
            r"leetcode-cn.com/u/([^/]+)(?:/|$)", expand=False))
    df['Number of force deduction questions']=list(nums)
    with open("/www/xxxx/score.html", 'w', encoding="u8") as f:
        f.write(get_html(df))
    print("time consuming:", time.time() - start)

Finally, thank you very much!! yyds

Topics: Python leetcode crawler