**python crawls Tencent videos through crawling requests**

Posted by Venkatesh on Tue, 08 Mar 2022 20:00:11 +0100

The first time I wrote a blog, my original intention was to share my knowledge with you.
Also pour down their own, uncomfortable and happy in the process. If you want to see the process, just look at it.
What you don't want to see can also be directly at the bottom? Take the code away.

Cause: it's because my girlfriend and another friend who is a teacher need to plug the video into the ppt during class. It's best to insert MP4 format, but Tencent has the wrong format after looking for it for a long time. So I asked me about learning computer. I'm also a headache. Do you think that as long as it is related to computer, people will learn computer, ha ha ha. Stop teasing.
My first thought is to download directly. Then the format is converted, but it is found that most video converters that can be converted on the market need to charge. And the frequency of use may be a little high. It's better to convert the video into MP4 if you study computer yourself. But...... Forget it, I don't write the process. I don't think many people read it. Go straight to the text.

Text:
First of all, because the movie in Tencent video is played by video streaming, each video is only about 10 seconds.
Therefore, the method of loading MP4 in the usual network can not be time-consuming and laborious, and it needs to be synthesized.
The second and most important point is that the token is added to the URL requested by Tencent, which is simply the string generated by the encryption algorithm, which makes it impossible to capture the video stream by ajax dynamic loading.
Therefore, if you want to get to the URL, you must get to the token. 2. The request contains a token. There are generally two ways:
1. The server was sent from the beginning;
2. Self generated.
But after searching, I found that the first one was not sent from the server at all (maybe I didn't find it too much). Second, there is no solution if you generate it yourself. First, you don't know the key. Second, even if you know the key, you don't know what encryption algorithm it is. So the road was blocked. What should we do? Let's quit and have a look. We can change our thinking. When we get the request, we can automatically get the video stream address, but python's selenium and request have no method to get the request. Later, we found that Java has a package - browsermobproxy, which can be achieved. Most importantly, our predecessors also thought of this requirement and combined selenium and browsermobproxy to obtain requests and responses. Thank you for paving the way for us, otherwise it will be difficult to do it.

Programming ideas version 1.1
1, Get request header
1.1. Configure the browser proxy environment and installation package
1.1.1. Configure environment variables for Java environment
1.1.2. Download the browsermobproxy package in GitHub
1.1.3. Install the virtual environment of browsermobproxy in python, pip install browsermobproxy
Note: the second and third steps are different and should be carried out separately
1.2. Write code to obtain request header

from selenium.webdriver.chrome.options import Options
import browsermobproxy
from time import sleep
from lxml import etree
import re
import json

# Start the browsermobproxy service
#########################################
#Be careful not to use browser mob proxy Bat, or the server won't start
borproy = r"C:\\Users\\Administrator\\Desktop\\test\\clraw\\browsermob-proxy-2.1.4-bin\\browsermob-proxy-2.1.4\\bin\\browsermob-proxy.bat"
server = browsermobproxy.Server(borproy,{'port':8080})
server.start()
proxy = server.create_proxy()
print('Service started','Port:',proxy.port)

to configure browsermobproxy Service and selenium
#####################################
chrome_options = Options() # Simulator settings
chrome_options.add_argument('--ignore-certificate-errors')# Anti detection settings
chrome_options.add_experimental_option('excludeSwitches',['enable-automation'])#Anti detection settings
# Personal understanding is to embed the monitoring service of browsermobproxy into selenium's browser in order to monitor the browser's plea and response
chrome_options.add_argument('--proxy-server={}'.format(proxy.proxy))
# If Selenium driver is placed in Python Exe directory, executable_ The path parameter can be omitted
driver = webdriver.Chrome(chrome_options=chrome_options) # Load configuration file into webdriver

url = "https://v.qq.com/x/page/q3238a78slz.html"
# It's new. An empty har is ready to receive the interactive information of the crawling website. 123 set it casually. That's the name you get yourself
proxy.new_har("123",options={'captureContent': True,'captureContent': True})

Get request of header
###################################################
# Analog browser (you can make headless browser if you want)
driver.get(url)
page_text = driver.page_source
# Convert string to HTML
tree = etree.HTML(page_text)
# Get the loading progress through XPath
Jazai_progress = tree.xpath('.//txpdiv[@class="txp_progress_load"]//@style')
# [1]//@data-clipboard-text'
title_old = tree.xpath('.//title//text()')


##############################################
# The above is to enter the browser
# Next, selenium is used to simulate the sending of requests
# Because the progress bar of Tencent is that the current video stream is about to be played, the next video stream will be loaded
# therefore
# 1, After entering, selenium clicks the progress bar to speed up the loading of data
##############################################



# Get the progress of the load bar
# Load in memory, don't put it on the hard disk, otherwise it will be very slow
# Test every 3 seconds

# Get page
progress_tiao = 0
while(1):
    sleep(3)
    page_text = driver.page_source
    # Convert string to HTML
    tree = etree.HTML(page_text)
    # Get the loading progress through XPath
    Jazai_progress = tree.xpath('.//txpdiv[@class="txp_progress_load"]//@style')
    # [1]//@data-clipboard-text'
    title_new = tree.xpath('.//title//text()')
    if title_new != title_old:
        break
    # print(title_new)
    # Get data from regular expression
    Jazai_progress_data = re.sub('\D','',Jazai_progress[0])
    print('Video loading progress',Jazai_progress_data,end='')
        
#################################
   
    if(int(Jazai_progress_data)==0):
        sleep(1)
        print('wait for')

    if Jazai_progress_data == '100':
        break

# Click the progress bar to keep up

result = proxy.har

print('Successfully get the request')
result_json = json.dumps(result,indent=4)  #Convert dictionary to string
with open("lagoujob.json","w",errors="igone") as f:
    f.write(result_json)

server.stop()
driver.quit()

# Get URL
import json
import requests

# database = "lagoujob.json"

# data = json.loads(open(result).read())

request_URL = {}
# print(data['log']['entries'][100]['request']['queryString'][0]['name'])
for i in range(len(result['log']['entries'])):
    if len(result['log']['entries'][i]['request']['queryString'])==0:
        continue
    
    if result['log']['entries'][i]['request']['queryString'][0]['name']=='index':
        
        URL = result['log']['entries'][i]['request']['url']
        index = result['log']['entries'][i]['request']['queryString'][0]['value']
        request_URL[index] = URL
        # print(request_URL)
print('URL Successfully obtained')



#########################################
# Get video

# Analog request header

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'
    }
print(request_URL)

video_0 = requests.get(url,headers = headers)
with open ("time.mp4","ab") as f:
    for value in request_URL.values():
        video = requests.get(value,headers = headers)
        f.write(video.content)
    f.close()
print('Video download succeeded')

Here comes the bright place. This method can be used, but. I think it's too much trouble.

Just then I rushed to another person's and saw another way
Pay link feel better than their own.
https://www.cnblogs.com/blogsxyz/p/12811236.html

Topics: Python html crawler