[Python crawler instance learning chapter] - 4. Super detailed video of climbing bilibili bili

Posted by gacon on Tue, 21 Jan 2020 14:21:11 +0100

[Python crawler instance learning chapter] - 4. Super detailed video of climbing bilibili bili

Because I often study on station B, but I can't help but my home network is too poor, and my online viewing is seriously jammed, so I got the idea of downloading videos (if you just want to download videos, please use you get Library). Don't talk too much and go straight to work.
(I find that it seems that many people use an API when they climb bilibili bili video and then need a cid parameter, which is not used in this article.)
Using tools

  1. python3.6
  2. requests Library
  3. lxml Library (xpath parsing)
  4. json Library (parse json data to get download link)
  5. ffmpeg (combine video and audio)

Catalog

  1. Determine video resource address
  2. Download test
  3. Download video and audio (two ways)
  4. Combine video and audio
  5. BiliBiliVideo.py

1. Determine video resource address

(1) Use Chrome to open any video, Ctrl+Shift+C to select the video box to try to get the video link. The result shows that the obtained link address is: blob: https://www.bilibilibili.com/198785ae-c0e6-48c1-b27b-36c5af8935c6, which is a blob encrypted link and cannot be accessed directly.

(2) After searching online, This article Gave me inspiration, thinking: to capture the web page, capture the links to video segments, and then use the captured link information to locate.

(3) The information that can be located to all video segments comes from https://www.bilibilibili.com/video/av56643958

(4) Parse this part of json code (the complete json data is too large, please go by yourself B station Find the corresponding location to watch), you can find:
The quality parameter refers to the definition of video. 112 is HD 1080p+, 80 is Hd 1080p, 64 is HD, 32 is clear, and 16 is smooth.
The duration parameter is the length of the video, in seconds.
The frameRate parameter is the frame rate.
The SegmentBase parameter should be the initial video slice size and the slice base address range, in bytes.
The deadline parameter is a parameter in the url indicating the timestamp of link failure.

{
    "code": 0, 
    "message": "0", 
    "ttl": 1, 
    "data": {
        "from": "local", 
        "result": "suee", 
        "message": "", 
        "quality": 64, 
        "format": "flv720", 
        "timelength": 1504366, 
        "accept_format": "flv720,flv480,flv360", 
        "accept_description": [
            "HD 720 P", 
            "Clear 480 P", 
            "Smooth 360 P"
        ], 
        "accept_quality": [
            64, 
            32, 
            16
        ], 
        "video_codecid": 7, 
        "seek_param": "start", 
        "seek_type": "offset", 
        "dash": {
            "duration": 1505, 
            "minBufferTime": 1.5, 
            "min_buffer_time": 1.5, 
            "video": [
                {
                    "id": 64, 
                    "baseUrl": "http://upos-sz-mirrorkodo.bilivideo.com/upgcxcode/03/88/98958803/98958803-1-30064.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1579449043&gen=playurl&os=kodobv&oi=1971869914&trid=9412dee30c4640c6907ef910ea2cb04cu&platform=pc&upsig=4b952dd652c9922b546b99e44756fe0a&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=352741151", 
                    "base_url": "http://upos-sz-mirrorkodo.bilivideo.com/upgcxcode/03/88/98958803/98958803-1-30064.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1579449043&gen=playurl&os=kodobv&oi=1971869914&trid=9412dee30c4640c6907ef910ea2cb04cu&platform=pc&upsig=4b952dd652c9922b546b99e44756fe0a&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=352741151", 
                    "backupUrl": [
                        "http://upos-sz-mirrorks3.bilivideo.com/upgcxcode/03/88/98958803/98958803-1-30064.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1579449043&gen=playurl&os=ks3bv&oi=1971869914&trid=9412dee30c4640c6907ef910ea2cb04cu&platform=pc&upsig=62448ee8270504e8e729d25fc402dc7f&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=352741151"
                    ], 
                    "backup_url": [
                        "http://upos-sz-mirrorks3.bilivideo.com/upgcxcode/03/88/98958803/98958803-1-30064.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1579449043&gen=playurl&os=ks3bv&oi=1971869914&trid=9412dee30c4640c6907ef910ea2cb04cu&platform=pc&upsig=62448ee8270504e8e729d25fc402dc7f&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=352741151"
                    ], 
                    "bandwidth": 359889, 
                    "mimeType": "video/mp4", 
                    "mime_type": "video/mp4", 
                    "codecs": "avc1.64001F", 
                    "width": 960, 
                    "height": 534, 
                    "frameRate": "25", 
                    "frame_rate": "25", 
                    "sar": "801:800", 
                    "startWithSap": 1, 
                    "start_with_sap": 1, 
                    "SegmentBase": {
                        "Initialization": "0-995", 
                        "indexRange": "996-4639"
                    }, 
                    "segment_base": {
                        "initialization": "0-995", 
                        "index_range": "996-4639"
                    }, 
                    "codecid": 7
                }, 
          	    // Some data is omitted here
                {
                    "id": 30216, 
                    "baseUrl": "http://upos-hz-mirrorks3u.acgvideo.com/upgcxcode/03/88/98958803/98958803-1-30216.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1579449043&gen=playurl&os=ks3u&oi=1971869914&trid=9412dee30c4640c6907ef910ea2cb04cu&platform=pc&upsig=669eccff96c56f5586d174870a496b12&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=352741151", 
                    "base_url": "http://upos-hz-mirrorks3u.acgvideo.com/upgcxcode/03/88/98958803/98958803-1-30216.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1579449043&gen=playurl&os=ks3u&oi=1971869914&trid=9412dee30c4640c6907ef910ea2cb04cu&platform=pc&upsig=669eccff96c56f5586d174870a496b12&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=352741151", 
                    "backupUrl": [
                        "http://upos-sz-mirrorks3.bilivideo.com/upgcxcode/03/88/98958803/98958803-1-30216.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1579449043&gen=playurl&os=ks3bv&oi=1971869914&trid=9412dee30c4640c6907ef910ea2cb04cu&platform=pc&upsig=b6d7b3957f5dbcbf686f267851ec42dd&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=352741151"
                    ], 
                    "backup_url": [
                        "http://upos-sz-mirrorks3.bilivideo.com/upgcxcode/03/88/98958803/98958803-1-30216.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1579449043&gen=playurl&os=ks3bv&oi=1971869914&trid=9412dee30c4640c6907ef910ea2cb04cu&platform=pc&upsig=b6d7b3957f5dbcbf686f267851ec42dd&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=352741151"
                    ], 
                    "bandwidth": 67100, 
                    "mimeType": "audio/mp4", 
                    "mime_type": "audio/mp4", 
                    "codecs": "mp4a.40.2", 
                    "width": 0, 
                    "height": 0, 
                    "frameRate": "", 
                    "frame_rate": "", 
                    "sar": "", 
                    "startWithSap": 0, 
                    "start_with_sap": 0, 
                    "SegmentBase": {
                        "Initialization": "0-907", 
                        "indexRange": "908-4551"
                    }, 
                    "segment_base": {
                        "initialization": "0-907", 
                        "index_range": "908-4551"
                    }, 
                    "codecid": 0
                }
            ]
        }
    }, 
    "session": "da9c24388db43b3dfe81ebd676d5e41b", 
    "videoFrame": { }
}
}

2. Download test

(1) Now that we have determined that the above link is the video link we want to request, try sending the request directly.
As a result.... 403 error, server denied access.

(2) Go back to Fiddler and check the packet capturing data. The same link appears repeatedly. Carefully observe the return code and find that when the return code is 200, there is no data, the request method is OPTIONS, when the return code is 206, there is data, and the request method is GET. consult data After guessing, we need to use OPTION to allocate resources to the request server before obtaining the b station video, and then use GET to obtain the video fragment.

(3) The next step is to GET a video fragment test. Considering that whether this OPTIONS request or GET request, its connect attribute is kepp alive, consider using requests.session() to keep the session.

import requests

# url1 is video link, url2 is audio link
url='https://cn-hbwh2-cmcc-bcache-07.bilivideo.com/upgcxcode/03/88/98958803/98958803-1-30064.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1579518843&gen=playurl&os=bcache&oi=1971869869&trid=0cfd59d728114a54bd4747a01f87c9bbu&platform=pc&upsig=2a9e83f3258a7e2694bc83a5fbab8664&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=352741151&origin_cdn=ks3'
url2='http://upos-hz-mirrorks3u.acgvideo.com/upgcxcode/03/88/98958803/98958803-1-30216.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1579519469&gen=playurl&os=ks3u&oi=1971869869&trid=99ee525d6c7f4bc8a414a537797e31f3u&platform=pc&upsig=ce817a7120709c60ac43cf095de05c8d&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=352741151'
headers1={
    'Host': 'cn-hbwh2-cmcc-bcache-04.bilivideo.com',
    'Connection': 'keep-alive',
    'Access-Control-Request-Method': 'GET',
    'Origin': 'https://www.bilibili.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36',
    'Access-Control-Request-Headers': 'range',
    'Accept': '*/*',
    'Sec-Fetch-Site': 'cross-site',
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.bilibili.com/video/av56643958?t=262',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.9'
}
headers2={
    'Host': 'cn-hbwh2-cmcc-bcache-04.bilivideo.com',
    'Connection': 'keep-alive',
    'Origin': 'https://www.bilibili.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5     Safari/537.36',
    'Accept': '*/*',
    'Sec-Fetch-Site': 'cross-site',
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.bilibili.com/video/av56643958?t=262',
    'Accept-Encoding': 'identity',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Range': 'bytes=0-907'
}

session=requests.session()
session.options(url=url1,headers=headers1)
res=session.get(url=url1,headers=headers2)
print(res)
with open('test1.mp4','wb') as fp:
    fp.write(res.content)
    fp.flush()
    fp.close()

It was found that the video could be downloaded successfully, but the video could not be opened.

(4) Reset the Range to 'Range': 'bytes=0-4639000', the video size is 4.42MB, and the normal viewing time is 59 seconds (video has no sound). Guess: the 908 bytes downloaded before is not enough to form one (the video resolution is 960 * 534, so guess it should be 62.57kb to form a picture). In addition, there is no sound in the video. After looking up the data, we know that the video and audio in station B are separated (at this time, we think that the last group of data analyzed by json has no resolution, and guess it may be the audio link).

Verify success, last link is audio

3. Download video and audio (two ways)

Method 1: cancel the range parameter and download the whole video or audio directly at one time

Method 2: the 416 error code is used to download in segments, each time 1MB of resources are downloaded, and the last time the range is set to 'range': 'bytes = last end -'. So as to realize fragment download.

The code is as follows:

import requests
import json
from lxml import etree

# Prevent error reporting due to https certificate problems
requests.packages.urllib3.disable_warnings()

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36',
    'Referer': 'https://www.bilibili.com/'
}


def GetBiliVideo(homeurl,session=requests.session()):
    res = session.get(url=homeurl, headers=headers, verify=False)
    html = etree.HTML(res.content)
    videoinforms = str(html.xpath('//head/script[3]/text()')[0])[20:]
    videojson = json.loads(videoinforms)
    # Get video and audio links
    VideoURL = videojson['data']['dash']['video'][0]['baseUrl']
    AudioURl = videojson['data']['dash']['audio'][0]['baseUrl']
    print(videojson)
    #Get the name of the video resource
    name = str(html.xpath("//h1/@title")[0].encode('ISO-8859-1').decode('utf-8'))
    # Download video and audio
    BiliBiliDownload(url=VideoURL, name=name + '_Video', session=session)
    BiliBiliDownload(homeurl,url=AudioURl, name=name + '_Audio', session=session)


def BiliBiliDownload(homeurl,url, name, session=requests.session()):
    headers.update({'Referer': homeurl})
    session.options(url=url, headers=headers,verify=False)
    # 1M data per download
    begin = 0
    end = 1024*512-1
    flag=0
    while True:
        headers.update({'Range': 'bytes='+str(begin) + '-' + str(end)})
        res = session.get(url=url, headers=headers,verify=False)
        if res.status_code != 416:
            begin = end + 1
            end = end + 1024*512
        else:
            headers.update({'Range': str(end + 1) + '-'})
            res = session.get(url=url, headers=headers,verify=False)
            flag=1
        with open(name + '.mp4', 'ab') as fp:
            fp.write(res.content)
            fp.flush()

        # data=data+res.content
        if flag==1:
            fp.close()
            break

4. Combine video and audio

Combining video and audio I looked up a lot of materials and finally decided to use ffmpeg to do this. z before using merge, you need to install ffmpeg. Please refer to this article for details ffmpeg installation . If ffmpeg + a pile of garbled code appears during the operation, you can Refer to this article.
Here is the combined audio code:

  1. You need to add the following code to the video method of ffmpeg library first:
# Combination of audio and video (self added)

def combine_audio(video_file, audiio_file, out_file):
    try:
        cmd ='D:/python/ffmpeg-20200115-0dc0837-win64-static/bin/ffmpeg -i '+video_file+' -i '+audiio_file+' -acodec copy '+out_file
        print(cmd)
        subprocess.call(cmd, shell=True)  # "Muxing Done
        print('Muxing Done')
        if res != 0:
            return False
        return True
    except Exception:
        return False
  1. Then you can call the following code in your BiliBiliVideo.py.
# All paths need to use full path
def CombineVideoAudio(videopath,audiopath,outpath):
    ffmpeg.video.combine_audio(videopath,audiopath,outpath)

5,BiliBiliVideo.py

Here is the complete code:

import requests
import json
from lxml import etree
import ffmpeg.video

requests.packages.urllib3.disable_warnings()

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36',
    'Referer': 'https://www.bilibili.com/'
}


def GetBiliVideo(homeurl,session=requests.session()):
    res = session.get(url=homeurl, headers=headers, verify=False)
    html = etree.HTML(res.content)
    videoinforms = str(html.xpath('//head/script[3]/text()')[0])[20:]
    videojson = json.loads(videoinforms)
    # Get video and audio links
    VideoURL = videojson['data']['dash']['video'][0]['baseUrl']
    AudioURl = videojson['data']['dash']['audio'][0]['baseUrl']
    #Get the name of the video resource
    name = str(html.xpath("//h1/@title")[0].encode('ISO-8859-1').decode('utf-8'))
    # Download video and audio
    print('Downloading video····')
    BiliBiliDownload(homeurl=homeurl,url=VideoURL, name=name + '_Video.mp4', session=session)
    print('Downloading audio····')
    BiliBiliDownload(homeurl=homeurl,url=AudioURl, name=name + '_Audio.mp3', session=session)
    print('Download complete!')
    CombineVideoAudio(name + '_Video.mp4',name + '_Audio.mp3',name + '_output.mp4')

def BiliBiliDownload(homeurl,url, name, session=requests.session()):
    headers.update({'Referer': homeurl})
    session.options(url=url, headers=headers,verify=False)
    # 512KB of data per download
    begin = 0
    end = 1024*512-1
    flag=0
    while True:
        headers.update({'Range': 'bytes='+str(begin) + '-' + str(end)})
        res = session.get(url=url, headers=headers,verify=False)
        if res.status_code != 416:
            begin = end + 1
            end = end + 1024*512
        else:
            headers.update({'Range': str(end + 1) + '-'})
            res = session.get(url=url, headers=headers,verify=False)
            flag=1
        with open(name, 'ab') as fp:
            fp.write(res.content)
            fp.flush()
        if flag==1:
            fp.close()
            break
            
# All paths need to use full path
def CombineVideoAudio(videopath,audiopath,outpath):
    ffmpeg.video.combine_audio(videopath,audiopath,outpath)

if __name__ == '__main__':
    url = 'https://www.bilibili.com/video/av56643958'
    GetBiliVideo(url)

Screenshot of operation result:

WeChat public address:

Published 3 original articles, won praise 2, visited 5133
Private letter follow

Topics: Session JSON Windows Fragment