python climbed to the top 100 of station B and found the secrets of these popular UP masters!

Posted by r3n on Tue, 04 Jan 2022 04:04:09 +0100

Now everyone's life is more and more inseparable from station B,

In the first quarter of 2020, the monthly active users of station B reached 172 million, and the daily active users have exceeded 50 million.

The continuous flow of traffic makes the up owners of station B also the number of fans soar, and there are up owners with millions of fans everywhere.

Today, Xiaobian will lead you to climb and analyze the hottest video ranking of station B to see what kind of video everyone likes to watch ~

requirement analysis

For the crawling of the list, we climbed the top 100 video of the list

Ranking, title, video link, playback volume, number of bullets, author, comprehensive score and author details page

And other information.

Web page analysis

First, we open the browser developer mode as follows. All information can be found in the ul tab on the right,

So we first get the page information, and then use xpath to get the tag information.

Send request

  url = ''

  headers = {
       "cookie": "_uuid=7D3DFA6C-6EB1-F72A-632B-C9AF9B9AD4C627183infoc; buvid3=D25672DE-BD2D-4E7C-B79E-DB356316D023167639infoc; sid=aylq5kgg; fingerprint=84acc3579a53d0eba78d769e71574df6; buvid_fp=BA184AFC-F4DC-408A-8897-D0EDEA653CE5148812infoc; buvid_fp_plain=BA184AFC-F4DC-408A-8897-D0EDEA653CE5148812infoc; DedeUserID=434541726; DedeUserID__ckMd5=448fda6ab5098e5e; SESSDATA=78a505c8%2C1643594982%2Cdfa35*81; bili_jct=1d9f4e960fb0ae7fe1de53663029874b; bsource=search_baidu; CURRENT_FNVAL=80; blackside_state=1; rpdid=|(u)YJR~R~)m0J'uYk)ku)~~)",
      "referer": "",
      "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.8 Safari/537.36"

  resp = requests.get(url, headers = headers, timeout=15)

Get browser response information

Next, we use xpath to get the information inside the tag

 for li in lis:

    sort = li.xpath("./div[@class='num']/text()")
    sort = ''.join(sort)

    author = li.xpath("./div[@class='content']/div[@class='info']/div[@class='detail']/a/span[@class='data-box up-name']/text()")
    author = ''.join(author).strip()

    #Comprehensive score
    score = li.xpath("./div[@class='content']/div[@class='info']/div[@class='pts']/div/text()")
    score = ''.join(score)

    #Video title
    title = li.xpath("./div[@class='content']/div[@class='info']/a[@class='title']/text()")
    title = ''.join(title)

    #Video link
    links = li.xpath("./div[@class='content']/div[@class='img']/a/@href")
    links = ''.join(links).strip()[2:]

    #Number of plays
    video_num = li.xpath("./div[@class='content']/div[@class='info']/div[@class='detail']/span[@class='data-box'][1]/text()")
    video_num = ''.join(video_num).strip()

    #Number of barrages
    barrage_num = li.xpath("./div[@class='content']/div[@class='info']/div[@class='detail']/span[@class='data-box'][2]/text()")
    barrage_num = ''.join(barrage_num).strip()

    #Author details
    detail_auth = li.xpath(".//div[@class='content']/div[@class='info']/div[@class='detail']/a/@href")
    detail_auth = ['https:' + i for i in detail_auth]
    detail_auth = ''.join(detail_auth)

    ic(sort, author, score, title,links, video_num, barrage_num, detail_auth)

Some information is as follows:

Data saving

Next, we use the openpyxl module to save the obtained information in excel

Facilitate subsequent data processing and visualization

  ws = op.Workbook()
  wb = ws.create_sheet(index=0)

  wb.cell(row=1, column=1, value='ranking')
  wb.cell(row=1, column=2, value='author')
  wb.cell(row=1, column=3, value='Comprehensive score')
  wb.cell(row=1, column=4, value='Video title')
  wb.cell(row=1, column=5, value='Video link')
  wb.cell(row=1, column=6, value='Number of plays')
  wb.cell(row=1, column=7, value='Number of barrages')
  wb.cell(row=1, column=8, value='Author details')'Bilibili Top100.xlsx')

data processing

When processing data, we found that the unit formats of some data are inconsistent, as follows:

Some are in units of and some are in units of ten thousand.

In addition, we need to remove the "ten thousand" word behind the number and convert the number in string format into number type to facilitate the subsequent visual operation.

Here, we use pandas for data processing. If you don't understand, you can take a look at this tutorial. This is a very practical panda document summarized by myself.

People can not refuse the pandas skills, simple but easy to use!

  #Read data
  df = pd.read_excel('Bilibili Top100.xlsx')
  #Delete spaces
   pd_data = df.dropna(subset=['Number of plays', 'Number of barrages'])

  #Number of formatted data playback
  #Remove '10000‘
  pd_data['Number of plays'] = pd_data['Number of plays'].str.replace('ten thousand', '')
  #Conversion format: ≥ 10000 - > 10000
  pd_data['Number of barrages'] = pd_data['Number of barrages'].map(lambda x: float(x[:-1]) * 10000 if ('ten thousand' in x) else float(x))

  #Save processed data as
  pd_data.to_excel('Bilibili Top101.xlsx')

Red is the data before processing

Green is the processed data

We have converted a number of string type to a subtype

And unify numerical units

Next, we will find the video with the most comments and the video with the most barrage

See why they are so popular?

  #Maximum playback
  max_video_num = rcv_data[rcv_data['Number of plays'] == rcv_data['Number of plays'].max()]

  #Maximum barrage
  max_cmts_num = rcv_data[rcv_data['Number of barrages'] == rcv_data['Number of barrages'].max()]

  ic| max_video_num:    Unnamed: 0  ranking    author     Comprehensive score                    Video title                                 Video link   Number of plays   Number of barrages                                 Author details
                   0           0   1  Sheep cuisine  4742269  A lemon costs $100?? delicious food up Master: that's the cost!  445.3  20000
  ic| max_bag_num:     Unnamed: 0  ranking     author     Comprehensive score                   Video title                                 Video link   Number of plays   Number of barrages                                 Author details
                 19          19  20  Burst, leaf  2320837  <B station various etc. level use household present shape !>  252.7  84000


This is the most played video,

Is lemon so expensive because miss is good-looking

Let's take a look at the video with the most bullets

What are you talking about? Look

Finally, we use the water drop diagram to see the proportion of comprehensive scores

  mean_score = rcv_data['Comprehensive score'].mean()

  max_score = rcv_data['Comprehensive score'].max()
  ic| mean_score/max_score: 0.39690731166873916

Topics: data visualization BiliBili