Using python to raise insects and climb the barrage

Posted by tomharding on Mon, 07 Feb 2022 20:41:35 +0100

preface

After 108 days, he released the latest video at station B, "I turned on a light in 108 days...". So let's use the crawler to crawl the bullet screen of the video and see what the partners say

1, What is a reptile?

Baidu Encyclopedia says: a program to automatically obtain web content. In my opinion, crawlers are ~ ~ "crawling around on the Internet..." shut up~~
Then let's take a look at how to raise the "insects" carrying the bullet screen of station B

II Feeding steps

1. Request barrage

First of all, you need to know what the url of the website is. For the barrage of station B, the location of the barrage has a fixed format:

http://comment.bilibili.com/+cid+.xml

ok, so the question is, what is cid? Whatever it is, I'll tell you how to get it.

  1. After opening the video, click F12 to switch to "network" and fill in "cid" in the filter to filter.
  2. Click the filtered network information and find the cid at the Payload at the right end
  3. Now we know the network link of he classmate's video barrage:

http://comment.bilibili.com/499893135.xml

  1. The next step is to send a network request and obtain the network page resources. Python has many libraries for sending network requests. For example:
#[classmate he] I turned on a light in 108 days cid of video: 499893135
#Barrage location
url = "http://comment.bilibili.com/499893135.xml"
#Send request
req = requests.get(url = url)
#Get the content of the content response
html_byte = req.content
#Convert byte to str
html_str = str(html_byte,"utf-8")

It is also worth mentioning that the request header of the sending request can be added to disguise that it is accessed by the browser. You can get it through the header parameter and user agent as follows:

Then, the code is as follows:

#Pretend to be a browser
header ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.43'
}
#[classmate he] I turned on a light in 108 days cid of video: 499893135
#Barrage location
url = "http://comment.bilibili.com/499893135.xml"
#Send request
req = requests.get(url = url, headers=header)
#Get the content of the content response
html_byte = req.content
#Convert byte to str
html_str = str(html_byte,"utf-8")

2. Analyze the barrage

html_str is the format of HTML file. We need to process it to get the information we want. At this time, Beautiful soup Library It's going to shine. We use it to process the resulting html file
The code is as follows (example):

#analysis
    soup = BeautifulSoup(html,'html.parser')
    #Find the < d > tag in the html file
    results = soup.find_all('d')
    #Extract the text from the label
    contents = [x.text for x in results]
    #Save as dictionary
    dic ={"contents" : contents}

contents is the bullet screen string list. It is saved as a dictionary for the next step

Storage barrage

The barrage information is stored in excel, and there are many libraries that can be used. For example:

  • xlwt Library
  • pandas Library
    We'll use the pandas library to
    The code is as follows (example):
    Create the dictionary obtained in step 2 dataFrame , and then save it with an API of pandas library
#Created a spreadsheet with a dictionary
df = pd.DataFrame(dic)
df["contents"].to_excel('htx.xlsx')

Total code

import requests
from bs4 import BeautifulSoup
import pandas as pd

def main():
    html = askUrl()
    dic =analyse(html)
    writeExcel(dic)

def askUrl():
    #Pretend to be a browser
    header ={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.43'
    }
    #[classmate he] I turned on a light in 108 days cid of video: 499893135
    #Barrage location
    url = "http://comment.bilibili.com/499893135.xml"
    req = requests.get(url = url, headers=header)
    html_byte = req.content#byte
    html_str = str(html_byte,"utf-8")
    return html_str

def analyse(html):
    soup = BeautifulSoup(html,'html.parser')
    results = soup.find_all('d')
    #x.text indicates the value to put in contents
    contents = [x.text for x in results]
    #Save results
    dic ={"contents" : contents}
    return dic

def writeExcel(dic):
    #Created a spreadsheet with a dictionary
    df = pd.DataFrame(dic)
    df["contents"].to_excel('htx.xlsx')
    
if __name__ == '__main__':
    main()

summary

To put it simply, there are three steps:
  1. Send network request and obtain resources
  2. Search and other operations to obtain useful information
  3. Store information

Have you learned? If you learn it, just praise it before you go

Topics: Python crawler