The opening ceremony of the Winter Olympics amazed the world. Baidu search index + word cloud map for nearly 90 days

Posted by LiamS94 on Sat, 12 Feb 2022 02:14:43 +0100

1, Introduction

In actual business, we may use crawlers to obtain historical data of Baidu search index according to keywords, and then conduct corresponding data analysis.

Baidu Index, experience the beauty of big data. However, the difficulties in obtaining data related to Baidu Index are as follows:

  • It is not a static web page, and the data returned by the URL request address of Baidu Index cannot be directly parsed by json, but the encrypted data and uniqid. You need to request the corresponding URL again through uniqid to obtain the key for decryption, decrypt it on the front page, and then render it to the broken line graph.

  • You must log in to baidu account (get cookies) on Baidu Index page.

  • You need to convert the front-end decryption code into Python back-end code, or directly use execjs to execute JavaScript code.

Taking the Baidu search index data of recent days as an example, this paper explains the process of using Baidu crawler to obtain the historical data of Baidu search index according to keywords (taking the Winter Olympic Games as an example), then visualizes the search index of the Winter Olympic Games in recent 90 days, and collects the materials reported by the media to make word cloud pictures.

  • Environment: Anaconda + pychar

  • Main libraries used: requests, execjs, datetime, pandas, matplotlib, stylecloud

2, Web page analysis

If you don't have a baidu account, you need to register first and then enter the Baidu Index official website:

https://www.baidu.com/s?wd= baidu index

Search the Winter Olympics and select nearly 90 days to see the broken line chart of the search index of the Winter Olympics in recent 90 days:

The final thing to do is to get these search index data and save it to local Excel.

First, obtain the Cookie after you log in (there must be one, otherwise you cannot obtain the data). The specific Cookie acquisition is shown in the figure below:

The interface of json data can be found through analysis, as shown below:

In the Request URL, after the word parameter is the search keyword (only the Chinese characters are encoded), days=90, which represents the data of nearly 90 days. It is pushed forward one month from the day before the current date. You can modify days as needed to obtain more data or less data. Paste the Request URL into the browser to access and view (it is convenient to have a plug-in such as JSON Handle when viewing JSON data web pages)

https://index.baidu.com/api/SearchApi/index?area=0&word[[%7B%22name%22:%22%E5%86%AC%E5%A5%A5%E4%BC%9A%22,%22wordType%22:1%7D]]&days=90

You can see the following data:

After decrypting the data corresponding to all, pc and wise, and comparing with the data displayed in the broken line diagram of the search index, it is found that the data of {all is the data of the search index. Here are all the data returned from this request. You can also see the uniqid, and the encrypted data and uniqid will change every time you refresh.

After multiple analysis, it is found that the uniqid under the url of the request data appears in this url, as shown above.

Therefore, it is necessary to get data from the corresponding url of the request data first, to analyze the encrypted data and uniqid corresponding to the search index, then to splice the url to get the key, and finally to decrypt the decryption method to get the data of the search index.

https://index.baidu.com/Interface/ptbk?uniqid=b92927de43cc02fcae9fbc0cee99e3a9

After finding the corresponding url, the basic idea of the crawler is the same: Send a request, obtain a response, parse the data, decrypt the data and save the data.

3, Data acquisition

Python code:

# -*- coding: UTF-8 -*-
"""
@Author   :  Ye Tingyun
@official account    : AI Yun Jun Ting
@CSDN     : https://yetingyun.blog.csdn.net/
"""
import execjs
import requests
import datetime
import pandas as pd
from colorama import Fore, init

init()

# Python code for decryption of search index data
def decryption(keys, data):
    dec_dict = {}
    for j in range(len(keys) // 2):
        dec_dict[keys[j]] = keys[len(keys) // 2 + j]

    dec_data = ''
    for k in range(len(data)):
        dec_data += dec_dict[data[k]]
    return dec_data


if __name__ == "__main__":
 # Beijing Winter Olympics Opening Ceremony
    keyword = 'Beijing Winter Olympics'     # Keywords included in Baidu search
    period = 90           # Nearly 90 days
    start_str = 'https://index.baidu.com/api/SearchApi/index?area=0&word=[[%7B%22name%22:%22'
    end_str = '%22,%22wordType%22:1%7D]]&days={}'.format(period)
    dataUrl = start_str + keyword + end_str
    keyUrl = 'https://index.baidu.com/Interface/ptbk?uniqid='
    # Request header
    header = {
        'Accept': 'application/json, text/plain, */*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive',
        'Cookie': 'Note: replace it with yours Cookie',
        'Host': 'index.baidu.com',
        'Referer': 'https://index.baidu.com/v2/main/index.html',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
        'sec-ch-ua-mobile': '?0',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
    }
    # Set the request timeout to 16 seconds
    resData = requests.get(dataUrl,
                           timeout=16, headers=header)

    uniqid = resData.json()['data']['uniqid']
    print(Fore.RED + "uniqid: {}".format(uniqid))
    keyData = requests.get(keyUrl + uniqid,
                           timeout=16, headers=header)
    keyData.raise_for_status()
    keyData.encoding = resData.apparent_encoding

    # Parsing json data
    startDate = resData.json()['data']['userIndexes'][0]['all']['startDate']
    print(Fore.RED + "startDate: {}".format(startDate))
    endDate = resData.json()['data']['userIndexes'][0]['all']['endDate']
    print(Fore.RED + "endDate: {}".format(endDate))
    source = (resData.json()['data']['userIndexes'][0]['all']['data'])  # Original encrypted data
    print(Fore.RED + "Original encrypted data:{}".format(source))
    key = keyData.json()['data']  # secret key
    print(Fore.RED + "Key:{}".format(key))

    res = decryption(key, source)
    # print(type(res))
    resArr = res.split(",")
    # Generate datetime
    dateStart = datetime.datetime.strptime(startDate, '%Y-%m-%d')
    dateEnd = datetime.datetime.strptime(endDate, '%Y-%m-%d')
    dataLs = []
    # Every day from the start date to the end date
    while dateStart <= dateEnd:
        dataLs.append(str(dateStart))
        dateStart += datetime.timedelta(days=1)
        # print(dateStart.strftime('%Y-%m-%d'))

    ls = []
    # Organize data and traverse print view
    for i in range(len(dataLs)):
        ls.append([keyword, dataLs[i], resArr[i]])

    for i in range(len(ls)):
        print(Fore.RED + str(ls[i]))

    # Save data to Excel and set column name
    df = pd.DataFrame(ls)
    df.columns = ["key word", "date", "Baidu search index"]
    df.to_excel("Beijing Winter Olympics search index data 90 days.xlsx", index=False)

The results are as follows:

The decryption part can also be directly implemented using JavaScript, as shown in the following figure:

# The strength of Python is that it has a powerful third-party library, which can directly execute js code, that is, if you are not familiar with the decryption algorithm and you cannot write Python code, you can directly execute js code
js = execjs.compile('''
            function decryption(t, e){
                for(var a=t.split(""),i=e.split(""),n={},s=[],o=0;o<a.length/2;o++)
                    n[a[o]]=a[a.length/2+o]
                for(var r=0;r<e.length;r++)
                    s.push(n[i[r]])
                return s.join("")
            }
    ''')
res = js.call('decryption', key, source)  # Call this method to decrypt
res = decryption(key, source)

Python crawler runs successfully, and the data is saved as the Winter Olympics search index data for 90 days xlsx.

4, Search index visualization

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import math
import warnings


warnings.filterwarnings("ignore")
df1 = pd.read_excel("Winter Olympics opening ceremony search index data 90 days.xlsx")
df2 = pd.read_excel("Beijing Winter Olympics search index data 90 days.xlsx")

mpl.rcParams['font.family'] = 'SimHei'

# Generate derivation of x-axis data list
# x_data = [i for i in range(1, len(df1) + 1)]
x_data = [i for i in range(1, len(df1) - 1)]
y_data1 = df1["Baidu search index"][:-2]
y_data2 = df2["Baidu search index"][:-2]

# Set figure pixel size
fig, ax = plt.subplots(figsize=(6, 4), dpi=500)

# Draw the shape color label of three polyline points: for legend display
plt.plot(x_data, y_data1, 
        color="#FF1493", label =" opening ceremony of Winter Olympics ")
plt.plot(x_data, y_data2,
         color="#00BFFF", label =" Beijing Winter Olympics ")


# x y axis label font size
plt.xlabel("Chronological order", 
           fontdict={"size": 15, "weight": "bold", "color": "black"})
plt.ylabel("Baidu search index",
           fontdict={"size": 15, "weight": "bold", "color": "black"}
          )

# Sets the size of the axis scale label
plt.tick_params(axis='x', direction='out',
               labelsize=12, length=4.6)
plt.tick_params(axis='y', direction='out',
               labelsize=12, length=4.6)

# Show Legend 
plt.legend(fontsize=15, frameon=False)
labels = ax.get_xticklabels() + ax.get_yticklabels()
[label.set_fontname('Times New Roman') for label in labels]
[label.set_color('black') for label in labels]
plt.grid(alpha=0.7, ls=":")

# Show show
plt.show()

The results are as follows:

Please add picture description

The search index of the opening ceremony of the Winter Olympic Games and the Beijing Winter Olympic Games has soared in a straight line in recent days. It shows that everyone is very concerned.

Science and technology, humanities, romance and passion are now united. In the name of the Olympics. Write a beautiful chapter in the community with a shared future for mankind. I also wish you a complete success in this year's Beijing Winter Olympics!! 🏆🏆❤️❤️

5, Word cloud picture

Many media are rushing to cover the opening ceremony of the Winter Olympics. Collect some text materials here and make a word cloud picture to have a look.

with open("Opening ceremony of Winter Olympics.txt", encoding="utf-8") as f:
    data = f.read()

# Text preprocessing removes some useless characters and only extracts Chinese characters
new_data = re.findall('[\u4e00-\u9fa5]+', data, re.S)
new_data = "/".join(new_data)

# Precise mode of text word segmentation
seg_list_exact = jieba.cut(new_data, cut_all=False)

# Load stop words
with open('stop_words.txt', encoding='utf-8') as f:
    # Get the stop word of each line and add it to the collection
    con = f.read().split('\n')
    stop_words = set()
    for i in con:
        stop_words.add(i)

# List parsing removes stop words and single words
result_list = [word for word in seg_list_exact if word not in stop_words and len(word) > 1]
print(result_list)

# https://fontawesome.com/license/free
# The palette color scheme recommended by me works very well. I felt it was average after other tests~~
# colorbrewer.qualitative.Dark2_7
# cartocolors.qualitative.Bold_5
# colorbrewer.qualitative.Set1_8
gen_stylecloud(
    text=' '.join(result_list),                   # Text data
    size=600,                                     # Word cloud size
    font_path='./font/Cat gnawing net sugar round body.ttf',            # Font needs to be set for Chinese word cloud display
    icon_name = "fas fa-grin-beam",               # Icon
    output_name='Opening ceremony of Winter Olympics.png',                 # Output word cloud name
    palette='colorbrewer.qualitative.Set1_8',     # Select color scheme
)

If we say that in the 2008 Olympic Games, China showed the world the beauty of traditional culture! So this winter Olympics, China has perfectly demonstrated to the world what is the rapid change and take-off of science and technology!

6, Summary

The opening ceremony of this year's Winter Olympic Games is really beautiful and amazing all over the world!

Summary:

  • Python crawler: obtain historical data of Baidu search index according to keywords

  • Parse JSON data, decrypt encrypted data, and save the data to Excel

  • matplotlib: visualization of Baidu search index

  • stylecloud: draw word cloud

end: that's all for sharing. I like to make up a three company for Xiao!

To get more complete source code and Python learning materials, click this line of font

Topics: Python Programmer