How to use Python to analyze the public opinion of a microblog about tax evasion~

Posted by anarchoi on Mon, 27 Dec 2021 18:06:11 +0100

Hello, I'm Chen Chen

A few days ago, Ya was fined 1.341 billion yuan for tax evasion. As soon as the news came out, it aroused thousands of waves on the Internet, and netizens fried the pot directly. They are all feeling that they don't know whether they have paid more fines for the money they earn in this life.

So I crawled the data under this microblog and conducted a simple public opinion analysis!

01 Analysis page

Because it is more convenient to crawl the microblog from the mobile terminal, we choose to crawl the microblog from the mobile terminal this time.

We usually enter keywords in this place to search the microblog content.

By observing this page in developer mode, I found that every time it makes a request for keywords, it will return an XHR response.

Now that we have found the real page of the data, we can carry out the normal operation of the crawler.

02 data acquisition

We have found the real web page of data storage above. Now we just need to make a request for the web page and extract the data.

1. Send request

By observing the request header, it is not difficult to construct the request code.

The construction code is as follows:

key = input("Please enter the crawling keyword:")
for page in range(1,10):
  params = (
      ('containerid', f'100103type=1&q={key}'),
      ('page_type', 'searchall'),
      ('page', str(page)),
  )

  response = requests.get('https://m.weibo.cn/api/container/getIndex', headers=headers, params=params)

2 analytical response

From the above observation, we found that this data can be transformed into a dictionary for crawling, but after my actual test, I found that regular extraction is the most simple and convenient, so here is the way of regular extraction. Interested readers can try to extract data by dictionary. The code is as follows:

r = response.text
title = re.findall('"page_title":"(.*?)"',r)
comments_count = re.findall('"comments_count":(.*?),',r)
attitudes_count = re.findall('"attitudes_count":(.*?),',r)
for i in range(len(title)):
  print(eval(f"'{title[i]}'"),comments_count[i],attitudes_count[i])

3. Store data

After the data has been parsed, we can store it directly. Here I store the data in a csv file. The code is as follows:

for i in range(len(title)):
   try:
       with open(f'{key}.csv', 'a', newline='') as f:
           writer = csv.writer(f)
           writer.writerow([eval(f"'{title[i]}'"),comments_count[i],attitudes_count[i],reposts_count[i],created_at[i].split()[-1],created_at[i].split()[1],created_at[i].split()[2],created_at[i].split()[0],created_at[i].split()[3]])
   except:
       pass

03 data cleaning

After data collection, it needs to be cleaned to meet the analysis requirements before visual analysis can be carried out.

1. Import data

Read the crawled data with pandas and preview it.

import pandas as pd

df = pd.read_csv('Weiya.csv',encoding='gbk')
print(df.head(10))

2 date conversion

We found that the data in the month is an English abbreviation, and we need to convert it into numbers. The code is as follows:

c = []
for i in list(df['month']):
   if i == 'Nov':
       c.append(11)
   elif i == 'Dec':
       c.append(12)
   elif i == 'Apr':
       c.append(4)
   elif i == 'Jan':
       c.append(1)
   elif i == 'Oct':
       c.append(10)
   else:
       c.append(7)
df['month'] = c
df.to_csv('Weiya.csv',encoding='gbk',index=False)

3 view data types

Check the field types and missing values, which meet the analysis needs and do not need to be handled separately.

df.info()

04 visual analysis

Let's make a visual analysis of these data.

1. Number of microblogs per day

Only nearly 100 pages of data were crawled here, which may be the reason for the small amount of microblog data on the 20th and 21st.

The code is as follows:

from pyecharts.charts import Bar
from pyecharts import options as opts
from collections import Counter # Statistical word frequency
c=[]
d={}
a = 0
for i in list(df['month']):
    if i == 12:
        if list(df['day'])[a] notin c:
            c.append(list(df['day'])[a])
    a+=1
a = 0
for i in c:
    d[i]=0

for i in list(df['month']):
    if i == 12:
        d[list(df['day'])[a]]+=1
    a += 1
columns = []
data = []
for k,v in d.items():
    columns.append(k)
    data.append(v)
bar = (
        Bar()
        .add_xaxis(columns)
        .add_yaxis("Number of articles", data)
        .set_global_opts(title_opts=opts.TitleOpts(title="Daily number of microblogs"))
    )
bar.render("word frequency.html")

2 comment and praise top10

We found that doutujun little starfish has the most comments, with 7.5w +. Let's see what its comments are, which makes users like it so much.

There are so many likes, which may be sent early and located in the front. Another reason may be that the content is in line with everyone's wishes.

3 comment time distribution

By analyzing the release time of all comments, we found that the largest number of comments were published at 21:00. At that time, it was almost the same time when it was on the hot search list. It seems that not on the hot search list still has a great impact on microblog.

The code is as follows:

import pandas as pd
df = pd.read_csv('weiya.csv',encoding='gbk')
c=[]
d={}
a = 0
for i in list(df['Time']):
    if i notin c:
        c.append(i)
a = 0
for i in c:
    d[i]=0

for i in list(df['Time']):
    d[i]+=1

print(d)
from collections import Counter # Statistical word frequency
from pyecharts.charts import Bar
from pyecharts import options as opts
columns = []
data = []
for k,v in d.items():
    columns.append(k)
    data.append(v)
bar = (
        Bar()
        .add_xaxis(columns)
        .add_yaxis("time", data)
        .set_global_opts(title_opts=opts.TitleOpts(title="Time distribution"))
    )
bar.render("word frequency.html")

4 word cloud

Through the word cloud picture, we can see that there are many key words of tax evasion, which is in line with the theme, followed by cancellation, ban and imprisonment. It seems that everyone still hates illegal acts.

The code is as follows:

from imageio import imread
import jieba
from wordcloud import WordCloud, STOPWORDS
with open("weiya.txt",encoding='utf-8') as f:
    job_title_1 = f.read()
with open('stop list .txt','r',encoding='utf-8') as f:
    stop_word = f.read()
word = jieba.cut(job_title_1)
words = []
for i in list(word):
    if i notin stop_word:
        words.append(i)
contents_list_job_title = " ".join(words)
wc = WordCloud(stopwords=STOPWORDS.add("One"), collocations=False,
               background_color="white",
               font_path=r"K:\Su Xinshi and Liu Kaijian.ttf",
               width=400, height=300, random_state=42,
               mask=imread('xin.jpg', pilmode="RGB")
               )
wc.generate(contents_list_job_title)
wc.to_file("Recommendation.png")

05 summary

As public figures, online celebrities and stars should set an example. They can't enjoy fame and wealth but also commit illegal acts.

Programmer Think

How to use Python to analyze the public opinion of a microblog about tax evasion~

01 Analysis page

02 data acquisition

1. Send request

2 analytical response

3. Store data

03 data cleaning

1. Import data

2 date conversion

3 view data types

04 visual analysis

1. Number of microblogs per day

2 comment and praise top10

3 comment time distribution

4 word cloud

05 summary

Hot Topics