I used Python to make a data visualization analysis of the comments on the White Snake 2 and the green snake

Posted by nay4 on Sun, 19 Sep 2021 15:16:37 +0200

Hello, I'm spicy.

Finally, we are going to start the series of articles on data analysis. Compared with crawlers, the technical dimension has risen to a higher level. The output of my article will update the series of practical projects and the series of detailed explanation and summary of knowledge points in two columns respectively, with the goal of realizing 100 cases of crawler and data analysis practical projects in the short term.

Libraries involved:

Pandas - data processing Pyecharts - data visualization jieba - word segmentation collections - data statistics

Visualization part:

Line chart line histogram Bar Pie chart Pie Calendar chart Calendar cloud chart WordCloud map Geo

White Snake 2: robbed by green snake

*Plot introduction:*

On July 23, 2021, white snake 2: green snake robbery was released in the mainland. It mainly tells that in the late Southern Song Dynasty, Xiaobai was finally pressed under Leifeng Tower by the sea of France in order to save Xu Xian. Xiaoqing is accidentally driven into the strange fantasy of Shura city by Fahai. In several crises, Xiaoqing was rescued by the mysterious masked boy. Xiaoqing took the idea of going out to rescue Xiaobai. After suffering and growing up, she found a way to leave with the masked boy

Execution link Notebook

Install third party packages

!pip install pyecharts
!pip install pandas
!pip install numpy

Import third party packages

import pandas as pd
import numpy as np
from pyecharts.charts import * 
from PIL import Image
from collections import Counter
from pyecharts import options as opts # Visual configuration item
from pyecharts.commons.utils import JsCode # Used to run js code
from pyecharts.globals import ThemeType,SymbolType,ChartType # Visual theme style

Read data

df = pd.read_excel("./White Snake 2.xlsx")
df.head(10)  # View the first 10 lines

id	user name	city	score	comment	Comment time
0	1142669584	Qitong glutinous rice	guest	5.0	The plot is very attractive. Watching an animated cartoon surprised me	2021-08-31 23:56:30
1	1142662178	LnV14610189	Xining	5.0	Strong picture sense!	2021-08-31 23:36:00
2	1142666877	Alo861902585	Guangzhou	5.0	And a very good connection, wonderful	2021-08-31 23:34:41
3	1142660216	Y.	Xi'an	4.0	The characters in the picture don't say that you can always believe in chasing light. The plot is smooth and the overall rhythm is OK. It is recommended to watch -!	2021-08-31 23:30:56
4	1142669423	I want to see the moon for you	Fengtai	5.0	Yes, although Xiaoqing and Xiaobai's obsession is far fetched (OK). If you can elaborate on the obsession of Niu Mo, the plot will be more perfect	2021-08-31 23:27:28
5	1142669422	Ah, Ka, wow, ah	sunshine	5.0	It's good. It feels more and more attractive	2021-08-31 23:27:12
6	1142669404	Lfz9696	Yongzhou	4.5	OK, very good	2021-08-31 23:23:30
7	1142666812	Tenacity	Guangzhou	4.0	The movie was ok, except that a man next door kept shaking his legs.	2021-08-31 23:22:23
8	1142661206	CQE579669148	Urumqi	5.0	Take a good look, recommend	2021-08-31 23:16:22
9	1142668420	Fat suona	Ili	5.0	The plot is a little incomprehensible, but the animation effects are great! The plot is very moving.	2021-08-31 23:06:36

Data cleaning

Missing value view

df.isnull().sum()
id      0
 user name     1
 city      0
 score      0
 comment      0
 Comment time    0
dtype: int64

Check and find that there is a real situation

The user name is missing, and the data of other columns is complete. Fill in the blank value with "unknown":

df['user name'].fillna('unknown', inplace=True)
df.isnull().sum()

Pyecharts data visualization

Score grade distribution

# Linear gradient
color_js = """new echarts.graphic.LinearGradient(0, 0, 1, 0,
    [{offset: 0, color: '#009ad6'}, {offset: 1, color: '#ed1941'}], false)"""

df_star = df.groupby('score')['comment'].count()
df_star = df_star.sort_values(ascending=True)
x_data = [str(i) for i in list(df_star.index)]
y_data = df_star.values.tolist()
b1 = (
    Bar()
    .add_xaxis(x_data)
    .add_yaxis('',y_data,itemstyle_opts=opts.ItemStyleOpts(color=JsCode(color_js)))
    .reversal_axis()
    .set_series_opts(label_opts=opts.LabelOpts(position='right'))    
    .set_global_opts(
        yaxis_opts=opts.AxisOpts(name='Rating'),
        xaxis_opts=opts.AxisOpts(name='people/second'),
        title_opts=opts.TitleOpts(title='Score grade distribution',pos_left='45%',pos_top="5%"),
        legend_opts=opts.LegendOpts(type_="scroll", pos_left="85%",pos_top="28%",orient="vertical")
    )
)

df_star = df.groupby('score')['comment'].count()
x_data = [str(i) for i in list(df_star.index)]
y_data = df_star.values.tolist()
p1 = (
    Pie(init_opts=opts.InitOpts(width='800px', height='600px'))
    .add(
    '',
    [list(z) for z in zip(x_data, y_data)],
    radius=['10%', '30%'],
    center=['65%', '60%'],
    label_opts=opts.LabelOpts(is_show=True),
    ) 
    .set_colors(["blue", "green", "#800000", "red", "#000000", "orange", "purple", "red", "#000000", "orange", "purple"])
    .set_series_opts(label_opts=opts.LabelOpts(formatter='score{b}: {c} \n ({d}%)'),position="outside")   
)

b1.overlap(p1)
b1.render_notebook()

The score of 5.0 reached 56%, more than half of the audience received five-star praise, and more than four-star praise reached 85%. It seems that everyone still highly recognizes this animation.

Distribution of daily comments from August 1, 2021 to August 31, 2021:

# Set style
# The style of the loaded js code is mainly color and theme
color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0,
    [{offset: 0, color: '#009ad6'}, {offset: 1, color: '#ed1941'}], false)"""

area_color_js = (
    "new echarts.graphic.LinearGradient(0, 0, 0, 1, "
    "[{offset: 0, color: '#eb64fb'}, {offset: 1, color: '#3fbbff0d'}], false)"
)

# Set parameters
linestyle_dic = { 'normal': {
                    'width': 2,  
                    'shadowColor': '#696969', 
                    'shadowBlur': 10,  
                    'shadowOffsetY': 10,  
                    'shadowOffsetX': 10,  
                    }
                }

# Transfer time format
df['Comment time'] = pd.to_datetime(df['Comment time'], format='%Y/%m/%d %H:%M:%S')
# Daily Comments
df['Comment time'] = pd.to_datetime(df['Comment time'], format='%Y/%m/%d %H:%M:%S')
df_day = df.groupby(df['Comment time'].dt.day)['comment'].count()  # Get the number of comments according to the comment time (count)
day_x_data = [str(i) for i in list(df_day.index)] # x axis
day_y_data = df_day.values.tolist() # Output to list y-axis
 
line1 = (
    Line(init_opts=opts.InitOpts(bg_color=JsCode(color_js))) # Linear visualization
    .add_xaxis(xaxis_data=day_x_data) # Add x-axis data
    .add_yaxis(  # Add y-axis data
        series_name="",  # y-axis name
        y_axis=day_y_data, # data
        is_smooth=True,
        is_symbol_show=True,
        symbol="circle",
        symbol_size=6,
        linestyle_opts=opts.LineStyleOpts(color="#fff"), # Configure y axis
        label_opts=opts.LabelOpts(is_show=True, position="top", color="white"), # y-axis label
        itemstyle_opts=opts.ItemStyleOpts(
            color="red", border_color="#fff", border_width=3
        ),
        tooltip_opts=opts.TooltipOpts(is_show=False),
        areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1),
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(
            title="Daily comments in August",
            pos_top="5%",
            pos_left="center",
            title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16),
        ),
        xaxis_opts=opts.AxisOpts(
            type_="category",
            boundary_gap=True,
            axislabel_opts=opts.LabelOpts(margin=30, color="#ffffff63"),
            axisline_opts=opts.AxisLineOpts(is_show=False),
            axistick_opts=opts.AxisTickOpts(
                is_show=True,
                length=25,
                linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
            ),
            splitline_opts=opts.SplitLineOpts(
                is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
            ),
        ),
        yaxis_opts=opts.AxisOpts(
            type_="value",
            position="left",
            axislabel_opts=opts.LabelOpts(margin=20, color="#ffffff63"),
            axisline_opts=opts.AxisLineOpts(
                linestyle_opts=opts.LineStyleOpts(width=2, color="#fff")
            ),
            axistick_opts=opts.AxisTickOpts(
                is_show=True,
                length=15,
                linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
            ),
            splitline_opts=opts.SplitLineOpts(
                is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
            ),
        ),
        legend_opts=opts.LegendOpts(is_show=False),
    )
)
line1.render_notebook()

The number of comments per day peaked on August 1 (data excluding July), and the number of comments gradually decreased with the passage of time, which is also in line with the general law of film viewing.

Comments per hour

The statistics is the sum of comments per hour and day in the 31 days from August 1, 2021 to August 31, 2021 (if you are interested, you can view the distribution of 24-hour film reviews on a day separately and filter by date)

df_hour = df.groupby(df['Comment time'].dt.hour)['comment'].count()
hours_x_data = [str(i) for i in list(df_hour.index)]
hours_y_data = df_hour.values.tolist()
 
line1 = (
#     Line(init_opts=opts.InitOpts(bg_color=JsCode(color_js)))
    Line(init_opts=opts.InitOpts(width='1000px', height='400px'))
    .add_xaxis(xaxis_data=hours_x_data)
    .add_yaxis(
        series_name="",
        y_axis=hours_y_data,
        is_smooth=True,
        is_symbol_show=True,
        symbol="circle",
        symbol_size=6,
        linestyle_opts=opts.LineStyleOpts(color="#fff"),
        label_opts=opts.LabelOpts(is_show=True, position="top", color="white"),
        itemstyle_opts=opts.ItemStyleOpts(
            color="red", border_color="#fff", border_width=3
        ),
        tooltip_opts=opts.TooltipOpts(is_show=False),
        areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1),
    )
    .set_series_opts(
        linestyle_opts=linestyle_dic,label_opts=opts.LabelOpts(font_size=12, color='red' ),
        markpoint_opts=opts.MarkPointOpts(
            data=[opts.MarkPointItem(type_="max",itemstyle_opts=opts.ItemStyleOpts(
            color="#06FFD7", border_width=3)), 
            opts.MarkPointItem(type_="min",itemstyle_opts=opts.ItemStyleOpts(
            color="#06FFD7", border_width=3))],
            symbol_size=[65, 50],
            label_opts=opts.LabelOpts(position="inside", color="red", font_size=10)
            ),
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(
            title="Comments per hour",
            pos_top="5%",
            pos_left="center",
            title_textstyle_opts=opts.TextStyleOpts(color="#EB1934", font_family='STKaiti', font_size=20),
        ),
        xaxis_opts=opts.AxisOpts(
            type_="category",
            boundary_gap=True,
            axislabel_opts=opts.LabelOpts(margin=30, color="#EB1934"),
            axisline_opts=opts.AxisLineOpts(
                is_show=False, 
                linestyle_opts=opts.LineStyleOpts(color="#EB1934")
            ),
            axistick_opts=opts.AxisTickOpts(
                is_show=False,
                length=25,
                linestyle_opts=opts.LineStyleOpts(color="#EB1934"),
            ),
        ),
        yaxis_opts=opts.AxisOpts(
            type_="value",
            position="left",
            axislabel_opts=opts.LabelOpts(is_show=False, margin=20, color="#EB1934"),
            axisline_opts=opts.AxisLineOpts(
                is_show=False,
                linestyle_opts=opts.LineStyleOpts(width=2, color="#EB1934")
            ),
            axistick_opts=opts.AxisTickOpts(
                is_show=False,
                length=10,
                linestyle_opts=opts.LineStyleOpts(color="#EB1934"),
            ),
            splitline_opts=opts.SplitLineOpts(
                is_show=False, linestyle_opts=opts.LineStyleOpts(color="#EB1934")
            ),
        ),
        legend_opts=opts.LegendOpts(is_show=False),
        graphic_opts=[
            opts.GraphicImage(
                graphic_item=opts.GraphicItem(
                    id_="logo", z=-10, bounding="raw", origin=[50, 100]
                ),
                graphic_imagestyle_opts=opts.GraphicImageStyleOpts(
                    image="./12.jpg",
                    width=1000,
                    height=400,
                    opacity=0.3,
                ),
            )
        ],
    )
)
# line1.render_notebook()
# The background map can be displayed locally, but the platform only displays the line chart without background. You can copy the code to the local operation
Image.open("./2.png")

From the perspective of hour distribution, we generally choose to comment from the afternoon to the evening. Especially after 17:00, we are still more dedicated during working hours. The peak of the second comment is 22:00, which is a time when young people stay up late are more active, and the work and rest time of young partners is relatively late.

3.4 comments per day of the week

The statistics is the sum of comments on each day of the week from August 1, 2021 to August 31, 2021:



# Add field 'week'
dic = {1:'Monday',2:'Tuesday',3:'Wednesday',4:'Thursday',5:'Friday',6:'Saturday',7:'Sunday'}
df['week'] = df['Comment time'].dt.dayofweek+1
df['week'] = df['week'].map(dic)
df.head(5

)

	id	user name	city	score	comment	Comment time	week
0	1142669584	Qitong glutinous rice	guest	5.0	The plot is very attractive. Watching an animated cartoon surprised me	2021-08-31 23:56:30	Tuesday
1	1142662178	LnV14610189	Xining	5.0	Strong picture sense!	2021-08-31 23:36:00	Tuesday
2	1142666877	Alo861902585	Guangzhou	5.0	And a very good connection, wonderful	2021-08-31 23:34:41	Tuesday
3	1142660216	Y.	Xi'an	4.0	The characters in the picture don't say that you can always believe in chasing light. The plot is smooth and the overall rhythm is OK. It is recommended to watch -!	2021-08-31 23:30:56	Tuesday
4	1142669423	I want to see the moon for you	Fengtai	5.0	Yes, although Xiaoqing and Xiaobai's obsession is far fetched (OK). If you can elaborate on the obsession of Niu Mo, the plot will be more perfect	2021-08-31 23:27:28	Tuesday

# Comments per day of the week
dic = {1:'Monday',2:'Tuesday',3:'Wednesday',4:'Thursday',5:'Friday',6:'Saturday',7:'Sunday'}
df['week'] = df['Comment time'].dt.dayofweek+1
df1 = df.sort_values('week',ascending=True)
df_week = df1.groupby(['week'])['comment'].count()
week_x_data = [dic[i] for i in list(df_week.index)]
week_y_data = df_week.values.tolist()
 
line1 = (
    Line(init_opts=opts.InitOpts(bg_color=JsCode(color_js)))
    .add_xaxis(xaxis_data=week_x_data)
    .add_yaxis(
        series_name="",
        y_axis=week_y_data,
        is_smooth=True,
        is_symbol_show=True,
        symbol="circle",
        symbol_size=6,
        linestyle_opts=opts.LineStyleOpts(color="#fff"),
        label_opts=opts.LabelOpts(is_show=True, position="top", color="white"),
        itemstyle_opts=opts.ItemStyleOpts(
            color="red", border_color="#fff", border_width=3
        ),
        tooltip_opts=opts.TooltipOpts(is_show=False),
        areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1),
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(
            title="Comments per day of the week",
            pos_top="5%",
            pos_left="center",
            title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16),
        ),
        xaxis_opts=opts.AxisOpts(
            type_="category",
            boundary_gap=True,
            axislabel_opts=opts.LabelOpts(margin=30, color="#ffffff63"),
            axisline_opts=opts.AxisLineOpts(is_show=False),
            axistick_opts=opts.AxisTickOpts(
                is_show=True,
                length=25,
                linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
            ),
            splitline_opts=opts.SplitLineOpts(
                is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
            ),
        ),
        yaxis_opts=opts.AxisOpts(
            type_="value",
            position="left",
            axislabel_opts=opts.LabelOpts(margin=20, color="#ffffff63"),
            axisline_opts=opts.AxisLineOpts(
                linestyle_opts=opts.LineStyleOpts(width=2, color="#fff")
            ),
            axistick_opts=opts.AxisTickOpts(
                is_show=True,
                length=15,
                linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
            ),
            splitline_opts=opts.SplitLineOpts(
                is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
            ),
        ),
        legend_opts=opts.LegendOpts(is_show=False),
    )
)
line1.render_notebook()

From the data distribution of each day of the week, Mondays and Sundays are the active periods for comments. It is very interesting. The beginning and end of the week start in the break and end in the leisure.

3.5 calendar chart

times = [x.strftime('%Y-%m-%d') for x in list(pd.date_range('20210801', '20210831'))]
data = [[times[index],day_y_data[index]] for index,item in enumerate( day_y_data)]
Cal = (
    Calendar(init_opts=opts.InitOpts(width="800px", height="500px"))
    .add(
        series_name="Distribution of daily comments in August",
        yaxis_data=data,
        calendar_opts=opts.CalendarOpts(
             pos_top='20%',
             pos_left='5%',
             range_="2021-08",
             cell_size=40,
             # Mm / DD / yy label style settings
             daylabel_opts=opts.CalendarDayLabelOpts(name_map="cn",
                                                     margin=20,
                                                     label_font_size=14,
                                                     label_color='#EB1934', 
                                                     label_font_weight='bold'
                                                    ),
             monthlabel_opts=opts.CalendarMonthLabelOpts(name_map="cn",
                                                         margin=20,
                                                         label_font_size=14,
                                                         label_color='#EB1934', 
                                                         label_font_weight='bold',
                                                         is_show=False
                                                        ),
             yearlabel_opts=opts.CalendarYearLabelOpts(is_show=False),
        ),
        tooltip_opts='{c}',
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(
            pos_top="2%", 
            pos_left="center", 
            title=""
        ),
        visualmap_opts=opts.VisualMapOpts(
            orient="horizontal", 
            max_=2000,
            pos_bottom='10%',
            is_piecewise=True,
            pieces=[{"min": 1200},
                    {"min": 800, "max": 1200},
                    {"min": 500, "max": 800},
                    {"min": 300, "max": 500},
                    {"min": 80, "max": 300},
                    {"max": 80}],
            range_color=["#F5F5F5", "#FFE4E1", "#FFCC99", "#F08080", "#CD5C5C", "#990000"]
        ),
        legend_opts=opts.LegendOpts(is_show=True,
                                    pos_top='5%',
                                    item_width = 50,
                                    item_height = 30,
                                    textstyle_opts=opts.TextStyleOpts(font_size=16,color='#EB1934'),
                                    legend_icon ='path://path://M621.855287 587.643358C708.573965 540.110571 768 442.883654 768 330.666667 768 171.608659 648.609267 42.666667 501.333333 42.666667 354.057399 42.666667 234.666667 171.608659 234.666667 330.666667 234.666667 443.22333 294.453005 540.699038 381.59961 588.07363 125.9882 652.794383 21.333333 855.35859 21.333333 1002.666667L486.175439 1002.666667 1002.666667 1002.666667C1002.666667 815.459407 839.953126 634.458526 621.855287 587.643358Z'
                                   ),
    )
)
Cal.render_notebook()

3.6 role heat

Main characters: Xiaobai, Xiaoqing, Xu Xian, Fahai, Sima, sister sun, leader of Niutou Gang, masked man, owner of Baoqing workshop and scholar

roles=['Xiaobai','indigo plant','Xu Xian','Fahai','Sima','Sister sun','Niutou sect leader','Masked man','Baoqing workshop owner','scholar']
content=''.join([str(i) for i in list(df['comment'])])
roles_num=[]
for role in roles:
    count=content.count(role)
    roles_num.append((role,count))
roles_num=pd.DataFrame(roles_num)
roles_num.columns=['name','Number of occurrences']
roles_num

	name	Number of occurrences
0	Xiaobai	1523
1	indigo plant	2683
2	Xu Xian	239
3	Fahai	396
4	Sima	112
5	Sister sun	20
6	Niutou sect leader	1
7	Masked man	3
8	Baoqing workshop owner	101
9	scholar	4

# Linear gradient
color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0,
    [{offset: 0, color: '#FFFFFF'}, {offset: 1, color: '#ed1941'}], false)"""

roles_num=roles_num.sort_values(by='Number of occurrences',ascending=False)
roles_num=roles_num.reset_index(drop=True)
b2 = (
        Bar()
        .add_xaxis(list(roles_num['name']))
        .add_yaxis('frequency', list(roles_num['Number of occurrences']),itemstyle_opts=opts.ItemStyleOpts(color=JsCode(color_js)))
        .set_global_opts(title_opts=opts.TitleOpts(title='Frequency distribution of film review roles',pos_top='2%',pos_left = 'center'),
            legend_opts=opts.LegendOpts(is_show=False),
            yaxis_opts=opts.AxisOpts(name="frequency",name_location='middle',name_gap=50,name_textstyle_opts=opts.TextStyleOpts(font_size=16)))

    )
b2.render_notebook()

3.7 geographical distribution of visitors

cities = df['city'].values.tolist()
data = Counter(cities).most_common(80)
geo = (
    Geo(init_opts=opts.InitOpts(width="1000px", height="600px", bg_color="#404a59"))
    .add_schema(maptype="china", 
                itemstyle_opts={
                  'normal': {
                      'shadowColor': 'rgba(0, 0, 0, .5)', 
                      'shadowBlur': 5, 
                      'shadowOffsetY': 0, 
                      'shadowOffsetX': 0, 
                      'borderColor': '#fff'
                  }
              }
               )
    .add("Number of comments", data,type_=ChartType.HEATMAP,)
    .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
    .set_global_opts(
       title_opts=opts.TitleOpts(title="Geographical distribution",pos_top="2%", pos_left="center",
                                 title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16)),
       legend_opts=opts.LegendOpts(is_show=False),
       visualmap_opts=opts.VisualMapOpts(
            is_show=True,
            is_piecewise=True,
            min_ = 0,
            max_ = 500,
            split_number = 5,
            series_index=0,
            pos_bottom='5%',
            pos_left='5%',
            textstyle_opts=opts.TextStyleOpts(color="#fff"),
            pieces=[
                {'max':500, 'min':401, 'label':'401-500', 'color': '#990000'},
                {'max':400, 'min':301, 'label':'301-400', 'color': '#CD5C5C'},
                {'max':300, 'min':201, 'label':'201-300', 'color': '#F08080'},
                {'max':200, 'min':101, 'label':'101-200', 'color': '#FFCC99'},
                {'max':100, 'min':0, 'label':'0-100', 'color': '#FFE4E1'},
               ],
            ),
    )
)
geo.render_notebook()

From the geographical distribution map, the audience is mainly distributed in Beijing, Tianjin, Shanghai, Chongqing, Sichuan, Guangdong, Yunnan and other places.

Industry data: add to get PPT template, resume template, industry classic book PDF.
Interview question bank: the classic and hot real interview questions of large factories over the years are continuously updated and added.
Learning materials: including Python, crawler, data analysis, algorithm and other learning videos and documents, which can be added and obtained
Communication plus group: the boss points out the maze. Your problems are often encountered by others. Technical assistance and communication.

Topics: Python Data Analysis pandas pyecharts

Programmer Think