Hello, I'm spicy.
Finally, we are going to start the series of articles on data analysis. Compared with crawlers, the technical dimension has risen to a higher level. The output of my article will update the series of practical projects and the series of detailed explanation and summary of knowledge points in two columns respectively, with the goal of realizing 100 cases of crawler and data analysis practical projects in the short term.
Libraries involved:
Pandas - data processing Pyecharts - data visualization jieba - word segmentation collections - data statistics
Visualization part:
Line chart line histogram Bar Pie chart Pie Calendar chart Calendar cloud chart WordCloud map Geo
White Snake 2: robbed by green snake
*Plot introduction:*
On July 23, 2021, white snake 2: green snake robbery was released in the mainland. It mainly tells that in the late Southern Song Dynasty, Xiaobai was finally pressed under Leifeng Tower by the sea of France in order to save Xu Xian. Xiaoqing is accidentally driven into the strange fantasy of Shura city by Fahai. In several crises, Xiaoqing was rescued by the mysterious masked boy. Xiaoqing took the idea of going out to rescue Xiaobai. After suffering and growing up, she found a way to leave with the masked boy
Execution link Notebook
Install third party packages
!pip install pyecharts !pip install pandas !pip install numpy
Import third party packages
import pandas as pd import numpy as np from pyecharts.charts import * from PIL import Image from collections import Counter from pyecharts import options as opts # Visual configuration item from pyecharts.commons.utils import JsCode # Used to run js code from pyecharts.globals import ThemeType,SymbolType,ChartType # Visual theme style
Read data
df = pd.read_excel("./White Snake 2.xlsx") df.head(10) # View the first 10 lines
id | user name | city | score | comment | Comment time | |
---|---|---|---|---|---|---|
0 | 1142669584 | Qitong glutinous rice | guest | 5.0 | The plot is very attractive. Watching an animated cartoon surprised me | 2021-08-31 23:56:30 |
1 | 1142662178 | LnV14610189 | Xining | 5.0 | Strong picture sense! | 2021-08-31 23:36:00 |
2 | 1142666877 | Alo861902585 | Guangzhou | 5.0 | And a very good connection, wonderful | 2021-08-31 23:34:41 |
3 | 1142660216 | Y. | Xi'an | 4.0 | The characters in the picture don't say that you can always believe in chasing light. The plot is smooth and the overall rhythm is OK. It is recommended to watch -! | 2021-08-31 23:30:56 |
4 | 1142669423 | I want to see the moon for you | Fengtai | 5.0 | Yes, although Xiaoqing and Xiaobai's obsession is far fetched (OK). If you can elaborate on the obsession of Niu Mo, the plot will be more perfect | 2021-08-31 23:27:28 |
5 | 1142669422 | Ah, Ka, wow, ah | sunshine | 5.0 | It's good. It feels more and more attractive | 2021-08-31 23:27:12 |
6 | 1142669404 | Lfz9696 | Yongzhou | 4.5 | OK, very good | 2021-08-31 23:23:30 |
7 | 1142666812 | Tenacity | Guangzhou | 4.0 | The movie was ok, except that a man next door kept shaking his legs. | 2021-08-31 23:22:23 |
8 | 1142661206 | CQE579669148 | Urumqi | 5.0 | Take a good look, recommend | 2021-08-31 23:16:22 |
9 | 1142668420 | Fat suona | Ili | 5.0 | The plot is a little incomprehensible, but the animation effects are great! The plot is very moving. | 2021-08-31 23:06:36 |
Data cleaning
Missing value view
df.isnull().sum() id 0 user name 1 city 0 score 0 comment 0 Comment time 0 dtype: int64
Check and find that there is a real situation
The user name is missing, and the data of other columns is complete. Fill in the blank value with "unknown":
df['user name'].fillna('unknown', inplace=True) df.isnull().sum()
Pyecharts data visualization
Score grade distribution
# Linear gradient color_js = """new echarts.graphic.LinearGradient(0, 0, 1, 0, [{offset: 0, color: '#009ad6'}, {offset: 1, color: '#ed1941'}], false)""" df_star = df.groupby('score')['comment'].count() df_star = df_star.sort_values(ascending=True) x_data = [str(i) for i in list(df_star.index)] y_data = df_star.values.tolist() b1 = ( Bar() .add_xaxis(x_data) .add_yaxis('',y_data,itemstyle_opts=opts.ItemStyleOpts(color=JsCode(color_js))) .reversal_axis() .set_series_opts(label_opts=opts.LabelOpts(position='right')) .set_global_opts( yaxis_opts=opts.AxisOpts(name='Rating'), xaxis_opts=opts.AxisOpts(name='people/second'), title_opts=opts.TitleOpts(title='Score grade distribution',pos_left='45%',pos_top="5%"), legend_opts=opts.LegendOpts(type_="scroll", pos_left="85%",pos_top="28%",orient="vertical") ) ) df_star = df.groupby('score')['comment'].count() x_data = [str(i) for i in list(df_star.index)] y_data = df_star.values.tolist() p1 = ( Pie(init_opts=opts.InitOpts(width='800px', height='600px')) .add( '', [list(z) for z in zip(x_data, y_data)], radius=['10%', '30%'], center=['65%', '60%'], label_opts=opts.LabelOpts(is_show=True), ) .set_colors(["blue", "green", "#800000", "red", "#000000", "orange", "purple", "red", "#000000", "orange", "purple"]) .set_series_opts(label_opts=opts.LabelOpts(formatter='score{b}: {c} \n ({d}%)'),position="outside") ) b1.overlap(p1) b1.render_notebook()
The score of 5.0 reached 56%, more than half of the audience received five-star praise, and more than four-star praise reached 85%. It seems that everyone still highly recognizes this animation.
Distribution of daily comments from August 1, 2021 to August 31, 2021:
# Set style # The style of the loaded js code is mainly color and theme color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0, [{offset: 0, color: '#009ad6'}, {offset: 1, color: '#ed1941'}], false)""" area_color_js = ( "new echarts.graphic.LinearGradient(0, 0, 0, 1, " "[{offset: 0, color: '#eb64fb'}, {offset: 1, color: '#3fbbff0d'}], false)" ) # Set parameters linestyle_dic = { 'normal': { 'width': 2, 'shadowColor': '#696969', 'shadowBlur': 10, 'shadowOffsetY': 10, 'shadowOffsetX': 10, } } # Transfer time format df['Comment time'] = pd.to_datetime(df['Comment time'], format='%Y/%m/%d %H:%M:%S') # Daily Comments df['Comment time'] = pd.to_datetime(df['Comment time'], format='%Y/%m/%d %H:%M:%S') df_day = df.groupby(df['Comment time'].dt.day)['comment'].count() # Get the number of comments according to the comment time (count) day_x_data = [str(i) for i in list(df_day.index)] # x axis day_y_data = df_day.values.tolist() # Output to list y-axis line1 = ( Line(init_opts=opts.InitOpts(bg_color=JsCode(color_js))) # Linear visualization .add_xaxis(xaxis_data=day_x_data) # Add x-axis data .add_yaxis( # Add y-axis data series_name="", # y-axis name y_axis=day_y_data, # data is_smooth=True, is_symbol_show=True, symbol="circle", symbol_size=6, linestyle_opts=opts.LineStyleOpts(color="#fff"), # Configure y axis label_opts=opts.LabelOpts(is_show=True, position="top", color="white"), # y-axis label itemstyle_opts=opts.ItemStyleOpts( color="red", border_color="#fff", border_width=3 ), tooltip_opts=opts.TooltipOpts(is_show=False), areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1), ) .set_global_opts( title_opts=opts.TitleOpts( title="Daily comments in August", pos_top="5%", pos_left="center", title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16), ), xaxis_opts=opts.AxisOpts( type_="category", boundary_gap=True, axislabel_opts=opts.LabelOpts(margin=30, color="#ffffff63"), axisline_opts=opts.AxisLineOpts(is_show=False), axistick_opts=opts.AxisTickOpts( is_show=True, length=25, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"), ), splitline_opts=opts.SplitLineOpts( is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f") ), ), yaxis_opts=opts.AxisOpts( type_="value", position="left", axislabel_opts=opts.LabelOpts(margin=20, color="#ffffff63"), axisline_opts=opts.AxisLineOpts( linestyle_opts=opts.LineStyleOpts(width=2, color="#fff") ), axistick_opts=opts.AxisTickOpts( is_show=True, length=15, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"), ), splitline_opts=opts.SplitLineOpts( is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f") ), ), legend_opts=opts.LegendOpts(is_show=False), ) ) line1.render_notebook()
The number of comments per day peaked on August 1 (data excluding July), and the number of comments gradually decreased with the passage of time, which is also in line with the general law of film viewing.
Comments per hour
The statistics is the sum of comments per hour and day in the 31 days from August 1, 2021 to August 31, 2021 (if you are interested, you can view the distribution of 24-hour film reviews on a day separately and filter by date)
df_hour = df.groupby(df['Comment time'].dt.hour)['comment'].count() hours_x_data = [str(i) for i in list(df_hour.index)] hours_y_data = df_hour.values.tolist() line1 = ( # Line(init_opts=opts.InitOpts(bg_color=JsCode(color_js))) Line(init_opts=opts.InitOpts(width='1000px', height='400px')) .add_xaxis(xaxis_data=hours_x_data) .add_yaxis( series_name="", y_axis=hours_y_data, is_smooth=True, is_symbol_show=True, symbol="circle", symbol_size=6, linestyle_opts=opts.LineStyleOpts(color="#fff"), label_opts=opts.LabelOpts(is_show=True, position="top", color="white"), itemstyle_opts=opts.ItemStyleOpts( color="red", border_color="#fff", border_width=3 ), tooltip_opts=opts.TooltipOpts(is_show=False), areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1), ) .set_series_opts( linestyle_opts=linestyle_dic,label_opts=opts.LabelOpts(font_size=12, color='red' ), markpoint_opts=opts.MarkPointOpts( data=[opts.MarkPointItem(type_="max",itemstyle_opts=opts.ItemStyleOpts( color="#06FFD7", border_width=3)), opts.MarkPointItem(type_="min",itemstyle_opts=opts.ItemStyleOpts( color="#06FFD7", border_width=3))], symbol_size=[65, 50], label_opts=opts.LabelOpts(position="inside", color="red", font_size=10) ), ) .set_global_opts( title_opts=opts.TitleOpts( title="Comments per hour", pos_top="5%", pos_left="center", title_textstyle_opts=opts.TextStyleOpts(color="#EB1934", font_family='STKaiti', font_size=20), ), xaxis_opts=opts.AxisOpts( type_="category", boundary_gap=True, axislabel_opts=opts.LabelOpts(margin=30, color="#EB1934"), axisline_opts=opts.AxisLineOpts( is_show=False, linestyle_opts=opts.LineStyleOpts(color="#EB1934") ), axistick_opts=opts.AxisTickOpts( is_show=False, length=25, linestyle_opts=opts.LineStyleOpts(color="#EB1934"), ), ), yaxis_opts=opts.AxisOpts( type_="value", position="left", axislabel_opts=opts.LabelOpts(is_show=False, margin=20, color="#EB1934"), axisline_opts=opts.AxisLineOpts( is_show=False, linestyle_opts=opts.LineStyleOpts(width=2, color="#EB1934") ), axistick_opts=opts.AxisTickOpts( is_show=False, length=10, linestyle_opts=opts.LineStyleOpts(color="#EB1934"), ), splitline_opts=opts.SplitLineOpts( is_show=False, linestyle_opts=opts.LineStyleOpts(color="#EB1934") ), ), legend_opts=opts.LegendOpts(is_show=False), graphic_opts=[ opts.GraphicImage( graphic_item=opts.GraphicItem( id_="logo", z=-10, bounding="raw", origin=[50, 100] ), graphic_imagestyle_opts=opts.GraphicImageStyleOpts( image="./12.jpg", width=1000, height=400, opacity=0.3, ), ) ], ) ) # line1.render_notebook() # The background map can be displayed locally, but the platform only displays the line chart without background. You can copy the code to the local operation Image.open("./2.png")
From the perspective of hour distribution, we generally choose to comment from the afternoon to the evening. Especially after 17:00, we are still more dedicated during working hours. The peak of the second comment is 22:00, which is a time when young people stay up late are more active, and the work and rest time of young partners is relatively late.
3.4 comments per day of the week
The statistics is the sum of comments on each day of the week from August 1, 2021 to August 31, 2021:
# Add field 'week' dic = {1:'Monday',2:'Tuesday',3:'Wednesday',4:'Thursday',5:'Friday',6:'Saturday',7:'Sunday'} df['week'] = df['Comment time'].dt.dayofweek+1 df['week'] = df['week'].map(dic) df.head(5 )
id | user name | city | score | comment | Comment time | week | |
---|---|---|---|---|---|---|---|
0 | 1142669584 | Qitong glutinous rice | guest | 5.0 | The plot is very attractive. Watching an animated cartoon surprised me | 2021-08-31 23:56:30 | Tuesday |
1 | 1142662178 | LnV14610189 | Xining | 5.0 | Strong picture sense! | 2021-08-31 23:36:00 | Tuesday |
2 | 1142666877 | Alo861902585 | Guangzhou | 5.0 | And a very good connection, wonderful | 2021-08-31 23:34:41 | Tuesday |
3 | 1142660216 | Y. | Xi'an | 4.0 | The characters in the picture don't say that you can always believe in chasing light. The plot is smooth and the overall rhythm is OK. It is recommended to watch -! | 2021-08-31 23:30:56 | Tuesday |
4 | 1142669423 | I want to see the moon for you | Fengtai | 5.0 | Yes, although Xiaoqing and Xiaobai's obsession is far fetched (OK). If you can elaborate on the obsession of Niu Mo, the plot will be more perfect | 2021-08-31 23:27:28 | Tuesday |
# Comments per day of the week dic = {1:'Monday',2:'Tuesday',3:'Wednesday',4:'Thursday',5:'Friday',6:'Saturday',7:'Sunday'} df['week'] = df['Comment time'].dt.dayofweek+1 df1 = df.sort_values('week',ascending=True) df_week = df1.groupby(['week'])['comment'].count() week_x_data = [dic[i] for i in list(df_week.index)] week_y_data = df_week.values.tolist() line1 = ( Line(init_opts=opts.InitOpts(bg_color=JsCode(color_js))) .add_xaxis(xaxis_data=week_x_data) .add_yaxis( series_name="", y_axis=week_y_data, is_smooth=True, is_symbol_show=True, symbol="circle", symbol_size=6, linestyle_opts=opts.LineStyleOpts(color="#fff"), label_opts=opts.LabelOpts(is_show=True, position="top", color="white"), itemstyle_opts=opts.ItemStyleOpts( color="red", border_color="#fff", border_width=3 ), tooltip_opts=opts.TooltipOpts(is_show=False), areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1), ) .set_global_opts( title_opts=opts.TitleOpts( title="Comments per day of the week", pos_top="5%", pos_left="center", title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16), ), xaxis_opts=opts.AxisOpts( type_="category", boundary_gap=True, axislabel_opts=opts.LabelOpts(margin=30, color="#ffffff63"), axisline_opts=opts.AxisLineOpts(is_show=False), axistick_opts=opts.AxisTickOpts( is_show=True, length=25, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"), ), splitline_opts=opts.SplitLineOpts( is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f") ), ), yaxis_opts=opts.AxisOpts( type_="value", position="left", axislabel_opts=opts.LabelOpts(margin=20, color="#ffffff63"), axisline_opts=opts.AxisLineOpts( linestyle_opts=opts.LineStyleOpts(width=2, color="#fff") ), axistick_opts=opts.AxisTickOpts( is_show=True, length=15, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"), ), splitline_opts=opts.SplitLineOpts( is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f") ), ), legend_opts=opts.LegendOpts(is_show=False), ) ) line1.render_notebook()
From the data distribution of each day of the week, Mondays and Sundays are the active periods for comments. It is very interesting. The beginning and end of the week start in the break and end in the leisure.
3.5 calendar chart
times = [x.strftime('%Y-%m-%d') for x in list(pd.date_range('20210801', '20210831'))] data = [[times[index],day_y_data[index]] for index,item in enumerate( day_y_data)] Cal = ( Calendar(init_opts=opts.InitOpts(width="800px", height="500px")) .add( series_name="Distribution of daily comments in August", yaxis_data=data, calendar_opts=opts.CalendarOpts( pos_top='20%', pos_left='5%', range_="2021-08", cell_size=40, # Mm / DD / yy label style settings daylabel_opts=opts.CalendarDayLabelOpts(name_map="cn", margin=20, label_font_size=14, label_color='#EB1934', label_font_weight='bold' ), monthlabel_opts=opts.CalendarMonthLabelOpts(name_map="cn", margin=20, label_font_size=14, label_color='#EB1934', label_font_weight='bold', is_show=False ), yearlabel_opts=opts.CalendarYearLabelOpts(is_show=False), ), tooltip_opts='{c}', ) .set_global_opts( title_opts=opts.TitleOpts( pos_top="2%", pos_left="center", title="" ), visualmap_opts=opts.VisualMapOpts( orient="horizontal", max_=2000, pos_bottom='10%', is_piecewise=True, pieces=[{"min": 1200}, {"min": 800, "max": 1200}, {"min": 500, "max": 800}, {"min": 300, "max": 500}, {"min": 80, "max": 300}, {"max": 80}], range_color=["#F5F5F5", "#FFE4E1", "#FFCC99", "#F08080", "#CD5C5C", "#990000"] ), legend_opts=opts.LegendOpts(is_show=True, pos_top='5%', item_width = 50, item_height = 30, textstyle_opts=opts.TextStyleOpts(font_size=16,color='#EB1934'), legend_icon ='path://path://M621.855287 587.643358C708.573965 540.110571 768 442.883654 768 330.666667 768 171.608659 648.609267 42.666667 501.333333 42.666667 354.057399 42.666667 234.666667 171.608659 234.666667 330.666667 234.666667 443.22333 294.453005 540.699038 381.59961 588.07363 125.9882 652.794383 21.333333 855.35859 21.333333 1002.666667L486.175439 1002.666667 1002.666667 1002.666667C1002.666667 815.459407 839.953126 634.458526 621.855287 587.643358Z' ), ) ) Cal.render_notebook()
3.6 role heat
Main characters: Xiaobai, Xiaoqing, Xu Xian, Fahai, Sima, sister sun, leader of Niutou Gang, masked man, owner of Baoqing workshop and scholar
roles=['Xiaobai','indigo plant','Xu Xian','Fahai','Sima','Sister sun','Niutou sect leader','Masked man','Baoqing workshop owner','scholar'] content=''.join([str(i) for i in list(df['comment'])]) roles_num=[] for role in roles: count=content.count(role) roles_num.append((role,count)) roles_num=pd.DataFrame(roles_num) roles_num.columns=['name','Number of occurrences'] roles_num
name | Number of occurrences | |
---|---|---|
0 | Xiaobai | 1523 |
1 | indigo plant | 2683 |
2 | Xu Xian | 239 |
3 | Fahai | 396 |
4 | Sima | 112 |
5 | Sister sun | 20 |
6 | Niutou sect leader | 1 |
7 | Masked man | 3 |
8 | Baoqing workshop owner | 101 |
9 | scholar | 4 |
# Linear gradient color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0, [{offset: 0, color: '#FFFFFF'}, {offset: 1, color: '#ed1941'}], false)""" roles_num=roles_num.sort_values(by='Number of occurrences',ascending=False) roles_num=roles_num.reset_index(drop=True) b2 = ( Bar() .add_xaxis(list(roles_num['name'])) .add_yaxis('frequency', list(roles_num['Number of occurrences']),itemstyle_opts=opts.ItemStyleOpts(color=JsCode(color_js))) .set_global_opts(title_opts=opts.TitleOpts(title='Frequency distribution of film review roles',pos_top='2%',pos_left = 'center'), legend_opts=opts.LegendOpts(is_show=False), yaxis_opts=opts.AxisOpts(name="frequency",name_location='middle',name_gap=50,name_textstyle_opts=opts.TextStyleOpts(font_size=16))) ) b2.render_notebook()
3.7 geographical distribution of visitors
cities = df['city'].values.tolist() data = Counter(cities).most_common(80) geo = ( Geo(init_opts=opts.InitOpts(width="1000px", height="600px", bg_color="#404a59")) .add_schema(maptype="china", itemstyle_opts={ 'normal': { 'shadowColor': 'rgba(0, 0, 0, .5)', 'shadowBlur': 5, 'shadowOffsetY': 0, 'shadowOffsetX': 0, 'borderColor': '#fff' } } ) .add("Number of comments", data,type_=ChartType.HEATMAP,) .set_series_opts(label_opts=opts.LabelOpts(is_show=False)) .set_global_opts( title_opts=opts.TitleOpts(title="Geographical distribution",pos_top="2%", pos_left="center", title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16)), legend_opts=opts.LegendOpts(is_show=False), visualmap_opts=opts.VisualMapOpts( is_show=True, is_piecewise=True, min_ = 0, max_ = 500, split_number = 5, series_index=0, pos_bottom='5%', pos_left='5%', textstyle_opts=opts.TextStyleOpts(color="#fff"), pieces=[ {'max':500, 'min':401, 'label':'401-500', 'color': '#990000'}, {'max':400, 'min':301, 'label':'301-400', 'color': '#CD5C5C'}, {'max':300, 'min':201, 'label':'201-300', 'color': '#F08080'}, {'max':200, 'min':101, 'label':'101-200', 'color': '#FFCC99'}, {'max':100, 'min':0, 'label':'0-100', 'color': '#FFE4E1'}, ], ), ) ) geo.render_notebook()
From the geographical distribution map, the audience is mainly distributed in Beijing, Tianjin, Shanghai, Chongqing, Sichuan, Guangdong, Yunnan and other places.
Industry data: add to get PPT template, resume template, industry classic book PDF.
Interview question bank: the classic and hot real interview questions of large factories over the years are continuously updated and added.
Learning materials: including Python, crawler, data analysis, algorithm and other learning videos and documents, which can be added and obtained
Communication plus group: the boss points out the maze. Your problems are often encountered by others. Technical assistance and communication.