Record a real order receiving record: cat's eye movie data visualization, with a revenue of 1000 yuan in three hours

Posted by fmpros on Thu, 03 Feb 2022 05:03:03 +0100

Last weekend, we received a 1200 order. The customer service took a 10% commission and got 1000. It was completed in two hours. I was very happy. In fact, there are not many such orders. The technical difficulty is low, but the price is high. We commonly call them "fish picking list". Thinking of making money, I invited the goddess to dinner, but I was ruthlessly rejected!

Effect display

Tool preparation

Data source: Cat's eye movie
Development environment: win10, python3 seven
Development tools: pycharm, Chrome

Analysis of project ideas

First, collect all the movie information of cat's eye movie
Take the top 100 list of cat's eye as an example
Get movie information:

Movie title
Film rating
Movie link
Film type
Location of the film
place
Film duration
Film duration

Parsing web page data information

Analyze the jump link of the home page

The score on the cat's eye details page is encrypted, so we directly retrieve the score information from the home page

Extract data on the details page

Save the data in csv form for data visualization

Tools needed for data visualization

import pandas as pd
import numpy as np
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# get_ipython().run_line_magic('matplotlib', 'inline')

Rendering display

Source code display:

Reptiles:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time: June 5, 2021
# @File    : demo4.py

import requests
from fake_useragent import UserAgent
from lxml import etree
import time

# Random request header
ua = UserAgent()

# To build a request, you need to change it on the web page. If the request is not available, you need to refresh the web page and get the verification code
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Cookie': '__mta=244176442.1622872454168.1622876903037.1622877097390.7; uuid_n_v=v1; uuid=6FFF6D30C5C211EB8D61CF53B1EFE83FE91D3C40EE5240DCBA0A422050B1E8C0; _csrf=bff9b813020b795594ff3b2ea3c1be6295b7453d19ecd72f8beb9700c679dfb4; Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1622872443; _lxsdk_cuid=1770e9ed136c8-048c356e76a22b-7d677965-1fa400-1770e9ed136c8; _lxsdk=6FFF6D30C5C211EB8D61CF53B1EFE83FE91D3C40EE5240DCBA0A422050B1E8C0; ci=59; recentCis=59; __mta=51142166.1622872443578.1622872443578.1622876719906.2; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1622877097; _lxsdk_s=179dafd56bf-06d-403-d81%7C%7C12',
    'User-Agent': str(ua.random)
}


def RequestsTools(url):
    '''
    Crawler request tool function
    :param url: Request address
    :return: HTML Object for xpath extract
    '''
    response = requests.get(url, headers=headers).content.decode('utf-8')
    html = etree.HTML(response)
    return html


def Index(page):
    '''
    Homepage function
    :param page: the number of pages
    :return:
    '''
    url = 'https://maoyan.com/board/4?offset={}'.format(page)
    html = RequestsTools(url)
    # Detail page address suffix
    urls_text = html.xpath('//a[@class="image-link"]/@href')
    # score
    pingfen1 = html.xpath('//i[@class="integer"]/text()')
    pingfen2 = html.xpath('//i[@class="fraction"]/text()')

    for i, p1, p2 in zip(urls_text, pingfen1, pingfen2):
        pingfen = p1 + p2
        urs = 'https://maoyan.com' + i
        # Requests are too frequent anyway
        time.sleep(2)
        Details(urs, pingfen)


def Details(url, pingfen):
    html = RequestsTools(url)
    dianyan = html.xpath('//h1[@class="name"]/text()) # movie name
    leixing = html.xpath('//li[@class="ellipsis"]/a/text()) # type
    diqu = html.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[2]/text()') # Read sum
    timedata = html.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[3]/text()') # time
    for d, l, b, t in zip(dianyan, leixing, diqu, timedata):
        countyr = b.replace('\n', '').split('/')[0] # region
        shichang = b.replace('\n', '').split('/')[1] # duration
        f = open('cat eye.csv', 'a')
        f.write('{}, {}, {}, {}, {}, {}, {}\n'.format(d, pingfen, url, l, countyr, shichang, t))
        print(d, pingfen, url, l, countyr, shichang, t )


for page in range(0, 11):
    page *= 10
    Index(page)

visualization

#!/usr/bin/env python
# coding: utf-8

# Load data analysis common library
import pandas as pd
import numpy as np
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# get_ipython().run_line_magic('matplotlib', 'inline')


# In[3]:


path='./maoyan.csv'
df=pd.read_csv(path,sep=',',encoding='utf-8',index_col=False)
df.drop(df.columns[0],axis=1,inplace=True)
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
df.head(10)





#View the structure of the data
df.info()
print(df.columns)



# In[11]:


#Year & the number of films released. The number of films released in 2018 and beyond is only announced on the cat's eye at present, which is uncertain. We will eliminate the films released in 2018 and beyond first
fig,ax=plt.subplots(figsize=(9,6),dpi=70)
df[df[u'Release time']<2018][u'Release time'].value_counts().sort_index().plot(kind='line',ax=ax)
ax.set_xlabel(u'Time (year)')
ax.set_ylabel(u'Number of releases')
ax.set_title(u'Release time&Number of films released')




#Based on the above figure, we will draw another diagram of the relationship between release time, release quantity and score
#However, due to the small amount of data before 1980 and inaccurate score, the main analysis areas are concentrated in 1980-2017
x=df[df[u'Release time']<2018][u'Release time'].value_counts().sort_index().index
y=df[df[u'Release time']<2018][u'Release time'].value_counts().sort_index().values
y2=df[df[u'Release time']<2018].sort_values(by=u'Release time').groupby(u'Release time').mean()[u'score'].values

fig,ax=plt.subplots(figsize=(10,5),dpi=70)
ax.plot(x,y,label=u'Number of releases')
ax.set_xlim(1980,2017)
ax.set_xlabel(u'Release time')
ax.set_ylabel(u'Number of releases')
ax.set_title(u'time&Number of releases&Average score')
ax2=ax.twinx()
ax2.plot(x,y2,c='y',ls='--',label=u'score')
ax.legend(loc=1)
ax2.legend(loc=2)



# Solve the problem of Chinese garbled code and no negative value on the coordinate axis
plt.rcParams['font.sans-serif'] =['Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False


# In[12]:


#World & release time & average score
fig,ax=plt.subplots(figsize=(10,7),dpi=60)
df[df[u'score']>0].groupby(u'Release time').mean()[u'score'].plot(kind='line',ax=ax)
ax.set_ylabel(u'score')
ax.set_title(u'world&Release time&Mean score')


# In[13]:


#Number of films of all types in the world
#Cut the type into the smallest unit, and then make statistics
types=[]
for tp in df[u'type']:
    ls=tp.split(',')
    for x in ls:
        types.append(x)

tp_df=pd.DataFrame({u'type':types})
fig,ax=plt.subplots(figsize=(9,6),dpi=60)
tp_df[u'type'].value_counts().plot(kind='bar',ax=ax)
ax.set_xlabel(u'type')
ax.set_ylabel(u'quantity')
ax.set_title(u'world&type&number')


# In[14]:


#Distribution of film duration and score
#There is a problem: in fact, there are some films that have not been rated. We should ban them here
x=df[df[u'score']>0].sort_values(by=u'duration(min)')[u'duration(min)'].values
y=df[df[u'score']>0].sort_values(by=u'duration(min)')[u'score'].values
fig,ax=plt.subplots(figsize=(9,6),dpi=70)
ax.scatter(x,y,alpha=0.6,marker='o')
ax.set_xlabel(u'duration(min)')
ax.set_ylabel(u'quantity')
ax.set_title(u'Film duration&Score distribution')
#You can see the score



i=0
c0=[]
c1=[]
c2=[]
c3=[]
c4=[]
c5=[]
c6=[]
c7=[]

for x in df[u'region']:
    if u'Chinese Mainland' in x:
        c0.append(df.iat[i, 0])
        c1.append(df.iat[i, 1])
        c2.append(df.iat[i, 2])
        c3.append(df.iat[i, 3])
        c4.append(df.iat[i, 4])
        c5.append(df.iat[i, 5])
        c6.append(df.iat[i, 6])
        c7.append(df.iat[i, 7])
    i=i+1

china_df=pd.DataFrame({u'film':c0, u'score':c1,u'link':c2, u'type':c3,u'region':c4, u'Place of release':c5,u'duration(min)':c6,u'Release time':c7})



# In[16]:


#The comparison time range of China world average score is 1980-2017  
x1 = df[df[u'score']>0].groupby(u'Release time').mean()[u'score'].index
y1 = df[df[u'score']>0].groupby(u'Release time').mean()[u'score'].values
    
x2 = china_df[china_df[u'score']>0].groupby(u'Release time').mean()[u'score'].index
y2 = china_df[china_df[u'score']>0].groupby(u'Release time').mean()[u'score'].values
fig,ax=plt.subplots(figsize=(12,9),dpi=60)
ax.plot(x1,y1,ls='-',c='DarkTurquoise',label=u'world')
ax.plot(x2,y2,ls='--',c='Gold',label=u'China')
ax.set_title(u'China&World average score')
ax.set_xlabel(u'time')
ax.set_xlim(1980,2017)
ax.set_ylabel(u'score')
ax.legend()


# In[17]:


#Type and number of releases: a comparison between China and the world
#Because types are mixed, in order to facilitate statistics, first write a function to segment types



# In[18]:


#The write split function passes in a Sreies type object and returns a type split DataFrame
#What is passed in here is a type of Series

def Cuttig_type(typeS):
    types=[]
    types1=[]

    for x in typeS:
        if len(x)<4:
            # print x
            types1.append(x)
        ls=x.split(',')
        for i in ls:
            types.append(i)

    types.extend(types1)
    df=pd.DataFrame({u'type':types})
    return pd.DataFrame(df[u'type'].value_counts().sort_values(ascending=False))


# In[19]:


#Comparison of film types between China and the world
df1=Cuttig_type(china_df[u'type'])
df2=Cuttig_type(df[u'type'])
trans=pd.concat([df1,df2],axis=1)
trans.dropna(inplace=True)
trans.columns=[u'China',u'world']
fig,ax=plt.subplots(figsize=(15,9),dpi=80)
trans.plot(kind='bar',ax=ax) 
fig.autofmt_xdate(rotation=30)
ax.set_title(u'China&World type comparison chart')
ax.set_xlabel(u'type')
ax.set_ylabel(u'Number of films')


# In[20]:


#Then there is the scattered distribution, China & World & time length & score distribution
y = df[df[u'score'] > 0].sort_values(by=u'duration(min)')[u'score'].values
x = df[df[u'score'] > 0].sort_values(by=u'duration(min)')[u'duration(min)'].values
y2 = china_df[china_df[u'score'] > 0].sort_values(by=u'duration(min)')[u'score'].values
x2 = china_df[china_df[u'score'] > 0].sort_values(by=u'duration(min)')[u'duration(min)'].values

fig, ax = plt.subplots(figsize=(10,7), dpi=80)
ax.scatter(x, y, c='DeepSkyBlue', alpha=0.6, label=u'world')
ax.scatter(x2, y2, c='Salmon', alpha=0.7, label=u'China')
ax.set_title(u'China&Distribution of world scores')
ax.set_xlabel(u'duration(min)')
ax.set_ylabel(u'score')
ax.legend(loc=4)


# In[25]:


dfs=df[(df[u'Release time']>1980)&(df[u'Release time']<2019)]




# for x in range(0,len(dfs)):
#     print(dfs.iat[x,0],dfs.iat[x,-1])

df666 = dfs['film'][:15]

wl = ",".join(df666.values)
# Write the txt after word segmentation to the text file
# fenciTxt  = open("fenciHou.txt","w+")
# fenciTxt.writelines(wl)
# fenciTxt.close()

# Set word cloud l
wc = WordCloud(background_color="white",  #Set background color
               # mask=imread('shen.jpg'),   #Set background picture
#                    max_words=2000,  #Set the maximum number of words displayed
                   font_path="C:\\Windows\\Fonts\\simkai.ttf", # Set to regular italics
    #Set the Chinese font so that the word cloud can be displayed (the default font of the word cloud is "DroidSansMono.ttf font library", which does not support Chinese)
               max_font_size=60, #Set font maximum
               random_state=30,  #Set the number of randomly generated States, that is, the number of color schemes
               )
myword = wc.generate(wl)  #Generate word cloud
wc.to_file('result.jpg')

# Show word cloud
plt.imshow(myword)
plt.axis("off")
plt.show()


# In[41]:

Summary

The source code has been provided, so I won't analyze the source code. If it's useful to you, please give it three times. Thank you very much. Finally, I'll show you the process of receiving orders.

PS: you must go to the third-party platform to receive and take orders!!!
PS: you must go to the third-party platform to receive and take orders!!!
PS: you must go to the third-party platform to receive and take orders!!!

If you want to take orders [if you have technology, don't participate in Xiaobai], or if you want to exercise your technology, you can privately trust me in the background and note what technology you will know and which order you want to take, I will pull you in. In addition, I sorted out some materials for Xiaobai, which can be obtained by myself.

① More than 3000 Python e-books
② Python development environment installation tutorial
③ Python 400 set self-study video
④ Common vocabulary of software development
⑤ Python learning Roadmap
⑥ Project source code case sharing
If you can use it, you can take it away directly. In my QQ technology exchange group (pure technology exchange and resource sharing, advertising is not allowed) you can take it away by yourself. The group number is 949222410.

Topics: Python crawler Data Analysis Visualization data visualization

Programmer Think