Taobao user behavior analysis project

Posted by bookbuyer2000 on Mon, 31 Jan 2022 00:42:19 +0100

Taobao user behavior analysis project

Note: the project passes the Jupiter notebook. Third party libraries such as pandas, numpy, matplotlib and seaborn are used to analyze the data in different dimensions.

1, Project background

Taking Taobao app platform as the data set, this data report analyzes the behavior of Taobao users through industry indicators, so as to explore the behavior mode of Taobao users. Specific indicators include daily PV and daily UV analysis, payment rate analysis, repurchase behavior analysis, funnel loss analysis and user value RFM analysis.

2, Data source

Link: https://pan.baidu.com/s/1468_ZeV0N_J1_FSyUgEMeQ
Extraction code: 3c5c

3, Problems to be solved

1. How many days pv are there

2. How much is the daily uv

3. Payment rate

4. What is the repurchase rate

5. How about the loss of funnel

6. User value

4, Data description

There are about 1.04 million pieces of data in this data set. The data is the user behavior data of Taobao app2 from November 18, 2014 to December 18, 2014, with a total of 6 columns. The columns are:

user_id: user identity, desensitization

item_id: Commodity ID, desensitization

behavior_type: user behavior type (including clicking, collecting, adding shopping cart and paying, represented by numbers 1, 2, 3 and 4 respectively)

user_geohash: geographic location

item_category: category ID (category to which the commodity belongs)

Time: the time when the user behavior occurs

5, Data cleaning

# Import necessary Libraries
import pandas as pd
import numpy as py
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import re
# Import data (Note: you need to enter the absolute path of the data)
data_user=pd.read_csv('E:/PythonData/tianchi_mobile_recommend_train_user.csv')
# View missing values
missingTotal = data_user.isnull().sum()
missingTotal

# View the first 5 rows of data
data_user.head()

# Extract the date and time respectively through regular expressions
import re 
# Extraction date
data_user['date'] = data_user['time'].map(lambda s:re.compile(' ').split(s)[0])
# Extracting hours through regular expressions
data_user['hour'] = data_user['time'].map(lambda s:re.compile(' ').split(s)[1])
# View data
data_user.head()

# View data type
data_user.dtypes

# Convert the date and time to the corresponding format
data_user['date']=pd.to_datetime(data_user['date'])
data_user['time']=pd.to_datetime(data_user['time'])

#Convert hour to integer
data_user['hour']=data_user['hour'].astype('int64')
data_user.dtypes

# Sort the data in ascending order through the time field
data_user = data_user.sort_values(by='time',ascending=True)

# After sorting, reset the index to get a new data set
data_user = data_user.reset_index(drop=True)
data_user

# View statistics of data
data_user.describe()

6, User behavior analysis

1. pv and uv analysis

PV (traffic): that is, Page View, specifically refers to the page views or clicks of the website, which is calculated once the page is refreshed.

UV (independent visitor): that is, Unique Visitor. A computer client accessing your website is a visitor.

1. Daily traffic analysis
# pv_daily records the number of user operations per day, uv_daily records the number of different online users every day

# Daily visits (group dates and count and aggregate user IDs. Reset the index and rename the field user_id to pv)
pv_daily=data_user.groupby('date')['user_id'].count().reset_index().rename(columns={'user_id':'pv'})

# Daily visitor volume (the operation is the same as above, but the difference is to de count different user_id s)
uv_daily=data_user.groupby('date')['user_id'].apply(lambda x:x.drop_duplicates().count()).reset_index().rename(columns={'user_id':'uv'})

#Add Chinese
plt.rcParams['font.sans-serif'] = ['SimHei']

# Create drawing
plt.figure(figsize=(20,8),dpi=80)
plt.plot(pv_daily.date,pv_daily.pv)

# Adjust scale
plt.xticks(size=12)
plt.yticks(size=12)

# Add dimension
plt.xlabel('date',size=15)
plt.ylabel('Visits',size=15)

#Add title
plt.title('pv_daily(Daily visits)',size=15)

# Create drawing
plt.figure(figsize=(20,8),dpi=80)
plt.plot(uv_daily.date,uv_daily.uv)

# Adjust scale
plt.xticks(size=12)
plt.yticks(size=12)

# Add dimension
plt.xlabel('date',size=15)
plt.ylabel('Number of visitors',size=15)

#Add title
plt.title('uv_daily',size=15)


From the above figure: the peak number of visits and visitors during the double twelfth Festival.

2. Hourly traffic analysis
#pv_hour records the number of user operations per hour, uv_hour records the number of online users per hour
# Visits per hour (group the hours, count and aggregate the user id, reset the index, and rename the user id to pv)
pv_hour=data_user.groupby('hour')['user_id'].count().reset_index().rename(columns={'user_id':'pv'})

# Number of visitors per hour (the operation is the same as above, except that the user id is de counted)
uv_hour=data_user.groupby('hour')['user_id'].apply(lambda x:x.drop_duplicates().count()).reset_index().rename(columns={'user_id':'uv'})

# Create drawing
plt.figure(figsize=(20,8),dpi=80)
plt.plot(pv_hour.hour,pv_hour.pv)

# Adjust scale
plt.xticks(size=12)
plt.yticks(size=12)

# Add dimension
plt.xlabel('hour',size=15)
plt.ylabel('Visits per hour',size=15)

#Add title
plt.title('pv_hour((visits per hour)',size=15)


As can be seen from the above figure, the number of visitors and visits is the largest during 8.00 p.m.

3. pv analysis of users with different behavior types

# Group different types of user behavior and hours
pv_detail=data_user.groupby(['behavior_type','hour'])['user_id'].count().reset_index().rename(columns={'user_id':'total_pv'})

# View the first 10 rows of data
pv_detail.head(10)

# Create drawing
plt.figure(figsize=(20,8),dpi=80)

# Draw a broken line chart of hourly pv situation of users with different behavior types
plt.plot(pv_detail[pv_detail.behavior_type==1].hour,pv_detail[pv_detail.behavior_type==1].total_pv,label='click',color='r')
plt.plot(pv_detail[pv_detail.behavior_type==2].hour,pv_detail[pv_detail.behavior_type==2].total_pv,label='Collection',color='b')
plt.plot(pv_detail[pv_detail.behavior_type==3].hour,pv_detail[pv_detail.behavior_type==3].total_pv,label='Add shopping cart',color='y')
plt.plot(pv_detail[pv_detail.behavior_type==4].hour,pv_detail[pv_detail.behavior_type==4].total_pv,label='payment',color='g')

# Adjust scale
plt.xticks(size=12)
plt.yticks(size=12)

# Add dimension
plt.xlabel('time/hour',fontsize=12)
plt.ylabel('quantity',fontsize=12)

# Add legend
plt.legend(loc='best',fontsize=15)

# Add title
plt.title('Different user behaviors pv analysis',fontsize=15)

3.1 only compare the pv situation where the user's behavior is collection, shopping cart addition and payment

# Create drawing
plt.figure(figsize=(20,8),dpi=80)

# Draw the pv line chart of collection, shopping cart and payment
plt.plot(pv_detail[pv_detail.behavior_type==2].hour,pv_detail[pv_detail.behavior_type==2].total_pv,label='Collection',color='b')
plt.plot(pv_detail[pv_detail.behavior_type==3].hour,pv_detail[pv_detail.behavior_type==3].total_pv,label='Add shopping cart',color='y')
plt.plot(pv_detail[pv_detail.behavior_type==4].hour,pv_detail[pv_detail.behavior_type==4].total_pv,label='payment',color='g')

# Adjust scale
plt.xticks(size=12)
plt.yticks(size=12)

# Add dimension
plt.xlabel('time/hour',fontsize=15)
plt.ylabel('quantity',fontsize=15)

# Add legend
plt.legend(loc='best',fontsize=20)

# Add title
plt.title('Different types of user behavior pv Analysis 2',size=15)

As shown in the chart above:

Compared with the other three types of user behaviors, the click user behavior has higher pv visits, and the fluctuations of the four user behaviors are basically the same. Therefore, no matter which user behavior is in the evening, the pv visits are the highest.

As can be seen from Figure 2, the total amount of pv of user behavior of adding shopping cart is higher than the total amount of collection. Therefore, in the subsequent funnel loss analysis, user type 3 should be analyzed before 2.

7, Consumer behavior analysis

1. Analysis of user purchase times
# Filter the data, and type 4 is payment. Group user IDs and aggregate the number of payment types
data_user_buy=data_user[data_user.behavior_type==4].groupby('user_id')['behavior_type'].count()

# Data obtained_ user_ Buy is the sequence, the index user id, and the value represents the number of payments made by the user
data_user_buy.head(10)

# Draw frequency histogram to check the distribution of purchase times and corresponding number of people
# Create drawing
plt.figure(figsize=(20,8),dpi=80)

# Draw frequency histogram
plt.hist(data_user_buy,bins=50,color = 'steelblue',edgecolor = 'k')

# Adjust scale
plt.xticks(range(100)[::10],size=12)

plt.xlim(0,100)

# Add dimension
plt.xlabel('Number of people',size=15)
plt.ylabel('Number of purchases',size=15)

# Add title
plt.title('daily_user_buy',size=15)

# Draw frequency histogram
plt.figure(figsize=(20,8),dpi=80)

# Create frequency histogram
plt.hist(data_user_buy,bins=50,color = 'steelblue',edgecolor = 'k',density = True)

# Adjust scale
plt.xticks(range(100)[::10],size=12)

# Add dimension
plt.xlabel('Number of people',size=15)
plt.ylabel('frequency',size=15)

# Add title
plt.title('daily_user_buy',size=15)

plt.xlim(0,100)
plt.show()

2. Day ARPPU

ARPPU (average revenue per paying user) refers to the income obtained from each paying user, which reflects the average payment amount of each paying user.

ARPPU = total revenue / amount paid by active users

Since there is no consumption amount in this data set, the consumption times are used to replace the consumption amount in the calculation process

Per capita consumption times = total consumption times / number of consumers

# Create a new DataFrame, reset the index 0,1,2,3
data_use_buy1=data_user[data_user.behavior_type==4].groupby(['date','user_id'])['behavior_type'].count().reset_index().rename(columns={'behavior_type':'total'})

# View the first 10 rows of data
data_use_buy1.head(10)

data_use_buy1 = data_use_buy1.groupby('date').apply(lambda x:x.total.sum()/x.total.count())

# View data
data_use_buy1.head()

plt.figure(figsize=(20,8),dpi=80)
plt.plot(data_use_buy1)

#Adjust scale
plt.xticks(size=13)
plt.yticks(size=13)

# Add dimension
plt.xlabel('date',size=20)
plt.ylabel('Per capita consumption times',size=20)

# Add title
plt.title('Per capita consumption times',size=20)


The chart shows that the average daily consumption fluctuates between 1-2 times, and the consumption times reach the highest value during the double twelve.

3. Day ARPU

ARPU(Average Revenue Per User): average revenue per user, which can be calculated by total revenue / AU. It can measure the profitability and development vitality of products.

Average consumption times of active users = total consumption times / number of active users (those with operation behavior every day are active)

data_user['operation']=1
data_use_buy2=data_user.groupby(['date','user_id','behavior_type'])['operation'].count().reset_index().rename(columns={'operation':'total'})

data_use_buy2 = data_use_buy2.groupby('date').apply(lambda x:x[x.behavior_type==4].total.sum()/len(x.user_id.unique()))

# View the first 5 rows of data
data_use_buy2.head()

# Create drawing
plt.figure(figsize=(20,8),dpi=80)
# Draw a line chart
plt.plot(data_use_buy2)

# Adjust scale
plt.xticks(size=12)
plt.yticks(size=12)

# Add flag
plt.xlabel('date',size=15)
plt.ylabel('Average consumption times',size=15)

# Add title
plt.title('Average consumption times of active users',size=15)

# display graphics
plt.show()

4. Daily payment rate

Payment rate = number of consumers / number of active users

data_use_buy2=data_user.groupby(['date','user_id','behavior_type'])['operation'].count().reset_index().rename(columns={'operation':'total'})
daily_afford_rate = data_use_buy2.groupby('date').apply(lambda x:x[x.behavior_type==4].total.count()/len(x.user_id.unique()))

# View the first 5 rows of data
daily_afford_rate.head()

# Create drawing
plt.figure(figsize=(20,8),dpi=80)
# Draw a line chart
plt.plot(daily_afford_rate)

# Adjust scale
plt.xticks(size=12)
plt.yticks(size=12)

# Add notes
plt.xlabel('date',size=15)
plt.ylabel('Rate ',size=15)

# Add title
plt.title('Daily payment rate',size=15)

# display graphics
plt.show()

5. Distribution of user consumption times in the same time period

data_user_buy3=data_user[data_user.behavior_type==4].groupby(['user_id','date','hour'])['operation'].sum().rename('buy_count')

plt.figure(figsize=(20,8),dpi=80)
plt.hist(data_user_buy3,bins=50,density = True)

# Adjust scale
plt.xticks(size=12)
plt.yticks(size=12)

# Add dimension
plt.xlabel('Number of purchases',size=15)
plt.ylabel('frequency',size=15)

# Add title
plt.title('Distribution of user consumption times at the same time')

print('Most users consume:{}second'.format(data_user_buy3.mode()[0]))

Consumption of most users: once

8, Analysis of repurchase

1. Repurchase rate

Re purchase, that is, if there is a purchase behavior for more than two days, multiple purchases a day are counted as one
Repurchase rate = number of users with repurchase behavior / total number of users with purchase behavior

date_rebuy=data_user[data_user.behavior_type==4].groupby('user_id')['date'].apply(lambda x:len(x.unique())).rename('rebuy_count')

# Re purchase behavior screening: Date_ Buy > = 2, 4 significant digits reserved
print('Repurchase rate:',round(date_rebuy[date_rebuy>=2].count()/date_rebuy.count(),4))

Repurchase rate: 0.8717

2. Distribution of consumption times in all repurchase intervals
data_day_buy=data_user[data_user.behavior_type==4].groupby(['user_id','date']).operation.count().reset_index()
data_user_buy4=data_day_buy.groupby('user_id').date.apply(lambda x:x.sort_values().diff(1).dropna())
data_user_buy4=data_user_buy4.map(lambda x:x.days)

# Create drawing
plt.figure(figsize=(20,8),dpi=80)

# Draw bar chart
data_user_buy4.value_counts().plot(kind='bar')

# Adjust scale
plt.xticks(size=12)
plt.yticks(size=12)

# Add dimension
plt.xlabel('gap_day',size=15)
plt.ylabel('gap_count',size=15)

# Add title
plt.title('time_gap',size=15)


The repurchase rate of most users is 0.8717, and the consumption times continue to decline with the increase of consumption time interval. There are more repurchase times within 1-10 days, and Taobao users rarely repurchase after 10 days. Therefore, it is necessary to pay attention to the repurchase behavior of Taobao users within 10 days and increase users' repurchase. The average repurchase time of different users is normally distributed, but overall, it shows a gradual downward trend. The average repurchase time of most Taobao users is concentrated in the interval of 1-5 days.

3. Analysis of average repurchase time of different users

# Create drawing
plt.figure(figsize=(20,8),dpi=80)

# Draw the average repurchase time distribution histogram
sns.distplot(data_user_buy4.reset_index().groupby('user_id').date.mean())

# Adjust scale
plt.xticks(size=12)
plt.yticks(size=12)

# Add dimension
plt.xlabel('gap_day',size=15)
plt.ylabel('gap_count',size=15)

# Add title
plt.title('time_gap',size=15)


From the above figure: the average repurchase time of different users is normally distributed. Overall, it shows a gradual downward trend. The average repurchase time of most Taobao users is concentrated in the interval of 1-5 days.

9, Funnel loss analysis

Funnel analysis is a set of process data analysis, which can scientifically reflect the user behavior state and the user conversion rate from the starting point to the end point.

1. Calculate the loss rate of each stage

Note: in the pv analysis of users with different behavior types above, the total amount of pv added to the shopping cart is higher than the total amount of collection. Therefore, in the funnel loss analysis, user type 3 should be analyzed before 2.

# Group user types and count them
# Count all IDS (1: click 2: collect 3: add shopping cart 4: pay)
pv_all=data_user['user_id'].count()
pv_1 = data_user[data_user.behavior_type==1]['user_id'].count()
pv_2 = data_user[data_user.behavior_type==2]['user_id'].count()
pv_3 = data_user[data_user.behavior_type==3]['user_id'].count()
pv_4 = data_user[data_user.behavior_type==4]['user_id'].count()

print(pv_all,pv_1,pv_2,pv_3,pv_4)

12256906 11550581 242556 343564 120205

# Calculation of loss rate in each stage
Wastage_rate1 = round(1-pv_1/pv_all,4)
Wastage_rate2 = round(1-pv_3/pv_1,4)
Wastage_rate3 = round(1-pv_2/pv_3,4)
Wastage_rate4 = round(1-pv_4/pv_2,4)

# Format and output each loss rate
print('Total views - the loss rate of hits is%s%%'%(Wastage_rate1*100))
print('Hits-The loss rate of shopping cart volume is%s%%'%(Wastage_rate2*100))
print('Number of shopping carts added-The loss rate of collection is%s%%'%(Wastage_rate3*100))
print('Collection volume-The loss rate of purchase volume is%s%%'%(Wastage_rate4*100))

10, Analysis on the relationship between user behavior and commodity types

1. Conversion rate of different user behavior categories

data_category=data_user[data_user.behavior_type!=2].groupby(['item_category','behavior_type']).operation.count().unstack(1).rename(columns={1:'Hits',3:'Number of shopping carts added',4:'Purchase volume'}).fillna(0)

# View the first 10 rows of data
data_category.head(10)

#Conversion calculation
data_category['Conversion rate']=data_category['Purchase volume']/data_category['Hits']
data_category.head(10)

#Outlier handling
data_category=data_category.fillna(0)
data_category=data_category[data_category['Conversion rate']<=1]

# Create drawing
plt.figure(figsize=(20,8),dpi=80)

#Conversion mapping
sns.distplot(data_category['Conversion rate'])

# Adjust scale
plt.xticks(size=12)
plt.yticks(size=12)

# Add dimension
plt.xlabel('Conversion rate',fontsize=15)
plt.ylabel('density',fontsize=15)

# Add title
plt.title('Conversion distribution',size=15)


The chart shows: basically, the conversion rate of various user behavior types is within 0.1, and most users do not buy. We need to focus on the causes of this phenomenon and analyze and improve it.

2. Interest ratio

data_category['Interest ratio']=data_category['Number of shopping carts added']/data_category['Hits']
data_category.head(10)

#Outlier processing, ratio of interest drawing
data_category=data_category[data_category['Interest ratio']<=1]

# Create drawing
plt.figure(figsize=(20,8),dpi=80)
sns.distplot(data_category['Interest ratio'])

# Adjust scale
plt.xticks(size=12)
plt.yticks(size=12)

# Add dimension
plt.xlabel('Interest ratio',fontsize=15)
plt.ylabel('density',fontsize=15)

Add title
plt.title('Distribution of interest ratio',size=15)


The chart shows that the interest rate is similar to the conversion rate, which is basically within 0.1, indicating that the vast majority of products clicked by users are not interested, so we need to pay attention to the adjustment of the recommendation system.

11, 28 theoretical analysis

28 law: in any group of things, the most important only accounts for a small part, about 20%, and the remaining 80% is secondary, although it is the majority. Therefore, it is also called 28 law.

#28 theory and long tail theory
data_category=data_category[data_category['Purchase volume']>0]
value_8=data_category['Purchase volume'].sum()*0.8
value_10=data_category['Purchase volume'].sum()
data_category=data_category.sort_values(by='Purchase volume',ascending=False)
data_category['Cumulative purchase']=data_category['Purchase volume'].cumsum()
data_category['classification']=data_category['Cumulative purchase'].map(lambda x:'Top 80%' if x<=value_8 else 'Last 20%')
data_category.head()

data_category.groupby('classification')['classification'].count()/data_category['classification'].count()


The chart shows that about 15% of the top 80% of sales are contracted by commodity categories, which is close to the principle of February 8th. But we also see that nearly 20% of sales are provided by 85% of commodity categories.

For the traditional retail industry, because of the high cost, it can only provide profits for the first 20% of the goods;

For e-commerce, the space cost is reduced to zero, so that the last 80% of the goods can also be sold. Therefore, optimizing and recommending the goods in the long tail can bring greater benefits to the enterprise.

12, Analysis of user value RFM model

Meaning of RFM:

R (Recency): the interval between the last transaction of the customer. The greater the R value, the longer the date of customer transaction; otherwise, the closer the date of customer transaction.

F (Frequency): the number of transactions of the customer in the latest period of time. The larger the F value, the more frequent the customer transactions; otherwise, the less active the customer transactions.

M (Monetary): the amount of the customer's transaction in the latest period of time. The larger the m value, the higher the customer value; otherwise, the lower the customer value.

RFM analysis is a method to segment customer value according to the contribution of customer activity and transaction amount.

from datetime import datetime
datenow=datetime(2014,12,20)

#Latest purchase time per user
recent_buy_time=data_user[data_user.behavior_type==4].groupby('user_id').date.apply(lambda x:datetime(2014,12,20)-x.sort_values().iloc[-1]).reset_index().rename(columns={'date':'recent'})
recent_buy_time.recent=recent_buy_time.recent.map(lambda x:x.days)

#Consumption frequency per user
buy_freq=data_user[data_user.behavior_type==4].groupby('user_id').date.count().reset_index().rename(columns={'date':'freq'})
rfm=pd.merge(recent_buy_time,buy_freq,left_on='user_id',right_on='user_id',how='outer')

#Divide each dimension into two levels. The higher the score, the better
rfm['recent_value']=pd.qcut(rfm.recent,2,labels=['2','1'])
rfm['freq_value']=pd.qcut(rfm.freq,2,labels=['1','2'])
rfm['rfm']=rfm['recent_value'].str.cat(rfm['freq_value'])
rfm.head(10)

The table shows: because this data set does not provide consumption amount, only R and F can conduct user value analysis. Through RF user value analysis, 22 users are key users that need attention; For those with high loyalty and insufficient purchasing power such as 21, you can appropriately give some discounts or bundle sales to increase the purchase frequency of users. For 12 those with low loyalty and strong purchasing power, we need to pay attention to their shopping habits and do precision marketing.

Topics: Python Data Analysis data visualization