Implementation of e-commerce website user operation analysis using Hive tool in Hadoop environment

Posted by gljaber on Wed, 02 Feb 2022 06:13:48 +0100

1, Analysis target

1. Establish daily operation index system;
2. Analyze the composition of existing users and count the composition of daily users;
3. Analyze the re purchase situation of users and guide the subsequent re purchase operation.

2, Data description

User_info user information table:

user_action user behavior table:

3, Implementation steps

First, it is necessary to build an operation index system so that operators can see the user scale and the change status of the scale; Then we need to distinguish users according to user stages (new users, activated users, sleep users and lost users), and count the number of users in different stages every day, so that the operation can understand the user scale in different stages and design more appropriate user operation schemes.

Table creation:

create table user_info
(user_id string,
age_between string,
sex int,
user_level int,
reg_time date
) 
row format delimited fields terminated by '\t'
tblproperties("skip.header.line.count"="1"); 

load data local inpath '/home/hadoop/datas/HomeWork/user_info_sample.txt' into table
user_info;

ALTER TABLE user_info SET SERDEPROPERTIES ('serialization.encoding'='GBK');

create table User_action
(user_id string,
sku_id string,
action_time Timestamp,
model_id int,
type int,
cate int,
brand int
) 
row format delimited fields terminated by '\t'
tblproperties("skip.header.line.count"="1"); --Skip the first line when reprinting
--Load data
load data local inpath '/home/hadoop/datas/HomeWork/user_action_sample.txt' into
table user_action;

1. Establish user index system

We need to gradually establish the user index system according to the user activity index:

  • Number of active users per day: DAU, that is, the weight removal statistics of active users per day;
select
    date(action_time) action_time,
    count(distinct user_id) DAU
from
    user_action
group by
    date(action_time);
  • Number of newly registered users per day: DNU, that is, the weight removal statistics of newly registered users per day;
select
    reg_time,
    count(distinct user_id)
from user_info
group by reg_time;
  • Daily consumption conversion rate: that is, the percentage between the number of users who have placed orders and the number of active users every day;
select
    date(action_time) action_time,
    concat(round(count(distinct if(type=4,user_id,Null))/count(distinct user_id)*100,2),"%") as per_con_rate
from user_action
group by date(action_time)
order by date(action_time);
  • Daily consumption conversion rate of new users: that is, the percentage between the number of users who place orders and the number of new registered users every day;
select
    reg_time,
    concat(round(count(distinct if(type=4,user_info.user_id,Null))/count(distinct user_info.user_id)*100,2),"%") as new_per_con_rate
from user_info
left join user_action on user_info.reg_time=date(user_action.action_time)
and user_info.user_id=user_action.user_id
group by reg_time
order by reg_time;

Analyze the change characteristics of various indicators in the near future:

select action_time,
  concat(round((DAU - DAU_1)/DAU_1 *100,2),"%") as DAU_DOD,
  concat(round((DNU - DNU_1)/DNU_1 *100,2),"%") as DNU_DOD,
  concat(round((per_con_rate - per_con_rate_1) / per_con_rate_1 *100,2),"%") as per_con_rate_DOD,
  concat(round((new_per_con_rate - new_per_con_rate_1) / new_per_con_rate_1 *100,2),"%") as new_per_con_rate_DOD,
  concat(round((DAU - DAU_7)/DAU_7 *100,2),"%") as DAU_WOW,
  concat(round((DNU - DNU_7)/DNU_7 *100,2),"%") as DNU_WOW,
  concat(round((per_con_rate - per_con_rate_7) / per_con_rate_7 *100,2),"%") as per_con_rate_WOW,
  concat(round((new_per_con_rate - new_per_con_rate_7) / new_per_con_rate_7 *100,2),"%") as new_per_con_rate_WOW
from(
  select *,
    lag(DAU,1) over(order by action_time) as DAU_1,
    lag(DNU,1) over(order by action_time) as DNU_1,
    lag(per_con_rate,1) over(order by action_time) as per_con_rate_1,
    lag(new_per_con_rate,1) over(order by action_time) as new_per_con_rate_1,
    lag(DAU,7) over(order by action_time) as DAU_7,
    lag(DNU,7) over(order by action_time) as DNU_7,
    lag(per_con_rate,7) over(order by action_time) as per_con_rate_7,
    lag(new_per_con_rate,7) over(order by action_time) as new_per_con_rate_7
  from(
        select date(action_time) action_time,
             count(distinct user_id) DAU,
             concat(round(count(distinct if(type=4,user_id,Null))/count(distinct user_id)*100,2),"%") as per_con_rate
        from user_action
        group by date(action_time) ) as table_a
   left join(
        select reg_time,
              count(distinct user_info.user_id) as DNU,
              concat(round(count(distinct if(type=4,user_info.user_id,Null))/count(distinct user_info.user_id)*100,2),"%") as new_per_con_rate
        from  user_info
        left join user_action 
        on user_info.user_id=user_action.user_id
        and reg_time=date(action_time)
        group by reg_time) as table_b on table_a.action_time=table_b.reg_time) as table_c
order by action_time;

2. Establish phased user labels

  • Take the user on the day of registration as a new user
riqi=reg_time
  • Users who have been registered for less than one week are regarded as inactive users (to avoid new users becoming jump out users)
(max_time-reg_time<7 or max_time is Null) and riqi-reg_time<14
  • On the 8th day after registration, the user enters the activation period. The activation period is from the 8th to the 14th day after registration. The user who visits back during the activation period becomes the activated user. (to make an inactive user an active user)
max_time-reg_time>7 and riqi-max_time<7
  • If the user is not active for a week after becoming an active user; Or users who are not activated during the activation period (days 8 to 14) after registration and are not active after day 14 are regarded as sleep users. (mainly for wake-up)
(max_time-reg_time>7 and riqi-max-time>7 and riqi-max_time<=21)
or(riqi-reg_time>14 and riqi-reg_time<=28 and (max_time-reg_time<7 or max_time is Null))
  • After becoming a sleep user, there are no active users for two weeks as lost users. (although it is in the same wake-up range, the wake-up difficulty of lost users is greater than that of sleeping users, which will not be considered in wake-up activities with less budget.)
(max_time-reg_time>7 and riqi-max_time>21) or
(riqi-reg_time>28 and (max_time-reg_time<7 or max_time is Null))

The reason why we need to count users at different stages is that it is difficult for us to observe the composition of users in our user pool only from DAU, DNU or MAU data. Have users settled down and become more and more? Or is it decreasing? These can only be counted after we break down users in different stages.

2.1 count the user size at different stages of the day according to user behavior

When the number of new users, inactive users and active users increases, and the number of sleep users and lost users decreases, our user scale is on the rise; On the contrary, it shows a downward trend. Analyze the recent change trend of our user scale and judge the health of our user pool.

-- Prepare statistical date data (statistics are made every day, so the time in the user behavior table is pulled out separately as the statistical time)
create table riqi as
select date(action_time) riqi from user_action group by date(action_time);

-- Prepare the last user behavior time (since the statistics date is the last behavior time, the last behavior time on the statistics date is constrained as the last behavior time). riqi The table is a continuous time pulled out separately, not the last behavior time of the user, riqi Table time (independent of user behavior time) 
create table base_info as
select 
    riqi,
    user_id,
    reg_time,
    (select max(action_time) max_time from user_action where date(action_time)<riqi.riqi and user_id=user_info.user_id) as max_time
from riqi,user_info;

select riqi,
       user_type,
       count(*)
from(
    select *,
          case when riqi=reg_time then "new_user"
          when ( day(max_time-reg_time)<7 or max_time is Null) and day(riqi-reg_time)<14 then "inaction_user"
          when day(max_time-reg_time)>7 and day(riqi-max_time)<7 then "action_user"
          when (day(max_time-reg_time)>7 and day(riqi-max_time)>7 and day(riqi-max_time)<=21) or (day(riqi-reg_time)>14 and day(riqi-reg_time)<=28 and (day(max_time-reg_time)<7 or max_time is Null)) then "sleep_user"
          when (day(max_time-reg_time)>7 and day(riqi-max_time)>21) or (day(riqi-reg_time)>28 and (day(max_time-reg_time)<7 or max_time is Null)) then "lose_user"
          end as user_type
     from base_info
) as a
group by riqi,user_type;

2.2 analyze the changes of users in different stages in the last two weeks by week

Changes in the number of users, the number of views per capita, the number of clicks per capita, the number of shopping carts per capita, the number of orders per capita, and the purchase conversion rate.

select if(date(action_time)>"2016-04-08","week_1","week_2") as week,
       count(distinct user_id) DAU_week,
       round(count(if(type=1,user_id,Null))/count(distinct user_id),2) avg_user1,
       round(count(if(type=6,user_id,Null))/count(distinct user_id),2) avg_user6,
       round(count(if(type=2,user_id,Null))/count(distinct user_id),2) avg_user2,
       round(count(if(type=4,user_id,Null))/count(distinct user_id),2) avg_user4,
       concat(round(count(distinct if(type=4,user_id,Null))/count(distinct user_id)*100,2),"%") buy_rate
from(
select * from user_action where date(action_time)>"2016-04-01"
) as a
group by date(action_time)>"2016-04-08";

2.3 compare and analyze the characteristics of crowd portrait and behavior data of new user group and activated user group

Briefly describe whether the new user group introduced by the product is healthy.

-- New user groups
select age_between,sex,
  count(distinct a.user_id) count_user,
  round(count(if(type=1,a.user_id,Null))/count(distinct a.user_id),2) avg_user1,
  round(count(if(type=6,a.user_id,Null))/count(distinct a.user_id),2) avg_user6,
  round(count(if(type=2,a.user_id,Null))/count(distinct a.user_id),2) avg_user2,
  round(count(if(type=4,a.user_id,Null))/count(distinct a.user_id),2) avg_user4,
  concat(round(count(distinct if(type=4,a.user_id,Null))/count(distinct a.user_id)*100,2),"%") buy_rate 
from
(select * from user_info where reg_time>"2016-04-08" and reg_time<="2016-04-15") as a
left join 
(select * from user_action where date(action_time)>"2016-04-08") as b
on a.user_id=b.user_id
group by age_between,sex
grouping sets(age_between,sex);

-- Activate user groups
select age_between,sex,
  count(distinct a.user_id) count_user,
  round(count(if(type=1,a.user_id,Null))/count(distinct a.user_id),2) avg_user1,
  round(count(if(type=6,a.user_id,Null))/count(distinct a.user_id),2) avg_user6,
  round(count(if(type=2,a.user_id,Null))/count(distinct a.user_id),2) avg_user2,
  round(count(if(type=4,a.user_id,Null))/count(distinct a.user_id),2) avg_user4,
  concat(round(count(distinct if(type=4,a.user_id,Null))/count(distinct a.user_id)*100,2),"%") buy_rate 
from
(select * from user_info where reg_time<="2016-04-08") as a
inner join 
(select * from user_action where date(action_time)>"2016-04-08") as b
on a.user_id=b.user_id
group by age_between,sex
grouping sets(age_between,sex);

3. User repurchase analysis

For user operation, due to the high cost of acquiring customers, improving user repurchase is the key to maximize user value.

When should we pay attention to repurchase? Kevin hilstrom gives reference in lean data analysis:

  • The repeat purchase rate reaches 1% ~ 15% within 90 days; Indicates that you are in user acquisition mode; Put more energy and resources into the acquisition and transformation of new users;

  • The repeat purchase rate reaches 15 ~ 30% within 90 days; Indicates that you are in mixed mode; Balance the energy and resources spent on the transformation of new users and the retention and re purchase of old users;

  • The repeat purchase rate reaches more than 30% within 90 days; Indicates that you are in loyalty mode; Put more energy and resources into user repurchase;

3.1 calculate the re purchase of users from February 1 to April 30

  • Re purchase rate of users = number of re purchase users / number of active users;
  • Number of re purchase users: the number of users who have purchased twice or more in a period of time

After statistics, the repurchase rate of users is 8.87%, indicating that the product is in the user acquisition mode.

select 
count(if(buy_count>=2,user_id,Null)) re_buy,
count(distinct user_id) AU,
concat(round(count(if(buy_count>=2,user_id,Null))/count(distinct user_id)*100,2),"%") as re_buy_rate
from 
(select user_id,
     count(if(type=4,user_id,Null)) buy_count
from user_action
where date(action_time)>="2016-02-01"
group by user_id) as a;

3.2 compare the user portrait and behavior characteristics of re purchase users and non re purchase users

Combined with the user information table and user purchase behavior table, we can see the characteristics of re purchase users: re purchase users are mostly men aged 26-35.

-- Re purchase user
select age_between,sex,
count(if(buy_count>=2,a.user_id,Null)) re_buy,
count(distinct a.user_id) AU,
concat(round(count(if(buy_count>=2,a.user_id,Null))/count(distinct a.user_id)*100,2),"%") as re_buy_rate
from 
(select user_id,count(if(type=4,user_id,Null)) buy_count from user_action
group by user_id) as a
left join
(select user_id,age_between,sex from user_info) as b
on a.user_id=b.user_id
group by age_between,sex
grouping sets(age_between,sex);

-- Non repurchase user
select age_between,sex,
count(if(buy_count<2,a.user_id,Null)) not_re_buy,
count(distinct a.user_id) AU,
concat(round(count(if(buy_count<2,a.user_id,Null))/count(distinct a.user_id)*100,2),"%") as not_re_buy_rate
from 
(select user_id,count(if(type=4,user_id,Null)) buy_count from user_action
group by user_id) as a
left join
(select user_id,age_between,sex from user_info) as b
on a.user_id=b.user_id
group by age_between,sex
grouping sets(age_between,sex);

3.3 analyze the user activation and loss of recent re purchase users in combination with the phased user labels in the last week

select user_type,count(distinct user_id) user_num
from(
select distinct a.user_id as user_id,
          case when reg_time="2016-04-15" then "new_user"
          when (day(max_time-reg_time)<=7 or max_time is Null)and reg_time>"2016-04-01" then "inaction_user"
          when day(max_time-reg_time)>7 and max_time>"2016-04-08" then "action_user"
          when (day(max_time-reg_time)>7 and max_time<"2016-04-08" and max_time>="2016-03-25") or (reg_time<"2016-04-01" and reg_time>="2016-03-18" and (day(max_time-reg_time)<7 or max_time is Null)) then "sleep_user"
          when (day(max_time-reg_time)>7 and max_time<"2016-03-25") or (reg_time<"2016-03-18" and (day(max_time-reg_time)<7 or max_time is Null)) then "lose_user"
          end as user_type
from
(select a1.user_id user_id,a1.action_time action_time,max(a1.action_time)over(partition by a1.user_id) max_time
from(
    select user_id,date(action_time) action_time,count(user_id)over(partition by user_id) buy_count
    from user_action where type=4) as a1 
    where buy_count>=2)as a
left join
(select user_id,reg_time from user_info) as b
on a.user_id=b.user_id
) as c
group by user_type; 

4. Report

https://gitee.com/Ariel999/data-analysis-case/blob/master/Hive%E5%AD%A6%E4%B9%A0/%E7%94%A8%E6%88%B7%E8%BF%90%E8%90%A5%E5%89%8D%E6%9C%9F%E5%88%86%E6%9E%90%E6%8A%A5%E5%91%8A-%E5%BC%A0%E8%8D%A3%E8%8D%A3.pptx

Topics: Big Data Hadoop hive HQL