The number of consecutive check-in days of hive users and the maximum number of consecutive check-in days in history

Posted by borris_uk on Thu, 10 Mar 2022 05:54:34 +0100

demand

Table temp_user_login, mark whether the user signs in every day (Note: this table contains the attendance records of all users on all working days), including three fields: date user id user_id if_login 0 not checked in 1 checked in
Question 1: count the number of consecutive sign in days of each user up to now (the output table only includes all users who sign in on the same day, and calculate their consecutive sign in days)
Question 2: count the maximum continuous check-in days of each user in history (the output table is all the users in the user check-in table, and calculate the maximum continuous check-in days in History)

Data preparation

create table temp_user_login
(
    day_id   string,
    user_id  int,
    if_login int
) stored as orc tblproperties ('orc.compress' = 'snappy');
insert into temp_user_login
values ('2022-03-01', 1, 1),
       ('2022-03-01', 2, 0),
       ('2022-03-01', 3, 0),
       ('2022-03-02', 1, 1),
       ('2022-03-02', 2, 1),
       ('2022-03-02', 3, 0),
       ('2022-03-03', 1, 1),
       ('2022-03-03', 2, 0),
       ('2022-03-03', 3, 1),
       ('2022-03-04', 1, 1),
       ('2022-03-04', 2, 0),
       ('2022-03-04', 3, 0),
       ('2022-03-05', 1, 1),
       ('2022-03-05', 2, 1),
       ('2022-03-05', 3, 1);
select * from temp_user_login;
date
user_id
if_login
2022/3/1
1
1
2022/3/1
2
0
2022/3/1
3
0
2022/3/2
1
1
2022/3/2
2
1
2022/3/2
3
0
2022/3/3
1
1
2022/3/3
2
0
2022/3/3
3
1
2022/3/4
1
1
2022/3/4
2
0
2022/3/4
3
0
2022/3/5
1
1
2022/3/5
2
1
2022/3/5
3
1

Solution - demand I

select *
from (select user_id, datediff('2022-03-05', date_max) date_cz
      from (select t.user_id, max(case when t.if_login = 0 then t.day_id else date_sub(t1.login_min, 1) end) date_max
            from temp_user_login t
                     left join
                 (select user_id, min(day_id) login_min
                  from temp_user_login
                  where if_login = 1
                  group by user_id) t1
            group by t.user_id) t) t
where date_cz != 0;
user_id
days_cz
1
3
2
1
3
1

Solution - demand II

Topic 2 can be used to find the maximum number of consecutive check-in days in each user's history. The problem is simplified as follows: find the longest check-in sequence of user 1;
user
Sign in
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
The data of 0 and 1 in the check-in table will be marked separately. We will distinguish and regroup the data of different events. The specific algorithm is as follows
  1. Determine whether the check-in columns of the current row and the previous row are equal. If the inequality is set to 1, the equality is 0, which is used as the flag column
  2. Accumulate the flag column and take the value as the bucket id
  3. Group according to the accumulated value (bucket id).
  4. Reorder within group
user
Sign in
flag
Cumulative value
1
1
0
0
1
0
1
1
1
0
0
1
1
1
1
2
1
0
1
3
1
1
1
4
1
1
0
4
1
1
0
4
1
1
0
4
1
1
0
4
1
0
1
5
1
1
1
6
The cumulative value in the last column of the above table is the id to be grouped. Through this algorithm, the data belonging to the same category can be distinguished. It is obvious that the background color marked in the above table is the longest sequence and the longest number of consecutive check-in days. The above marked distinction is written as SQL as follows:
1. Determine whether the check-in columns of the current row and the previous row are equal. If the inequality is set to 1, the equality is 0, which is used as the flag column
select *,
       case
           when lag(if_login, 1, if_login) over (partition by user_id order by day_id) != if_login then 1
           else 0 end flag
from temp_user_login

2. Accumulate the flag column and take the value as the bucket id

select day_id, user_id, if_login, flag, sum(flag) over (partition by user_id order by day_id) tong_id
from (select *,
             case
                 when lag(if_login, 1, if_login) over (partition by user_id order by day_id) != if_login then 1
                 else 0 end flag
      from temp_user_login) t;

3. Group according to the accumulated value (bucket id)

select user_id, sum(if_login) sum_logs_days
from (select day_id, user_id, if_login, flag, sum(flag) over (partition by user_id order by day_id) tong_id
      from (select *,
                   case
                       when lag(if_login, 1, if_login) over (partition by user_id order by day_id) != if_login
                           then 1
                       else 0 end flag
            from temp_user_login) t) t
group by user_id, tong_id;

4. Reorder within group

select user_id, max(sum_logs_days) max_length_days
from (select user_id, sum(if_login) sum_logs_days
      from (select day_id, user_id, if_login, flag, sum(flag) over (partition by user_id order by day_id) tong_id
            from (select *,
                         case
                             when lag(if_login, 1, if_login) over (partition by user_id order by day_id) != if_login
                                 then 1
                             else 0 end flag
                  from temp_user_login) t) t
      group by user_id, tong_id) t
group by user_id;

final result

user_id
max_length_days
2
1
3
1
1
5

Topics: hive