demand
Table temp_user_login, mark whether the user signs in every day (Note: this table contains the attendance records of all users on all working days), including three fields: date user id user_id if_login 0 not checked in 1 checked in
Question 1: count the number of consecutive sign in days of each user up to now (the output table only includes all users who sign in on the same day, and calculate their consecutive sign in days)
Question 2: count the maximum continuous check-in days of each user in history (the output table is all the users in the user check-in table, and calculate the maximum continuous check-in days in History)
Data preparation
create table temp_user_login ( day_id string, user_id int, if_login int ) stored as orc tblproperties ('orc.compress' = 'snappy'); insert into temp_user_login values ('2022-03-01', 1, 1), ('2022-03-01', 2, 0), ('2022-03-01', 3, 0), ('2022-03-02', 1, 1), ('2022-03-02', 2, 1), ('2022-03-02', 3, 0), ('2022-03-03', 1, 1), ('2022-03-03', 2, 0), ('2022-03-03', 3, 1), ('2022-03-04', 1, 1), ('2022-03-04', 2, 0), ('2022-03-04', 3, 0), ('2022-03-05', 1, 1), ('2022-03-05', 2, 1), ('2022-03-05', 3, 1); select * from temp_user_login;
date
|
user_id
|
if_login
|
2022/3/1
|
1
|
1
|
2022/3/1
|
2
|
0
|
2022/3/1
|
3
|
0
|
2022/3/2
|
1
|
1
|
2022/3/2
|
2
|
1
|
2022/3/2
|
3
|
0
|
2022/3/3
|
1
|
1
|
2022/3/3
|
2
|
0
|
2022/3/3
|
3
|
1
|
2022/3/4
|
1
|
1
|
2022/3/4
|
2
|
0
|
2022/3/4
|
3
|
0
|
2022/3/5
|
1
|
1
|
2022/3/5
|
2
|
1
|
2022/3/5
|
3
|
1
|
Solution - demand I
select * from (select user_id, datediff('2022-03-05', date_max) date_cz from (select t.user_id, max(case when t.if_login = 0 then t.day_id else date_sub(t1.login_min, 1) end) date_max from temp_user_login t left join (select user_id, min(day_id) login_min from temp_user_login where if_login = 1 group by user_id) t1 group by t.user_id) t) t where date_cz != 0;
user_id
|
days_cz
|
1
|
3
|
2
|
1
|
3
|
1
|
Solution - demand II
Topic 2 can be used to find the maximum number of consecutive check-in days in each user's history. The problem is simplified as follows: find the longest check-in sequence of user 1;
user
|
Sign in
|
1
|
1
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
The data of 0 and 1 in the check-in table will be marked separately. We will distinguish and regroup the data of different events. The specific algorithm is as follows
- Determine whether the check-in columns of the current row and the previous row are equal. If the inequality is set to 1, the equality is 0, which is used as the flag column
- Accumulate the flag column and take the value as the bucket id
- Group according to the accumulated value (bucket id).
- Reorder within group
user
|
Sign in
|
flag
|
Cumulative value
|
1
|
1
|
0
|
0
|
1
|
0
|
1
|
1
|
1
|
0
|
0
|
1
|
1
|
1
|
1
|
2
|
1
|
0
|
1
|
3
|
1
|
1
|
1
|
4
|
1
|
1
|
0
|
4
|
1
|
1
|
0
|
4
|
1
|
1
|
0
|
4
|
1
|
1
|
0
|
4
|
1
|
0
|
1
|
5
|
1
|
1
|
1
|
6
|
The cumulative value in the last column of the above table is the id to be grouped. Through this algorithm, the data belonging to the same category can be distinguished. It is obvious that the background color marked in the above table is the longest sequence and the longest number of consecutive check-in days. The above marked distinction is written as SQL as follows:
1. Determine whether the check-in columns of the current row and the previous row are equal. If the inequality is set to 1, the equality is 0, which is used as the flag column
select *, case when lag(if_login, 1, if_login) over (partition by user_id order by day_id) != if_login then 1 else 0 end flag from temp_user_login
2. Accumulate the flag column and take the value as the bucket id
select day_id, user_id, if_login, flag, sum(flag) over (partition by user_id order by day_id) tong_id from (select *, case when lag(if_login, 1, if_login) over (partition by user_id order by day_id) != if_login then 1 else 0 end flag from temp_user_login) t;
3. Group according to the accumulated value (bucket id)
select user_id, sum(if_login) sum_logs_days from (select day_id, user_id, if_login, flag, sum(flag) over (partition by user_id order by day_id) tong_id from (select *, case when lag(if_login, 1, if_login) over (partition by user_id order by day_id) != if_login then 1 else 0 end flag from temp_user_login) t) t group by user_id, tong_id;
4. Reorder within group
select user_id, max(sum_logs_days) max_length_days from (select user_id, sum(if_login) sum_logs_days from (select day_id, user_id, if_login, flag, sum(flag) over (partition by user_id order by day_id) tong_id from (select *, case when lag(if_login, 1, if_login) over (partition by user_id order by day_id) != if_login then 1 else 0 end flag from temp_user_login) t) t group by user_id, tong_id) t group by user_id;
final result
user_id
|
max_length_days
|
2
|
1
|
3
|
1
|
1
|
5
|