1. Description of index caliber
Retained on the next day: the number of users who use the product on the first day and still use it on the second day
Multi day retention: the number of users who use the product on the first day and still use it after many days
Number of people continuously active: how many people continuously use products
2. The original scheme shall be retained on the next day, retained for multiple days, and continuously active users
2.1 table field description:
The table has four fields: uid, action, product, date
uid: user id, unique identification of the user
action: the user's behavior in a product
Product: a product id used by the user
Date: the date when the user uses a product
2.2 retain the original calculation scheme - left connection
Idea:
- In uid and date granularity de duplication, use the group by uid and date method
- The left connection is made according to the id field and date field. The date of the right table must be greater than the date of the left table
- Group and aggregate according to the ID and date in the left table to calculate the multi-day retention. For example, date in the right table - date in the left table = 1, that is, the user is still active the next day on the basis of date in the left table, that is, what we call retention the next day
- You can get the required multi-day retention respectively
This explanation will realize the view functions based on mysql in turn
Step 1:
CREATE VIEW retention1 AS SELECT uid, date FROM temp GROUP BY uid,date
Step 2:
CREATE VIEW retention2 AS SELECT r1.uid AS r1_uid, r1.date AS r1_date, r2.date AS r2_date FROM retention1 AS r1 LEFT JOIN retention1 AS r2 ON r1.uid = r2.uid AND T2.date > T1.date
Step 3:
CREATE VIEW retention3 AS SELECT t1_date AS date ,COUNT(DISTINCT r1_uid) AS day0_count ,COUNT(DISTINCT IF(DATEDIFF(t2_date,t1_date)=1,r1_uid,NULL)) AS day1_count ,COUNT(DISTINCT IF(DATEDIFF(t2_date,t1_date)=3,r1_uid,NULL)) AS day3_count ,COUNT(DISTINCT IF(DATEDIFF(t2_date,t1_date)=7,r1_uid,NULL)) AS day7_count FROM retention2 GROUP BY t1_date
Where, day0_count,day1_count,day3_count,day7_count refers to the number of people on the first day, the number of people retained on the next day, the number of people retained on the third day and the number of people retained on the seventh day.
Disadvantages: this implementation scheme will make Cartesian product of complete table data, which has large data magnitude, high time cost and occupies more resources.
Note: the reason why the left connection is used here is that we use the left table as the base date table for calculation retention. Therefore, the data in the right table may not exist after connection, but the date in the left table needs to be left. If inner join is used, the final calculated number of retained persons will be equal to the number of active persons on the first day, that is, the retention rate = 100%, which is incorrect.
2.3 original scheme for continuous active population calculation - row_number calculate tag
Detailed description to be continued
Implementation method:
--Log in for three consecutive days SELECT DISTINCT uid, COUNT(1) FROM ( SELECT uid, date - ROW_NUMBER() OVER(PARTITION BY uid ORDER BY date) as tag FROM table WHERE date > CUR_DATE() - 30 ) t1 GROUP BY uid,tag HAVING COUNT(1) = 3;
3. Optimization scheme for continuous active users with next day retention and multi day retention
3.1 ・ retention calculation optimization scheme 1 - screening advance based on the original scheme
As mentioned earlier, the disadvantage of the original scheme is that it uses complete data for Cartesian product. However, in normal work, we only need to retain on a specific date, that is, we only need to determine a standard day and calculate the multi-day retention according to this standard day.
Assuming June 18 as the base date, the multi-day retention is calculated as follows:
- As before, in the granularity of uid and date, use the group by uid and date method
- Take June 18 in the left table as the base date, and connect with June 19 and June 21 in the right table as the retention days of the next day and three days to be calculated
- count calculates the number of people retained. If uid in the right table is blank, it means that the user has not used products on the next day or three days with June 18 as the base date.
The first step is shown above
Step 2:
( SELECT uid FROM retention1 WHERE date = "2021-06-18" ) AS left_table LEFT JOIN ( SELECT uid FROM retention1 WHERE date in ("2021-06-19","2021-06-21") ) AS right_table
Step 3:
SELECT a.date AS day0, COUNT( IF(b.date = "2021-06-19" AND b.uid IS NOT NULL, a.uid, NULL) ) AS day1_count, COUNT( IF(b.date = "2021-06-21" AND b.uid IS NOT NULL, a.uid, NULL) ) AS day3_count FROM ( SELECT uid date, FROM retention1 WHERE date = "2021-06-18" ) AS a LEFT JOIN ( SELECT uid date, FROM retention1 WHERE date in ("2021-06-19","2021-06-21") ) AS b
After being implemented in this way, compared with the original scheme, it is obvious that the data level computing resources are much smaller, because we only calculate the retention date we need.
3.2 # retention calculation optimization scheme 2 - based on the idea of bitmap
3.2.1 concept explanation
Friends who don't understand bitmap don't need to worry, and it doesn't affect their understanding.
If you want to know friends, you can also refer to https://zhuanlan.zhihu.com/p/54783053.
First, you need to optimize the table structure and add a mapset field with the field type varchar or text
The data in the assumption table are as follows:
uid | mapset | date |
1 | 1,1,1,0,1,0,1,0 | 2021/6/18 |
2 | 0,0,1,1,0,1,1,0 | 2021/6/18 |
3 | 1,1,1,1,0,1,1,1 | 2021/6/18 |
Date is the date before January based on the current day (assuming today is July 18)
In the mapset field, 1 indicates active and 0 indicates inactive.
Take the user with uid of 1 as an example. The mapset field values of 1,1,1,0,1,0,1 respectively indicate that the user is active from June 18 to June 20, stops being active for one day on June 21, and so on. (the first number on the left represents whether it is active on June 18, and so on)
3.2.2 retention calculation
After understanding this idea, let's continue to count. How to calculate the retention on the next day and the seventh day of June 18-- Pause for 1 minute
Did you think of it? As long as the first two numbers of mapset are 1, it means that the user is the user retained the next day, and then count(uid).
The implementation is as follows:
Step 1: find out the users who are active on the first day and are active on the next day or the seventh day respectively
SELECT uid, bigint(split(mapset,",")[0]) as active0, bigint(split(mapset,",")[1]) as active1, bigint(split(mapset,",")[7]) as active7, FROM temp WHERE date = "2021/06/18" and bigint(split(mapset,",")[0])=1 GROUP BY uid,mapset
Step 2: sum active0, active1 and active7 respectively through sum() to obtain the number of active people, three-day retained people and seven-day retained people on the base day.
SELECT SUM(active0), SUM(active1), SUM(active7) FROM ( SELECT uid, bigint(split(mapset,",")[0]) as active0, bigint(split(mapset,",")[1]) as active1, bigint(split(mapset,",")[7]) as active7, FROM temp WHERE date = "2021/06/18" and bigint(split(mapset,",")[0])=1 GROUP BY uid,mapset ) AS t1
You can refer to this example and think about how to calculate the 30 day retention-- Pause for 1 minute
3.3 optimization scheme for the number of active people for multiple consecutive days - based on the idea of bitmap
3.3.1 active users for more than consecutive days
How to calculate the number of people who have been active for many consecutive days-- Pause thinking for 3 minutes
All right, let's go on. The answer is very simple. Remember the commonly used fuzzy matching like?
If we need to calculate that there are more than 3 consecutive active users in a month, we only need to like "%1,1,1%"
The method is as follows:
SELECT COUNT(DISTINCT uid) AS consecutive_3days_count FROM temp WHERE date = "2020/06/18" AND mapset like "%1,1,1%"
If it is a user who has been active for more than 7 consecutive days in a month, it can also be obtained by analogy.
3.3.2 active users for several consecutive days
Another extended question is, what if we need to calculate only the users who have been active for only 3 days this month? Or users who have only been active for 7 days? How should we calculate-- Pause for 3 minutes
Did you think of it? Here we also need to go back to our previous example - the number of active users for more than several consecutive days
Assuming that we only need to calculate users who have been active for 3 days, we only need [3,30] - [4,30] (respectively distinguish active users for more than 3 consecutive days and active users for more than 4 consecutive days, 30 represents the highest value of the interval). By subtracting the interval, we will get users who have only been active for 3 days in 30 days.
The method is as follows:
SELECT ( SELECT COUNT(DISTINCT uid) FROM temp WHERE date = "2020/06/18" AND mapset like "%1,1,1%" - SELECT COUNT(DISTINCT uid) FROM temp WHERE date = "2020/06/18" AND mapset like "%1,1,1,1%" )
4. Other instructions
1. When calculating retention, you should confirm the calculation caliber with the leader. For example, the same company has two products, and users use different products for two days. Is such a user retained? Or a user uses the company's products in Guangdong Province on the first day and the company's products in Hainan Province on the second question. Is such a user retained?
2. Generally, the count(distinct xxx) in sql is easy to cause data skew, and the group by conversion method needs to be used to realize de counting. I will write another blog post to explain the detailed process later. Just remember this.
At the end of the article, I hope this article will help you and wish you a higher level every day.