SQL retention analysis (next day retention, multi day retention) and optimization scheme for continuous active population

Posted by kontesto on Wed, 26 Jan 2022 21:18:10 +0100

1. Description of index caliber

Retained on the next day: the number of users who use the product on the first day and still use it on the second day

Multi day retention: the number of users who use the product on the first day and still use it after many days

Number of people continuously active: how many people continuously use products

2. The original scheme shall be retained on the next day, retained for multiple days, and continuously active users

2.1 table field description:

The table has four fields: uid, action, product, date

uid: user id, unique identification of the user

action: the user's behavior in a product

Product: a product id used by the user

Date: the date when the user uses a product

2.2 retain the original calculation scheme - left connection

Idea:

  1. In uid and date granularity de duplication, use the group by uid and date method
  2. The left connection is made according to the id field and date field. The date of the right table must be greater than the date of the left table
  3. Group and aggregate according to the ID and date in the left table to calculate the multi-day retention. For example, date in the right table - date in the left table = 1, that is, the user is still active the next day on the basis of date in the left table, that is, what we call retention the next day
  4. You can get the required multi-day retention respectively

This explanation will realize the view functions based on mysql in turn

Step 1:

CREATE VIEW retention1 AS
SELECT
    uid,
    date
FROM
    temp
GROUP BY
    uid,date

Step 2:

CREATE VIEW retention2 AS
SELECT
    r1.uid AS r1_uid,
    r1.date AS r1_date,
    r2.date AS r2_date
FROM
    retention1 AS r1
LEFT JOIN
    retention1 AS r2 ON r1.uid = r2.uid AND T2.date > T1.date

Step 3:

CREATE VIEW retention3 AS
SELECT
	t1_date AS date
	,COUNT(DISTINCT r1_uid) AS day0_count
	,COUNT(DISTINCT IF(DATEDIFF(t2_date,t1_date)=1,r1_uid,NULL)) AS day1_count
	,COUNT(DISTINCT IF(DATEDIFF(t2_date,t1_date)=3,r1_uid,NULL)) AS day3_count
	,COUNT(DISTINCT IF(DATEDIFF(t2_date,t1_date)=7,r1_uid,NULL)) AS day7_count
FROM
	retention2
GROUP BY
	t1_date

Where, day0_count,day1_count,day3_count,day7_count refers to the number of people on the first day, the number of people retained on the next day, the number of people retained on the third day and the number of people retained on the seventh day.

Disadvantages: this implementation scheme will make Cartesian product of complete table data, which has large data magnitude, high time cost and occupies more resources.

Note: the reason why the left connection is used here is that we use the left table as the base date table for calculation retention. Therefore, the data in the right table may not exist after connection, but the date in the left table needs to be left. If inner join is used, the final calculated number of retained persons will be equal to the number of active persons on the first day, that is, the retention rate = 100%, which is incorrect.

2.3 original scheme for continuous active population calculation - row_number calculate tag

Detailed description to be continued

Implementation method:

--Log in for three consecutive days
SELECT 
    DISTINCT uid,
    COUNT(1)
FROM
(    
    SELECT 
	    uid,
	    date - ROW_NUMBER() OVER(PARTITION BY uid ORDER BY date) as tag
    FROM
        table
    WHERE
        date > CUR_DATE() - 30
) t1
GROUP BY
    uid,tag 
HAVING COUNT(1) = 3;

3. Optimization scheme for continuous active users with next day retention and multi day retention

3.1 ・ retention calculation optimization scheme 1 - screening advance based on the original scheme

As mentioned earlier, the disadvantage of the original scheme is that it uses complete data for Cartesian product. However, in normal work, we only need to retain on a specific date, that is, we only need to determine a standard day and calculate the multi-day retention according to this standard day.

Assuming June 18 as the base date, the multi-day retention is calculated as follows:

  1. As before, in the granularity of uid and date, use the group by uid and date method
  2. Take June 18 in the left table as the base date, and connect with June 19 and June 21 in the right table as the retention days of the next day and three days to be calculated
  3. count calculates the number of people retained. If uid in the right table is blank, it means that the user has not used products on the next day or three days with June 18 as the base date.

The first step is shown above

Step 2:

(
SELECT
    uid
FROM
    retention1
WHERE
    date = "2021-06-18"
) AS left_table
LEFT JOIN
(
SELECT
    uid
FROM
    retention1
WHERE
    date in ("2021-06-19","2021-06-21")
) AS right_table

Step 3:

SELECT
    a.date AS day0,
    COUNT(
        IF(b.date = "2021-06-19" AND b.uid IS NOT NULL, a.uid, NULL)
    ) AS day1_count,
    COUNT(
        IF(b.date = "2021-06-21" AND b.uid IS NOT NULL, a.uid, NULL)
    ) AS day3_count
FROM
    (
        SELECT
            uid
            date,
        FROM
            retention1
        WHERE
            date = "2021-06-18"
    ) AS a
LEFT JOIN
    (
        SELECT
            uid
            date,
        FROM
            retention1
        WHERE
            date in ("2021-06-19","2021-06-21")
    ) AS b

After being implemented in this way, compared with the original scheme, it is obvious that the data level computing resources are much smaller, because we only calculate the retention date we need.

3.2 # retention calculation optimization scheme 2 - based on the idea of bitmap

3.2.1 concept explanation

Friends who don't understand bitmap don't need to worry, and it doesn't affect their understanding.

If you want to know friends, you can also refer to https://zhuanlan.zhihu.com/p/54783053.

First, you need to optimize the table structure and add a mapset field with the field type varchar or text

The data in the assumption table are as follows:

uidmapsetdate
11,1,1,0,1,0,1,02021/6/18
20,0,1,1,0,1,1,02021/6/18
31,1,1,1,0,1,1,12021/6/18

Date is the date before January based on the current day (assuming today is July 18)

In the mapset field, 1 indicates active and 0 indicates inactive.

Take the user with uid of 1 as an example. The mapset field values of 1,1,1,0,1,0,1 respectively indicate that the user is active from June 18 to June 20, stops being active for one day on June 21, and so on. (the first number on the left represents whether it is active on June 18, and so on)

3.2.2 retention calculation

After understanding this idea, let's continue to count. How to calculate the retention on the next day and the seventh day of June 18-- Pause for 1 minute

Did you think of it? As long as the first two numbers of mapset are 1, it means that the user is the user retained the next day, and then count(uid).

The implementation is as follows:

Step 1: find out the users who are active on the first day and are active on the next day or the seventh day respectively

SELECT
    uid,
    bigint(split(mapset,",")[0]) as active0,
    bigint(split(mapset,",")[1]) as active1,
    bigint(split(mapset,",")[7]) as active7,
FROM
    temp
WHERE
    date = "2021/06/18" and bigint(split(mapset,",")[0])=1
GROUP BY
    uid,mapset

Step 2: sum active0, active1 and active7 respectively through sum() to obtain the number of active people, three-day retained people and seven-day retained people on the base day.

SELECT
    SUM(active0),
    SUM(active1),
    SUM(active7)
FROM
(
    SELECT
        uid,
        bigint(split(mapset,",")[0]) as active0,
        bigint(split(mapset,",")[1]) as active1,
        bigint(split(mapset,",")[7]) as active7,
    FROM
        temp
    WHERE
        date = "2021/06/18" and bigint(split(mapset,",")[0])=1
    GROUP BY
        uid,mapset
) AS t1

You can refer to this example and think about how to calculate the 30 day retention-- Pause for 1 minute

3.3 optimization scheme for the number of active people for multiple consecutive days - based on the idea of bitmap

3.3.1 active users for more than consecutive days

How to calculate the number of people who have been active for many consecutive days-- Pause thinking for 3 minutes

All right, let's go on. The answer is very simple. Remember the commonly used fuzzy matching like?

If we need to calculate that there are more than 3 consecutive active users in a month, we only need to like "%1,1,1%"

The method is as follows:

SELECT
    COUNT(DISTINCT uid) AS consecutive_3days_count
FROM
    temp
WHERE
    date = "2020/06/18"
    AND mapset like "%1,1,1%"

If it is a user who has been active for more than 7 consecutive days in a month, it can also be obtained by analogy.

3.3.2 active users for several consecutive days

Another extended question is, what if we need to calculate only the users who have been active for only 3 days this month? Or users who have only been active for 7 days? How should we calculate-- Pause for 3 minutes

Did you think of it? Here we also need to go back to our previous example - the number of active users for more than several consecutive days

Assuming that we only need to calculate users who have been active for 3 days, we only need [3,30] - [4,30] (respectively distinguish active users for more than 3 consecutive days and active users for more than 4 consecutive days, 30 represents the highest value of the interval). By subtracting the interval, we will get users who have only been active for 3 days in 30 days.

The method is as follows:

SELECT
(
    SELECT
        COUNT(DISTINCT uid)
    FROM
        temp
    WHERE
        date = "2020/06/18"
        AND mapset like "%1,1,1%"
    -
    SELECT
        COUNT(DISTINCT uid)
    FROM
        temp
    WHERE
        date = "2020/06/18"
        AND mapset like "%1,1,1,1%"
)

4. Other instructions

1. When calculating retention, you should confirm the calculation caliber with the leader. For example, the same company has two products, and users use different products for two days. Is such a user retained? Or a user uses the company's products in Guangdong Province on the first day and the company's products in Hainan Province on the second question. Is such a user retained?

2. Generally, the count(distinct xxx) in sql is easy to cause data skew, and the group by conversion method needs to be used to realize de counting. I will write another blog post to explain the detailed process later. Just remember this.

At the end of the article, I hope this article will help you and wish you a higher level every day.

Topics: Database MySQL SQL Data Analysis