HiveSql interview question 29 -- find the maximum number of people online and the peak time period [accumulator idea, timing analysis]

Posted by Mzor on Sat, 22 Jan 2022 04:45:39 +0100

catalogue

0 demand analysis

1 data preparation

2 data analysis

3 Summary

0 demand analysis

The data is the anchor ID,stt represents the start time and edt represents the next time.

idsttedt
10012021-06-14 12:12:122021-06-14 18:12:12
10032021-06-14 13:12:122021-06-14 16:12:12
10042021-06-14 13:15:122021-06-14 20:12:12
10022021-06-14 15:12:122021-06-14 16:12:12
10052021-06-14 15:18:122021-06-14 20:12:12
10012021-06-14 20:12:122021-06-14 23:12:12
10062021-06-14 21:12:122021-06-14 23:15:12
10072021-06-14 22:12:122021-06-14 23:10:12

Requirements:

  • (1) What is the maximum number of anchors online on the platform one day?
  • (2) What is the peak time period?

1 data preparation

(1) Data

vim play.txt

id	stt			        edt
1001	2021-06-14 12:12:12	        2021-06-14 18:12:12
1003	2021-06-14 13:12:12		2021-06-14 16:12:12
1004	2021-06-14 13:15:12		2021-06-14 20:12:12
1002	2021-06-14 15:12:12		2021-06-14 16:12:12
1005	2021-06-14 15:18:12		2021-06-14 20:12:12
1001	2021-06-14 20:12:12		2021-06-14 23:12:12
1006	2021-06-14 21:12:12		2021-06-14 23:15:12
1007	2021-06-14 22:12:12		2021-06-14 23:10:12

(2) Build table

create table if not exists play(
 
id string,
stt string,
edt string
 
)
 
row format delimited
fields terminated by '\t'
 
;

(3) loading data

load data local inpath "/home/centos/dan_test/play.txt" into table play;

(4) Query data

hive> select * from play;
OK
1001	2021-06-14 12:12:12	2021-06-14 18:12:12
1003	2021-06-14 13:12:12	2021-06-14 16:12:12
1004	2021-06-14 13:15:12	2021-06-14 20:12:12
1002	2021-06-14 15:12:12	2021-06-14 16:12:12
1005	2021-06-14 15:18:12	2021-06-14 20:12:12
1001	2021-06-14 20:12:12	2021-06-14 23:12:12
1006	2021-06-14 21:12:12	2021-06-14 23:15:12
1007	2021-06-14 22:12:12	2021-06-14 23:10:12
Time taken: 0.087 seconds, Fetched: 8 row(s)

2 data analysis

Problem 1 Analysis:

If it is difficult to start directly from SQL itself, we might as well change our thinking. Assuming we get a piece of data, what can we do with java program now? In fact, it is the idea of an accumulator (such as SPARK accumulator). First, we need to split such a record into different records or data streams into the accumulator, and then mark each record. If the broadcast is started, the record is recorded as 1 and if the broadcast is down, it is recorded as - 1. At this time, the data streams enter the accumulator in chronological order, and then stack in the accumulator, The maximum cumulative result is the desired result. In fact, the essence is to use the idea of accumulator, but the data entering the accumulator is the sequential flow data arranged in time order (the data enters in time order).

The above ideas are summarized as follows:

  • (1) Segment data by start time and end time
  • (2) Label the data. The record of broadcasting is recorded as 1 and the record of broadcasting is recorded as - 1 for accumulation
  • (2) Sort data by time
  • (3) The data enters the accumulator for accumulation
  • (4) Gets the maximum value of the current accumulated value in the accumulator

With the above ideas, we convert them into SQL solution ideas.

(1) Data segmentation: in fact, it is to convert the start time and next time into records. That is, column to row conversion. We use the familiar UNION operation to convert.

select id,stt dt from play
union
select id,edt dt from play
--------------------------------------------------------------------------------
OK
1001	2021-06-14 12:12:12
1001	2021-06-14 18:12:12
1001	2021-06-14 20:12:12
1001	2021-06-14 23:12:12
1002	2021-06-14 15:12:12
1002	2021-06-14 16:12:12
1003	2021-06-14 13:12:12
1003	2021-06-14 16:12:12
1004	2021-06-14 13:15:12
1004	2021-06-14 20:12:12
1005	2021-06-14 15:18:12
1005	2021-06-14 20:12:12
1006	2021-06-14 21:12:12
1006	2021-06-14 23:15:12
1007	2021-06-14 22:12:12
1007	2021-06-14 23:10:12
Time taken: 20.502 seconds, Fetched: 16 row(s)

(2) Data marker. Mark directly on the basis of the above SQL If the data is originally separated, mark it with case when.

select id,stt dt , 1 flag from play
union
select id,edt dt ,-1 flag from play
--------------------------------------------------------------------------------
OK
1001	2021-06-14 12:12:12	1
1001	2021-06-14 18:12:12	-1
1001	2021-06-14 20:12:12	1
1001	2021-06-14 23:12:12	-1
1002	2021-06-14 15:12:12	1
1002	2021-06-14 16:12:12	-1
1003	2021-06-14 13:12:12	1
1003	2021-06-14 16:12:12	-1
1004	2021-06-14 13:15:12	1
1004	2021-06-14 20:12:12	-1
1005	2021-06-14 15:18:12	1
1005	2021-06-14 20:12:12	-1
1006	2021-06-14 21:12:12	1
1006	2021-06-14 23:15:12	-1
1007	2021-06-14 22:12:12	1
1007	2021-06-14 23:10:12	-1
Time taken: 7.408 seconds, Fetched: 16 row(s)

(3) The data is sorted by time and enters the accumulator for accumulation (sorting by time is the key to accumulation)

select id
     ,dt
     ,sum(flag) over(order by dt) as cur_cnt
from(
     select id,stt dt , 1 flag from play
     union
     select id,edt dt ,-1 flag from play
    ) t
--------------------------------------------------------------------------------
OK
1001	2021-06-14 12:12:12	1
1003	2021-06-14 13:12:12	2
1004	2021-06-14 13:15:12	3
1002	2021-06-14 15:12:12	4
1005	2021-06-14 15:18:12	5
1002	2021-06-14 16:12:12	3
1003	2021-06-14 16:12:12	3
1001	2021-06-14 18:12:12	2
1001	2021-06-14 20:12:12	1
1004	2021-06-14 20:12:12	1
1005	2021-06-14 20:12:12	1
1006	2021-06-14 21:12:12	2
1007	2021-06-14 22:12:12	3
1007	2021-06-14 23:10:12	2
1001	2021-06-14 23:12:12	1
1006	2021-06-14 23:15:12	0
Time taken: 8.133 seconds, Fetched: 16 row(s)

(4) Get the maximum value accumulated at the current time in the accumulator, that is, the maximum number of people broadcasting at the same time

select max(cur_cnt)
from(
	select id
		 ,dt
		 ,sum(flag) over(order by dt) as cur_cnt
	from(
		select id,stt dt , 1 flag from play
		union
		select id,edt dt ,-1 flag from play
	) t
) m
--------------------------------------------------------------------------------
OK
5
Time taken: 13.087 seconds, Fetched: 1 row(s)

Problem 2 Analysis:

The second question is the time period when the peak occurs, that is, the start time and end time of the peak, or the duration.

Based on the results of the first question:

select *,max(cur_cnt) over()
from(
     select id
           ,dt
           ,sum(flag) over(order by dt) as cur_cnt
     from(
          select id,stt dt , 1 flag from play
          union
          select id,edt dt ,-1 flag from play
     ) t
) m

From the above figure, we can see that when the number of people decreases from the recording time of the peak to the next record, this period is the duration of the peak, or the peak period, that is, the time of the next record of the peak and the time of the record corresponding to the peak is the high peak period. Therefore, it is easy to find the answer to the question by using the lead() function. The SQL is as follows:

select max_cur_cnt 
      ,dt as start_time
       ,lead_dt as end_time
from(
     select *
            ,lead(dt,1,dt) over(order by dt) lead_dt
     from(
           select *,max(cur_cnt) over() as max_cur_cnt
           from(
                  select id
                           ,dt
                           ,flag
                           ,sum(flag) over(order by dt) as cur_cnt
                  from(
                          select id,stt dt , 1 flag from play
                          union
                          select id,edt dt ,-1 flag from play
                  ) t
           ) m
     ) n
) p
where cur_cnt=max_cur_cnt

The calculation results are as follows:

--------------------------------------------------------------------------------
OK
5	2021-06-14 15:18:12	2021-06-14 16:12:12
Time taken: 17.513 seconds, Fetched: 1 row(s)

3 Summary

This paper analyzes the problem of SQL statistics on the number of people online at the same time, and uses the accumulator idea to solve the problem. Finally, it is classified into time series data for time series data analysis (common skills: labeling, forming sequences, and analyzing multiple sequences). Finally, sum() over() is used to accumulate the labels to calculate the current number of people online. The thinking method of this question is worth learning from, and can also be compared with the previous article to count the number and duration of power failure( https://blog.csdn.net/godlovedaniel/article/details/118027222 )To form a unified analysis method for such problems. The key point of this topic is the idea of converting into time series data and accumulator. I hope readers can master it

Topics: hive SQL Algorithm Data Analysis