Analysis and implementation of window TopN

Posted by pbaker on Sat, 18 Dec 2021 23:24:22 +0100

Statement: this series of blogs is compiled according to SGG's videos, which is very suitable for you to learn. Some articles are collected by crawlers and other technical means in order to learn and share. If there is a copyright problem, please leave a message and delete it at any time.

The latest version of big data interview questions in 2021 is fully updated

The demand scenarios of topN are common both in offline calculation and real-time calculation, such as calculating popular selling goods in e-commerce, advertising before N hits in advertising calculation, and search words before N search times in search calculation. topN is also divided into global topN and grouping topN. It is metaphorically said that popular sales goods can be sorted directly according to the total sales of each commodity, or they can be grouped by region first, and then sorted according to the total sales of each commodity in each region. Taking the popular selling goods as an example, this article makes real-time statistics on the top 10 goods with sales in each regional dimension every 10min.

This requirement can be broken down into the following steps:

  • The order time in the extracted data is the event time
  • Count the sales in each 10min according to the dimension of region + commodity
  • Count the top 10 sales of goods in the region according to the region as the dimension

Time extraction

The data source type is Kafka. The data is order data, including order ID, order time, commodity ID, region ID and order amount (including user ID, which is omitted here)

case class Order(orderId: String, orderTime: Long, gdsId: String, amount: Double, areaId: String)

We want to count the data within every 10min according to the real order time. Then use the event time. Considering the possible data disorder, the maximum allowable delay is 30s

val orderStream=ds.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[Order](Time.seconds(30)) {
      override def extractTimestamp(element: Order): Long = element.orderTime
    })

Statistical sales

Count the sales volume every 10min, such as [9:00, 9:10], [9:10, 9:20], etc., corresponding to the event time scrolling window in Flink

val amountStream=dataStream.keyBy(x => {
      x.areaId + "_" + x.gdsId
    }).timeWindow(Time.minutes(10))
      .reduce(new ReduceFunction[Order] {
        override def reduce(value1: Order, value2: Order): Order = {
          Order(value1.orderId, value1.orderTime, value1.gdsId, value1.amount + value2.amount, value1.areaId)
        }
      })

Firstly, the keyBy operation is grouped by the area areaId and the commodity gdsId, so that the same key flows into the window of the same task for calculation. The window function includes WindowFunction, ReduceFunction and AggregateFunction. Because the aggregation operation is used, there is no need to retain the intermediate result data, so the ReduceFunction is directly used to read the data while aggregating, so as to reduce memory usage. In the ReduceFunction, the two order data sales are directly added to obtain a new order data

Top 10 sales of goods in regional dimension

So far, we have obtained the sales amountStream of each commodity in each region within each 10min. Now we need to group the commodities according to the region as the dimension to calculate the top 10 sales. We need to consider two problems:

  • How to get all the data in 10min window
  • How to sort

Let's first look at how to obtain the data of the 10min window, that is, the output of each window of the amountStream. In fact, this is also explained on the official website of Flink. Then, just connect a window of the same size directly behind it, and the subsequent window will obtain all the data of the previous window. The code is as follows:

amountStream.keyBy(_.areaId)
      .timeWindow(Time.minutes(10))
      .apply(...)

In fact, the author was puzzled at the beginning. Why can the output of the previous window be obtained after the same window? Until you look at the source code here, you can slowly understand that the triggering of the event time window depends on watermark,

//In AbstractStreamOperator
public void processWatermark(Watermark mark) throws Exception {
        if (timeServiceManager != null) {
            timeServiceManager.advanceWatermark(mark);
        }
        output.emitWatermark(mark);
    }

It is also analyzed in the previous time system series that advanceWatermark will trigger a window that meets the requirements, output the results of the window, and then output watermark. There is a very important relationship here. Watermark is output after the window data is output. How can the next window judge that the output of the previous window should be divided into the same window, Of course, by time, but what is the time when the window outputs data?

//WindowOperator
private void emitWindowContents(W window, ACC contents) throws Exception {
        timestampedCollector.setAbsoluteTimestamp(window.maxTimestamp());
        processContext.window = window;
        userFunction.process(triggerContext.key, window, processContext, contents, timestampedCollector);
    }

You can see that for a timestamped collector, the set time is the endTime of the window, that is, the data time of the window output data is the endTime of the window. Then the output data of the same window has the same data time endTime, and these data can be allocated to the same window in the downstream window. After the previous window is triggered, outputting watermark can just trigger the window operation of the downstream window.

Up to now, we can obtain the sales information of all commodities in each region. The next step is to complete the sorting operation. It is easy to think of the Sorted data structure TreeSet or priority queue. The implementation principle of TreeSet is red black tree, and the implementation principle of priority queue is maximum / minimum heap. These two can meet the requirements, But which one to choose? The time complexity of the red black tree is logN, while the construction complexity of the heap is N and the reading complexity is 1. However, we need to continuously insert data here. Therefore, it involves a continuous construction process. Relatively speaking, it is better to select the red black tree (in fact, the topN in the flink sql is also a treemap of the red black tree type).

Finally, do you need to save all data sorting? Obviously, it is unnecessary. If the TreeSet is set to sort in ascending order, the first node data is the minimum value. When the data in the TreeSet reaches N, the first node data (minimum value) is obtained and compared with the data currently to be inserted. If it is larger, it is directly discarded. If it is smaller, the first node data in the TreeSet is deleted, Insert new data, and the resulting TreeSet data is the topN we need. Take a look at the code implementation in apply:

new WindowFunction[Order, Order, String, TimeWindow] {
        override def apply(key: String, window: TimeWindow, input: Iterable[Order], out: Collector[Order]): Unit = {
          println("==area===" + key)
          val topMap = new util.TreeSet[Order](new Comparator[Order] {
            override def compare(o1: Order, o2: Order): Int = (o1.amount-o2.amount).toInt
          })
          input.foreach(x => {
            if (topMap.size() >= N) {
              val min=topMap.first()
              if(x.amount>min.amount) {
                  topMap.pollFirst() //Abandon
                  topMap.add(x)
                }
            }else{
              topMap.add(x)
            }
          })
          //Print directly here
          topMap.foreach(x=>{
            println(x)
          })
        }
      }

Finally, execute the main function directly. To facilitate a simple test, only the top3 and Kafka input data within 1min are obtained:

orderId02,1573483405000,gdsId01,500,beijing
orderId03,1573483408000,gdsId02,200,beijing
orderId03,1573483408000,gdsId03,300,beijing
orderId03,1573483408000,gdsId04,400,beijing
orderId07,1573483600000,gdsId01,600,beijing //trigger

Final results:

==area===beijing
Order(orderId03,1573483408000,gdsId03,300.0,beijing)
Order(orderId03,1573483408000,gdsId04,400.0,beijing)
Order(orderId02,1573483405000,gdsId01,500.0,beijing)

summary

So far, the window topN function has been realized. I think the more important point is how to obtain the aggregated data of the window and sort it. The aggregation result of the window is followed by the same window. The data sorting is similar to using the minimum heap mechanism.

Topics: Interview crawler NLP flink