Common indicators of Apache druid calculation

Posted by naveenbj on Thu, 28 Nov 2019 17:12:13 +0100

Druid supports SQL

[ EXPLAIN PLAN FOR ]
[ WITH tableName [ ( column1, column2, ... ) ] AS ( query ) ]
SELECT [ ALL | DISTINCT ] { * | exprs }
FROM table
[ WHERE expr ]
[ GROUP BY exprs ]
[ HAVING expr ]
[ ORDER BY expr [ ASC | DESC ], expr [ ASC | DESC ], ... ]
[ LIMIT limit ]
[ UNION ALL <another query> ]

Duration data type

The duration data type specifies the time interval.

The time interval is specified in the format "PnYnMnDTnHnMnS", where:

P for period (required)
nY is the number of years
nM is the number of months
nD for days
T is the start of the time period (required if you want to specify hours, minutes or seconds)
nH for hours
nM for minutes
nS for seconds
Current "timestamp: the current timestamp in the connection time zone.
Current date: the current date in the connection time zone.
INTERVAL:

**Event selection: * * you can select any event to view the corresponding data

Number of times (pv): the number of times the user does this, > [count()]
Number of users (uv): the number of users doing this, > [?]
Per capita times (pv/uv): the average number of times a person does this, or the combination of events can be used to check (e.g. event total video play success)
Per capita times of all active users: total number of active users, how many times per person on average, can also be checked with combined events
- We can mainly check the number of advertisements per capita
[how to count active users]
Press... Sum: sum the number of a parameter. The statistical time length is often used
- You can check the time length by summing playTime. Pay attention to remove the outliers and check the activity, time length and retention of a game
- play_game: play time, sum by play time (MS)
  - Playtime > 2000 (only when the time length is more than 2s)
  - Playtime < 2160000 (less than abnormal value)
Press... Average: not commonly used
Press... Calculate per capita value: not commonly used
Press... Quantile:
Press... To repeat
- To de duplicate the event parameters, you can use this to check the number of games pulled by our platform every day, and the total number of games pulled by our platform -- [press... To de duplicate count(distinct)]
- Play \
Press... De duplication with users
- The event parameters and users who have done the event are de duplicated. If there are 100 people playing game A, 200 people playing game B and 300 people playing game C in the platform, the number found in this way is 100 + 200 + 300. With this, the number of games per capita in A certain version / experiment can be checked
- Play \
Press... View distribution
- You can see the frequency of an event and check the frequency distribution of an event
- Event? Scratcher? Show? Reward? Dialog: interactive game general reward pop-up

**New indicators and combined indicators: * * several indicators can be added, subtracted, multiplied and divided

Data query:

Time interval: (month, week, day, hour, 5 minutes)
Data query scheme (sql + json),
- Query event: (play_game: directly pull up the game I played)

index

Times (pv)

Total number of play game events

-- SQL 
SELECT count(*) as "PV" 
FROM "currency_record"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' DAY

// json
{ 
   "queryType":"groupBy", 
   "dataSource":"currency_record", 
   "granularity":"hour", 
   "aggregations":[ 
     { 
         "type":"count", 
         "name":"count" 
     } 
   ], 
   "intervals":[ 
      "2000-01-01T00:00Z/3000-01-01T00:00Z" 
   ] 
}

Number of people (uv)

-- SQL
SELECT count(DISTINCT "android_id") as "UV"
FROM "currency_record"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' DAY

// json
{ 
   "queryType":"groupBy", 
   "dataSource":"currency_record", 
   "granularity":"day", 
   "aggregations":[ 
     { 
         "type":"count", 
         "name":"count" 
     } 
   ], 
   "intervals":[ 
      "2000-01-01T00:00Z/3000-01-01T00:00Z" 
   ] 
}

Times per capita (pv/uv)

-- sql
SELECT SUM("recordtype") as "sum_recordtype", COUNT(DISTINCT "imei" ) as "imei_count", TRUNCATE(SUM("recordtype")/(COUNT(DISTINCT "imei" )*1.0), 2) as "avg_recordtype"
FROM "currency_record"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' DAY

{ 
// Query daily UV. 51s. 30s. 25s average. 353s
// --The query type corresponds to the aggregation type with 3 type values,
// --Timeseries topN groupbyif you want to group and aggregate a single dimension and indicator, it is recommended to use topN query,
// --Higher performance can be achieved, grouping is suitable for multi-dimensional, multi index aggregate query
// --timeseries time series query returns the aggregated result set for the specified time period according to the query rules,
// --In query rules, you can set query strength, result sorting method and filter criteria. The filter criteria can use nested filtering and support post aggregation
   "queryType":"timeseries", 
   "dataSource":"currency_record", 
   "granularity":"day",
   "aggregations":[ 
     { 
        "type":"count", 
        "name":"pv",
        "fieldName":"uuid",
     },
     {
      // thetaSketch
         "type":"thetaSketch", 
         "name":"uv",
         "fieldName":"uuid",
     }
   ],
   "postAggregations": [
   {
            "type": "arithmetic",
            "name": "click_rate",
            "fn": "/",
            "fields": [
               {
                    "type": "hyperUniqueCardinality",
                    "name": "click_cnt",
                    "fieldName": "pv"
               },
               {
                // If the type uses fieldAccess, an error will be reported, indicating that it cannot be converted to Number
                    "type": "hyperUniqueCardinality",
                    "name": "pv_cnt",
                    "fieldName": "uv"
               },
           ],
            
   }
   ]
   "intervals":[ 
      "2000-01-01T00:00Z/3000-01-01T00:00Z" 
   ] 
}

Press... Summation

-- sql
SELECT SUM("recordtype")
FROM "currency_record"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' DAY AND "recordtype" <= 40 AND "recordtype" > 25

// json
{ 
   "queryType":"timeseries", 
   "dataSource":"currency_record", 
   "granularity":"day",
   "aggregations":[ 
     { 
        "type":"count", 
        "name":"recordtype",
        "fieldName":"recordtype",
     },
   ],
  "filter":
     {
          "type": "bound",
          "dimension": "recordtype",
          "lower": 25,
          "upper": 40
          "alphaNumeric": true
     },
   "intervals":[ 
      "2000-01-01T00:00Z/3000-01-01T00:00Z" 
   ] 
}

Press... Average value

-- sql
SELECT SUM("recordtype") as "sum_recordtype", COUNT(*) as "count", AVG("recordtype") as "avg_recordtype"
FROM "currency_record"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' DAY AND "recordtype" <= 40 AND "recordtype" > 25

// json
{ 
   "queryType":"timeseries", 
   "dataSource":"currency_record", 
   "granularity":"month",
   "aggregations":[ 
     { 
        "type":"longSum", 
        "name":"recordtype",
        "fieldName":"recordtype",
     },
     {
        "type":"count", 
        "name":"pv",
        "fieldName":"uuid",
     }
   ],
   "postAggregations": [
     {
        "type": "arithmetic",
            "name": "avg",
            "fn": "/",
            "fields": [
               {
                    "type": "hyperUniqueCardinality",
                    "name": "click_cnt",
                    "fieldName": "recordtype"
               },
               {
                // If the type uses fieldAccess, an error will be reported, indicating that it cannot be converted to Number
                    "type": "hyperUniqueCardinality",
                    "name": "pv_cnt",
                    "fieldName": "pv"
               },
           ],  
     }
   ]
  "filter":
     {
          "type": "bound",
          "dimension": "recordtype",
          "lower": 25,
          "upper": 40
          "alphaNumeric": true
     },
   "intervals":[ 
      "2000-01-01T00:00Z/3000-01-01T00:00Z" 
   ] 
}

Press... Seek per capita value

-- SQL
SELECT SUM("recordtype") as "sum_recordtype", COUNT(DISTINCT "imei" ) as "imei_count", SUM("recordtype")/COUNT(DISTINCT "imei" ) as "avg_recordtype"
FROM "currency_record"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' DAY AND "recordtype" <= 40 AND "recordtype" > 25

// JSON

{ 
   "queryType":"timeseries", 
   "dataSource":"currency_record", 
   "granularity":"month",
   "aggregations":[ 
     { 
        "type":"longSum", 
        "name":"recordtype",
        "fieldName":"recordtype",
     },
     {
        "type":"thetaSketch", 
        "name":"imeiCount",
        "fieldName":"imei",
     }
   ],
   "postAggregations": [
     {
        "type": "arithmetic",
            "name": "avg",
            "fn": "/",
            "fields": [
               {
                    "type": "hyperUniqueCardinality",
                    "name": "click_cnt",
                    "fieldName": "recordtype"
               },
               {
                // If the type uses fieldAccess, an error will be reported, indicating that it cannot be converted to Number
                    "type": "hyperUniqueCardinality",
                    "name": "pv_cnt",
                    "fieldName": "imeiCount"
               },
           ],  
     }
   ]
  "filter":
     {
          "type": "bound",
          "dimension": "recordtype",
          "lower": 25,
          "upper": 40
          "alphaNumeric": true
     },
   "intervals":[ 
      "2000-01-01T00:00Z/3000-01-01T00:00Z" 
   ] 
}

Press... Quantile

-- SQL
SELECT "Time", sum("=1") AS "=1" ,sum("2~3") AS "2~3" ,sum("4~8") AS "4~8", sum("9~16") AS "9~16", sum("17~25") AS "17~25"
from (
    SELECT 
    "Time",
     case WHEN "Count1" = 1 then 1 ELSE 0  end as "=1",
     case WHEN "Count1" > 1 and "Count1" >= 3 then 1 ELSE 0  end as "2~3",
     case WHEN "Count1" > 3 and "Count1" <= 8 then 1 ELSE 0  end as "4~8",
     case WHEN "Count1" > 8 and "Count1" <= 16 then 1  ELSE 0  end as "9~16",
     case WHEN "Count1" > 16 and "Count1" <= 25 then 1  ELSE 0  end as "17~25",
     case WHEN "Count1" > 25 then 1  ELSE 0  end as ">25"
    from (
          SELECT
            TIME_FLOOR("__time", 'PT5M') AS "Time", "uuid", COUNT(*) AS "Count1"
          FROM "app_time"
          WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR AND "__time" <= CURRENT_TIMESTAMP 
          GROUP BY 1, 2
          ORDER BY "Time" ASC ))
GROUP BY 1
ORDER BY "Time"

// JSON

Type conversion

Ask for crowd classification

SELECT SUM("recordtype") as "sum_recordtype", COUNT(DISTINCT "imei" ) as "imei_count", TRUNCATE(SUM("recordtype")/(COUNT(DISTINCT "imei" )*1.0), 2) as "avg_recordtype"
FROM "currency_record"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' DAY AND "uuid" IN (SELECT DISTINCT "uuid"
FROM "currency_record"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' DAY AND  "uuid" LIKE '%b31%')

Press... To view the distribution

SELECT "Time", SUM("A"), SUM("B"), SUM("C"), SUM("D")
FROM (
    SELECT 
    "Time",
    (case when "Count" <= 2 then 1 ELSE 0 end) as A,
    (case when "Count" > 2 AND "Count" <= 6 then 1  ELSE 0 end) as B,
    (case when "Count" > 6 AND "Count" <= 10 then 1  ELSE 0 end) as C,
    (case when "Count" > 10  then 1  ELSE 0 end) as D
    FROM (
        SELECT
          TIME_FLOOR("__time", 'PT5M') AS "Time", "uuid", COUNT(*) AS "Count"
        FROM "app_time"
        WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR AND "__time" <= CURRENT_TIMESTAMP 
        GROUP BY 1, "uuid"
    ) )
GROUP BY 1
ORDER BY "Time"

SELECT count(*) as "PV" ,  count(DISTINCT "android_id") as "UV", count(*)/count(DISTINCT "android_id") as "PV/UV"
FROM "currency_record"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' DAY

Topics: Programming SQL JSON Druid less

Programmer Think

Common indicators of Apache druid calculation

Druid supports SQL

Duration data type

Data query:

index

Hot Topics