Histogram science.
0. What is histogram 1 How histograms work 2 What happens if there are indexes and histograms at the same time 3 How to improve the statistical accuracy of histogram
0. What is histogram
Histogram is a kind of basic statistical information (other than index) provided by the database, which is used to describe the data distribution on the column. Its most typical scenario is to estimate the selection rate of query predicates in order to select the appropriate execution plan.
In other words, a column can not create an index but create a histogram, which can also help improve query efficiency. MySQL 8.0 began to support histogram, which is a great progress.
The histogram can record the data distribution statistics of a column. For example, if the value of a column is from 10000 to 10000, the histogram can be divided into 100 buckets. How the 10000 values are distributed, as well as the maximum value, minimum value, proportion and other information in each bucket can be counted in each bucket.
Although the index can be used to optimize SQL efficiency, the cost of index maintenance is higher. The index should be kept updated, and the histogram can be updated manually on demand.
Index statistics are also unreliable, such as data skew or statistical delay.
In addition, when necessary, you can create histograms on each required column, but it is unlikely to create multiple single column indexes at the same time, which is too expensive.
For example, the following implementation plan:
[root@yejr.run]> explain select * from t1 where seq = 1234; +------+---------------+------+---------+------+--------+----------+-------------+ | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+---------------+------+---------+------+--------+----------+-------------+ | ALL | NULL | NULL | NULL | NULL | 299876 | 10.00 | Using where | +------+---------------+------+---------+------+--------+----------+-------------+ [root@yejr.run]> select * from t1 where seq = 1234; Empty set (0.097 sec)
Before creating the histogram, there is no index on the seq column. At this time, it is a full table scan. Note that the value of the filtered column is 10%.
After creating the histogram, look at the execution plan:
[root@yejr.run]> explain select * from t1 where seq = 1234; +------+---------------+------+---------+------+--------+----------+-------------+ | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+---------------+------+---------+------+--------+----------+-------------+ | ALL | NULL | NULL | NULL | NULL | 299876 | 0.00 | Using where | +------+---------------+------+---------+------+--------+----------+-------------+ [root@yejr.run]> select * from t1 where seq = 1234; Empty set (0.079 sec)
Note that the filtered column value becomes 0%, and the actual time consumption is 81% of the original. Although the absolute value is not small, it still saves nearly 20% time compared with the original full table scan.
Therefore, histogram is still very meaningful. Of course, histogram can not replace index. It is only useful in some specific scenes.
1. How does the histogram work
MySQL supports two histogram modes: equal width and equal height. The equal width histogram is to save a value and the cumulative frequency of this value for each bucket. The equal height histogram needs to save the number of different values, upper and lower limits and cumulative frequency for each bucket. MySQL will automatically select which type of histogram to use, which does not need or cannot be specified. Generally speaking, if the data distribution range is relatively large, equal height is adopted; on the contrary, if the data distribution is relatively small, equal width is adopted.
Histogram statistics physical table column_statistics is stored in the mysql table space and cannot be read or written directly, but information can be accessed_ schema. column_statistics view to view the statistics results.
[root@yejr.run]> show create view column_statistics\G *************************** 1. row *************************** View: COLUMN_STATISTICS Create View: CREATE ALGORITHM=UNDEFINED DEFINER=`mysql.infoschema`@`localhost` SQL SECURITY DEFINER VIEW `COLUMN_STATISTICS` AS select `mysql`.`column_statistics`.`schema_name` AS ... character_set_client: utf8 collation_connection: utf8_general_ci
Each record corresponds to a histogram statistical result, which is saved in JSON format.
In addition, there is a parameter histogram_generation_max_mem_size is used to control the available memory when creating histograms. This parameter is very important and will be discussed later.
As of MySQL 8.0.0 In version 19, histogram supports a variety of data types and scenes, even including virtual columns. However, the following situations are not supported:
- Encrypted table, temporary table.
- JSON data type, spatial data type.
- Single column with unique index created.
Let's take an example of creation failure:
[root@yejr.run]> analyze table t2 update histogram on nu; +---------+-----------+----------+-----------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------+-----------+----------+-----------------------------------------------------------+ | yejr.t2 | histogram | Error | The column 'nu' is covered by a single-part unique index. | +---------+-----------+----------+-----------------------------------------------------------+
MySQL simply rejected this anti intelligence behavior.
Of course, if a column creates a non unique auxiliary index, you can create a histogram without conflict.
To create a formal histogram:
+---------+-----------+----------+------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------+-----------+----------+------------------------------------------------+ | yejr.t1 | histogram | status | Histogram statistics created for column 'seq'. | +---------+-----------+----------+------------------------------------------------+
Look at the column again_ Statistics stored in Statistics:
[root@yejr.run]> select SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM) from COLUMN_STATISTICS\G *************************** 1. row *************************** SCHEMA_NAME: yejr --library TABLE_NAME: t1 --surface COLUMN_NAME: seq --column JSON_PRETTY(HISTOGRAM): { "buckets": [ [ --First bucket(bucket)Statistics in 1, --minimum value 9710, --Maximum 0.009996666666666666, --Cumulative proportion, 0.99% 2571 --How many values accumulate in the first bucket ], ... [ --Statistics in the 100th bucket 989875, 999994, 0.9996666666666667, --Because the column contains parts NULL Value, so this is not 1.0(100%) 2580 ] ], "data-type": "int", --The column data type is INT "null-values": 0.0003333333333333333, --Include NULL Value, or NULL Proportion of values "collation-id": 8, "last-updated": "2020-04-21 07:21:53.084054", --Histogram last update time "sampling-rate": 1.0, --Sampling scale 100% "histogram-type": "equi-height", --Contour histogram "number-of-buckets-specified": 100 --There are 100 barrels in total }
There are 100 barrels in the contour histogram above. The data volume of each barrel ranges from 2571 to 2620, and the total data volume is 259550, accounting for 99.9667%. In addition, there are about 0.033% NULL records.
Another equal width histogram
"buckets": [ [ 5, --Store each value 0.07692307692307693 ---Frequency of this value ], [ 25, 0.15384615384615385 ], ...... ], "data-type": "int", "null-values": 0.0, "collation-id": 8, "last-updated": "2019-07-02 06:44:15.134130", "sampling-rate": 1.0, --Data sampling scale "histogram-type": "singleton", --Equal width "number-of-buckets-specified": 100 }
In the equal width square graph, the distribution information of each value is recorded in each bucket.
2. What if there are both indexes and histograms
One day, a classmate in the group asked, if a column created a histogram and index at the same time, how would the optimizer choose?
Before answering, you can open your mind for a few seconds
In fact, if so, the optimizer will choose indexes instead of histograms. Because the statistical information of the index is relatively "more timely" or "more accurate", because the histogram needs to be updated manually, there is no way to ensure "timeliness". Of course, I didn't look at the source code. I just guessed and confirmed it through experiments.
Suppose that the column seq of the histogram is created above, and the index is also created. Turn on optimizer_ After trace, you can see the difference between the two execution plans (I only selected some content)
- When there is no index, go to histogram
"considered_execution_plans": [ { "table": "`t1`", "chosen": true, "plan_prefix": [], "cost_for_plan": 35080, "rows_for_plan": 1, "best_access_path": { "considered_access_paths": [ { "cost": 35080, --cost It looks so tall "chosen": true, --But he was chosen "access_type": "scan", --Full table scan "rows_to_scan": 347352, --Estimated number of lines to scan "resulting_rows": 1, --Expected number of rows returned "filtering_effect": [ { "condition": "(`t1`.`seq` = 1234)", "histogram_selectivity": 0.000001 --Histogram filtering effect } ], "final_filtering_effect": 0.0000029 } ] }, "condition_filtering_pct": 100 }
Although it seems that the full table scan is necessary, the results can be returned soon because of the histogram.
- When there is an index, the index is preferred
"considered_execution_plans": [ { "table": "`t1`", "chosen": true, "plan_prefix": [], "cost_for_plan": 0.35, "rows_for_plan": 1, "best_access_path": { "considered_access_paths": [ { "cost": 0.35, --Equivalent to the histogram above cost,This is very low, so there is no suspense about choosing the index "rows": 1, "index": "k1", "chosen": true, "access_type": "ref" }, { "cause": "heuristic_index_cheaper", "chosen": false, "access_type": "range", "range_details": { "used_index": "k1" } } ] }, "condition_filtering_pct": 100 }
If there is a source code boss, please help confirm whether it is like this.
3. How to improve the statistical accuracy of histogram
We mentioned the parameter histogram earlier_ generation_ max_ mem_ Size, which controls the amount of memory required when creating / updating histograms.
The maximum amount of memory available for generating histogram statistics.
The default value of this parameter is 20000000 (less than 20MB) and the minimum value is 1000000 (about 976KB). This is a session level allocated memory, and it needs to be allocated every time the histogram is created / updated. It is released after execution.
After introducing the pre information, it's time to focus on how to improve the statistical accuracy in the histogram.
When scanning InnoDB data page for histogram data statistics, the steps are roughly as follows:
- Estimate the length of the column data type to be counted, and record it as row_size_bytes.
- Divide the available memory by the length of each record to obtain the expected amount of data that can be sampled rows_in_memory = histogram_generation_max_mem_size / row_size_bytes.
- Calculate the sampling scale sample_percentage = rows_in_memory / rows_in_table. Where rows_in_table is the estimated total number of records in the table.
- Scan the data page according to the sampling proportion to obtain the sampling results. For example, if the sampling ratio is 10%, then after scanning 1 page, skip 9 pages, and then continue sampling. These steps are based on my reading level of tripod source code. If there is any difference, please leave a message and correct it. The code corresponding to the above steps is SQL / histograms / histogram CC, about 868 lines near the update_histogram function.
MySQL currently handles data length in a very coarse-grained manner. It only distinguishes the following situations, which leads to the fact that the actual memory required for the straight chart column may be larger than the type length defined. You can also see the definition in the source code:
vim sql/histograms/histogram.cc +113 /** Convert from enum_field_types to Value_map_type. @param field_type the field type @param is_unsigned whether the field type is unsigned or not. This is only considered if the field type is LONGLONG @return A Value_map_type. May be INVALID if the Value_map does not support the field type. */ static Value_map_type field_type_to_value_map_type( const enum_field_types field_type, const bool is_unsigned) { switch (field_type) { case MYSQL_TYPE_DECIMAL: case MYSQL_TYPE_NEWDECIMAL: return Value_map_type::DECIMAL; case MYSQL_TYPE_TINY: case MYSQL_TYPE_SHORT: case MYSQL_TYPE_LONG: case MYSQL_TYPE_INT24: case MYSQL_TYPE_YEAR: case MYSQL_TYPE_BIT: return Value_map_type::INT; case MYSQL_TYPE_ENUM: return Value_map_type::ENUM; case MYSQL_TYPE_SET: return Value_map_type::SET; case MYSQL_TYPE_LONGLONG: return is_unsigned ? Value_map_type::UINT : Value_map_type::INT; case MYSQL_TYPE_FLOAT: case MYSQL_TYPE_DOUBLE: return Value_map_type::DOUBLE; case MYSQL_TYPE_TIME: case MYSQL_TYPE_TIME2: return Value_map_type::TIME; case MYSQL_TYPE_DATE: case MYSQL_TYPE_NEWDATE: return Value_map_type::DATE; case MYSQL_TYPE_DATETIME: case MYSQL_TYPE_TIMESTAMP: case MYSQL_TYPE_TIMESTAMP2: case MYSQL_TYPE_DATETIME2: return Value_map_type::DATETIME; case MYSQL_TYPE_TINY_BLOB: case MYSQL_TYPE_MEDIUM_BLOB: case MYSQL_TYPE_LONG_BLOB: case MYSQL_TYPE_BLOB: case MYSQL_TYPE_VAR_STRING: case MYSQL_TYPE_STRING: case MYSQL_TYPE_VARCHAR: return Value_map_type::STRING; case MYSQL_TYPE_JSON: case MYSQL_TYPE_GEOMETRY: case MYSQL_TYPE_NULL: default: return Value_map_type::INVALID; }
As you can see, press value for TINYINT (1 byte), SMALLINT (2 bytes), etc_ map_ Type:: int, and this type is actually long, that is, 16 bytes.
In addition, each record requires an additional overhead of about 42 bytes (larger than most data types, embarrassing).
vim sql/histograms/value_map.h +262 /// @return the overhead in bytes for each distinct value stored in the /// Value_map. The value 32 is obtained from both GCC 8.2 and /// Clang 8.0 (same as sizeof(value_map_type::node_type) in C++17). size_t element_overhead() const override { // TODO: Replace this with sizeof(value_map_type::node_type) when we have // full C++17 support. return sizeof(typename value_map_type::value_type) + sizeof(typename value_map_type::key_type) + 32; }
Several other data types are also handled roughly, and should be improved in future versions.
If histogram_ generation_ max_ mem_ If the size is not large enough, the sampling proportion is low, which will affect the accuracy. How much should be set reasonably? You can follow the following method:
- Set the minimum value first.
- Perform a sample.
- View the sampling scale.
- Reverses the memory required to fully sample. Of course, if the amount of table data is particularly large, it is not necessary to sample all of them. After all, it consumes more memory and requires more physical reads. Let's give an example.
# Set histogram_generation_max_mem_size is the minimum [root@yejr.run]> set session histogram_generation_max_mem_size = 1000000; # Reset InnoDB Metric counters [root@yejr.run]> SET GLOBAL innodb_monitor_enable = 'sampled%'; # One shot, create a histogram [root@yejr.run]> analyze table t1 update histogram on seq; 1 row in set (0.16 sec) --Time consuming 0.16 second # View sampling statistics scale [root@yejr.run]> SELECT HISTOGRAM->>'$."sampling-rate"' FROM INFORMATION_SCHEMA.COLUMN_STATISTICS; +---------------------------------+ | HISTOGRAM->>'$."sampling-rate"' | +---------------------------------+ | 0.059548422871929935 | +---------------------------------+ # View innodb data page scan statistics [root@yejr.run]> SELECT NAME, COUNT FROM information_schema.INNODB_METRICS WHERE NAME LIKE 'sampled%'; +-----------------------+-------+ | NAME | COUNT | +-----------------------+-------+ | sampled_pages_read | 51 | --51 were scanned page | sampled_pages_skipped | 760 | --760 skipped page +-----------------------+-------+
At the same time, in another session, I query the memory consumption of the thread creating the histogram before and after.
# First query [root@yejr.run]> select * from sys.x$memory_by_thread_by_current_bytes where thread_id = 286; +-----------+-------------------------------+--------------------+-------------------+-------------------+-------------------+-----------------+ | thread_id | user | current_count_used | current_allocated | current_avg_alloc | current_max_alloc | total_allocated | +-----------+-------------------------------+--------------------+-------------------+-------------------+-------------------+-----------------+ | 286 | thread_pool/tp_one_connection | 163 | 345036 | 2116.7853 | 260280 | 212710814 | +-----------+-------------------------------+--------------------+-------------------+-------------------+-------------------+-----------------+ 1 row in set (0.07 sec) #Second query [root@yejr.run]> select * from sys.x$memory_by_thread_by_current_bytes where thread_id = 286; +-----------+-------------------------------+--------------------+-------------------+-------------------+-------------------+-----------------+ | thread_id | user | current_count_used | current_allocated | current_avg_alloc | current_max_alloc | total_allocated | +-----------+-------------------------------+--------------------+-------------------+-------------------+-------------------+-----------------+ | 286 | thread_pool/tp_one_connection | 165 | 346502 | 2100.0121 | 261712 | 213946226 | +-----------+-------------------------------+--------------------+-------------------+-------------------+-------------------+-----------------+
Query total twice_ The allocated difference of 1235412 (1.17MB) is mainly due to the memory required to create the histogram.
Now, it can be estimated that the required memory is about
[root@yejr.run]> select 1235412 / 0.059548422871929935; +--------------------------------+ | 1235412 / 0.059548422871929935 | +--------------------------------+ | 20746342.8991 | +--------------------------------+
Now, after increasing the memory setting, do it again
[root@yejr.run]> set session histogram_generation_max_mem_size = 1000000; [root@yejr.run]> SET GLOBAL innodb_monitor_enable = 'sampled%'; [root@yejr.run]> analyze table t1 update histogram on seq; 1 row in set (3.78 sec) --Time consuming 3.78 It's 23 seconds above.6 times [root@yejr.run]> SELECT HISTOGRAM->>'$."sampling-rate"' FROM INFORMATION_SCHEMA.COLUMN_STATISTICS; +---------------------------------+ | HISTOGRAM->>'$."sampling-rate"' | +---------------------------------+ | 1.0 | +---------------------------------+ [root@yejr.run]> SELECT NAME, COUNT FROM information_schema.INNODB_METRICS WHERE NAME LIKE 'sampled%'; +-----------------------+-------+ | NAME | COUNT | +-----------------------+-------+ | sampled_pages_read | 811 | | sampled_pages_skipped | 0 | --Indeed, not skipped page Yes +-----------------------+-------+
Again, not all page s have to be collected, otherwise the price may be unbearable...
End of the full text.