version: postgresql10.17
0 summary
summary
- leaf page
- If the level 0 page (the bottom page) has a right brother, the first one is the first one of the right brothers (it should be the last one on this page. Note that this one does not belong to this page, but the smallest one on the next page is worth copying). Note: the leftmost page of level 0 page also has this copy
- If there is no data on the bottom page (0), the first page is the bottom page
- root / branch
- The data stored in the root/branch page is the minimum value pointing to the page (normally the second one, and the first one is generally the copy of the second "minimum value of current page" pointing to the right brother of the page). In particular, the data that the root/branch points to the leftmost page does not record information.
- How to find the leftmost and rightmost of the intermediate node (branch/root)?
- The data on the leftmost page is empty
- The first piece of leaf pointed to by the rightmost node is the minimum value of the current page (the first piece of other pages saves the copy of the minimum value of right sibling)
-------- | meta | -------- | root | -------- | left branch | branch | branch | ..... | branch | -------------------------------------------------------- | leaf | leaf | leaf | leaf | leaf | right leaf | -------------------------------------------------------- ! left branch of data Empty ! branch of data Is the minimum value that points to the page ! right leaf The first entry is the minimum value of the current page ! leaf The first one is right sibling Minimum value of copy
1 related system table
What index types are supported?
postgres=# \d pg_am Table "pg_catalog.pg_am" Column | Type | Collation | Nullable | Default -----------+---------+-----------+----------+--------- amname | name | | not null | amhandler | regproc | | not null | amtype | "char" | | not null | postgres=# select * from pg_am; amname | amhandler | amtype --------+-------------+-------- btree | bthandler | i hash | hashhandler | i gist | gisthandler | i gin | ginhandler | i spgist | spghandler | i brin | brinhandler | i
PS. the system table before 9.6 contains a lot of attribute information of method. In the new version, these attributes are saved in C code.
amhandler's correlation function is in PG_ In proc, all are C functions.
select proname, prosrc from pg_proc where oid in (330,331,332,333,334,335); proname | prosrc -------------+------------- bthandler | bthandler hashhandler | hashhandler gisthandler | gisthandler ginhandler | ginhandler spghandler | spghandler brinhandler | brinhandler
example:
postgres=# \d t2 Table "public.t2" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- id | integer | | not null | info | integer | | | Indexes: "pk_t2_a" PRIMARY KEY, btree (id)
System table meaning:
postgres=# select * from pg_index where indexrelid='pk_t2_a'::regclass; -[ RECORD 1 ]--+------- indexrelid | 126817 indrelid | 126814 indnatts | 1 -- Number of columns indisunique | t -- unique index indisprimary | t -- primary key indisexclusion | f -- support exclusion constraint indimmediate | t -- Force every time insert Check uniqueness when indisclustered | f -- Clustering is done according to this index indisvalid | t -- Used for query indcheckxmin | f -- Index tuples need to be checked when querying xmin<Transactional xmin Can use indisready | t -- Whether the index can be inserted or updated indislive | t -- f Indicates that the index is being drop,Other operations should ignore this index indisreplident | f -- ALTER TABLE ... REPLICA IDENTITY USING INDEX indkey | 1 -- Which columns were built with indcollation | 0 indclass | 1978 indoption | 0 indexprs | indpred | -- indisreplident Subscribed update and delete,Need to specify REPLICA IDENTITY To perform update deletion postgres=# select oid,* from pg_class order by oid desc limit 1; -[ RECORD 1 ]-------+-------- oid | 126817 relname | pk_t2_a relnamespace | 2200 reltype | 0 reloftype | 0 relowner | 10 relam | 403 relfilenode | 126817 reltablespace | 0 relpages | 1 reltuples | 0 relallvisible | 0 reltoastrelid | 0 relhasindex | f relisshared | f relpersistence | p relkind | i relnatts | 1 relchecks | 0 relhasoids | f relhaspkey | f relhasrules | f relhastriggers | f relhassubclass | f relrowsecurity | f relforcerowsecurity | f relispopulated | t relreplident | n relispartition | f relfrozenxid | 0 relminmxid | 0 relacl | reloptions | relpartbound |
2 data structure
|-----------------------------------------------------| | linp0(page header) | linp1 | linp2 | | |-----------------------------------------------------| | | | | |-----------------------------------------------------| | | itup2 | itup1 | special space(BTMetaPageData)| |-----------------------------------------------------|
BTPageOpaqueData maintains the structure of B-linked-tree in special space
typedef struct BTPageOpaqueData { BlockNumber btpo_prev; /* The block number of the left brother is convenient for reverse scanning */ BlockNumber btpo_next; /* Right brother's block number, forward scanning*/ union { uint32 level; /* level in the index tree*/ TransactionId xact; /* Delete useless ID S */ } btpo; uint16 btpo_flags; /* Page type */ BTCycleId btpo_cycleid; /* vacuum cycle ID of latest split */ } BTPageOpaqueData; /* Bits defined in btpo_flags */ #define BTP_LEAF (1 << 0) /* This flag indicates that there are leaf pages*/ #define BTP_ROOT (1 << 1) /* Root page*/ #define BTP_DELETED (1 << 2) /* Pages deleted from the index tree*/ #define BTP_META (1 << 3) /* Meta page*/ #define BTP_HALF_DEAD (1 << 4) /* Empty page, but keep it*/ #define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */ #define BTP_ HAS_ GARBAGE (1 << 6) /* page has LP_DEAD tuples, LP -- DEAD tuples*/ #define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */ // BTMetaPageData typedef struct BTMetaPageData { uint32 btm_magic; /* should contain BTREE_MAGIC */ uint32 btm_version; /* should contain BTREE_VERSION */ BlockNumber btm_root; /* current root location */ uint32 btm_level; /* tree level of the root page */ BlockNumber btm_fastroot; /* current "fast" root location */ uint32 btm_fastlevel; /* tree level of the "fast" root page */ } BTMetaPageData;
3 observe the storage structure
summary
- If the level 0 page (the bottom page) has a right brother, the first one is the first one of the right brothers (it should be the last one on this page. Note that this one does not belong to this page, but the smallest one on the next page is worth copying). Note: the leftmost page of level 0 page also has this copy
- If there is no right brother in the level 0 page (the bottom page), the first item is the starting data
- The data stored in the root page is the minimum value pointing to the page (normally the second one, and the first one is the copy of the right slicing minimum value). This first one is generally the copy of the second one (the minimum value of the current page) pointing to the right brother of the page. In particular, the data pointing to the leftmost page does not record information.
Observe the storage structure:
---------------- | meta page | ---------------- | root page | ----------------
Create test table
create table t3(id int primary key, info text);
meta page
The 0th page of the test table is meta page, pointing to the root page
postgres=# \d t3 Table "public.t3" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- id | integer | | not null | info | text | | | Indexes: "t3_pkey" PRIMARY KEY, btree (id) postgres=# select * from bt_page_stats('t3_pkey',0); ERROR: block 0 is a meta page postgres=# select * from bt_page_items('t3_pkey',0); ERROR: block 0 is a meta page postgres=# select * from bt_metap('t3_pkey'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 1 | 0 | 1 | 0
Scenario 1: 100 entries | meta page | root | branch | leaf
create table t3(id int primary key, info text); insert into t3 select generate_series(1,100), md5(random()::text);
level = 0
select * from bt_metap('t3_pkey');
btpo = grade
btpo_flags = type
postgres=# select * from bt_page_stats('t3_pkey',1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 100 | 0 | 16 | 8192 | 6148 | 0 | 0 | 0 | 3 (1 row)
bt_page_stats data is taken from btpageopera of specialspan. Refer to btpageopera data structure for meaning
bt_page_stats GetBTPageStatistics BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); ... stat->btpo_prev = opaque->btpo_prev; stat->btpo_next = opaque->btpo_next; stat->btpo.level = opaque->btpo.level; stat->btpo_flags = opaque->btpo_flags; stat->btpo_cycleid = opaque->btpo_cycleid; ...
https://developer.aliyun.com/article/53701
The hierarchy can be from BT_ page_ The btpo of stats is obtained, which represents the level of the current index page.
Note that the level is not unique. For example, the level with btpo=3 may be divided into several levels.
btpo=0 indicates the lowest level. The items(ctid) stored in index pages at this level points to the heap page.
The category and level are not linked. There can be multiple levels in the category, but only the ctid content stored in the index page with level = 0 points to the heap page
The ctid content stored in index pages of other levels refers to other index pages (two-way linked list) of the same level or index pages of lower levels.
0 layer structure, only meta and root pages.
The maximum number of item s that can be stored in the root page depends on the length of the index field data and the size of the index page.
current
btpo = grade = 0
btpo_flags = type = 3 = BTP_LEAF | BTP_ROOT
#define BTP_LEAF (1 << 0) /* leaf page, i.e. not internal page */ #define BTP_ROOT (1 << 1) /* root page (has no parent) */ #define BTP_DELETED (1 << 2) /* page has been deleted from tree */ #define BTP_META (1 << 3) /* meta-page */ #define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */ #define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */ #define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */ #define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
btpo knows that this is a level 0 node, and the level 0 node points to the heap tuple
postgres=# select * from bt_page_items('t3_pkey',1) limit 10; itemoffset | ctid | itemlen | nulls | vars | data ------------+--------+---------+-------+------+------------------------- 1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00 3 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00 4 | (0,4) | 16 | f | f | 04 00 00 00 00 00 00 00 5 | (0,5) | 16 | f | f | 05 00 00 00 00 00 00 00 6 | (0,6) | 16 | f | f | 06 00 00 00 00 00 00 00 7 | (0,7) | 16 | f | f | 07 00 00 00 00 00 00 00 8 | (0,8) | 16 | f | f | 08 00 00 00 00 00 00 00 9 | (0,9) | 16 | f | f | 09 00 00 00 00 00 00 00 10 | (0,10) | 16 | f | f | 0a 00 00 00 00 00 00 00 postgres=# select * from t3 where ctid='(0,10)'; id | info ----+---------------------------------- 10 | 986dd76dcec6d1b705ea163fe6e11f37
Scenario 2: 1000 | meta | root | branch | leaf
create table t4(id int primary key, info text); insert into t4 select generate_series(1,1000), md5(random()::text);
level = 1 (only leaf, meta and root are excluded)
select * from bt_metap('t4_pkey');
1. root page ID=3
postgres=# select * from bt_metap('t4_pkey'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 3 | 1 | 3 | 1
2. Page stats
-- btpo=0: It's already at the bottom -- btpo_flags=1: BTP_LEAF postgres=# select * from bt_page_stats('t4_pkey',1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1 -- btpo=0: It's already at the bottom -- btpo_flags=1: BTP_LEAF postgres=# select * from bt_page_stats('t4_pkey',2); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1 -- btpo=1: Not the bottom -- btpo_flags=2: BTP_ROOT postgres=# select * from bt_page_stats('t4_pkey',3); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 3 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 1 | 2 -- btpo=0: It's already at the bottom -- btpo_flags=1: BTP_LEAF postgres=# select * from bt_page_stats('t4_pkey',4); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 4 | l | 268 | 0 | 16 | 8192 | 2788 | 2 | 0 | 0 | 1 postgres=# select * from bt_page_stats('t4_pkey',5); ERROR: block number out of range
3. root page items
Point to three pages. data stores the minimum value stored in this leaf page
postgres=# select * from bt_page_items('t4_pkey',3); itemoffset | ctid | itemlen | nulls | vars | data ------------+-------+---------+-------+------+------------------------- 1 | (1,1) | 8 | f | f | 2 | (2,1) | 16 | f | f | 6f 01 00 00 00 00 00 00 3 | (4,1) | 16 | f | f | dd 02 00 00 00 00 00 00
First level 0 page:
postgres=# select * from bt_page_items('t4_pkey',1); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (3,7) | 16 | f | f | 6f 01 00 00 00 00 00 00 2 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 3 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00 4 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00 ... 365 | (3,4) | 16 | f | f | 6c 01 00 00 00 00 00 00 366 | (3,5) | 16 | f | f | 6d 01 00 00 00 00 00 00 367 | (3,6) | 16 | f | f | 6e 01 00 00 00 00 00 00
Second level 0 page:
postgres=# select * from bt_page_items('t4_pkey',2); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (6,13) | 16 | f | f | dd 02 00 00 00 00 00 00 2 | (3,7) | 16 | f | f | 6f 01 00 00 00 00 00 00 3 | (3,8) | 16 | f | f | 70 01 00 00 00 00 00 00 4 | (3,9) | 16 | f | f | 71 01 00 00 00 00 00 00 ... 365 | (6,10) | 16 | f | f | da 02 00 00 00 00 00 00 366 | (6,11) | 16 | f | f | db 02 00 00 00 00 00 00 367 | (6,12) | 16 | f | f | dc 02 00 00 00 00 00 00
Fourth level 0 page:
postgres=# select * from bt_page_items('t4_pkey',4); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (6,13) | 16 | f | f | dd 02 00 00 00 00 00 00 2 | (6,14) | 16 | f | f | de 02 00 00 00 00 00 00 3 | (6,15) | 16 | f | f | df 02 00 00 00 00 00 00 4 | (6,16) | 16 | f | f | e0 02 00 00 00 00 00 00 ... 266 | (8,38) | 16 | f | f | e6 03 00 00 00 00 00 00 267 | (8,39) | 16 | f | f | e7 03 00 00 00 00 00 00 268 | (8,40) | 16 | f | f | e8 03 00 00 00 00 00 00
Scenario 3: 10000000 | meta | root | branch | leaf
Summary:
- The first number of ctid indicates the page ID, and the second number is the offset
- btpo_flags = 0 is branch page, btpo_flags=2 is the root page
create table t5(id int primary key, info text); insert into t5 select trunc(random()*10000000), md5(random()::text) from generate_series(1,1000000) on conflict on constraint t5_pkey do nothing;
level = 2 (with branch and leaf) root is 412
select * from bt_metap('t5_pkey'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 412 | 2 | 412 | 2
Check the root (the data stores the maximum value of the left brother)
btpo = 2 (on the second floor)
btpo_flags = 2 = BTP_ROOT
postgres=# select * from bt_page_stats('t5_pkey', 412); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 412 | r | 11 | 0 | 15 | 8192 | 7936 | 0 | 0 | 2 | 2 postgres=# select * from bt_page_items('t5_pkey', 412); itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (3,1) | 8 | f | f | 2 | (2714,1) | 16 | f | f | 66 07 0a 00 00 00 00 00 3 | (1233,1) | 16 | f | f | a0 8c 16 00 00 00 00 00 4 | (2381,1) | 16 | f | f | 0c 62 23 00 00 00 00 00 5 | (583,1) | 16 | f | f | 4d 71 30 00 00 00 00 00 6 | (2286,1) | 16 | f | f | 6b 80 3d 00 00 00 00 00 7 | (1062,1) | 16 | f | f | 59 fb 4b 00 00 00 00 00 8 | (2046,1) | 16 | f | f | de b2 5a 00 00 00 00 00 9 | (411,1) | 16 | f | f | d6 df 6a 00 00 00 00 00 10 | (2006,1) | 16 | f | f | 06 6d 79 00 00 00 00 00 11 | (1380,1) | 16 | f | f | 8c 34 8a 00 00 00 00 00
It seems that there are 11 branch page s, including 3, 2714, 1233 and 2381.
postgres=# select * from bt_page_stats('t5_pkey', 3); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 3 | i | 214 | 0 | 15 | 8192 | 3876 | 0 | 2714 | 1 | 0 postgres=# select * from bt_page_stats('t5_pkey', 2714); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 2714 | i | 273 | 0 | 15 | 8192 | 2696 | 3 | 1233 | 1 | 0
Let's take a look at the content of the branch, for example, 3 pages:
postgres=# select * from bt_page_items('t5_pkey', 3); itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (1852,1) | 16 | f | f | 66 07 0a 00 00 00 00 00 2 | (1,1) | 8 | f | f | 3 | (3176,1) | 16 | f | f | 5c 08 00 00 00 00 00 00 4 | (1612,1) | 16 | f | f | ef 11 00 00 00 00 00 00 5 | (3045,1) | 16 | f | f | 2d 1a 00 00 00 00 00 00 6 | (724,1) | 16 | f | f | e2 24 00 00 00 00 00 00 7 | (2822,1) | 16 | f | f | ae 2d 00 00 00 00 00 00 8 | (1369,1) | 16 | f | f | 96 38 00 00 00 00 00 00 9 | (2612,1) | 16 | f | f | ac 42 00 00 00 00 00 00 10 | (315,1) | 16 | f | f | 0d 4e 00 00 00 00 00 00 11 | (2458,1) | 16 | f | f | d6 58 00 00 00 00 00 00 ... ...
What does branch data store?
Check the content of leaf page 1852. The first item is the copy of the minimum value of the right brother; The second is the current minimum value.
So data is the smallest data to save to the page.
postgres=# select * from bt_page_items('t5_pkey', 1852); itemoffset | ctid | itemlen | nulls | vars | data ------------+------------+---------+-------+------+------------------------- 1 | (151,77) | 16 | f | f | 2b 18 0a 00 00 00 00 00 2 | (1231,33) | 16 | f | f | 66 07 0a 00 00 00 00 00 <----------- 3 | (4098,24) | 16 | f | f | 67 07 0a 00 00 00 00 00 4 | (2373,32) | 16 | f | f | 73 07 0a 00 00 00 00 00 5 | (7888,95) | 16 | f | f | 9e 07 0a 00 00 00 00 00
How to find the minimum value?
Check the root and find that only (3,1) has no data. Data should have saved the minimum value of the right brother. No description (3,1) is the first page.
postgres=# select * from bt_page_items('t5_pkey', 412); itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (3,1) | 8 | f | f | 2 | (2714,1) | 16 | f | f | 66 07 0a 00 00 00 00 00 3 | (1233,1) | 16 | f | f | a0 8c 16 00 00 00 00 00 4 | (2381,1) | 16 | f | f | 0c 62 23 00 00 00 00 00
View branch
postgres=# select * from bt_page_items('t5_pkey', 3); itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (1852,1) | 16 | f | f | 66 07 0a 00 00 00 00 00 2 | (1,1) | 8 | f | f | 3 | (3176,1) | 16 | f | f | 5c 08 00 00 00 00 00 00 4 | (1612,1) | 16 | f | f | ef 11 00 00 00 00 00 00 5 | (3045,1) | 16 | f | f | 2d 1a 00 00 00 00 00 00 6 | (724,1) | 16 | f | f | e2 24 00 00 00 00 00 00 7 | (2822,1) | 16 | f | f | ae 2d 00 00 00 00 00 00 8 | (1369,1) | 16 | f | f | 96 38 00 00 00 00 00 00 9 | (2612,1) | 16 | f | f | ac 42 00 00 00 00 00 00
Check the leaf. The first one is the minimum value of the right sibling, and the second one is the minimum value of the current page.
postgres=# select * from bt_page_items('t5_pkey', 1); itemoffset | ctid | itemlen | nulls | vars | data ------------+------------+---------+-------+------+------------------------- 1 | (212,57) | 16 | f | f | 5c 08 00 00 00 00 00 00 2 | (7238,62) | 16 | f | f | 2a 00 00 00 00 00 00 00 3 | (430,12) | 16 | f | f | 2b 00 00 00 00 00 00 00 4 | (5535,81) | 16 | f | f | 3e 00 00 00 00 00 00 00 5 | (1278,47) | 16 | f | f | 3f 00 00 00 00 00 00 00 6 | (4255,45) | 16 | f | f | 46 00 00 00 00 00 00 00
View data
postgres=# select * from bt_page_items('t5_pkey', 1); itemoffset | ctid | itemlen | nulls | vars | data ------------+------------+---------+-------+------+------------------------- 1 | (212,57) | 16 | f | f | 5c 08 00 00 00 00 00 00 2 | (7238,62) | 16 | f | f | 2a 00 00 00 00 00 00 00 3 | (430,12) | 16 | f | f | 2b 00 00 00 00 00 00 00 4 | (5535,81) | 16 | f | f | 3e 00 00 00 00 00 00 00 5 | (1278,47) | 16 | f | f | 3f 00 00 00 00 00 00 00 6 | (4255,45) | 16 | f | f | 46 00 00 00 00 00 00 00 postgres=# select * from t5 where ctid='(7238,62)'; id | info ----+---------------------------------- 42 | faca6c2f77cfb6eea5e9ee6bf2740578 postgres=# select min(id) from t5; min ----- 42
ref
https://developer.aliyun.com/article/53701
The next article begins to turn to the source code