Facing the problem of lock waiting, how to realize the second level positioning and analysis of the problem

Posted by james_kirk on Mon, 21 Feb 2022 12:38:42 +0100

Abstract: GaussDB(DWS) provides two cluster level views to quickly identify and query lock waiting and distributed deadlock information, which can locate and analyze the second level problems of this kind of problems.

This article is shared from Huawei cloud community< GaussDB(DWS) operation and maintenance -- one click lock waiting and distributed deadlock detection >, author: Chen Lige file.

Lock is the key element for GaussDB(DWS) to realize concurrent management. GaussDB(DWS) lock categories include table level lock, partition level lock (consistent with table level lock), transaction lock, consultation lock, etc. at present, table level lock, partition level lock (consistent with table level lock) and transaction lock are the most commonly used in business. When executing different SQL statements, you need to apply for and hold corresponding locks. When these lock resources are mutually exclusive, the corresponding business SQL will wait; This kind of waiting will have the following consequences:

  1. The party holding the lock releases the lock (generally, the corresponding action is the submission of the thing holding the lock), waits for the party holding the lock to apply for the lock, and then continues to execute
  2. The party holding the lock has not submitted for a long time, and the party waiting for the lock reports an error due to the lock waiting timeout
  3. On instance A, the roles of the things holding the lock and the things applying for the lock are exchanged on instance B, resulting in distributed deadlock (see the introduction below for details). In this scenario, it is necessary to first release the lock resource when an error is reported and the lock waiting timeout event is rolled back, and then another event can be applied normally

From the above description, it can be seen that lock waiting, especially distributed deadlock, has a great impact on the business. At least, waiting will lead to business performance jitter and decline, and even business error reporting. Dwsdb provides two levels of fast query and deadlock analysis.

1) Lock wait detection view pgxc_lock_conflicts

[function] query the lock waiting information on different nodes in the current library

[parse] execute the following query results

postgres=# SELECT * FROM pgxc_lock_conflicts ORDER BY nodename,dbname,locktype,nspname,relname,partname;
 locktype  | nodename |  dbname  | nspname |        relname        | partname | page | tuple | transactionid | username  |   gxid   |           xactstart           |      queryid       |                          query                           |       pid       |        mode         | granted
-----------+----------+----------+---------+-----------------------+----------+------+-------+---------------+-----------+----------+-------------------------------+--------------------+----------------------------------------------------------+-----------------+---------------------+---------
 partition | cn_5001  | postgres | public  | table_partition_num_3 | p1       |      |       |               | dfm           | 24097147 | 2022-02-17 17:56:03.113194+08 | 104145741383084190 | alter table table_partition_num_3 truncate partition p1; | 140160505136896 | AccessExclusiveLock | f
 partition | cn_5001  | postgres | public  | table_partition_num_3 | p1       |      |       |               | dfm           | 24102679 | 2022-02-17 18:41:36.580348+08 |                  0 | alter table table_partition_num_3 truncate partition p1; | 140160568055552 | AccessExclusiveLock | t
 relation  | cn_5002  | postgres | public  | xxx                   |          |      |       |               | dfm           | 24102679 | 2022-02-17 18:41:36.580348+08 | 175921860444402398 | truncate xxx;                                            | 140418767369984 | AccessShareLock     | f
 relation  | cn_5002  | postgres | public  | xxx                   |          |      |       |               | dfm           | 24097147 | 2022-02-17 17:56:03.113194+08 |                  0 | truncate xxx;                                            | 140420489144064 | AccessExclusiveLock | t
(4 rows)

SQL shown above

  • On node cn_5001 postgres table public table_ partition_ num_ There is a partition level lock conflict on partition P1 of 3. In the current lock conflict, thread 140160568055552 holds the lock (mode = true), the lock level is AccessExclusiveLock, and the execution statement is alter table_ partition_ num_ 3 truncate partition p1. Thread 140160568055552 is waiting for (mode = false)AccessExclusiveLock lock, and the statement waiting for lock is also alter table_ partition_ num_ 3 truncate partition p1.
  • On node cn_ Table in postgres of 5002 http://public.xxx There is a table level lock conflict on. Thread 140420489144064 holds the lock AccessExclusiveLock(mode = true), and thread 140418767369984 is waiting for (mode = false)AccessShareLock lock

2) Distributed lock wait detection view pgxc_deadlock

[function] query the distributed deadlock information on different nodes in the current library

[parse] execute the following query results

postgres=# SELECT * FROM pgxc_deadlock ORDER BY nodename,dbname,locktype,nspname,relname,partname;
 locktype | nodename |  dbname  | nspname | relname | partname | page | tuple | transactionid | waitusername | waitgxid |         waitxactstart         |    waitqueryid     |                      waitquery                      |     waitpid     |    waitmode     | holdusername | holdgxid |         holdxactstart         | holdqueryid |  holdquery   |     holdpid     |      holdmode
----------+----------+----------+---------+---------+----------+------+-------+---------------+--------------+----------+-------------------------------+--------------------+-----------------------------------------------------+-----------------+-----------------+--------------+----------+-------------------------------+-------------+--------------+-----------------+---------------------
 relation | cn_5001  | postgres | public  | t2      |          |      |       |               | j00565968    | 24112406 | 2022-02-17 20:01:57.421532+08 | 104145741383110084 | EXECUTE DIRECT ON(dn_6003_6004) 'SELECT * FROM t2'; | 140160505136896 | AccessShareLock | j00565968    | 24112465 | 2022-02-17 20:02:24.220656+08 |           0 | TRUNCATE t2; | 140160421234432 | AccessExclusiveLock
 relation | cn_5002  | postgres | public  | t1      |          |      |       |               | j00565968    | 24112465 | 2022-02-17 20:02:24.220656+08 | 175921860444446866 | EXECUTE DIRECT ON(dn_6001_6002) 'SELECT * FROM t1'; | 140418784151296 | AccessShareLock | j00565968    | 24112406 | 2022-02-17 20:01:57.421532+08 |           0 | TRUNCATE t1; | 140421763163904 | AccessExclusiveLock
(2 rows)

As shown in the SQL above, it is in the postgres library

  • Node cn_5001 upper

Transaction 24112465 holds the table public through thread 140160421234432 AccessExclusiveLock lock of T2

Transaction 24112406 is waiting for the application form public through thread 140160505136896 AccessShareLock lock of T2

  • Node cn_5002 upper

Transaction 24112465 is waiting for the application form public through thread 140418784151296 AccessShareLock lock for T1

Transaction 24112406 holds the table public through thread 140421763163904 AccessExclusiveLock lock of T1

If we define a line of defense according to the holding of resources to the application, the following table can be formed

As can be seen from the above, transaction 24112465 is on node cn_5001 holding form public T2's AccessExclusiveLock lock, waiting for the application form public AccessShareLock lock of T1; Transaction 24112406 on node cn_5002 holds table public T1's AccessExclusiveLock lock, waiting for the application form public AccessShareLock lock of T2; Transaction 24112406 and transaction 24112465 can apply for lock resources only when they wait for each other to submit and let them continue to execute. This distributed waiting relationship on multiple instances forms a ring. We call this phenomenon distributed deadlock.

3) The difference between lock waiting and distributed deadlock

For distributed deadlocks, only one transaction is allowed because the lock is waiting (parameter lockwait_timeout )When the timeout rollback occurs, another transaction can proceed; Or manually intervene to kill or cancel one transaction and let the other transaction continue.

For lock waiting without distributed deadlock, this generally does not require manual intervention. After the normal execution of the lock holding transaction is completed, another transaction can be executed normally; However, if the transaction holds the lock for more than the lock wait timeout parameter (parameter) lockwait_timeout ), a transaction waiting for a lock will fail because of the lock wait timeout.

 

Click follow to learn about Huawei's new cloud technology for the first time~

Topics: PostgreSQL GaussDB