PG database kernel analysis learning notes log recovery strategy
In PostgreSQL, the StartupXlog entry function is called when the system restarts after a crash
// xlog.c /* * This must be called ONCE during postmaster or standalone-backend startup */ void StartupXLOG(void) { XLogCtlInsert *Insert; CheckPoint checkPoint; bool wasShutdown; bool reachedStopPoint = false; bool haveBackupLabel = false; bool haveTblspcMap = false; XLogRecPtr RecPtr, checkPointLoc, EndOfLog; TimeLineID EndOfLogTLI; TimeLineID PrevTimeLineID; XLogRecord *record; TransactionId oldestActiveXID; bool backupEndRequired = false; bool backupFromStandby = false; DBState dbstate_at_startup; XLogReaderState *xlogreader; XLogPageReadPrivate private; bool fast_promoted = false; struct stat st;
This function first scans the global / PG control file to read the control information of the system
/* * Read control file and check XLOG status looks valid. * * Note: in most control paths, *ControlFile is already valid and we need * not do ReadControlFile() here, but might as well do it to be sure. */ ReadControlFile();
Then scan the structure of XLOG 0 log directory to check whether it is complete
/* * Verify that pg_wal and pg_wal/archive_status exist. In cases where * someone has performed a copy for PITR, these directories may have been * excluded and need to be re-created. */ ValidateXLOGDirectoryStructure();
Then read the latest log check point record
/* * When a backup_label file is present, we want to roll forward from * the checkpoint it identifies, rather than using pg_control. */ record = ReadCheckpointRecord(xlogreader, checkPointLoc, 0, true); if (record != NULL)
Next, we detect whether the system is in abnormal state according to the partial order relationship of the logging sequence. If the system is in abnormal state, then trigger the recovery mechanism to recover. After the recovery, re-establish the checkpoint and initialize the XlogCtl control information, and then start the transaction commit log and related auxiliary log modules.
There are three situations in the log that need to be recovered
1) The backup? Label file is scanned in the log file.
if (read_backup_label(&checkPointLoc, &backupEndRequired, &backupFromStandby)) .... /* * When a backup_label file is present, we want to roll forward from * the checkpoint it identifies, rather than using pg_control. */ record = ReadCheckpointRecord(xlogreader, checkPointLoc, 0, true); if (record != NULL)
2) According to the latest check point recorded in ControlFile, no log record can be read.
3) The inspection point recorded according to the ControlFile is inconsistent with the Redo position in the inspection point log found through the record.
In PostgreSQL system, the strategy of log creation is to use the Redo logs of improved non static checkpoints, and recovery is to find the latest legal checkpoint and Redo.
The specific steps of recovery operation are as follows:
1) First update the control information to the ControlFile.
2) Initialize the resource manager used for log recovery.
3) Read the log record from the REDO position of the log record of the inspection point.
4) Select the corresponding RMGR according to the resource manager number of the log record, and then use the RMGR to do the operation process (REDO operation) recorded in the log record.
. 5) repeat steps 3 and 4 until no log record can be read.
In the above recovery process, the REDO operation in the fourth step will do different recovery operations for different log types
1) Log operation of type Database. There is no backup block in this type of log. The possible operations are Create/Drop. For Create operation, first force to refresh all buffers, and then copy the source DB directory recorded in the log to the new DB directory. For Drop operation, directly delete the buffer corresponding to the Database.
2) Redo operation of Heap type. First, find out whether there is a backup block according to the log sequence number (LSN) and the log record. If there is, restore the backup block to the Page, and set the Page as "dirty". Then, select the corresponding operation according to the log flag bit. Typical operations include INSERT/DELETE/UPDATE. For these operations, first determine whether there is a backup block for the log record. If there is one, it means that it has been recovered and returned directly. If not, read the log and rebuild HeapTuple.
3) Redo operation of type B-Tree. Btree is a more complex index structure, involving operations such as leaf node insertion and node segmentation for different locations (such as root node, leaf node, left and right subtree). Therefore, the type of redo operation is determined according to the flag information during recovery.
4) Redo operation of type Xlog. Because the system crash is uncertain, the operation of Xlog log also needs to be logged. Logs of Xlog type include recording the next assignable OID, setting check points and other operations. The recovery is relatively simple. You can copy the original log information.
When the recovery process is completed, re-establish the checkpoint and re initialize the XlogCtl structure information.
// xlog.c /* * Total shared-memory state for XLOG. */ typedef struct XLogCtlData { XLogCtlInsert Insert; /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */ uint32 ckptXidEpoch; /* nextXID & epoch of latest checkpoint */ TransactionId ckptXid; XLogRecPtr asyncXactLSN; /* LSN of newest async commit/abort */ XLogRecPtr replicationSlotMinLSN; /* oldest LSN needed by any slot */ XLogSegNo lastRemovedSegNo; /* latest removed/recycled XLOG segment */ /* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */ XLogRecPtr unloggedLSN; slock_t ulsn_lck; /* Time and LSN of last xlog segment switch. Protected by WALWriteLock. */ pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; /* * Protected by info_lck and WALWriteLock (you must hold either lock to * read it, but both to update) */ XLogwrtResult LogwrtResult; /* * Latest initialized page in the cache (last byte position + 1). * * To change the identity of a buffer (and InitializedUpTo), you need to * hold WALBufMappingLock. To change the identity of a buffer that's * still dirty, the old page needs to be written out first, and for that * you need WALWriteLock, and you need to ensure that there are no * in-progress insertions to the page by calling * WaitXLogInsertionsToFinish(). */ XLogRecPtr InitializedUpTo; /* * These values do not change after startup, although the pointed-to pages * and xlblocks values certainly do. xlblock values are protected by * WALBufMappingLock. */ char *pages; /* buffers for unwritten XLOG pages */ XLogRecPtr *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */ int XLogCacheBlck; /* highest allocated xlog buffer index */ /* * Shared copy of ThisTimeLineID. Does not change after end-of-recovery. * If we created a new timeline when the system was started up, * PrevTimeLineID is the old timeline's ID that we forked off from. * Otherwise it's equal to ThisTimeLineID. */ TimeLineID ThisTimeLineID; TimeLineID PrevTimeLineID; /* * archiveCleanupCommand is read from recovery.conf but needs to be in * shared memory so that the checkpointer process can access it. */ char archiveCleanupCommand[MAXPGPATH]; /* * SharedRecoveryInProgress indicates if we're still in crash or archive * recovery. Protected by info_lck. */ bool SharedRecoveryInProgress; /* * SharedHotStandbyActive indicates if we're still in crash or archive * recovery. Protected by info_lck. */ bool SharedHotStandbyActive; /* * WalWriterSleeping indicates whether the WAL writer is currently in * low-power mode (and hence should be nudged if an async commit occurs). * Protected by info_lck. */ bool WalWriterSleeping; /* * recoveryWakeupLatch is used to wake up the startup process to continue * WAL replay, if it is waiting for WAL to arrive or failover trigger file * to appear. */ Latch recoveryWakeupLatch; /* * During recovery, we keep a copy of the latest checkpoint record here. * lastCheckPointRecPtr points to start of checkpoint record and * lastCheckPointEndPtr points to end+1 of checkpoint record. Used by the * checkpointer when it wants to create a restartpoint. * * Protected by info_lck. */ XLogRecPtr lastCheckPointRecPtr; XLogRecPtr lastCheckPointEndPtr; CheckPoint lastCheckPoint; /* * lastReplayedEndRecPtr points to end+1 of the last record successfully * replayed. When we're currently replaying a record, ie. in a redo * function, replayEndRecPtr points to the end+1 of the record being * replayed, otherwise it's equal to lastReplayedEndRecPtr. */ XLogRecPtr lastReplayedEndRecPtr; TimeLineID lastReplayedTLI; XLogRecPtr replayEndRecPtr; TimeLineID replayEndTLI; /* timestamp of last COMMIT/ABORT record replayed (or being replayed) */ TimestampTz recoveryLastXTime; /* * timestamp of when we started replaying the current chunk of WAL data, * only relevant for replication or archive recovery */ TimestampTz currentChunkStartTime; /* Are we requested to pause recovery? */ bool recoveryPause; /* * lastFpwDisableRecPtr points to the start of the last replayed * XLOG_FPW_CHANGE record that instructs full_page_writes is disabled. */ XLogRecPtr lastFpwDisableRecPtr; slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData;
Then continue to call the StartupCLOG function, StartupSUBTRANS function, and StartupMultiXact function
// clog.c /* * This must be called ONCE during postmaster or standalone-backend startup, * after StartupXLOG has initialized ShmemVariableCache->nextXid. */ void StartupCLOG(void) { TransactionId xid = ShmemVariableCache->nextXid; int pageno = TransactionIdToPage(xid); LWLockAcquire(CLogControlLock, LW_EXCLUSIVE); /* * Initialize our idea of the latest page number. */ ClogCtl->shared->latest_page_number = pageno; LWLockRelease(CLogControlLock); }
// subtrans.c /* * This must be called ONCE during postmaster or standalone-backend startup, * after StartupXLOG has initialized ShmemVariableCache->nextXid. * * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid * if there are none. */ void StartupSUBTRANS(TransactionId oldestActiveXID) { int startPage; int endPage; /* * Since we don't expect pg_subtrans to be valid across crashes, we * initialize the currently-active page(s) to zeroes during startup. * Whenever we advance into a new page, ExtendSUBTRANS will likewise zero * the new page without regard to whatever was previously on disk. */ LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE); startPage = TransactionIdToPage(oldestActiveXID); endPage = TransactionIdToPage(ShmemVariableCache->nextXid); while (startPage != endPage) { (void) ZeroSUBTRANSPage(startPage); startPage++; /* must account for wraparound */ if (startPage > TransactionIdToPage(MaxTransactionId)) startPage = 0; } (void) ZeroSUBTRANSPage(startPage); LWLockRelease(SubtransControlLock); }
// multixact.c /* * This must be called ONCE during postmaster or standalone-backend startup. * * StartupXLOG has already established nextMXact/nextOffset by calling * MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact, and the oldestMulti * info from pg_control and/or MultiXactAdvanceOldest, but we haven't yet * replayed WAL. */ void StartupMultiXact(void) { MultiXactId multi = MultiXactState->nextMXact; MultiXactOffset offset = MultiXactState->nextOffset; int pageno; /* * Initialize offset's idea of the latest page number. */ pageno = MultiXactIdToOffsetPage(multi); MultiXactOffsetCtl->shared->latest_page_number = pageno; /* * Initialize member's idea of the latest page number. */ pageno = MXOffsetToMemberPage(offset); MultiXactMemberCtl->shared->latest_page_number = pageno; }
Complete the start of transaction commit log and other auxiliary log modules. If it is detected that the system does not need recovery operation according to the system log, the recovery operation will be skipped, and then the initialization of related modules such as transaction commit log will be completed.
Reference resources
< PG Database Kernel Analysis > 7.11.5 xlog log manager