Design and implementation of Redis 7.0 Multi Part AOF

Posted by Thoaren on Mon, 14 Feb 2022 10:18:08 +0100

Introduction: This paper will explain in detail some shortcomings of the existing AOF mechanism in Redis and the design and implementation details of Multi Part AOF introduced in Redis 7.0.

As a very popular in memory database, Redis can have very high read and write performance by saving data in memory. However, once the process exits, all Redis data will be lost.

In order to solve this problem, Redis provides two persistence schemes, RDB and AOF, to save the data in memory to disk to avoid data loss. This article will focus on the AOF persistence scheme and some existing problems, and discuss the design and implementation details of Multi Part AOF (hereinafter referred to as MP-AOF, which is contributed by Alibaba cloud database Tair team) in Redis 7.0 (released RC1).

AOF

Aof (append only file) persistence records each write command in the form of an independent log file, and plays back the commands in the AOF file when Redis starts to recover data.

Since AOF records every redis write command in an additional way, with the increase of write commands processed by redis, the AOF file will become larger and larger, and the command playback time will also increase. In order to solve this problem, redis introduces the AOF rewrite mechanism (hereinafter referred to as AOFRW). AOFRW will remove redundant write commands in AOF and rewrite and generate a new AOF file in an equivalent way to reduce the size of AOF file.

AOFRW

Figure 1 shows the implementation principle of AOFRW. When the AOFRW is triggered to execute, Redis will first fork a sub process to perform a background rewrite operation, which will rewrite all the data snapshots of Redis at the moment of forking to a file named temp rewriteaof BG PID In the temporary AOF file of AOF.

Since the rewriting operation is performed by the child process in the background, the main process can still respond to user commands normally during AOF rewriting. Therefore, in order for the child process to finally obtain the incremental changes generated by the main process during rewriting, the main process will write the executed write command to aof_buf, and write a copy to AOF_ rewrite_ Cache in buf. At a later stage of child process rewriting, the main process will change AOF_ rewrite_ The data accumulated in buf is sent to the subprocess using pipe, and the subprocess will append these data to the temporary AOF file (see here for detailed principle).

When the main process undertakes large write traffic, AOF_ rewrite_ There may be a lot of data accumulated in buf, so that the child process cannot convert AOF during rewriting_ rewrite_ All data in buf is consumed. At this point, AOF_ rewrite_ The remaining data of buf will be processed by the main process at the end of rewriting.

When the child process completes the rewrite operation and exits, the main process will handle the subsequent things in the backgroundRewriteDoneHandler. First, AOF will be overridden during_ rewrite_ The unused data in buf is appended to the temporary AOF file. Secondly, when everything is ready, Redis will rename the atom of the temporary AOF file to server. Com using the rename operation aof_ Filename, and the original AOF file will be overwritten. So far, the whole AOFRW process is over.


Figure 1 implementation principle of aofrw

Problems in AOFRW

Memory overhead

As can be seen from Figure 1, during AOFRW, the main process will write the data changes after fork into AOF_ rewrite_ In buf, aof_rewrite_buf and AOF_ Most of the content in buf is repeated, so this will bring additional memory redundancy overhead.

Aof in Redis INFO_ rewrite_ buffer_ The length field shows the current time AOF_ rewrite_ The amount of memory occupied by buf. As shown below, AOF at high write traffic_ rewrite_ buffer_ Length is almost the same as aof_buffer_length takes up the same amount of memory space and almost wastes twice as much memory.

aof_pending_rewrite:0
aof_buffer_length:35500
aof_rewrite_buffer_length:34000
aof_pending_bio_fsync:0

When AOF_ rewrite_ When the memory size occupied by buf exceeds a certain threshold, we will see the following information in the Redis log. As you can see, AOF_ rewrite_ The buffer occupies 100MB of memory space, and 2135MB of data is transferred between the main process and the sub process (the sub process will also have the memory overhead of internal reading buffer when reading these data through pipe). For Redis, an in memory database, this is not a small expense.

3351:M 25 Jan 2022 09:55:39.655 * Background append only file rewriting started by pid 6817
3351:M 25 Jan 2022 09:57:51.864 * AOF rewrite child asks to stop sending diffs.
6817:C 25 Jan 2022 09:57:51.864 * Parent agreed to stop sending diffs. Finalizing AOF...
6817:C 25 Jan 2022 09:57:51.864 * Concatenating 2135.60 MB of AOF diff received from parent.
3351:M 25 Jan 2022 09:57:56.545 * Background AOF buffer size: 100 MB

The memory overhead caused by AOFRW may cause Redis's memory to suddenly reach the maxmemory limit, which will affect the writing of normal commands, and even trigger the operating system limit to be killed by OOM Killer, resulting in Redis being unserviceable.

CPU overhead

There are three main areas of CPU overhead, which are explained as follows:

During AOFRW, the main process needs to spend CPU time reporting to aof_rewrite_buf writes data and uses the eventloop event loop to send AOF to the child process_ rewrite_ Data in buf:

/* Append data to the AOF rewrite buffer, allocating new blocks if needed. */
void aofRewriteBufferAppend(unsigned char *s, unsigned long len) {
    // Omit other details here
  
    /* Install a file event to send data to the rewrite child if there is
     * not one already. */
    if (!server.aof_stop_sending_diff &&
        aeGetFileEvents(server.el,server.aof_pipe_write_data_to_child) == 0)
    {
        aeCreateFileEvent(server.el, server.aof_pipe_write_data_to_child,
            AE_WRITABLE, aofChildWriteDiffData, NULL);
    } 
  
    // Omit other details here
}

At the later stage of the rewriting operation of the child process, it will cycle to read the incremental data sent by the main process in the pipe, and then append it to the temporary AOF file:

int rewriteAppendOnlyFile(char *filename) {
    // Omit other details here
  
    /* Read again a few times to get more data from the parent.
     * We can't read forever (the server may receive data from clients
     * faster than it is able to send data to the child), so we try to read
     * some more data in a loop as soon as there is a good chance more data
     * will come. If it looks like we are wasting time, we abort (this
     * happens after 20 ms without new data). */
    int nodata = 0;
    mstime_t start = mstime();
    while(mstime()-start < 1000 && nodata < 20) {
        if (aeWait(server.aof_pipe_read_data_from_parent, AE_READABLE, 1) <= 0)
        {
            nodata++;
            continue;
        }
        nodata = 0; /* Start counting from zero, we stop on N *contiguous*
                       timeouts. */
        aofReadDiffFromParent();
    }
    // Omit other details here
}

After the child process completes the rewrite operation, the main process will finish the work in the backgroundRewriteDoneHandler. One of the tasks is to AOF during rewriting_ rewrite_ The data not consumed in buf is written to the temporary AOF file. If AOF_ rewrite_ There are a lot of data left in buf, which will also consume CPU time.

void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
    // Omit other details here
  
    /* Flush the differences accumulated by the parent to the rewritten AOF. */
    if (aofRewriteBufferWrite(newfd) == -1) {
        serverLog(LL_WARNING,
                "Error trying to flush the parent diff to the rewritten AOF: %s", strerror(errno));
        close(newfd);
        goto cleanup;
     }
    
     // Omit other details here
}

The CPU overhead caused by AOFRW may cause Redis to jitter on RT when executing commands, and even cause the problem of client timeout.

Disk IO overhead

As mentioned earlier, during AOFRW, the main process will write the executed write command to AOF_ In addition to buf, a copy will be written to aof_rewrite_buf. aof_ The data in buf will eventually be written to the old AOF file currently used, resulting in disk IO. Meanwhile, AOF_ rewrite_ The data in buf will also be written into the new AOF file generated by rewriting to generate disk IO. Therefore, the same data will generate disk IO twice.

Code complexity

Redis uses the six pipe s shown below for data transmission and control interaction between main processes and sub processes, which makes the whole AOFRW logic more complex and difficult to understand.

/* AOF pipes used to communicate between parent and child during rewrite. */
 int aof_pipe_write_data_to_child;
 int aof_pipe_read_data_from_parent;
 int aof_pipe_write_ack_to_parent;
 int aof_pipe_read_ack_from_child;
 int aof_pipe_write_ack_to_child;
 int aof_pipe_read_ack_from_parent;

MP-AOF implementation

Programme overview

As the name suggests, MP-AOF is to split the original single AOF file into multiple AOF files. In MP-AOF, we divide AOF into three types:

  • BASE: represents the basic AOF. It is generally generated by child processes through rewriting. There is only one file at most.
  • INCR: indicates incremental AOF, which is usually created when AOFRW starts to execute. There may be multiple files in this file.
  • HISTORY: indicates the historical AOF, which is derived from the BASE and INCR AOF. When each AOFRW is successfully completed, the corresponding BASE and INCR AOF before this AOFRW will become HISTORY, and the AOF of HISTORY type will be automatically deleted by Redis.

In order to manage these AOF files, we have introduced a manifest file to track and manage these AOF files. At the same time, in order to facilitate AOF backup and copy, we put all AOF files and manifest files into a separate file directory, and the directory name is determined by the appenddirname configuration (a new configuration item in Redis 7.0).


Fig. 2 principle of mp-aof rewrite

Figure 2 shows the general process of executing AOFRW once in MP-AOF. At the beginning, we will still fork a child process for rewriting operation. In the main process, we will open a new INCR type AOF file at the same time. During the child process rewriting operation, all data changes will be written to the newly opened INCR AOF. The rewriting operation of the child process is completely independent. During rewriting, there will be no data and control interaction with the main process. Finally, the rewriting operation will produce a BASE AOF. The newly generated BASE AOF and the newly opened INCR AOF represent all the data of Redis at the current time. At the end of AOFRW, the main process will be responsible for updating the manifest file, adding the newly generated BASE AOF and INCR AOF information, and marking the previous BASE AOF and INCR AOF as HISTORY (these HISTORY AOF S will be deleted asynchronously by Redis). Once the manifest file is updated, it marks the end of the whole AOFRW process.

As can be seen from Figure 2, we no longer need AOF during AOFRW_ rewrite_ BUF, so the corresponding memory consumption is removed. At the same time, there is no data transmission and control interaction between the main process and sub process, so the corresponding CPU overhead is also removed. Correspondingly, the six pipe s mentioned above and their corresponding codes are also deleted, making the AOFRW logic simpler and clearer.

Key implementation

Manifest

Representation in memory

MP-AOF strongly depends on the manifest file. The manifest is represented in memory as the following structure, where:

  • aofInfo: indicates an AOF file information. Currently, it only includes file name, file serial number and file type
  • base_aof_info: indicates the BASE AOF information. When there is no BASE AOF, this field is NULL
  • incr_aof_list: used to store the information of all INCR AOF files. All incr AOFS will be arranged in the order of file opening
  • history_aof_list: used to store HISTORY AOF information, history_ aof_ The elements in the list are from base_aof_info and incr_ aof_ move from list
typedef struct {
    sds           file_name;  /* file name */
    long long     file_seq;   /* file sequence */
    aof_file_type file_type;  /* file type */
} aofInfo;
typedef struct {
    aofInfo     *base_aof_info;       /* BASE file information. NULL if there is no BASE file. */
    list        *incr_aof_list;       /* INCR AOFs list. We may have multiple INCR AOF when rewrite fails. */
    list        *history_aof_list;    /* HISTORY AOF list. When the AOFRW success, The aofInfo contained in
                                         `base_aof_info` and `incr_aof_list` will be moved to this list. We
                                         will delete these AOF files when AOFRW finish. */
    long long   curr_base_file_seq;   /* The sequence number used by the current BASE file. */
    long long   curr_incr_file_seq;   /* The sequence number used by the current INCR file. */
    int         dirty;                /* 1 Indicates that the aofManifest in the memory is inconsistent with
                                         disk, we need to persist it immediately. */
} aofManifest;

In order to facilitate atomic modification and rollback operations, we use pointers to reference aofManifest in the redisServer structure.

struct redisServer {
    // Omit other details here
    aofManifest *aof_manifest;       /* Used to track AOFs. */
    // Omit other details here
}

Representation on disk

Manifest is essentially a text file containing multiple lines of records. Each line of records corresponds to an AOF file information. These information is displayed in the form of key/value pairs, which is easy for Redis to process, read and modify. Here is a possible manifest file content:

file appendonly.aof.1.base.rdb seq 1 type b
file appendonly.aof.1.incr.aof seq 1 type i
file appendonly.aof.2.incr.aof seq 2 type i

The Manifest format itself needs to be extensible so that other functions can be added or supported in the future. For example, you can easily add key/value and annotation (similar to the annotation in AOF), which can ensure better forward compatibility.

file appendonly.aof.1.base.rdb seq 1 type b newkey newvalue
file appendonly.aof.1.incr.aof type i seq 1 
# this is annotations
seq 2 type i file appendonly.aof.2.incr.aof

File naming rules

Before MP-AOF, the file name of AOF is the setting value of the appendfilename parameter (appendonly.aof by default).

In MP-AOF, we use basename Name multiple AOF files in suffix mode. Among them, the appendfilename configuration content will be used as the basename part, and suffix is composed of three parts in the format of SEQ type. Format, where:

  • seq is the serial number of the file, increasing monotonically from 1. BASE and INCR have independent file serial numbers
  • Type is the type of AOF, indicating whether the AOF file is BASE or INCR
  • format is used to represent the internal coding mode of this AOF. Because Redis supports RDB preamble mechanism,

Therefore, BASE AOF may be RDB format code or AOF format code:

#define BASE_FILE_SUFFIX           ".base"
#define INCR_FILE_SUFFIX           ".incr"
#define RDB_FORMAT_SUFFIX          ".rdb"
#define AOF_FORMAT_SUFFIX          ".aof"
#define MANIFEST_NAME_SUFFIX       ".manifest"

Therefore, when using the default configuration of appendfilename, the possible names of BASE, INCR and manifest files are as follows:

appendonly.aof.1.base.rdb // Enable RDB preamble
appendonly.aof.1.base.aof // Close RDB preamble
appendonly.aof.1.incr.aof
appendonly.aof.2.incr.aof

Compatible with old version upgrade

Since MP-AOF strongly depends on the manifest file, Redis will load the corresponding AOF file in strict accordance with the instructions of the manifest when starting. However, when upgrading from the old version of Redis (referring to the version before Redis 7.0) to Redis 7.0, since there is no manifest file at this time, how to make Redis correctly recognize that this is an upgrade process and load the old AOF correctly and safely is a capability that must be supported.

Recognition ability is the first step in this important process. Before loading the AOF file, we will check whether there is a server in the Redis working directory aof_ Filename's AOF file. If it exists, it means that we may be upgrading from an old version of Redis. Next, we will continue to judge that when one of the following three conditions is met, we will consider it as an upgrade start:

  • If the appenddirname directory does not exist
  • Or the appenddirname directory exists, but there is no corresponding manifest file in the directory
  • If the appenddirname directory exists and the manifest file exists in the directory, and only the relevant information of BASE AOF is in the manifest file, and the name and server of this BASE AOF aof_ The filename is the same, and the name server does not exist in the appenddirname directory aof_ Filename file
/* Load the AOF files according the aofManifest pointed by am. */
int loadAppendOnlyFiles(aofManifest *am) {
    // Omit other details here
  
    /* If the 'server.aof_filename' file exists in dir, we may be starting
     * from an old redis version. We will use enter upgrade mode in three situations.
     *
     * 1. If the 'server.aof_dirname' directory not exist
     * 2. If the 'server.aof_dirname' directory exists but the manifest file is missing
     * 3. If the 'server.aof_dirname' directory exists and the manifest file it contains
     *    has only one base AOF record, and the file name of this base AOF is 'server.aof_filename',
     *    and the 'server.aof_filename' file not exist in 'server.aof_dirname' directory
     * */
    if (fileExist(server.aof_filename)) {
        if (!dirExists(server.aof_dirname) ||
            (am->base_aof_info == NULL && listLength(am->incr_aof_list) == 0) ||
            (am->base_aof_info != NULL && listLength(am->incr_aof_list) == 0 &&
             !strcmp(am->base_aof_info->file_name, server.aof_filename) && !aofFileExist(server.aof_filename)))
        {
            aofUpgradePrepare(am);
        }
    }
  
    // Omit other details here
  }

Once it is recognized that this is an upgrade start, we will use the aofUpgradePrepare function to prepare for the upgrade.

The upgrade preparation is mainly divided into three parts:

  • Use server aof_ Filename is used as the file name to construct a BASE AOF information
  • Persist the BASE AOF information to the manifest file
  • Use rename to move the old AOF file to the appenddirname directory
void aofUpgradePrepare(aofManifest *am) {
    // Omit other details here
  
    /* 1. Manually construct a BASE type aofInfo and add it to aofManifest. */
    if (am->base_aof_info) aofInfoFree(am->base_aof_info);
    aofInfo *ai = aofInfoCreate();
    ai->file_name = sdsnew(server.aof_filename);
    ai->file_seq = 1;
    ai->file_type = AOF_FILE_TYPE_BASE;
    am->base_aof_info = ai;
    am->curr_base_file_seq = 1;
    am->dirty = 1;
    /* 2. Persist the manifest file to AOF directory. */
    if (persistAofManifest(am) != C_OK) {
        exit(1);
    }
    /* 3. Move the old AOF file to AOF directory. */
    sds aof_filepath = makePath(server.aof_dirname, server.aof_filename);
    if (rename(server.aof_filename, aof_filepath) == -1) {
        sdsfree(aof_filepath);
        exit(1);;
    }
  
    // Omit other details here
}

The upgrade preparation operation is Crash Safety. If a Crash occurs in any of the above three steps, we can correctly identify and retry the whole upgrade operation in the next startup.

Multi file loading and progress calculation

Redis will record the loading progress when loading AOF and use the loading of Redis INFO_ loaded_ The perc field is displayed. In MP-AOF, the loadAppendOnlyFiles function loads the AOF file according to the aofManifest passed in. Before loading, we need to calculate the total size of all AOF files to be loaded in advance and pass it to the startLoading function, and then constantly report the loading progress in loadSingleAppendOnlyFile.

Next, loadAppendOnlyFiles will load BASE AOF and INCR AOF according to aofManifest. When all AOF files have been loaded, stopLoading will be used to end the loading status.

int loadAppendOnlyFiles(aofManifest *am) {
    // Omit other details here
    /* Here we calculate the total size of all BASE and INCR files in
     * advance, it will be set to `server.loading_total_bytes`. */
    total_size = getBaseAndIncrAppendOnlyFilesSize(am);
    startLoading(total_size, RDBFLAGS_AOF_PREAMBLE, 0);
    /* Load BASE AOF if needed. */
    if (am->base_aof_info) {
        aof_name = (char*)am->base_aof_info->file_name;
        updateLoadingFileName(aof_name);
        loadSingleAppendOnlyFile(aof_name);
    }
    /* Load INCR AOFs if needed. */
    if (listLength(am->incr_aof_list)) {
        listNode *ln;
        listIter li;
        listRewind(am->incr_aof_list, &li);
        while ((ln = listNext(&li)) != NULL) {
            aofInfo *ai = (aofInfo*)ln->value;
            aof_name = (char*)ai->file_name;
            updateLoadingFileName(aof_name);
            loadSingleAppendOnlyFile(aof_name);
        }
    }
  
    server.aof_current_size = total_size;
    server.aof_rewrite_base_size = server.aof_current_size;
    server.aof_fsync_offset = server.aof_current_size;
    stopLoading();
    
    // Omit other details here
}

AOFRW Crash Safety

When the subprocess completes the rewrite operation, the subprocess will create a named temp rewriteaof BG PID The temporary AOF file of AOF. At this time, this file is still invisible to Redis because it has not been added to the manifest file. To enable it to be recognized by Redis and loaded correctly when Redis starts, we also need to rename it according to the naming rules mentioned above, and add its information to the manifest file.

Although AOF file rename and manifest file modification are two independent operations, we must ensure the atomicity of these two operations, so that Redis can correctly load the corresponding AOF at startup. MP-AOF uses two designs to solve this problem:

  • The name of BASE AOF contains the file serial number to ensure that the BASE AOF created each time will not conflict with the previous BASE AOF
  • First perform the rename operation of AOF, and then modify the manifest file

For ease of illustration, we assume that before AOFRW starts, the contents of the manifest file are as follows:

file appendonly.aof.1.base.rdb seq 1 type b
file appendonly.aof.1.incr.aof seq 1 type i

After AOFRW starts executing, the contents of the manifest file are as follows:

file appendonly.aof.1.base.rdb seq 1 type b
file appendonly.aof.1.incr.aof seq 1 type i
file appendonly.aof.2.incr.aof seq 2 type i

After the child process is rewritten, in the main process, we will set temp rewriteaof BG PID Rename AOF to appendonly aof. 2.BASE. RDB and add it to the manifest. At the same time, the previous BASE and INCR AOF will be marked as HISTORY. The contents of the manifest file are as follows:

file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.1.base.rdb seq 1 type h
file appendonly.aof.1.incr.aof seq 1 type h
file appendonly.aof.2.incr.aof seq 2 type i

At this time, the results of this AOFRW are visible to Redis, and the HISTORY AOF will be asynchronously cleaned up by Redis.

The backgroundRewriteDoneHandler function implements the above logic in seven steps:

  • Modify the server in memory aof_ Before the manifest, dup a temporary manifest structure, and the following modifications will be made to this temporary manifest. The advantage of this is that once the following steps fail, we can simply destroy the temporary manifest to roll back the whole operation and avoid polluting the server aof_ Manifest global data structure
  • Get the new BASE AOF file name (recorded as new_base_filename) from the temporary manifest, and mark the previous (if any) BASE AOF as HISTORY
  • The temp rewriteaof BG PID generated by the subprocess Rename the AOF temporary file to new_base_filename
  • Mark all the last incr AOFS in the temporary manifest structure as HISTORY
  • Persist the information corresponding to the temporary manifest to the disk (persistAofManifest will ensure the atomicity of the modification of the manifest itself)
  • If the above steps are successful, we can safely store the server in memory aof_ The manifest pointer points to the temporary manifest structure (and releases the previous manifest structure). So far, the whole modification is visible to Redis
  • Clean up AOF of HISTORY type. This step is allowed to fail because it will not cause data consistency problems
void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
    snprintf(tmpfile, 256, "temp-rewriteaof-bg-%d.aof",
        (int)server.child_pid);
    /* 1. Dup a temporary aof_manifest for subsequent modifications. */
    temp_am = aofManifestDup(server.aof_manifest);
    /* 2. Get a new BASE file name and mark the previous (if we have)
     * as the HISTORY type. */
    new_base_filename = getNewBaseFileNameAndMarkPreAsHistory(temp_am);
    /* 3. Rename the temporary aof file to 'new_base_filename'. */
    if (rename(tmpfile, new_base_filename) == -1) {
        aofManifestFree(temp_am);
        goto cleanup;
    }
    /* 4. Change the AOF file type in 'incr_aof_list' from AOF_FILE_TYPE_INCR
     * to AOF_FILE_TYPE_HIST, and move them to the 'history_aof_list'. */
    markRewrittenIncrAofAsHistory(temp_am);
    /* 5. Persist our modifications. */
    if (persistAofManifest(temp_am) == C_ERR) {
        bg_unlink(new_base_filename);
        aofManifestFree(temp_am);
        goto cleanup;
    }
    /* 6. We can safely let `server.aof_manifest` point to 'temp_am' and free the previous one. */
    aofManifestFreeAndUpdate(temp_am);
    /* 7. We don't care about the return value of `aofDelHistoryFiles`, because the history
     * deletion failure will not cause any problems. */
    aofDelHistoryFiles();
}

Support AOF truncate

When a Crash occurs in the process, the AOF file may be written incompletely. For example, Redis crashes when only MULTI is written in a transaction, but EXEC is not written yet. By default, Redis cannot load this incomplete AOF, but Redis supports the AOF truncate function (opened through AOF load truncated configuration). The principle is to use server aof_ current_ Size tracks the last correct file offset of AOF, and then uses the ftruncate function to delete all the file contents after the offset. Although some data may be lost, it can ensure the integrity of AOF.

In MP-AOF, server aof_ current_ Size no longer represents the size of a single AOF file, but the total size of all AOF files. Because only the last INCR AOF can have the problem of incomplete writing, we introduce a separate field server aof_ last_ incr_ Size is used to track the size of the last INCR AOF file. When the last INCR AOF is incompletely written, we only need to set the server aof_ last_ incr_ The file content after size can be deleted.

if (ftruncate(server.aof_fd, server.aof_last_incr_size) == -1) {
      //Omit other details here
 }

AOFRW current limiting

Redis supports the automatic execution of AOFRW when the AOF size exceeds a certain threshold. When a disk failure occurs or a code bug is triggered, resulting in the failure of AOFRW, redis will repeatedly execute AOFRW until it succeeds. Before the advent of MP-AOF, this seemed to be no big problem (at most, it consumed some CPU time and fork overhead). However, in MP-AOF, because an INCR AOF will be opened every time AOFRW, and the previous INCR and BASE will be converted to HISTORY and deleted only when AOFRW is successful. Therefore, continuous AOFRW failure is bound to lead to the coexistence of multiple INCR AOFS. In extreme cases, if the AOFRW retry frequency is high, we will see hundreds of INCR AOF files.

Therefore, we introduce AOFRW current limiting mechanism. That is, when the AOFRW has failed three times in a row, the execution of the next AOFRW will be forcibly delayed by 1 minute. If the next AOFRW still fails, it will be delayed by 2 minutes, delayed by 4, 8 and 16 in turn, The current maximum delay time is 1 hour.

During AOFRW current limiting, we can still use the bgrewriteaof command to execute AOFRW immediately.

if (server.aof_state == AOF_ON &&
    !hasActiveChildProcess() &&
    server.aof_rewrite_perc &&
    server.aof_current_size > server.aof_rewrite_min_size &&
    !aofRewriteLimited())
{
    long long base = server.aof_rewrite_base_size ?
        server.aof_rewrite_base_size : 1;
    long long growth = (server.aof_current_size*100/base) - 100;
    if (growth >= server.aof_rewrite_perc) {
        rewriteAppendOnlyFileBackground();
    }
}

The introduction of AOFRW current limiting mechanism can also effectively avoid the CPU and fork overhead caused by AOFRW high-frequency retry. Many RT jitters in Redis are related to fork.

summary

The introduction of MP-AOF successfully solves the adverse impact of memory and CPU overhead of AOFRW on Redis instance and even business access. At the same time, in the process of solving these problems, we have also encountered many unexpected challenges. These challenges mainly come from Redis's huge user groups and diversified use scenarios. Therefore, we must consider the problems that users may encounter when using MP-AOF in various scenarios. Such as compatibility, ease of use and reducing intrusion to Redis code as much as possible. This is the top priority of Redis community function evolution.

At the same time, the introduction of MP-AOF also brings more imagination space to Redis's data persistence. For example, when AOF use RDB preamble is enabled, BASE AOF is essentially an RDB file, so we do not need to perform a separate BGSAVE operation during full backup. Directly back up BASE AOF. MP-AOF supports the ability to turn off the automatic cleaning of HISTORY AOF, so those historical AOFS have the opportunity to be retained. At present, Redis has supported adding timestamp annotation to AOF. Therefore, based on these, we can even implement a simple PITR capability (point in time recovery).

The design prototype of MP-AOF comes from the binlog implementation of Tair for redis Enterprise Edition. This is a set of core functions that have been verified on Alibaba cloud Tair service. On this core function, Alibaba cloud Tair has successfully built enterprise level capabilities such as global multi activity and PITR, so that users' needs for more business scenarios can be met. Today, we contribute this core competence to the Redis community. We hope that community users can also enjoy these enterprise level features and create their own business code through better optimization of these enterprise level features. For more details about MP-AOF, please refer to the relevant PR(#9788), where there are more original designs and complete codes.

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.

Topics: Database Redis