Analysis of redis cluster source code

Posted by tony.j.jackson@o2.co.uk on Wed, 19 Jan 2022 16:14:09 +0100

The redis cluster source code is divided into two parts. The first part mainly analyzes how a command locates to a server to process the command in the cluster mode. The second part mainly analyzes the source code of cluster failover.

This article mainly includes
1. How is a command request processed in cluster mode
2. What problems should we pay attention to in cluster mode?
We know that in the redis cluster mode, each instance shares part of the command requests. For example, if we want to find a key, on which example should the command to find the key be processed? At this point, it is necessary to give the data structure of redis cluster first.

// Node data structure in cluster mode
struct clusterNode {
    // When the node was created
    mstime_t ctime; /* Node object creation time. */

    // The name of the node, consisting of 40 hexadecimal characters
    // For example, 68eef66df23420a5862208ef5b1a7005b806f2ff
    char name[REDIS_CLUSTER_NAMELEN]; /* Node name, hex string, sha1-size */

    // Node ID
    // Use various identification values to record the role of the node (such as master node or slave node),
    // And the current status of the node (such as online or offline).
    int flags;      /* REDIS_NODE_... */

    // The current configuration era of the node for failover
    uint64_t configEpoch; /* Last configEpoch observed for this node */

    // The slot handled by this node
    // There are REDIS_CLUSTER_SLOTS / 8 bytes long
    // Each bit of each byte records the save status of a slot
    // A value of 1 indicates that the slot is being processed by this node, and a value of 0 indicates that the slot is not being processed by this node
    // For example, the first bit of slots[0] stores the storage of slot 0
    // The second bit of slots[0] holds the storage of slot 1, and so on
    unsigned char slots[REDIS_CLUSTER_SLOTS/8]; /* slots handled by this node */

    // The number of slots processed by this node
    int numslots;   /* Number of slots handled by this node */

    // If this node is the master node, use this attribute to record the number of slave nodes
    int numslaves;  /* Number of slave nodes, if this is a master */

    // Pointer array, pointing to each slave node
    struct clusterNode **slaves; /* pointers to slave nodes */

    // If this is a slave node, point to the master node
    struct clusterNode *slaveof; /* pointer to the master node */

    // The last time the PING command was sent
    mstime_t ping_sent;      /* Unix time we sent latest ping */

    // Timestamp of the last time a PONG reply was received
    mstime_t pong_received;  /* Unix time we received the pong */

    // The last time it was set to FAIL
    mstime_t fail_time;      /* Unix time when FAIL flag was set */

    // The last time a slave node was voted
    mstime_t voted_time;     /* Last time we voted for a slave of this master */

    // The last time the copy offset was received from this node
    mstime_t repl_offset_time;  /* Unix time we received offset for this node */

    // Copy offset for this node
    long long repl_offset;      /* Last known repl offset for this node. */

    // IP address of the node
    char ip[REDIS_IP_STR_LEN];  /* Latest known IP address of this node */

    // Port number of the node
    int port;                   /* Latest known port of this node */

    // Save the relevant information required to connect to the node
    clusterLink *link;          /* TCP/IP link with this node */

    // A linked list that records the offline reports of all other nodes to this node
    list *fail_reports;         /* List of nodes signaling this as failing */

};

We need to pay attention to the slots array in the clusterNode node, which stores the slots to be processed by the node. If the position is 1, it means that the node is responsible for the processing of the slot. Otherwise, the node is not responsible for the processing of the slot.
Explanation of slots: in the cluster mode, redis is divided into 16384 slots in total. The nodes in each cluster (Note: in this article, nodes = instances) are responsible for the processing of some slots.

The data structure of cluster status in cluster mode is given below

// Cluster state. Each node keeps a state that records the appearance of the cluster in their eyes.
// In addition, although this structure is mainly used to record the attributes of clusters, in order to save resources,
// Some node related attributes, such as slots_to_keys , failover_auth_count 
// Also put into this structure.
typedef struct clusterState {

    // Pointer to the current node
    clusterNode *myself;  /* This node */

    // The current configuration era of the cluster for failover
    uint64_t currentEpoch;

    // Current status of the cluster: online or offline
    int state;            /* REDIS_CLUSTER_OK, REDIS_CLUSTER_FAIL, ... */

    // The number of nodes in the cluster that handle at least one slot.
    int size;             /* Num of master nodes with at least one slot */

    // List of cluster nodes (including myself nodes)
    // The key of the dictionary is the name of the node, and the value of the dictionary is the clusterNode structure
    dict *nodes;          /* Hash table of name -> clusterNode structures */

    // Node blacklist, used for CLUSTER FORGET command
    // Prevent the FORGET command from being added to the cluster again
    // (but it doesn't seem to be in use now. Has it been abandoned? Or hasn't it been implemented yet?)
    dict *nodes_black_list; /* Nodes we don't re-add for a few seconds. */

    // Record the slot to be migrated from the current node to the target node and the target node to be migrated
    // migrating_slots_to[i] = NULL indicates that slot I has not been migrated
    // migrating_slots_to[i] = clusterNode_A indicates that slot I is to be migrated from this node to node a
    clusterNode *migrating_slots_to[REDIS_CLUSTER_SLOTS];

    // Record the slot to be migrated from the source node to this node and the source node to be migrated
    // importing_slots_from[i] = NULL indicates that slot I is not imported
    // importing_slots_from[i] = clusterNode_A indicates that slot I is being imported from node a
    clusterNode *importing_slots_from[REDIS_CLUSTER_SLOTS];

    // Handle the nodes of each slot
    // For example, slots[i] = clusterNode_A indicates that slot I is handled by node a
    clusterNode *slots[REDIS_CLUSTER_SLOTS];

    // Jump table, which takes slots as scores and keys as members to sort slots in order
    // This skip table can provide convenience when it is necessary to perform range operations on some slots
    // The specific operations are defined in dB C inside
    zskiplist *slots_to_keys;

    /* The following fields are used to take the slave state on elections. */
    // The following domains are used for failover elections

    // The time of the last election or the next election
    mstime_t failover_auth_time; /* Time of previous or next election. */

    // Number of votes obtained by the node
    int failover_auth_count;    /* Number of votes received so far. */

    // If the value is 1, it indicates that this node has sent a voting request to other nodes
    int failover_auth_sent;     /* True if we already asked for votes. */

    int failover_auth_rank;     /* This slave rank for current auth request. */

    uint64_t failover_auth_epoch; /* Epoch of the current election. */

    /* Manual failover state in common. */
    /* Shared manual failover status */

    // Time limit for manual failover execution
    mstime_t mf_end;            /* Manual failover time limit (ms unixtime).
                                   It is zero if there is no MF in progress. */
    /* Manual failover state of master. */
    /* Manual failover status of the primary server */
    clusterNode *mf_slave;      /* Slave performing the manual failover. */
    /* Manual failover state of slave. */
    /* Manual failover status from server */
    long long mf_master_offset; /* Master offset the slave needs to start MF
                                   or zero if stil not received. */
    // Flag value indicating whether manual failover can begin
    // A value other than 0 indicates that each primary server can start voting
    int mf_can_start;           /* If non-zero signal that the manual failover
                                   can start requesting masters vote. */

    /* The followign fields are uesd by masters to take state on elections. */
    /* The following domains are used by the master server to record the status at the time of the election */

    // The era when the cluster last voted
    uint64_t lastVoteEpoch;     /* Epoch of the last vote granted. */

    // The things to be done before entering the next event cycle are recorded in each flag
    int todo_before_sleep; /* Things to do in clusterBeforeSleep(). */

    // Number of messages sent through cluster connection
    long long stats_bus_messages_sent;  /* Num of msg sent via cluster bus. */

    // Number of messages received through cluster
    long long stats_bus_messages_received; /* Num of msg rcvd via cluster bus.*/

} clusterState;

What we need to pay special attention to is that the clusterNode data structure is referenced in the clusterState structure. In the clusterState structure, we also see an array of slots, which records which node slot i is assigned to handle. Therefore, the differences between the slots array in clusterNode and the slots array in clusterState are: clusterNode Slots only records the slot information that needs to be processed by this node, while clusterState Slots records the node to which slot i is allocated. The reason why two arrays are used is for efficiency. We know this.
Now let's get to the point. When in cluster mode, which server will process a request command? Remember that every command request starts with the function processCommand(redisClient *c)

Insert the code slice here int processCommand(redisClient *c) {

    // This is the entrance to the cluster
    /* If cluster is enabled perform the cluster redirection here.
     *
     * If the cluster mode is enabled, perform the steering operation here.
     *
     * However we don't perform the redirection if:
     *
     * However, if the following conditions occur, the node will not turn:
     *
     * 1) The sender of this command is our master.
     *    The sender of the command is the primary node of this node
     *
     * 2) The command has no key arguments. 
     *    The command has no key parameter
     */
    if (server.cluster_enabled &&
        !(c->flags & REDIS_MASTER) &&
        !(c->cmd->getkeys_proc == NULL && c->cmd->firstkey == 0))
    {
        int hashslot;

        // Cluster offline
        if (server.cluster->state != REDIS_CLUSTER_OK) {
            flagTransaction(c);
            addReplySds(c,sdsnew("-CLUSTERDOWN The cluster is down. Use CLUSTER INFO for more information\r\n"));
            return REDIS_OK;

        // The cluster operates normally
        } else {
            int error_code;
            clusterNode *n = getNodeByQuery(c,c->cmd,c->argv,c->argc,&hashslot,&error_code);
            // Cannot execute multi key commands
            if (n == NULL) {
                flagTransaction(c);
                if (error_code == REDIS_CLUSTER_REDIR_CROSS_SLOT) {
                    addReplySds(c,sdsnew("-CROSSSLOT Keys in request don't hash to the same slot\r\n"));
                } else if (error_code == REDIS_CLUSTER_REDIR_UNSTABLE) {
                    /* The request spawns mutliple keys in the same slot,
                     * but the slot is not "stable" currently as there is
                     * a migration or import in progress. */
                    addReplySds(c,sdsnew("-TRYAGAIN Multiple keys request during rehashing of slot\r\n"));
                } else {
                    redisPanic("getNodeByQuery() unknown error.");
                }
                return REDIS_OK;

            // The slot and key targeted by the command are not handled by this node, so turn
            } else if (n != server.cluster->myself) {
                flagTransaction(c);
                // -<ASK or MOVED> <slot> <ip>:<port>
                // For example - ASK 10086 127.0.0.1:12345
                addReplySds(c,sdscatprintf(sdsempty(),
                    "-%s %d %s:%d\r\n",
                    (error_code == REDIS_CLUSTER_REDIR_ASK) ? "ASK" : "MOVED",
                    hashslot,n->ip,n->port));

                return REDIS_OK;
            }

            // If this is the case, the slot where the key is located is handled by this node
            // Or the client executes a parameterless command
        }
    }

The above code segment is the entry to process command requests in cluster mode. The getnodebyquery (C, C - > CMD, C - > argv, C - > argc, & hashlot, & error_code) function is to find the redis instance corresponding to the processing command.
We see that there is an error in the input parameter_ Code, which will be assigned in this function. How many values does it have? What does each different code mean? Several enumeration values are given below

/*Steering error returned by getNodeByQuery() function/
//The node can handle this command
#define REDIS_CLUSTER_REDIR_NONE 0 / Node can serve the request. /
//Key in other slot
#define REDIS_CLUSTER_REDIR_CROSS_SLOT 1 / Keys in different slots. /
//The slot in which the key is located is being reshaped
#define REDIS_CLUSTER_REDIR_UNSTABLE 2 / Keys in slot resharding. /
//ASK steering is required
#define REDIS_CLUSTER_REDIR_ASK 3 / -ASK redirection required. */
//MOVED steering is required
#define REDIS_CLUSTER_REDIR_MOVED 4
Here we give these values first, and the subsequent source code will explain these values.
Let's start analyzing the getNodeByQuery() function

```c
/* 
This function returns the cluster node that processes the command
Return the pointer to the cluster node that is able to serve the command.

  The commands processed by the cluster can only be 1 Single key 2 Multiple keys, but the redis instance corresponding to these keys is the same, and the cluster is stable [no re fragmentation is in progress]
 * For the function to succeed the command should only target either:
 *
 * 1) A single key (even multiple times like LPOPRPUSH mylist mylist).
 * 2) Multiple keys in the same hash slot, while the slot is stable (no
 *    resharding in progress).
 *
  If successful, the function returns the redis instance that can handle the command request
 * On success the function returns the node that is able to serve the request.
 * If the node is not 'myself', redirection is performed
 * If the node is not 'myself' a redirection must be perfomed.
 *
 There are two redirection methods: ask and remvoed
 The kind of
 * redirection is specified setting the integer passed by reference
 * 'error_code', which will be set to REDIS_CLUSTER_REDIR_ASK or
 * REDIS_CLUSTER_REDIR_MOVED.
 *
 When the node processing the command is' myself ', error_code is assigned REDIS_CLUSTER_REDIR_NONE
 * When the node is 'myself' 'error_code' is set to REDIS_CLUSTER_REDIR_NONE.
 *When this command cannot be processed, null and error are returned_ Code is assigned as reason
 
 * If the command fails NULL is returned, and the reason of the failure is
 * provided via 'error_code', which will be set to:
 *When this command contains multiple keys and these keys are not in the same slot,
 error_code Assigned REDIS_CLUSTER_REDIR_CROSS_SLOT
 * REDIS_CLUSTER_REDIR_CROSS_SLOT if the request contains multiple keys that
 * don't belong to the same hash slot.
 *
 When the command contains multiple keys and these keys belong to the same slot, but the cluster is reshaping, error_code is assigned RREDIS_CLUSTER_REDIR_UNSTABLE
 * REDIS_CLUSTER_REDIR_UNSTABLE if the request contains mutliple keys
 * belonging to the same slot, but the slot is not stable (in migration or
 * importing state, likely because a resharding is in progress). */
clusterNode *getNodeByQuery(redisClient *c, struct redisCommand *cmd, robj **argv, int argc, int *hashslot, int *error_code) {

    // Initialize to NULL,
    // If the input command is a parameterless command, n will continue to be NULL
    clusterNode *n = NULL;

    robj *firstkey = NULL;
    int multiple_keys = 0;
    multiState *ms, _ms;
    multiCmd mc;
    int i, slot = 0, migrating_slot = 0, importing_slot = 0, missing_keys = 0;

    /* Set error code optimistically for the base case. */
    if (error_code) *error_code = REDIS_CLUSTER_REDIR_NONE;

    /* We handle all the cases as if they were EXEC commands, so we have
     * a common code path for everything */
    // The cluster can execute transactions,
    // However, you must ensure that all commands in the transaction are directed to the same key
    // This if and the following for perform this legitimacy test
    if (cmd->proc == execCommand) {
        /* If REDIS_MULTI flag is not set EXEC is just going to return an
         * error. */
        if (!(c->flags & REDIS_MULTI)) return myself;
        ms = &c->mstate;
    } else {
        /* In order to have a single codepath create a fake Multi State
         * structure if the client is not in MULTI/EXEC state, this way
         * we have a single codepath below. */
        ms = &_ms;
        _ms.commands = &mc;
        _ms.count = 1;
        mc.argv = argv;
        mc.argc = argc;
        mc.cmd = cmd;
    }

    /* Check that all the keys are in the same hash slot, and obtain this
     * slot and the node associated. */
    for (i = 0; i < ms->count; i++) {
        struct redisCommand *mcmd;
        robj **margv;
        int margc, *keyindex, numkeys, j;

        mcmd = ms->commands[i].cmd;
        margc = ms->commands[i].argc;
        margv = ms->commands[i].argv;

        // Key position of positioning command
        keyindex = getKeysFromCommand(mcmd,margv,margc,&numkeys);
        // Traverse all keys in the command
        for (j = 0; j < numkeys; j++) {
            robj *thiskey = margv[keyindex[j]];
            int thisslot = keyHashSlot((char*)thiskey->ptr,
                                       sdslen(thiskey->ptr));

            if (firstkey == NULL) {
                // This is the first key to be processed in the transaction
                // Get the slot of the key and the node responsible for processing the slot
                /* This is the first key we see. Check what is the slot
                 * and node. */
                firstkey = thiskey;
                slot = thisslot;
                n = server.cluster->slots[slot];
                redisAssertWithInfo(c,firstkey,n != NULL);
                /* If we are migrating or importing this slot, we need to check
                 * if we have all the keys in the request (the only way we
                 * can safely serve the request, otherwise we return a TRYAGAIN
                 * error). To do so we set the importing/migrating state and
                 * increment a counter for every missing key. */
                if (n == myself &&
                    server.cluster->migrating_slots_to[slot] != NULL)
                {
                    migrating_slot = 1;
                } else if (server.cluster->importing_slots_from[slot] != NULL) {
                    importing_slot = 1;
                }
            } else {
                /* If it is not the first key, make sure it is exactly
                 * the same key as the first we saw. */
                if (!equalStringObjects(firstkey,thiskey)) {
                    if (slot != thisslot) {
                        /* Error: multiple keys from different slots. */
                        getKeysFreeResult(keyindex);
                        if (error_code)
                            *error_code = REDIS_CLUSTER_REDIR_CROSS_SLOT;
                        return NULL;
                    } else {
                        /* Flag this request as one with multiple different
                         * keys. */
                        multiple_keys = 1;
                    }
                }
            }

            /* Migarting / Improrting slot? Count keys we don't have. */
            if ((migrating_slot || importing_slot) &&
                lookupKeyRead(&server.db[0],thiskey) == NULL)
            {
                missing_keys++;
            }
        }
        getKeysFreeResult(keyindex);
    }  // end for

    /* No key at all in command? then we can serve the request
     * without redirections or errors. */
    if (n == NULL) return myself;

    /* Return the hashslot by reference. */
    if (hashslot) *hashslot = slot;

    /* This request is about a slot we are migrating into another instance?
     * Then if we have all the keys. */

    /* If we don't have all the keys and we are migrating the slot, send
     * an ASK redirection. */
    if (migrating_slot && missing_keys) {
        if (error_code) *error_code = REDIS_CLUSTER_REDIR_ASK;
        return server.cluster->migrating_slots_to[slot];
    }

    /* If we are receiving the slot, and the client correctly flagged the
     * request as "ASKING", we can serve the request. However if the request
     * involves multiple keys and we don't have them all, the only option is
     * to send a TRYAGAIN error. */
    if (importing_slot &&
        (c->flags & REDIS_ASKING || cmd->flags & REDIS_CMD_ASKING))
    {
        if (multiple_keys && missing_keys) {
            if (error_code) *error_code = REDIS_CLUSTER_REDIR_UNSTABLE;
            return NULL;
        } else {
            return myself;
        }
    }

    /* Handle the read-only client case reading from a slave: if this
     * node is a slave and the request is about an hash slot our master
     * is serving, we can reply without redirection. */
    if (c->flags & REDIS_READONLY &&
        cmd->flags & REDIS_CMD_READONLY &&
        nodeIsSlave(myself) &&
        myself->slaveof == n)
    {
        return myself;
    }

    /* Base case: just return the right node. However if this node is not
     * myself, set error_code to MOVED since we need to issue a rediretion. */
    if (n != myself && error_code) *error_code = REDIS_CLUSTER_REDIR_MOVED;

    // Returns the node responsible for processing the slot n
    return n;
   }

The above is the function of getQueryNode. In fact, it is relatively simple. You can read the comment twice. Nothing more than finding the cluster node that processes the command. If it is not found, give it to error_code assignment reason. Caller root error_code do the corresponding processing.

2. What problems should we pay attention to in the cluster mode?
(1) In the cluster mode, multi key queries should be avoided as much as possible
(2) Note the key hash and cluster balance

Postscript: reference books redis design and implementation, redis3.0 source code

Topics: Redis