Distributed ID generation algorithm snowflake algorithm

Posted by EriRyoutan on Tue, 16 Nov 2021 10:42:40 +0100

Reason: why do we need snowflake algorithm

Why do you need distributed globally unique IDs and business requirements of Distributed IDS? How to ensure the generation of distributed unique global ID in the case of high cluster concurrency?

In complex distributed systems, it is often necessary to uniquely identify a large number of data and messages. For example, in the systems of finance, payment, catering, hotel, cat's eye film and other products commented by meituan, the data is growing day by day. After the data is divided into databases and tables, a unique ID is required to identify a data or message. In particular, orders, riders and coupons should also be identified with a unique ID. At this time, a system that can generate globally unique IDs is very necessary.

Some hard requirements for ID generation rules

  • Globally unique: duplicate ID numbers cannot appear. Since they are unique - identification, this is the most basic requirement
  • Increasing trend: clustered indexes are used in MySQL's InnoDB engine. Since most RDBMS use Btree's data structure to store index data, we should try our best to use ordered primary keys in the selection of primary keys to ensure write performance.
  • Monotonic increment: ensure that the next ID is greater than the previous ID, such as transaction version number, IM increment message, sorting and other special requirements
  • Information security: if the ID is continuous, it is very easy for malicious users to steal. You can download the specified URL directly in order. If it is the order number, it is even more dangerous. The competitor can directly know our daily order quantity. Therefore, in some application scenarios, the ID needs to be irregular, which makes it easy for competitors to guess.
  • Including time stamp: in this way, we can quickly understand the generation time of this distributed id in development.

Availability requirements of ID number generation system

  • High availability: when sending a request to obtain a distributed ID, the server must ensure that a unique distributed ID is created for me in 99.999% of the cases.
  • Low latency: Send a request to obtain the distributed ID, and the server will be fast and extremely fast.
  • High QPS: if 100000 requests to create distributed IDS are killed simultaneously, the server should withstand and successfully create 100000 distributed IDS at once.

General scheme

UUID

The standard type of UUID (universal unique ldentifer) contains 32 hexadecimal digits, which are divided into five segments with hyphens and 36 characters in the form of 8-4-4-4-12. Example: 550e8400-e29b-41d4-a716-446655440000

Very high performance: locally generated, no network consumption

If you only consider uniqueness, choose it

However, the database performance is poor

Why does disordered UUID s lead to poor warehousing performance?

  • Disorder, unable to predict his generation order, unable to generate incrementally ordered numbers. First of all, the distributed ID is generally used as the primary key, but the official recommendation for installing MySQL is that the primary key should be as short as possible. Each UUID is very long, so it is not recommended.
  • When the ID is used as the primary key, there will be some problems in a specific environment. For example, in the case of a DB primary key, the UUID is very unsuitable. MySQL officials have clearly suggested that the primary key should be as short as possible. A UUID with a length of 36 characters does not meet the requirements.
  • Index. Since the distributed ID is the primary key, and then the primary key contains the index, and then the MySQL index is implemented through the B + tree, every time a new UUID data is inserted, the B + tree at the bottom of the index will be modified for query optimization. Because the UUID data is disordered, every UUID data insertion will greatly modify the B + tree of the primary key, This is very bad. Inserting completely out of order will not only lead to the splitting of some intermediate nodes, but also create many unsaturated nodes in vain, which greatly reduces the performance of database insertion.

All indexes other than the clustered index are known as secondary indexes. In InnoDB, each record in a secondary index contains the primary key columns for the row, as well as the columns specified for the secondary index. InnoDB uses this primary key value to search for the row in the clustered index.

If the primary key is long, the secondary indexes use more space, so it is advantageous to have a short primary key.

link

Database auto increment primary key

stand-alone

In a stand-alone machine, the main principle of the self increasing ID mechanism of the database is that the self increasing ID of the database and the replace into of the MySQL database are implemented.

REPLACE INTO means to insert a record. If the value of the unique index in the table encounters a conflict, the old data will be replaced.

The function of replace into here is similar to that of inset. The difference is that replace into first tries to insert the data into the data list. If it is found that this row of data already exists in the table (judged according to the primary key or unique index), it is deleted first and then inserted. Otherwise, insert new data directly.

CREATE TABLE t_test(
    id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    stub CHAR(1) NOT NULL DEFAULT '',
    UNIQUE KEY stub(stub)
)

SELECT * FROMt_ test;

REPLACE INTO t_test (stub) VALUES('b');

SELECT LAST_INSERT_ID();

Cluster distributed

Is the database self increment ID mechanism suitable for distributed ID? The answer is not very suitable

1: It is difficult to expand the system horizontally. For example, after defining the step size and the number of machines, what should I do if I want to add machines? Suppose there is only one machine with the number of 1, 2, 3, 4 and 5 (step size is 1), which

One machine needs to be expanded at a time. You can do this: set the initial value of the second machine much more than the first one. It seems good. Now imagine if we have 100 machines on the line

What should I do when I want to expand capacity? It is a nightmare, so the system level expansion scheme is complex and difficult to achieve.

2: There is still a lot of pressure on the database. You have to read and write the database every time you get the ID, which greatly affects the performance. It does not comply with the rules of low latency and high QPS in the distributed ID (if you get the ID in the database at high speed, it will greatly affect the performance)

Generate global ID policy based on Redis

Because Redis is single wire and naturally ensures atomicity, it can be realized by atomic operations INCR and INCRBY

Note: in the case of redis cluster, the same as MySQL, different growth steps need to be set, and the validity period must be set. Redis cluster can be used to obtain higher throughput.

Suppose there are 5 Redis in a cluster. Each Redis can be initialized with values of 1, 2, 3, 4 and 5 respectively, and then the step size is 5.

The ID s generated by each Redis are:

A: 1, 6, 11, 16, 21 B: 2, 7 , 12, 17, 22 C: 3, 8, 13, 18, 23 D: 4, 9, 14, 19, 24 E: 5, 10, 15, 20, 25

source

Twitter's distributed self increasing ID algorithm snowflake

summary

Twitter's snowflake solves this need. Initially, twitter migrated the storage system from MySQL to Cassandra (an open-source distributed NoSQL database system developed by Facebook). Because Cassandra has no sequential ID generation mechanism, it has developed such a set of globally unique generation services.

Twitter's distributed snowflake algorithm snowflake has been tested to generate 260000 self increasing and sortable ID S per second

  1. Twitter's SnowFlake ID can be generated in chronological order.
  2. The result of ID generated by SnowFlake algorithm is an integer of 64bit size, which is a Long type (the maximum length after conversion to string is 19).
  3. ID collision will not occur in the distributed system (distinguished by datacenter and worker LD) and the efficiency is high.

In distributed systems, there are some scenarios that need to use globally unique IDs. The basic requirements for generating IDs are as follows:

In a distributed environment, it must be global and unique.

Generally, it needs to increase monotonically, because generally, the unique ID will be saved to the database, and the feature of Innodb is to store the content in the leaf node on the primary key index tree, and it increases from left to right, so

Considering the database performance, the generated ID should also be monotonically increasing. In order to prevent ID conflicts, 36 bit UUIDs can be used, but UUIDs have some disadvantages. First, they are relatively long. In addition, UUIDs are generally unordered.

It may also be necessary to have no rules, because if the unique ID is used as the order number, this rule is required in order to prevent others from knowing the order quantity of a day.

structure

Several core components of snowflake algorithm:

SnowFlake guarantees:

  • All generated ID s are incremented over time.
  • There will be no duplicate IDS in the whole distributed system (because there are datacenter id and worker LD to distinguish)

Source code

The following codes are for learning only:

/**
 * Twitter_Snowflake
 * SnowFlake The structure of the is as follows (each part is separated by -)
 * 0 - 0000000000 0000000000 0000000000 0000000000 0 - 00000 - 00000 - 000000000000
 * 1 Bit identification. Since the basic type of long is signed in Java, the highest bit is the sign bit, the positive number is 0 and the negative number is 1, the id is generally a positive number and the highest bit is 0
 * 41 Bit timestamp (in milliseconds). Note that the 41 bit timestamp does not store the timestamp of the current time, but stores the difference of the timestamp (current timestamp - start timestamp)
 * The start timestamp here is generally the time when our id generator starts to use, which is specified by our program (such as the startTime attribute of SnowflakeIdWorker class in the following program). The 41 bit timestamp can be used for 69 years, and the year t = (1L < < 41) / (1000L * 60 * 60 * 24 * 365) = 69
 * 10 It can be deployed in 1024 nodes, including 5-bit datacenter ID and 5-bit workerId
 * 12 Bit sequence, counting in milliseconds, and 12 bit counting sequence number. Each node can generate 4096 ID sequence numbers per millisecond (the same machine, the same timestamp)
 * Add up to just 64 bits, which is a Long type.
 */
public class SnowflakeIdWorker {
    /** Start timestamp (2015-01-01) */
    private final long twepoch = 1420041600000L;

    /** Number of digits occupied by machine id */
    private final long workerIdBits = 5L;

    /** Number of bits occupied by data id */
    private final long datacenterIdBits = 5L;

    /** The maximum machine id supported is 31 (this shift algorithm can quickly calculate the maximum decimal number represented by several binary numbers) */
    private final long maxWorkerId = -1L ^ (-1L << workerIdBits);

    /** The maximum supported data id is 31 */
    private final long maxDatacenterId = -1L ^ (-1L << datacenterIdBits);

    /** The number of bits the sequence occupies in the id */
    private final long sequenceBits = 12L;

    /** The machine ID shifts 12 bits to the left */
    private final long workerIdShift = sequenceBits;

    /** The data id shifts 17 bits to the left (12 + 5) */
    private final long datacenterIdShift = sequenceBits + workerIdBits;

    /** Shift the timestamp 22 bits to the left (5 + 5 + 12) */
    private final long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits;

    /** The mask of the generated sequence is 4095 (0b111111 = 0xfff = 4095) */
    private final long sequenceMask = -1L ^ (-1L << sequenceBits);

    /** Work machine ID(0~31) */
    private long workerId;

    /** Data center ID(0~31) */
    private long datacenterId;

    /** Sequence in milliseconds (0 ~ 4095) */
    private long sequence = 0L;

    /** Timestamp of last generated ID */
    private long lastTimestamp = -1L;

    //==============================Constructors=====================================
    /**
     * Constructor
     * @param workerId Job ID (0~31)
     * @param datacenterId Data center ID (0~31)
     */
    public SnowflakeIdWorker(long workerId, long datacenterId) {
        if (workerId > maxWorkerId || workerId < 0) {
            throw new IllegalArgumentException(String.format("worker Id can't be greater than %d or less than 0", maxWorkerId));
        }
        if (datacenterId > maxDatacenterId || datacenterId < 0) {
            throw new IllegalArgumentException(String.format("datacenter Id can't be greater than %d or less than 0", maxDatacenterId));
        }
        this.workerId = workerId;
        this.datacenterId = datacenterId;
    }

    // ==============================Methods==========================================
    /**
     * Get the next ID (the method is thread safe)
     * @return SnowflakeId
     */
    public synchronized long nextId() {
        long timestamp = timeGen();

        //If the current time is less than the timestamp generated by the last ID, it indicates that the system clock has fallback, and an exception should be thrown at this time
        if (timestamp < lastTimestamp) {
            throw new RuntimeException(
                    String.format("Clock moved backwards.  Refusing to generate id for %d milliseconds", lastTimestamp - timestamp));
        }

        //If it is generated at the same time, the sequence within milliseconds is performed
        if (lastTimestamp == timestamp) {
            sequence = (sequence + 1) & sequenceMask;
            //Sequence overflow in milliseconds
            if (sequence == 0) {
                //Block to the next millisecond and get a new timestamp
                timestamp = tilNextMillis(lastTimestamp);
            }
        }
        //Timestamp change, sequence reset in milliseconds
        else {
            sequence = 0L;
        }

        //Timestamp of last generated ID
        lastTimestamp = timestamp;

        //Shift and put together by or operation to form a 64 bit ID
        return ((timestamp - twepoch) << timestampLeftShift) //
                | (datacenterId << datacenterIdShift) //
                | (workerId << workerIdShift) //
                | sequence;
    }

    /**
     * Block to the next millisecond until a new timestamp is obtained
     * @param lastTimestamp Timestamp of last generated ID
     * @return Current timestamp
     */
    protected long tilNextMillis(long lastTimestamp) {
        long timestamp = timeGen();
        while (timestamp <= lastTimestamp) {
            timestamp = timeGen();
        }
        return timestamp;
    }

    /**
     * Returns the current time in milliseconds
     * @return Current time (MS)
     */
    protected long timeGen() {
        return System.currentTimeMillis();
    }

    /** test */
    public static void main(String[] args) {
        System.out.println("Start:"+System.currentTimeMillis());
        SnowflakeIdWorker idWorker = new SnowflakeIdWorker(0, 0);
        for (int i = 0; i < 50; i++) {
            long id = idWorker.nextId();
            System.out.println(id);
//            System.out.println(Long.toBinaryString(id));
        }
        System.out.println("end:"+System.currentTimeMillis());
    }
}

Project landing experience

Snowflake documentation for Hutool

Add dependency

<dependency>
    <groupId>cn.hutool</groupId>
    <artifactId>hutool-captcha</artifactId>
    <version>4.6.8</version>
</dependency>

Sample program:

import cn.hutool.core.lang.Snowflake;
import cn.hutool.core.net.NetUtil;
import cn.hutool.core.util.IdUtil; 
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;

import javax.annotation.PostConstruct;

@Slf4j
@Component
public class IdGeneratorSnowflake{
    private long workerId = 0;
    private long datacenterId = 1;
    private Snowflake snowflake = IdUtil.createSnowflake(workerId, datacenterId);

    public synchronized long snowflakeId(){
        return snowflake.nextId();
    }

    public synchronized long snowflakeId(long workerId, long datacenterId){
        Snowflake snowflake = IdUtil.createSnowflake(workerId, datacenterId);
        return snowflake.nextId();
    }

    public static void main(String[] args){
        IdGeneratorSnowflake idGenerator = new IdGeneratorSnowflake();
        System.out.println(idGenerator.snowflakeId());
        
        ExecutorService threadPool = Executors.newFixedThreadPool(5);
        for (int i = 1; i <= 20; i++){
            threadPool.submit(() -> {
                System.out.print1n(idGenerator.snowflakeId());
            });
        }
        
        threadPool.shutdown();

    }
}

Advantages and disadvantages

advantage:

The number of milliseconds is in the high order, the self increasing sequence is in the low order, and the whole ID is increasing in trend.

It does not rely on third-party systems such as databases. It is deployed in the form of services, which has higher stability and high performance of ID generation.

bit bits can be allocated according to their own business characteristics, which is very flexible.

Disadvantages:

Depending on the machine clock, if the machine clock is dialed back, it will lead to duplicate ID generation.

It is incremental on a single machine, but because it is designed in a distributed environment, the clock on each machine cannot be fully synchronized, and sometimes it is not globally incremental.

(it doesn't matter. Generally, distributed ID S only require increasing trend, not strictly. 90% of the requirements only require increasing trend)

Other supplements

Baidu open source distributed unique ID generator UidGenerator

Meituan comment distributed ID generation system Leaf

Author: other shore dance

Time: 2021 \ 11 \ 15

Content about: SpringBoot

This article comes from the network, only technology sharing, without any responsibility