Sub table and sub database (10 billion level big data storage)

Posted by rawb on Tue, 11 Jan 2022 06:42:01 +0100

NewLife.XCode is an open source data middleware with a history of 15 years. It supports netcore/net45/net40. It has been developed and maintained by the new life team (2002 ~ 2019). It is hereinafter referred to as XCode.
The whole series of tutorials will conduct in-depth analysis in combination with a large number of sample codes and operation logs, which contain many years of development experience. The representative works are 10 billion level big data real-time computing projects.
Open source address: https://github.com/NewLifeX/X (star, 938 +)

XCode is a severe congestion model. It takes single table operation as the core and does not support multi table Association Join. Complex queries can only be made on where. The whole select statement must be from a single table. Therefore, it has natural advantages for split table operation!
!! Before reading this article, it is recommended to review the 10 billion performance, in which the "complete index" chapter describes the core points of large data tables in detail.

In fact, there are not many 10 billion data. A common data table and database model:
MySql database has 8 master and 8 slave databases, 8 databases per server, 16 tables per database, a total of 1024 tables (there are also 1024 tables from the database), 10 million to 50 million data per table, and 10 billion to 50 billion data!

Routine analysis
Routine location: https://github.com/NewLifeX/X...
Create a new console project. nuget references newlife After Xcode, create an entity model (modify Model.xml):
<Tables Version="9.12.7136.19046" NameSpace="STOD.Entity" ConnName="STOD" Output="" BaseClass="Entity" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://www.newlifex.com https://raw.githubusercontent.com/NewLifeX/X/master/XCode/ModelSchema.xsd" xmlns="http://www.newlifex.com/ModelSchema.xsd">

</Tables>
In build TT right click to run the custom tool to generate entity classes "history. cs" and "history. Biz.cs". There is no need to modify the code. Later, we will use this entity class to demonstrate the usage of table and library.
For convenience, we will use SQLite database, so there is no need to configure any database connection. When XCode detects that there is no connection string named STOD, SQLite will be used by default.
Alternatively, you can use other non SQLite databases by specifying a connection string named STOD.

Sub database by digital hash table
For a large number of orders, users and other information, crc16 hash table can be used. We split the entity data into four databases, a total of 16 tables:
static void TestByNumber()
{

XTrace.WriteLine("Table and database by number");

// The connection string of each library is prepared in advance and added dynamically. It can also be written in the configuration file
for (var i = 0; i < 4; i++)
{
    var connName = $"HDB_{i + 1}";
    DAL.AddConnStr(connName, $"data source=numberData\\{connName}.db", null, "sqlite");
    History.Meta.ConnName = connName;

    // Create 4 tables for each library. This step is not necessary. It will also be created when reading and writing data for the first time
    //for (var j = 0; j < 4; j++)
    //{
    //    History.Meta.TableName = $"History_{j + 1}";

    //    //Initialize data table
    //    History.Meta.Session.InitData();
    //}
}

//!!!  Write data test

// 4 Libraries
for (var i = 0; i < 4; i++)
{
    var connName = $"HDB_{i + 1}";
    History.Meta.ConnName = connName;

    // 4 sheets per warehouse
    for (var j = 0; j < 4; j++)
    {
        History.Meta.TableName = $"History_{j + 1}";

        // Insert a batch of data
        var list = new List<History>();
        for (var n = 0; n < 1000; n++)
        {
            var entity = new History
            {
                Category = "transaction",
                Action = "transfer accounts",
                CreateUserID = 1234,
                CreateTime = DateTime.Now,
                Remark = $"[{Rand.NextString(6)}]towards[{Rand.NextString(6)}]transfer accounts[￥{Rand.Next(1_000_000) / 100d}]"
            };

            list.Add(entity);
        }

        // Batch insert. The two expressions are equivalent
        //list.BatchInsert();
        list.Insert(true);
    }
}

}
Through DAL Addconnstr dynamically registers the connection string with the system:
var connName = $"HDB_{i + 1}";

DAL.AddConnStr(connName, $"data source=numberData\{connName}.db", null, "sqlite");
The connection name must be unique and regular, which will be used later. The database name should also have a certain rule.
Use meta Connname specifies the connection name for subsequent operations, meta Tablename specifies the table name of subsequent operations. This thread is valid and will not interfere with other threads.
var connName = $"HDB_{i + 1}";
History.Meta.ConnName = connName;
History.Meta.TableName = $"History_{j + 1}";
Note that after ConnName/TableName is changed, the parameter will be maintained until it is modified to a new connection name and table name.
After specifying the table name and connection name, it can be continuously used in this thread. Later, batch insertion technology is used to insert a batch of data into each table.

The operation effect is as follows:

Under the numberData directory specified by the connection string, four databases are generated, four tables are generated for each database, and 1000 rows of data are inserted into each table.
When specifying nonexistent databases and data tables, XCode reverse engineering will automatically create tables and databases, which is its unique function. (due to asynchronous operation, there may be a certain probability of failure in intensive table and database creation. Just try again)

Table and database by time series
Log type time series data is especially suitable for storage by tables and databases. The finalized splitting mode is one database per month and one table per day.
static void TestByDate()
{

XTrace.WriteLine("According to time, there are tables and libraries, one library per month and one table per day");

// The connection string of each library is prepared in advance and added dynamically. It can also be written in the configuration file
var start = DateTime.Today;
for (var i = 0; i < 12; i++)
{
    var dt = new DateTime(start.Year, i + 1, 1);
    var connName = $"HDB_{dt:yyMM}";
    DAL.AddConnStr(connName, $"data source=timeData\\{connName}.db", null, "sqlite");
}

// One library per month and one table per day
start = new DateTime(start.Year, 1, 1);
for (var i = 0; i < 365; i++)
{
    var dt = start.AddDays(i);
    History.Meta.ConnName = $"HDB_{dt:yyMM}";
    History.Meta.TableName = $"History_{dt:yyMMdd}";

    // Insert a batch of data
    var list = new List<History>();
    for (var n = 0; n < 1000; n++)
    {
        var entity = new History
        {
            Category = "transaction",
            Action = "transfer accounts",
            CreateUserID = 1234,
            CreateTime = DateTime.Now,
            Remark = $"[{Rand.NextString(6)}]towards[{Rand.NextString(6)}]transfer accounts[￥{Rand.Next(1_000_000) / 100d}]"
        };

        list.Add(entity);
    }

    // Batch insert. The two expressions are equivalent
    //list.BatchInsert();
    list.Insert(true);
}

}
The time series sub table looks simpler than the digital hash, and the sub table logic is clear.

The routine traverses 365 days of this year, generates a 12-month database under the timeData directory specified by the connection string, and then generates data tables by month in each library, and inserts 1000 rows of simulation data into each table.

To sum up, splitting tables and databases is actually to preset meta before operating the database ConnName/Meta. Tablename, other operations remain unchanged!

Split table query
When it comes to split tables, many people's first reaction is, how to do cross table queries?
Sorry, no support!
You can only query on multiple tables. If the system design is unreasonable, you may even need to query on all tables.
It is not recommended to do view union, which will be endless. It is better to put the business logic in the code, and do a good job in database storage and basic calculation.

The usage of sub table query is the same as adding, deleting and modifying sub tables:
static void SearchByDate()
{

// The connection string of each library is prepared in advance and added dynamically. It can also be written in the configuration file
var start = DateTime.Today;
for (var i = 0; i < 12; i++)
{
    var dt = new DateTime(start.Year, i + 1, 1);
    var connName = $"HDB_{dt:yyMM}";
    DAL.AddConnStr(connName, $"data source=timeData\\{connName}.db", null, "sqlite");
}

// Random date. Batch operation
start = new DateTime(start.Year, 1, 1);
{
    var dt = start.AddDays(Rand.Next(0, 365));
    XTrace.WriteLine("Query date:{0}", dt);

    History.Meta.ConnName = $"HDB_{dt:yyMM}";
    History.Meta.TableName = $"History_{dt:yyMMdd}";

    var list = History.FindAll();
    XTrace.WriteLine("Data:{0}", list.Count);
}

// Random date. Individual operation
start = new DateTime(start.Year, 1, 1);
{
    var dt = start.AddDays(Rand.Next(0, 365));
    XTrace.WriteLine("Query date:{0}", dt);
    var list = History.Meta.ProcessWithSplit(
        $"HDB_{dt:yyMM}",
        $"History_{dt:yyMMdd}",
        () => History.FindAll());

    XTrace.WriteLine("Data:{0}", list.Count);
}

}

Still by setting meta ConnName/Meta. Tablename to implement table and database segmentation. The log output can see which library and which table are searched.
There is an extra history here Meta. Processwithsplit is actually a shortcut method. Use the connection name and table name in the callback and recover after exiting.

After dividing tables and databases, the most common mistake is to forget to set the table name when using, find data on the wrong table, and then you can't find it

Table splitting strategy
Based on these years of experience:
● Oracle is suitable for 10 million ~ 100 million rows of data in a single table and needs to be partitioned
● MySql is suitable for 10-50 million rows of data in a single table, and few people use MySql partition
If the application layer is split uniformly and the database is only responsible for storage, the above scheme is applicable to all kinds of databases.
At the same time, the upper limit of single table data is often asked, which tables should be divided into? Within the system life cycle (generally 1 ~ 2 years), it is best to ensure that the total data of each table after splitting is around 10 million.
According to the ten billion level performance, the common table splitting strategies are as follows:
● log time series table. If the monthly data is less than 10 million, it shall be divided by month, otherwise it shall be divided by day. The disadvantage is that the data hotspot is very obvious, which is suitable for the echelon architecture of hot table, cold table and archive table. The advantage is that the performance of batch writing and extraction is remarkable;
● status table (orders, users, etc.) is divided into tables according to Crc16 hash, and the number of tables is determined based on 10 million, rounded up to an exponential multiple of 2 (for easy calculation). The data is cold and hot evenly, which is conducive to single line query and update. The disadvantage is that it is not conducive to batch writing and extraction;
● mixing sub table. The order table can be hashed according to the order No. Crc16, which is convenient for single line search and update. As a wide table, it has various detailed fields. At the same time, it can also establish a set of time series table based on the order time as redundancy, which only stores the necessary fields such as order No. This solves the problem of splitting tables by primary key and querying by time dimension. The disadvantage is that the order data needs to be written in two copies. Of course, the time series table only needs to insert the document number, and other update operations are not involved.
Whether to divide the database is mainly determined by the storage space and performance requirements.

Comparison between sub table and partition
There is also a very common question, why use split tables instead of partitions?
Large databases Oracle, MSSQL and MySql all support partitioning. The first two use partitioning more, while MySql uses more tables.
There is no essential difference between partition and table. Both are to split and store massive data according to certain strategies to optimize writing and query.
● in addition to sub indexes, partitions can also establish global indexes, while sub tables cannot establish global indexes;
● cross region query can be performed in partitions, but it is very, very slow. All partitions are scanned accidentally;
● the sub table architecture can be easily made into a sub database, and can be easily extended to multiple servers. Zoning can only require stronger and larger database servers;
● the partition is mainly operated by DBA, and the sub table is mainly controlled by programmer;

!!! A project uses the XCode sub table function, which has passed the test of production environment for three and a half years, with an average daily increase of 40 million ~ 50 million data, more than 200 million additions, deletions and modifications, and a total data volume of tens of billions.

Topics: C#

Programmer Think

Sub table and sub database (10 billion level big data storage)

Hot Topics