mycat sub database and sub table

Posted by patriklko on Mon, 17 Jan 2022 23:36:49 +0100

Sub database and sub table

Sub database: the process of splitting a single database into multiple databases, scattering data in multiple databases.
Split table: the process of splitting a single table into multiple tables, scattering data in multiple tables.

Why do you want to separate databases and tables?

Keywords: improve performance and increase availability.

In terms of performance
With the increasing amount of data in a single database and the higher QPS of database query, the time required to read and write the database is also increasing. The read-write performance of database may become the bottleneck of business development. Accordingly, you need to optimize the database performance. In this article, we only discuss database level optimization, not application level optimization methods such as caching.

If the query QPS of the database is too high, you need to consider splitting the database and sharing the connection pressure of a single database through database splitting. For example, if the query QPS is 3500, assuming that a single database can support 1000 connections, you can consider splitting into multiple databases to disperse the query connection pressure.

If the amount of data in a single table is too large, when the amount of data exceeds a certain order of magnitude, there may still be performance problems after traditional optimization methods at the pure database level, such as index optimization, whether for data query or data update. This is a qualitative change caused by quantitative change. At this time, we need to change ideas to solve the problem, such as solving the problem from the source of data production and data processing. Since the amount of data is large, let's divide and rule and divide into parts. This leads to a split table, which splits the data into multiple tables according to certain rules to solve the access performance problem that cannot be solved in a single table environment.

In terms of availability
If an accident occurs to a single database, it is likely to lose all data. Especially in the cloud age, many databases run on virtual machines. If there is an accident in the virtual machine / host, irreparable losses may be caused. Therefore, in addition to the traditional master slave, master master master and other deployment levels to solve the reliability problem, we can also consider solving this problem from the data splitting level.

Here we take database downtime as an example:

In the case of single database deployment, if the database goes down, the impact of the failure is 100%, and the recovery may take a long time.
If we split it into two libraries and deploy them on different machines, and one library goes down, the impact of the failure is 50%, and 50% of the data can continue to serve.
If we split it into four libraries and deploy them on different machines, and one library goes down, the impact of the failure is 25%, and 75% of the data can continue to serve, and the recovery time will be very short.
Of course, we can't dismantle the library indefinitely. This is also a way to sacrifice storage resources to improve performance and availability. After all, resources are always limited.

How to divide databases and tables
Sub library? Sub table? Or both sub database and sub table?
According to the information learned in the first part, there are three schemes for sub database and sub table:

The database read / write QPS is too high and the database connection is insufficient
There are only some tables in the database. The amount of data in a single table is too large, and the storage performance encounters a bottleneck
That is, the storage performance bottleneck caused by insufficient number of sub database and sub table connections + excessive amount of data
How to choose our own segmentation scheme?
If tables need to be divided, how many tables are appropriate?

Since all technologies serve the business, let's first review the business background from the perspective of data.

For example, our business system is to solve the consulting demands of members. We serve members through our XSpace customer service platform system. At present, we mainly use synchronized offline work order data as our data source to build our own data.

It is assumed that each offline work order will generate a corresponding member's consultation question (we refer to it as question list). If:

Online channel: 3w chat sessions are generated every day. Assuming that 50% of the sessions will generate an offline work order, 3w * 50% = 1.5w work orders can be generated every day;
Hotline channel: 2.5w calls are generated every day. Assuming that 80% of the calls will generate a work order, 2.5w * 80% = 2w calls / day can be generated every day;
Offline channel: assume that the offline channel directly generates 3w pens every day;
Total 1.5w + 2w + 3w = 6.5w / day

Considering the new business scenarios that may continue to be covered in the future, some expansion space needs to be reserved in advance. Here, we assume that 8w questionnaires are generated every day.

In addition to the questionnaire, there are two other commonly used business tables: user operation log table, user submitted form data table.

Among them, each questionnaire will generate multiple user operation logs. According to the historical statistics, on average, each questionnaire big appointment will generate 8 operation logs. We reserve some space, assuming that each questionnaire will generate about 10 user operation logs on average.

If the design service life of the system is 5 years, the data volume of the questionnaire is about 5 years 365 days / year 8w / day = 146 million, and the estimated number of tables is as follows:

The questionnaire needs: 146 million / 500w = 29.2 tables, so we will segment according to 32 tables;
The operation log needs 32 10 = 320 tables, so we divide it according to 32 16 = 512 tables.

If it is necessary to divide the library, how many libraries are appropriate?
In addition to the usual business peak reading and writing QPS, it is also necessary to consider the possible peak during the double 11 promotion period, which needs to be estimated in advance.

According to our actual business scenario, the data query source of the questionnaire mainly comes from Alibaba customer service Xiaomi home page. Therefore, it can be evaluated according to historical QPS, RT and other data. Assuming that we only need 3500 database connections, if a single database can bear up to 1000 database connections, we can split it into four databases.

How to segment data?
According to industry practice, segmentation is usually performed in two ways: horizontal segmentation and vertical segmentation. Of course, some complex business scenarios may also choose the combination of the two.

(1) Horizontal segmentation

This is a method of horizontal segmentation by business dimension. For example, the common segmentation by member dimension scatters the data related to different members in different database tables according to certain rules. Because our business scenarios decide to read and write data from the perspective of members, we choose to perform database segmentation in a horizontal manner.

(2) Vertical segmentation

Vertical segmentation can be simply understood as splitting different fields of a table into different tables.

For example, suppose there is a small e-commerce business, which puts the commodity information, buyer and seller information and payment information related to an order in a large table. You can consider splitting commodity information, buyer information, seller information and payment information into separate tables through vertical segmentation, and associating them with the basic order information through the order number.

In another case, if a table has 10 fields, only 3 of which need to be modified frequently, you can consider splitting these 3 fields into sub tables. Avoid locking query rows that affect the remaining seven fields when modifying these three data.

sub-treasury

Database preparation

The cloud commodity management system is simulated here. When the database connection pressure is too high, the database is split. It is planned to split into two databases
Commodity library user advertising Library
According to the above business table association relationship, commodity table and category table are associated, so they should be placed on one data node, and the other two tables user ad table should be placed on another data node for testing
Create data table to two databases

Configure mycat

Sub database rules:

dn1: department table, user table, register patient registration table
dn2: drugs list, disease list
Modify the schema of mycat XML reconfiguration rule:

<?xml version="1.0"?>
<!DOCTYPE mycat:schema SYSTEM "schema.dtd">
<mycat:schema xmlns:mycat="http://io.mycat/">
    <schema name="TESTDB" checkSQLschema="false" sqlMaxLimit="100" dataNode="dn1">
        <table name="drugs" dataNode="dn2"></table>
        <table name="disease" dataNode="dn2"></table>
    </schema>

    <dataNode name="dn1" dataHost="host1" database="his_mycat"/>
    <dataNode name="dn2" dataHost="host2" database="his_mycat"/>

    <dataHost name="host1" maxCon="1000" minCon="10" balance="0"
              writeType="0" dbType="mysql" dbDriver="native" switchType="1"
              slaveThreshold="100">
        <heartbeat>select user()</heartbeat>
        <!-- can have multi write hosts -->
        <writeHost host="hostM1" url="172.17.0.2:3306" user="root" password="root">
        </writeHost>
    </dataHost>
    <dataHost name="host2" maxCon="1000" minCon="10" balance="0"
              writeType="0" dbType="mysql" dbDriver="native" switchType="1"
              slaveThreshold="100">
        <heartbeat>select user()</heartbeat>
        <!-- can have multi write hosts -->
        <writeHost host="hostM2" url="172.17.0.3:3306" user="root"  password="root">
        </writeHost>
    </dataHost>

</mycat:schema>

Start mycat
Execute in the mycat/bin directory

./mycat console

verification
After creation, view the table storage in mycat client, dn1 node and dn2 node respectively

Sub table

Compared with vertical splitting, horizontal splitting does not classify tables, but distributes them into multiple databases according to certain rules of a field, and each table contains some data. In short, we can understand the horizontal segmentation of data as the segmentation of data rows, that is, some rows in the table are segmented into one database, while others are segmented into other databases.

There is a bottleneck in the number of data stored in a single MySQL table. If there are 10 million data in a single table, it will reach the bottleneck, which will affect the query efficiency,
Horizontal splitting (table splitting) is required for optimization. The system predicts more than 50 million tables within 5 years. The solution is to split the id horizontally.

Split table implementation (mold taking)

This rule is the operation of finding the partition field

<tableRule name="mod-long">
    <rule>
        <columns>user_id</columns>
        <algorithm>mod-long</algorithm>
    </rule>
</tableRule>

<function name="mod-long" class="io.mycat.route.function.PartitionByMod">
    <!-- how many data nodes -->
    <property name="count">3</property>
</function>

The columns above identify the table fields to be fragmented, and the algorithm fragmentation function,
This configuration is very clear, that is, the decimal module calculation budget is carried out according to the id. compared with the fixed fragment hash, there may be batch insertion orders during batch insertion
The transaction inserts multiple data fragments, which increases the difficulty of transaction consistency (so the implementation of this method is the simplest, so the description is preferred).
Complete rule after configuration The XML is as follows

<?xml version="1.0" encoding="UTF-8"?>
<!-- - - Licensed under the Apache License, Version 2.0 (the "License"); 
    - you may not use this file except in compliance with the License. - You 
    may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 
    - - Unless required by applicable law or agreed to in writing, software - 
    distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT 
    WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the 
    License for the specific language governing permissions and - limitations 
    under the License. -->
<!DOCTYPE mycat:rule SYSTEM "rule.dtd">
<mycat:rule xmlns:mycat="http://io.mycat/">
    <tableRule name="customer_rule">
                <rule>
                        <columns>cid</columns>
                        <algorithm>mod-long</algorithm>
                </rule>
        </tableRule>
    <tableRule name="rule1">
        <rule>
            <columns>id</columns>
            <algorithm>func1</algorithm>
        </rule>
    </tableRule>

    <tableRule name="rule2">
        <rule>
            <columns>user_id</columns>
            <algorithm>func1</algorithm>
        </rule>
    </tableRule>

    <tableRule name="sharding-by-intfile">
        <rule>
            <columns>sharding_id</columns>
            <algorithm>hash-int</algorithm>
        </rule>
    </tableRule>
    <tableRule name="auto-sharding-long">
        <rule>
            <columns>id</columns>
            <algorithm>rang-long</algorithm>
        </rule>
    </tableRule>
    <tableRule name="mod-long">
        <rule>
            <columns>id</columns>
            <algorithm>mod-long</algorithm>
        </rule>
    </tableRule>
    <tableRule name="sharding-by-murmur">
        <rule>
            <columns>id</columns>
            <algorithm>murmur</algorithm>
        </rule>
    </tableRule>
    <tableRule name="crc32slot">
        <rule>
            <columns>id</columns>
            <algorithm>crc32slot</algorithm>
        </rule>
    </tableRule>
    <tableRule name="sharding-by-month">
        <rule>
            <columns>create_time</columns>
            <algorithm>partbymonth</algorithm>
        </rule>
    </tableRule>
    <tableRule name="latest-month-calldate">
        <rule>
            <columns>calldate</columns>
            <algorithm>latestMonth</algorithm>
        </rule>
    </tableRule>

    <tableRule name="auto-sharding-rang-mod">
        <rule>
            <columns>id</columns>
            <algorithm>rang-mod</algorithm>
        </rule>
    </tableRule>

    <tableRule name="jch">
        <rule>
            <columns>id</columns>
            <algorithm>jump-consistent-hash</algorithm>
        </rule>
    </tableRule>

    <function name="murmur"
        class="io.mycat.route.function.PartitionByMurmurHash">
        <property name="seed">0</property><!-- The default is 0 -->
        <property name="count">2</property><!-- The number of database nodes to be partitioned must be specified, otherwise it cannot be partitioned -->
        <property name="virtualBucketTimes">160</property><!-- An actual database node is mapped to so many virtual nodes. The default is 160 times, that is, the number of virtual nodes is 160 times that of physical nodes -->
        <!-- <property name="weightMapFile">weightMapFile</property> The weight of a node. The default value for a node without a specified weight is 1. with properties Fill in the format of the file, starting from 0 to count-1 The integer value of, that is, the node index is key，Take the node weight value as the value. Ownership multiplicity must be a positive integer, otherwise it is replaced by 1 -->
        <!-- <property name="bucketMapPath">/etc/mycat/bucketMapPath</property> 
            It is used to observe the distribution of each physical node and virtual node during testing. If this attribute is specified, the virtual node's murmur hash The mapping between value and physical node is output to this file by line. There is no default value. If it is not specified, nothing will be output -->
    </function>

    <function name="crc32slot"
              class="io.mycat.route.function.PartitionByCRC32PreSlot">
        <property name="count">2</property><!-- The number of database nodes to be partitioned must be specified, otherwise it cannot be partitioned -->
    </function>
    <function name="hash-int"
        class="io.mycat.route.function.PartitionByFileMap">
        <property name="mapFile">partition-hash-int.txt</property>
    </function>
    <function name="rang-long"
        class="io.mycat.route.function.AutoPartitionByLong">
        <property name="mapFile">autopartition-long.txt</property>
    </function>
    <function name="mod-long" class="io.mycat.route.function.PartitionByMod">
        <!-- how many data nodes -->
        <property name="count">2</property>
    </function>

    <function name="func1" class="io.mycat.route.function.PartitionByLong">
        <property name="partitionCount">8</property>
        <property name="partitionLength">128</property>
    </function>
    <function name="latestMonth"
        class="io.mycat.route.function.LatestMonthPartion">
        <property name="splitOneDay">24</property>
    </function>
    <function name="partbymonth"
        class="io.mycat.route.function.PartitionByMonth">
        <property name="dateFormat">yyyy-MM-dd</property>
        <property name="sBeginDate">2015-01-01</property>
    </function>

    <function name="rang-mod" class="io.mycat.route.function.PartitionByRangeMod">
            <property name="mapFile">partition-range-mod.txt</property>
    </function>

    <function name="jump-consistent-hash" class="io.mycat.route.function.PartitionByJumpConsistentHash">
        <property name="totalBuckets">3</property>
    </function>
</mycat:rule>

Fragment join of Mycat

Join is definitely the most commonly used feature in relational database. However, in distributed environment, cross fragment join is the most complex and difficult problem to solve.

Mycat performance recommendations
Try to avoid using Left join or Right join and use Inner join instead
When using Left join or Right join, ON will be executed first, and the where condition will be executed last. Therefore, during the use process, the condition shall be executed as much as possible
If possible, judge in the ON statement to reduce the execution of where
Use fewer subqueries and join instead.
The current version of Mycat supports cross slice join. There are four main implementation methods.
Global table, ER sharding, cattlett (Artificial Intelligence) and ShareJoin are supported in the development version. The first three methods are supported in 1.3.0.1.

ER slice
MyCAT draws lessons from the design idea of foundation dB, a rookie in the field of NewSQL. Foundation DB creatively puts forward the concept of TableGroup, which relies on the storage location of sub tables on the main table and is physically adjacent to each other. Therefore, it completely solves the efficiency and performance problems of Jian. According to this idea, a data fragmentation strategy based on E-R relationship is proposed, The records of the child table are stored on the same data slice as the associated parent table records.
The customer (mobile phone account table) adopts the segmentation strategy of taking modules according to the customer ID. the segmentation is on SPT1 and spt2. The call log (call record table) relies on the parent table for segmentation. The association relationship between the two tables is customer id =calllog. phone_ id.

<?xml version="1.0"?>
<!DOCTYPE mycat:schema SYSTEM "schema.dtd">
<mycat:schema xmlns:mycat="http://io.mycat/">

    <schema name="TESTDB" checkSQLschema="false" sqlMaxLimit="100" dataNode="spt1">
        <!-- 
                         rule="customer_rule"  Custom fragmentation rules need to be defined in rule.xml Medium configuration
        -->
        <table name="customer" dataNode="spt1,spt2" rule="customer_rule" >
            <!-- 
                name="calllog"          Child table (slave table) name
                primaryKey="id"         Child table (slave table) primary key
                joinKey="id"            Primary table primary key( customer.id)
                parentKey="phone_id"    Foreign keys recorded from table( calllog.phone_id)
            -->
            <childTable name="calllog" primaryKey="id" joinKey="id" parentKey="phone_id" />
        </table>
    </schema>

    <dataNode name="spt1" dataHost="host1" database="boss"/>
    <dataNode name="spt2" dataHost="host2" database="boss"/>

    <dataHost name="host1" maxCon="1000" minCon="10" balance="0"
              writeType="0" dbType="mysql" dbDriver="native" switchType="1"
              slaveThreshold="100">
        <heartbeat>select user()</heartbeat>
        <!-- can have multi write hosts -->
        <writeHost host="hostM1" url="172.17.0.2:3306" user="root" password="root">
        </writeHost>
    </dataHost>

    <dataHost name="host2" maxCon="1000" minCon="10" balance="0"
              writeType="0" dbType="mysql" dbDriver="native" switchType="1"
              slaveThreshold="100">
        <heartbeat>select user()</heartbeat>
        <!-- can have multi write hosts -->
        <writeHost host="hostM2" url="172.17.0.3:3306" user="root"  password="root">
        </writeHost>
    </dataHost>

</mycat:schema>

Use join query on mycat, data nodes spt1 and spt2 respectively

SELECT
cus.id,
cus.name 'customer name',
log.type 'call type',
log.othernum 'other phone number'
FROM
customer cus
INNER JOIN calllog log
ON cus.id = log.phone_id
ORDER BY cus.id ;

Topics: MySQL

Programmer Think