Doris basic use guide

Posted by Drizzt321 on Tue, 01 Feb 2022 06:26:49 +0100

Basic use guide

Doris uses MySQL protocol for communication. Users can connect to Doris cluster through MySQL client or MySQL JDBC. When selecting the MySQL client version, it is recommended to use the version after 5.1, because the user name with a length of more than 16 characters cannot be supported before 5.1. Taking MySQL client as an example, this paper shows users the basic usage of Doris through a complete process.

1 create user

1.1 Root user login and password modification

Doris has built-in root and admin users, and the password is empty by default. After starting the Doris program, you can connect to the Doris cluster through root or admin user. Log in to Doris with the following command:

mysql -h FE_HOST -P9030 -uroot

fe_host is the ip address of any FE node. 9030 is FE Query in conf_ Port configuration.

After logging in, you can change the root password through the following commands

SET PASSWORD FOR 'root' = PASSWORD('your_password');

1.3 create a new user

Create a normal user with the following command.

CREATE USER 'test' IDENTIFIED BY 'test_passwd';

During subsequent login, you can log in through the following connection commands.

mysql -h FE_HOST -P9030 -utest -ptest_passwd

Newly created ordinary users do not have any permissions by default. For permission granting, refer to the following permission granting.

2 data table creation and data import

2.1 creating database

Initially, you can create a database through root or admin users:

CREATE DATABASE example_db;

All commands can use 'HELP command;' See the detailed syntax help. For example: HELP CREATE DATABASE;

If you do not know the full name of the command, you can use "help command a field" for fuzzy query. If you type 'HELP CREATE', you can match commands such as create database, create table, and create user.

After the database is created, you can use SHOW DATABASES; View database information.

MySQL> SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| example_db         |
| information_schema |
+--------------------+
2 rows in set (0.00 sec)

information_schema exists to be compatible with MySQL protocol. In practice, the information may not be very accurate, so it is recommended to obtain the information about the specific database by directly querying the corresponding database.

2.2 account authorization

example_ After the DB is created, the example can be created through the root/admin account_ DB read and write permissions are authorized to ordinary accounts, such as test. After authorization, you can log in with the test account to operate example_db database.

GRANT ALL ON example_db TO test;

2.3 table building

Use the CREATE TABLE command to create a table. More detailed parameters can be viewed:

HELP CREATE TABLE;

First switch the database:

USE example_db;

Doris supports two table creation methods: single partition and composite partition.

In a composite partition:

The first level is called Partition, that is, Partition. You can specify a dimension column as a Partition column (currently only integer and time type columns are supported), and specify the value range of each Partition.

The second level is called Distribution, which is divided into barrels. You can specify one or more dimension columns and bucket number to HASH the data.

Composite partitions are recommended for the following scenarios

If there are time dimensions or similar dimensions with ordered values, such dimension columns can be used as partition columns. The partition granularity can be evaluated according to the import frequency and the amount of partition data.
Historical data deletion requirements: if there is a need to DELETE historical data (for example, only keep the data of the last N days). Using composite partitions, you can achieve this by deleting historical partitions. You can also DELETE data by sending a DELETE statement in the specified partition.
Solve the problem of data skew: each partition can specify the number of buckets separately. For example, partition by day. When the amount of data varies greatly every day, you can reasonably divide the data in different partitions by specifying the number of buckets in the partition. It is recommended to select the column with large differentiation for the bucket column.
Users can also use single partition instead of composite partition. Then the data is only HASH distributed.

Next, take the aggregation model as an example to demonstrate the table creation statements of the two partitions.

Single partition

Create a logical table named table1. The barrels are listed as siteid, and the number of barrels is 10.

The schema of this table is as follows:

siteid: the type is INT (4 bytes), and the default value is 10
citycode: the type is SMALLINT (2 bytes)
username: the type is VARCHAR, the maximum length is 32, and the default value is empty string
pv: the type is BIGINT (8 bytes), and the default value is 0; This is an indicator column. Doris will aggregate the indicator column internally. The aggregation method of this column is SUM
The table creation statement is as follows:

CREATE TABLE table1
(
    siteid INT DEFAULT '10',
    citycode SMALLINT,
    username VARCHAR(32) DEFAULT '',
    pv BIGINT SUM DEFAULT '0'
)
AGGREGATE KEY(siteid, citycode, username)
DISTRIBUTED BY HASH(siteid) BUCKETS 10
PROPERTIES("replication_num" = "1");

Composite partition

Create a logical table named table2.

The schema of this table is as follows:

event_day: the type is DATE and there is no default value
siteid: the type is INT (4 bytes), and the default value is 10
citycode: the type is SMALLINT (2 bytes)
username: the type is VARCHAR, the maximum length is 32, and the default value is empty string
pv: the type is BIGINT (8 bytes), and the default value is 0; This is an indicator column. Doris will aggregate the indicator column internally. The aggregation method of this column is SUM
We use event_ The day column is used as the partition column to create three partitions: p201706, p201707 and p201708

p201706: Range [minimum value, 2017-07-01)
p201707: Range [2017-07-01, 2017-08-01)
p201708: Range [2017-08-01, 2017-09-01)

Note that the interval is left closed and right open.

Each partition uses siteid to hash buckets. The number of buckets is 10

The table creation statement is as follows:

CREATE TABLE table2
(
    event_day DATE,
    siteid INT DEFAULT '10',
    citycode SMALLINT,
    username VARCHAR(32) DEFAULT '',
    pv BIGINT SUM DEFAULT '0'
)
AGGREGATE KEY(event_day, siteid, citycode, username)
PARTITION BY RANGE(event_day)
(
    PARTITION p201706 VALUES LESS THAN ('2017-07-01'),
    PARTITION p201707 VALUES LESS THAN ('2017-08-01'),
    PARTITION p201708 VALUES LESS THAN ('2017-09-01')
)
DISTRIBUTED BY HASH(siteid) BUCKETS 10
PROPERTIES("replication_num" = "1");

After the table is created, you can view example_ Table information in DB:

MySQL> SHOW TABLES;
+----------------------+
| Tables_in_example_db |
+----------------------+
| table1               |
| table2               |
+----------------------+
2 rows in set (0.01 sec)

MySQL> DESC table1;
+----------+-------------+------+-------+---------+-------+
| Field    | Type        | Null | Key   | Default | Extra |
+----------+-------------+------+-------+---------+-------+
| siteid   | int(11)     | Yes  | true  | 10      |       |
| citycode | smallint(6) | Yes  | true  | N/A     |       |
| username | varchar(32) | Yes  | true  |         |       |
| pv       | bigint(20)  | Yes  | false | 0       | SUM   |
+----------+-------------+------+-------+---------+-------+
4 rows in set (0.00 sec)

MySQL> DESC table2;
+-----------+-------------+------+-------+---------+-------+
| Field     | Type        | Null | Key   | Default | Extra |
+-----------+-------------+------+-------+---------+-------+
| event_day | date        | Yes  | true  | N/A     |       |
| siteid    | int(11)     | Yes  | true  | 10      |       |
| citycode  | smallint(6) | Yes  | true  | N/A     |       |
| username  | varchar(32) | Yes  | true  |         |       |
| pv        | bigint(20)  | Yes  | false | 0       | SUM   |
+-----------+-------------+------+-------+---------+-------+
5 rows in set (0.00 sec)

matters needing attention:

The above table is created by setting replication_num builds single copy tables. Doris recommends that users adopt the default 3 copy setting to ensure high availability.
You can dynamically add or delete partitions to the composite Partition table. For details, see the relevant section of Partition in HELP ALTER TABLE.
Data import can import the specified Partition. See HELP LOAD for details.
You can dynamically modify the Schema of the table.
Rollup can be added to the Table to improve query performance. For this part, please refer to the description of rollup in the advanced user guide.
The Null attribute of the column of the table is true by default, which will have a certain impact on the query performance.
#2.4 importing data
Doris supports a variety of data import methods. Please refer to the data import document for details. Here we use streaming import and Broker import as examples.

Streaming import

Streaming import transmits data to Doris through HTTP protocol, and can import local data directly without relying on other systems or components. For detailed syntax help, see HELP STREAM LOAD;.

Example 1: take "table1_20170707" as the Label and use the local file table1_data import table1 table.

curl --location-trusted -u test:test_passwd -H "label:table1_20170707" -H "column_separator:," -T table1_data http://FE_HOST:8030/api/example_db/table1/_stream_load
FE_HOST is the IP address of any Fe node, and 8030 is Fe HTTP in conf_ port.
You can use the IP of any BE, and BE Webserver in conf_ Port to import. For example: BE_HOST:8040
Local file table1_data is used as the separation between data. The specific contents are as follows:

1,1,jim,2
2,1,grace,2
3,2,tom,2
4,3,bush,3
5,3,helen,3

Example 2: take "table2_20170707" as the Label and use the local file table2_data import table2 table.

curl --location-trusted -u test:test -H "label:table2_20170707" -H "column_separator:|" -T table2_data http://127.0.0.1:8030/api/example_db/table2/_stream_load
Local file table2_data is separated by | as follows:

2017-07-03|1|1|jim|2
2017-07-05|2|1|grace|2
2017-07-12|3|2|tom|2
2017-07-15|4|3|bush|3
2017-07-12|5|3|helen|3

matters needing attention:

Using streaming import, it is recommended that the file size be limited to 10GB. Too large files will lead to failure and higher retry cost.
Each batch of imported data needs a Label. The Label is preferably a string related to a batch of data, which is convenient for reading and management. Doris ensures that the same batch of data can only be imported successfully once in a Database based on Label. Labels of failed tasks can be reused.
Streaming import is a synchronous command. If the command returns successfully, it indicates that the data has been imported. If the command returns failed, it indicates that this batch of data has not been imported.
#Broker import
Broker import reads data from external storage through the deployed broker process for import. For more help, please refer to HELP BROKER LOAD;

Example: with "table1_20170708" as the Label, import the file on HDFS into table1 table

LOAD LABEL table1_20170708
(
    DATA INFILE("hdfs://your.namenode.host:port/dir/table1_data")
    INTO TABLE table1
)
WITH BROKER hdfs 
(
    "username"="hdfs_user",
    "password"="hdfs_password"
)
PROPERTIES
(
    "timeout"="3600",
    "max_filter_ratio"="0.1"
);

Broker import is an asynchronous command. The successful execution of the above command only means that the task is submitted successfully. Whether the import is successful or not needs to go through SHOW LOAD; see. For example:

SHOW LOAD WHERE LABEL = "table1_20170708";

In the returned result, if the State field is FINISHED, the import is successful.

For more instructions on SHOW LOAD, please refer to HELP SHOW LOAD;

Asynchronous import tasks can be cancelled before they are finished:

CANCEL LOAD WHERE LABEL = "table1_20170708";

3 data query

3.1 simple query

Example:

MySQL> SELECT * FROM table1 LIMIT 3;
+--------+----------+----------+------+
| siteid | citycode | username | pv   |
+--------+----------+----------+------+
|      2 |        1 | 'grace'  |    2 |
|      5 |        3 | 'helen'  |    3 |
|      3 |        2 | 'tom'    |    2 |
+--------+----------+----------+------+
3 rows in set (0.01 sec)

MySQL> SELECT * FROM table1 ORDER BY citycode;
+--------+----------+----------+------+
| siteid | citycode | username | pv   |
+--------+----------+----------+------+
|      2 |        1 | 'grace'  |    2 |
|      1 |        1 | 'jim'    |    2 |
|      3 |        2 | 'tom'    |    2 |
|      4 |        3 | 'bush'   |    3 |
|      5 |        3 | 'helen'  |    3 |
+--------+----------+----------+------+
5 rows in set (0.01 sec)

3.3 Join query

Example:

MySQL> SELECT SUM(table1.pv) FROM table1 JOIN table2 WHERE table1.siteid = table2.siteid;
+--------------------+
| sum(`table1`.`pv`) |
+--------------------+
|                 12 |
+--------------------+
1 row in set (0.20 sec)

3.4 sub query

Example:

MySQL> SELECT SUM(pv) FROM table2 WHERE siteid IN (SELECT siteid FROM table1 WHERE siteid > 2);
+-----------+
| sum(`pv`) |
+-----------+
|         8 |
+-----------+
1 row in set (0.13 sec)

Topics: Database MySQL server