Data Source Management | OLAP Query Engine, ClickHouse Clustered Management

Posted by bettydailey on Mon, 18 May 2020 02:49:32 +0200

Source code for this article: GitHub. Click here || GitEE. Click here

1. Introduction to Column Library

ClickHouse is the 2016 Open Source Column Storage Database (DBMS) of Yandex, Russia. It is mainly used for OLAP online analysis processing queries, and can use SQL queries to generate real-time analysis data reports.

Column storage

Row storage and column storage, the organization of data on disk is fundamentally different. When data analysis and calculation, row storage needs to traverse the entire table, while column storage only needs to traverse a single column, so column libraries are more suitable for large and wide tables for data analysis and calculation.

Note that the scenarios compared here are those of data analysis and calculation.

2. Cluster Configuration

1. Basic Environment

ClickHouse Single Service is installed by default

2. Remove file restrictions

vim /etc/security/limits.conf
vim /etc/security/limits.d/90-nproc.conf
//Append to end of file
* soft nofile 65536 
* hard nofile 65536 
* soft nproc 131072 
* hard nproc 131072

3. Cancel SELINUX

Restart after modifying SELINUX=disabled in /etc/selinux/config

4. Cluster Profile

Services add cluster configurations separately: vim/etc/metrika.xml

<yandex>
<clickhouse_remote_servers>
    <clickhouse_cluster>
        <shard>
            <internal_replication>true</internal_replication>
            <replica>
                <host>192.168.72.133</host>
                <port>9000</port>
            </replica>
        </shard>
        <shard>
            <replica>
                <internal_replication>true</internal_replication>
                <host>192.168.72.136</host>
                <port>9000</port>
            </replica>
        </shard>
        <shard>
            <internal_replication>true</internal_replication>
            <replica>
                <host>192.168.72.137</host>
                <port>9000</port>
            </replica>
        </shard>
    </clickhouse_cluster>
</clickhouse_remote_servers>

<zookeeper-servers>
  <node index="1">
    <host>192.168.72.133</host>
    <port>2181</port>
  </node>
  <node index="2">
    <host>192.168.72.136</host>
    <port>2181</port>
  </node>
  <node index="3">
    <host>192.168.72.137</host>
    <port>2181</port>
  </node>
</zookeeper-servers>

<macros>
    <replica>192.168.72.133</replica>
</macros>

<networks>
   <ip>::/0</ip>
</networks>

<clickhouse_compression>
<case>
  <min_part_size>10000000000</min_part_size>
  <min_part_size_ratio>0.01</min_part_size_ratio>
  <method>lz4</method>
</case>
</clickhouse_compression>
</yandex>

Notice here

<macros>
    <replica>192.168.72.133</replica>
</macros>

Configure the IP addresses of each service.

5. Start Cluster

Start three services separately

service clickhouse-server start

6. Logon Client View

Just log in to any service here

clickhouse-client
en-master :) select * from system.clusters

Here's the cluster name: clickhouse_cluster for later use.

7. Basic environmental testing

Create table structure on three services at the same time.

CREATE TABLE ontime_local (FlightDate Date,Year UInt16) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192);

133 Environment Create Distribution Table

CREATE TABLE ontime_all AS ontime_local ENGINE = Distributed(clickhouse_cluster, default, ontime_local, rand());

Write any service data

insert into ontime_local (FlightDate,Year) values ('2020-03-12',2020);

Query Total Table

select * from ontime_all;

Write to the total table and the data will be distributed among the individual forms

insert into ontime_all (FlightDate,Year)values('2001-10-12',2001);
insert into ontime_all (FlightDate,Year)values('2002-10-12',2002);
insert into ontime_all (FlightDate,Year)values('2003-10-12',2003);

Any one service is shut down and cluster queries are directly suspended

3. Integration of Cluster Environment

1. Basic Configuration

url: Configure the entire list of services, mainly to manage the table structure, batch processing;

cluster: cluster connection service, which can be configured based on Nginx proxy service;

spring:
  datasource:
    type: com.alibaba.druid.pool.DruidDataSource
    click:
      driverClassName: ru.yandex.clickhouse.ClickHouseDriver
      url: jdbc:clickhouse://127.0.0.1:8123/default,jdbc:clickhouse://127.0.0.1:8123/default,jdbc:clickhouse://127.0.0.1:8123/default
      cluster: jdbc:clickhouse://127.0.0.1:8123/default
      initialSize: 10
      maxActive: 100
      minIdle: 10
      maxWait: 6000

2. Management Interface

Create tables and write data to each single-node service, respectively:

Data_shard (single-node data)

Data_all (distribution data)

@RestController
public class DataShardWeb {

    @Resource
    private JdbcFactory jdbcFactory ;

    /**
     * Infrastructure creation
     */
    @GetMapping("/createTable")
    public String createTable (){
        List<JdbcTemplate> jdbcTemplateList = jdbcFactory.getJdbcList();
        for (JdbcTemplate jdbcTemplate:jdbcTemplateList){
            jdbcTemplate.execute("CREATE TABLE data_shard (FlightDate Date,Year UInt16) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)");
            jdbcTemplate.execute("CREATE TABLE data_all AS data_shard ENGINE = Distributed(clickhouse_cluster, default, data_shard, rand())");
        }
        return "success" ;
    }

    /**
     * Node table write data
     */
    @GetMapping("/insertData")
    public String insertData (){
        List<JdbcTemplate> jdbcTemplateList = jdbcFactory.getJdbcList();
        for (JdbcTemplate jdbcTemplate:jdbcTemplateList){
            jdbcTemplate.execute("insert into data_shard (FlightDate,Year) values ('2020-04-12',2020)");
        }
        return "success" ;
    }
}

3. Cluster Query

When the above steps are completed, you can join the cluster service to query the data of the distribution master table and the table.

Druid-based Connection

@Configuration
public class DruidConfig {

    @Resource
    private JdbcParamConfig jdbcParamConfig ;

    @Bean
    public DataSource dataSource() {
        DruidDataSource datasource = new DruidDataSource();
        datasource.setUrl(jdbcParamConfig.getCluster());
        datasource.setDriverClassName(jdbcParamConfig.getDriverClassName());
        datasource.setInitialSize(jdbcParamConfig.getInitialSize());
        datasource.setMinIdle(jdbcParamConfig.getMinIdle());
        datasource.setMaxActive(jdbcParamConfig.getMaxActive());
        datasource.setMaxWait(jdbcParamConfig.getMaxWait());
        return datasource;
    }
}

Based on mapper query

<mapper namespace="com.ckhouse.cluster.mapper.DataAllMapper">

    <resultMap id="BaseResultMap" type="com.ckhouse.cluster.entity.DataAllEntity">
        <result column="FlightDate" jdbcType="VARCHAR" property="flightDate" />
        <result column="Year" jdbcType="INTEGER" property="year" />
    </resultMap>

    <select id="getList" resultMap="BaseResultMap" >
        select * from data_all where Year=2020
    </select>
</mapper>

4. Source code address

GitHub·address
https://github.com/cicadasmile/data-manage-parent
GitEE·address
https://gitee.com/cicadasmile/data-manage-parent

Topics: Database JDBC github vim SELinux