OLAP engine: cross data source analysis based on Presto component

Posted by daredevil14 on Fri, 04 Mar 2022 16:21:36 +0100

1, Presto overview

1. Introduction to Presto

Presto is an open source distributed SQL query engine, which is suitable for interactive analysis and query. The data volume supports GB to PB bytes. Although Presto has the ability to parse SQL, it does not belong to the standard database category.

Presto supports online data query, including Hive, relational database and proprietary data storage. A Presto query can merge data from multiple data sources and analyze across the entire organization. Presto is mainly used to deal with scenarios with response time less than 1 second to a few minutes.

2. Presto architecture

Presto query engine is a distributed system based on master slave architecture and running on multiple servers. It is composed of one Coordinator node and multiple Worker nodes. The Coordinator is responsible for parsing SQL statements, generating execution plans, distributing execution tasks to the Worker node for execution, and the Worker node is responsible for actually executing query tasks.

Coordinator node

Coordinator server is used to parse query statements, execute plans, analyze and manage Presto's Worker nodes, track the activities of each Work and coordinate the execution of query statements. The coordinator establishes a model for each query. The model contains multiple stages. Each Stage is converted into a Task and distributed to different workers for execution. The coordination communication is based on REST-API. Presto installation must have a coordinator node.

Worker node

Workers are responsible for executing query tasks and processing data, obtaining data from connectors, and exchanging intermediate data between workers. The Coordinator obtains the results from the Worker and returns the final results to the Client. When the Worker starts, it will broadcast itself and find the Coordinator to inform the Coordinator of the available status. The coordinated communication is based on the REST-API. Presto usually installs multiple Worker nodes.

Data source adaptation

Presto can adapt to many different data sources, and can connect and interact with data sources. Presto processes table through the fully qualified name of the table, Catalog corresponds to the class data source, Schema corresponds to the database, and table corresponds to the data table.

The smallest data unit processed in Presto is a Page object. A Page object contains multiple Block objects. Each Block object is a byte array that stores several rows of a field. A row crosscut by multiple blocks is a real row of data.

2, Presto installation

1. Installation package management

[root@hop01 presto]# pwd
/opt/presto
[root@hop01 presto]# ll
presto-cli-0.196-executable.jar
presto-server-0.189.tar.gz
[root@hop01 presto]# tar -zxvf presto-server-0.189.tar.gz

2. Configuration management

Create the etc folder in the presto installation directory and add the following configuration information:

/opt/presto/presto-server-0.189/etc

Node attribute

Specific environment configuration of each node: etc / node properties；

[root@hop01 etc]# vim node.properties
node.environment=production
node.id=presto01
node.data-dir=/opt/presto/data

Configuration content: environment name, unique ID, and data directory.

JVM configuration

Command line options of JVM, list of command line options for starting Java virtual machine: etc / JVM config.

[root@hop01 etc]# vim jvm.config
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

Configuration properties

For the configuration of Presto server, each Presto server can act as a coordinator and worker. If a single machine is used to perform coordination, it can provide the best performance on a larger cluster. Here, Presto server is both a coordinator and a worker node: etc / config properties.

[root@hop01 etc]# vim config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8083
query.max-memory=3GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://hop01:8083

Here coordinator=true means that the current Presto instance acts as coordinator.

Log configuration

[root@hop01 etc]# vim log.properties
com.facebook.presto=INFO

Catalog properties

/opt/presto/presto-server-0.189/etc/catalog

Configure hive adaptation:

[root@hop01 catalog]# vim hive.properties
connector.name=hive-hadoop2
hive.metastore.uri=thrift://192.168.37.133:9083

Configure MySQL adapter:

[root@hop01 catalog]# vim mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://192.168.37.133:3306
connection-user=root
connection-password=123456

3. Running services

Start command

[root@hop01 /]# /opt/presto/presto-server-0.189/bin/launcher run

start log

So presto starts successfully.

3, Client installation

1. Jar package management

[root@hop01 presto-cli]# pwd
/opt/presto/presto-cli
[root@hop01 presto-cli]# ll
presto-cli-0.196-executable.jar
[root@hop01 presto-cli]# mv presto-cli-0.196-executable.jar presto-cli.jar

2. Connect to MySQL

java -jar presto-cli.jar --server ip:9000 --catalog mysql --schema sq_export

4, Source code address

GitHub·address
https://github.com/cicadasmile/big-data-parent
GitEE·address
https://gitee.com/cicadasmile/big-data-parent

Topics: Java MySQL hive v-im

Programmer Think