1, Deployment planning
hadoop high availability cluster deployment reference: Hadoop 3. X distributed high availability cluster deployment
1.1 version description
Software | edition |
---|---|
operating system | CentOS Linux release 7.8.2003 (Core) |
JAVA | jdk-8u271-linux-x64 |
Hadoop | hadoop-3.2.2 |
Hive | hive-3.1.2 |
1.2 cluster planning
- hive remote mode & & hiveserver2 high availability
hostname | IP | assembly | ||||
---|---|---|---|---|---|---|
master | 172.16.20.200 | NameNode | hive-metastore | |||
secondmaster | 172.16.20.201 | NameNode | hive-metastore | |||
slave1 | 172.16.20.202 | Zookeeper | DataNode | NodeManage | hiveserver2 | |
slave2 | 172.16.20.203 | Zookeeper | DataNode | NodeManage | hiveserver2 | |
slave3 | 172.16.20.204 | Zookeeper | DataNode | NodeManage | mysql |
1.3 hive storage concept
- Hive user interface:
Command line interface (CLI): input SQL statements in the form of command line for data operation
Web interface: accessed through the web.
Hive's remote service mode: access through JDBC and other methods. - Metadata Store
Metadata is stored in relational databases (MySql and Derby). Metadata includes table attributes, table names, table columns, partitions and their attributes, as well as the directory where the table data is located. - Interpreter, compiler, optimizer
Complete the lexical analysis, syntax analysis, compilation, optimization and query plan generation of SQL query statements respectively. The generated query plan is stored in HDFS and then called and executed by MapReduce. - data storage
Hive does not have a special data storage format or index for data. All data in hive is stored in HDFS. Hive contains the following data models: tables, external tables, partitions, and buckets
1.4 role of metadata and Metastore
-
Metadata concept:
Metadata contains meta information of databases, table s, etc. created with Hive. Metadata is stored in a relational database. Such as Derby, MySQL, etc. -
Metastore function:
The client connects to the metastore service, and then metastore connects to the MySQL database to access metadata. With metastore service, multiple clients can connect at the same time, and these clients do not need to know the user name and password of MySQL database, but only need to connect to metastore service.
1.5 Hive metadata storage (three configuration modes of Metastore)
- Embedded mode
The hive service and the Metastore service run in the same process, and the Derby service also runs in the process. The embedded mode uses the embedded Derby database to store metadata, and there is no need for an additional Metastore service.
This is the default and simple configuration, but only one client can connect at a time. It is suitable for experiments and not for production environments. - Local mode: install mysql locally instead of derby to store metadata
- Instead of using the embedded Derby as the metadata storage medium, other databases such as MySQL are used to store metadata. hive service and metastore service run in the same process. MySQL is a separate process, which can be on the same machine or on a remote machine.
- This method is a multi-user mode, running multiple user clients to connect to a database. This method is generally used as Hive within the company. Every user must have access to MySQL, that is, every client user needs to know the user name and password of MySQL.
- Remote mode: install mysql remotely instead of derby to store metadata
- Hive service and Metastore may be different machines in different processes. This mode requires hive.metastore.local to be set to false and hive.metastore.uris to be set to the Metastore server URL
- Remote meta storage requires a separate metastore service, and then each client is configured to connect to the metastore service in the configuration file. Start the metadata as a separate service. All kinds of clients connect through beeline. You don't need to know the password of the database before connecting.
- mysql that only connects to the remote cannot be called "remote mode". Remote means whether the metastore and hive services are in the same process
2, MYSQL deployment
- slave3 node
Download decompression
Download address: https://dev.mysql.com/get/Downloads/MySQL-8.0/mysql-8.0.26-linux-glibc2.12-x86_64.tar.xz
tar -xf mysql-8.0.26-linux-glibc2.12-x86_64.tar.xz -C /usr/local/ ln -s /usr/local/mysql-8.0.26-linux-glibc2.12-x86_64 /usr/local/mysql echo 'export PATH=/usr/local/mysql/bin:$PATH' >> /etc/profile source /etc/profile
Environment configuration
## Create user useradd mysql ## Create directory mkdir -pv /data/mysql/{data,logs,binlog} chown mysql.mysql /data/mysql -R
create profile
cat > /data/mysql/my.cnf << 'EOF' [client] port = 3306 socket = /data/mysql/mysqld.sock default-character-set = utf8mb4 [mysql] prompt="\u@mysqldb \R:\m:\s [\d]> " no-auto-rehash default-character-set = utf8mb4 [mysqld] user = mysql port = 3306 socket = /data/mysql/mysqld.sock skip-name-resolve # Set character encoding character-set-server = utf8 collation-server = utf8_general_ci # Set default time zone #default-time_zone='+8:00' server-id = 1 # Directory basedir = /usr/local/mysql datadir = /data/mysql/data secure_file_priv = /data/mysql/ pid-file = /data/mysql/mysql.pid max_connections = 1024 max_connect_errors = 100 wait_timeout = 100 max_allowed_packet = 128M table_open_cache = 2048 back_log = 600 default-storage-engine = innodb log_bin_trust_function_creators = 1 # Log general_log=off #general_log_file = /data/mysql/logs/mysql.log log-error = /data/mysql/logs/error.log # binlog log-bin = /data/mysql/binlog/mysql-binlog binlog_format=mixed #slowlog slow query log slow_query_log = 1 slow_query_log_file = /data/mysql/logs/slow.log long_query_time = 2 log_output = FILE log_queries_not_using_indexes = 0 #global_buffers innodb_buffer_pool_size = 2G innodb_log_buffer_size = 16M innodb_flush_log_at_trx_commit = 2 key_buffer_size = 64M innodb_log_file_size = 512M innodb_log_file_size = 2G innodb_log_files_in_group = 2 innodb_data_file_path = ibdata1:20M:autoextend sql_mode = STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION [mysqldump] quick max_allowed_packet = 32M EOF ln -s /data/mysql/my.cnf /etc/my.cnf
Initialize MYSQL
bin/mysqld --initialize --user=mysql --datadir=/data/mysql/data/ --basedir=/usr/local/mysql
Start MYSQL
cp support-files/mysql.server /etc/init.d/mysqld chmod +x /etc/init.d/mysqld
2.6 login to MYSQL
## View mysql initialization password cat /data/mysql/logs/error.log |grep password|awk -F':' '{print $NF}' 5>G=3sJ6Cy2i ## Change Password mysqladmin -uroot -p'5>G=3sJ6Cy2i' password 123456 ## Login to mysql mysql -uroot -p123456
Hive authorization
create database hive; create user "hive"@"%" identified by "Passw0rd"; grant all privileges on hive.* to "hive"@"%"; flush privileges;
mysql solves time zone problems
select now(); show variables like '%time_zone%'; set gloable time_zone='+8:00';
2, Hive deployment
- master node operation
2.1 download and unzip
Download address: https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
tar -zxf apache-hive-3.1.2-bin.tar.gz -C /opt/hadoop/ ln -s /opt/hadoop/apache-hive-3.1.2-bin /usr/local/hive
Configure the environment variable and add it under / etc / profile
cat >> /etc/profile << 'EOF' #HIVE HIVE_HOME=/usr/local/hive PATH=$HIVE_HOME/bin:$PATH export PATH HIVE_HOME EOF source /etc/profile
2.2 modify configuration
cd $HIVE_HOME/conf
hive-log4j2.properties
cp hive-log4j2.properties.template hive-log4j2.properties
hive-env.sh
cat > hive-env.sh << 'EOF' export HADOOP_HOME=/usr/local/hadoop export HIVE_CONF_DIR=/usr/local/hive/conf export HIVE_AUX_JARS_PATH=/usr/local/hive/lib EOF
2.3 synchronous configuration
rsync -av /opt/hadoop/apache-hive-3.1.2-bin root@sm:/opt/hadoop/ rsync -av /opt/hadoop/apache-hive-3.1.2-bin root@s1:/opt/hadoop/ rsync -av /opt/hadoop/apache-hive-3.1.2-bin root@s2:/opt/hadoop/
And create a soft connection on the node
ln -s /opt/hadoop/apache-hive-3.1.2-bin /usr/local/hive
2.4 metastore configuration
- master and secondmaster node operations
Create directory
hdfs dfs -mkdir -p /data/hive/{warehouse,tmp,logs} hdfs dfs -chmod -R 775 /data/hive/
cd $HIVE_HOME/conf
2.4.1 metastore-site.xml
cat > metastore-site.xml << 'EOF' <?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.metastore.local</name> <value>true</value> </property> <!--Hive Operational HDFS Root directory location --> <property> <name>hive.exec.scratchdir</name> <value>/data/hive/tmp</value> </property> <!--Hive Operational HDFS Create write permissions for the root directory --> <property> <name>hive.scratch.dir.permission</name> <value>775</value> </property> <!--hdfs upper hive Metadata storage location --> <property> <name>hive.metastore.warehouse.dir</name> <value>/data/hive/warehouse</value> </property> <!--Connection database address, name --> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://slave3:3306/hive?createDatabaseIfNotExist=true</value> </property> <!--Connect database driver --> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.cj.jdbc.Driver</value> </property> <!--Connection database user name --> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <!--Connect database user password --> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>Passw0rd</value> </property> <!-- appoint metastore Connection address --> <property> <name>hive.metastore.uris</name> <value>thrift://master:9083</value> </property> </configuration> EOF
The configuration of secondmaster is the same. Note that the address of hive.metastore.uris is modified to thrift://secondmaster:9083
2.4.2 download mysql driver package
Drive address: https://mvnrepository.com/artifact/mysql/mysql-connector-java
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.26/mysql-connector-java-8.0.26.jar -P $HIVE_HOME/lib/
2.4.3 initialize MYSQL
- Just execute it once
$HIVE_HOME/bin/schematool -dbType mysql -initSchema
report errors:
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
The guave version is incorrect. The solution is to delete the lower version (hadoop or hive) and copy the higher version, as follows:
ll $HIVE_HOME/lib/guava*; ll $HADOOP_HOME/share/hadoop/common/lib/guava* -rw-r--r-- 1 root root 2308517 Sep 27 2018 /usr/local/hive/lib/guava-19.0.jar -rw-r--r-- 1 1000 1000 2747878 Jan 3 2021 /usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar rm -f $HIVE_HOME/lib/guava-19.0.jar cp $HADOOP_HOME/share/hadoop/common/lib/guava-27.0-jre.jar $HIVE_HOME/lib/
2.4.4 start metastore
mkdir -pv $HIVE_HOME/logs nohup $HIVE_HOME/bin/hive --service metastore >> $HIVE_HOME/logs/metastore.log 2>&1 &
2.5 hiveserver2 configuration
- slave1 and salve2 nodes operate in the same way
2.5.1 hiveserver2-site.xml
cat > hiveserver2-site.xml << 'EOF' <?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://master:9083,thrift://secondmaster:9083</value> </property> <!--hiveserver2 High availability--> <property> <name>hive.server2.support.dynamic.service.discovery</name> <value>true</value> </property> <property> <name>hive.server2.active.passive.ha.enable</name> <value>true</value> </property> <property> <name>hive.server2.zookeeper.namespace</name> <value>hiveserver2_zk</value> </property> <property> <name>hive.zookeeper.quorum</name> <value>slave1:2181,slave2:2181,slave3:2181</value> </property> <property> <name>hive.zookeeper.client.port</name> <value>2181</value> </property> <property> <name>hive.server2.thrift.port</name> <value>10001</value> </property> <!--Fill in node, as slave1,slave2--> <property> <name>hive.server2.thrift.bind.host</name> <value>slave1</value> </property> </configuration> EOF
Note: modify hive.server2.thrift.bind.host to the hostname of the local machine
2.5.2 modifying hadoop configuration
Add the following configuration in core-site.xml
<!-- If 10001 is not connected --> <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property>
hadoop.proxyuser.xxx.hosts and hadoop.proxyuser.xxx.groups, where XXX is the user who starts HiveServer2
If you do not modify it, an error will be reported when you start hiveserver2
WARN [main] metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. getCurrentNotificationEventId org.apache.thrift.TApplicationException: Internal error processing get_current_notificationEventId
2.5.3 start hiveserver2
Start hive
mkdir -pv $HIVE_HOME/logs nohup $HIVE_HOME/bin/hive --service hiveserver2 >> $HIVE_HOME/logs/hiveserver2.log 2>&1 &
2.5.4 connection test
$HIVE_HOME/bin/beeline -u jdbc:hive2://slave1:10001 $HIVE_HOME/bin/beeline -u jdbc:hive2://slave2:10001
2.5.5 ui interface
http://172.16.20.201:10002/
http://172.16.20.202:10002/
2.6 hive client configuration
hive-site.xml
cd $HIVE_HOME/conf
cat > hive-site.xml << 'EOF' <?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://master:9083,thrift://secondmaster:9083</value> </property> <!-- Displays the column name of the table --> <property> <name>hive.cli.print.header</name> <value>true</value> </property> <!-- Display database name --> <property> <name>hive.cli.print.current.db</name> <value>true</value> </property> </configuration> EOF
Solve the inconsistency of guava version
rm -f $HIVE_HOME/lib/guava-19.0.jar cp $HADOOP_HOME/share/hadoop/common/lib/guava-27.0-jre.jar $HIVE_HOME/lib/
Start client
$HIVE_HOME/bin/hive
Connection test
Login hive command
$HIVE_HOME/bin/hive