Hive3.X high availability deployment

Posted by ernielou on Mon, 04 Oct 2021 19:56:52 +0200

1, Deployment planning

hadoop high availability cluster deployment reference: Hadoop 3. X distributed high availability cluster deployment

1.1 version description

Softwareedition
operating systemCentOS Linux release 7.8.2003 (Core)
JAVAjdk-8u271-linux-x64
Hadoophadoop-3.2.2
Hivehive-3.1.2

1.2 cluster planning

  • hive remote mode & & hiveserver2 high availability
hostnameIPassembly
master172.16.20.200NameNodehive-metastore
secondmaster172.16.20.201NameNodehive-metastore
slave1172.16.20.202ZookeeperDataNodeNodeManagehiveserver2
slave2172.16.20.203ZookeeperDataNodeNodeManagehiveserver2
slave3172.16.20.204ZookeeperDataNodeNodeManagemysql

1.3 hive storage concept

  1. Hive user interface:
    Command line interface (CLI): input SQL statements in the form of command line for data operation
    Web interface: accessed through the web.      
    Hive's remote service mode: access through JDBC and other methods.
  2. Metadata Store
    Metadata is stored in relational databases (MySql and Derby). Metadata includes table attributes, table names, table columns, partitions and their attributes, as well as the directory where the table data is located.
  3. Interpreter, compiler, optimizer
    Complete the lexical analysis, syntax analysis, compilation, optimization and query plan generation of SQL query statements respectively. The generated query plan is stored in HDFS and then called and executed by MapReduce.
  4. data storage
    Hive does not have a special data storage format or index for data. All data in hive is stored in HDFS. Hive contains the following data models: tables, external tables, partitions, and buckets

1.4 role of metadata and Metastore

  1. Metadata concept:
    Metadata contains meta information of databases, table s, etc. created with Hive. Metadata is stored in a relational database. Such as Derby, MySQL, etc.

  2. Metastore function:
    The client connects to the metastore service, and then metastore connects to the MySQL database to access metadata. With metastore service, multiple clients can connect at the same time, and these clients do not need to know the user name and password of MySQL database, but only need to connect to metastore service.

1.5 Hive metadata storage (three configuration modes of Metastore)

  1. Embedded mode
    The hive service and the Metastore service run in the same process, and the Derby service also runs in the process. The embedded mode uses the embedded Derby database to store metadata, and there is no need for an additional Metastore service.
    This is the default and simple configuration, but only one client can connect at a time. It is suitable for experiments and not for production environments.
  2. Local mode: install mysql locally instead of derby to store metadata
    • Instead of using the embedded Derby as the metadata storage medium, other databases such as MySQL are used to store metadata. hive service and metastore service run in the same process. MySQL is a separate process, which can be on the same machine or on a remote machine.
    • This method is a multi-user mode, running multiple user clients to connect to a database. This method is generally used as Hive within the company. Every user must have access to MySQL, that is, every client user needs to know the user name and password of MySQL.
  3. Remote mode: install mysql remotely instead of derby to store metadata
    • Hive service and Metastore may be different machines in different processes. This mode requires hive.metastore.local to be set to false and hive.metastore.uris to be set to the Metastore server URL
    • Remote meta storage requires a separate metastore service, and then each client is configured to connect to the metastore service in the configuration file. Start the metadata as a separate service. All kinds of clients connect through beeline. You don't need to know the password of the database before connecting.
    • mysql that only connects to the remote cannot be called "remote mode". Remote means whether the metastore and hive services are in the same process

2, MYSQL deployment

  • slave3 node

Download decompression

Download address: https://dev.mysql.com/get/Downloads/MySQL-8.0/mysql-8.0.26-linux-glibc2.12-x86_64.tar.xz

tar -xf mysql-8.0.26-linux-glibc2.12-x86_64.tar.xz -C /usr/local/
ln -s /usr/local/mysql-8.0.26-linux-glibc2.12-x86_64 /usr/local/mysql
echo 'export PATH=/usr/local/mysql/bin:$PATH' >> /etc/profile
source /etc/profile

Environment configuration

## Create user
useradd mysql
## Create directory
mkdir -pv /data/mysql/{data,logs,binlog}
chown mysql.mysql /data/mysql -R

create profile

cat > /data/mysql/my.cnf << 'EOF'
[client]
port = 3306
socket = /data/mysql/mysqld.sock
default-character-set = utf8mb4

[mysql]
prompt="\u@mysqldb \R:\m:\s [\d]> "
no-auto-rehash
default-character-set = utf8mb4

[mysqld]
user = mysql
port = 3306
socket = /data/mysql/mysqld.sock
skip-name-resolve

# Set character encoding
character-set-server = utf8
collation-server = utf8_general_ci

# Set default time zone
#default-time_zone='+8:00'

server-id = 1

# Directory
basedir = /usr/local/mysql
datadir = /data/mysql/data
secure_file_priv = /data/mysql/
pid-file = /data/mysql/mysql.pid


max_connections       = 1024
max_connect_errors    = 100
wait_timeout          = 100
max_allowed_packet    = 128M
table_open_cache      = 2048
back_log              = 600

default-storage-engine = innodb
log_bin_trust_function_creators = 1

# Log
general_log=off
#general_log_file =  /data/mysql/logs/mysql.log
log-error = /data/mysql/logs/error.log

# binlog
log-bin = /data/mysql/binlog/mysql-binlog
binlog_format=mixed

#slowlog slow query log
slow_query_log = 1
slow_query_log_file = /data/mysql/logs/slow.log
long_query_time = 2
log_output = FILE
log_queries_not_using_indexes = 0

#global_buffers
innodb_buffer_pool_size = 2G
innodb_log_buffer_size = 16M
innodb_flush_log_at_trx_commit = 2
key_buffer_size = 64M

innodb_log_file_size = 512M
innodb_log_file_size = 2G
innodb_log_files_in_group = 2
innodb_data_file_path = ibdata1:20M:autoextend

sql_mode = STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION

[mysqldump]
quick
max_allowed_packet = 32M
EOF
ln -s /data/mysql/my.cnf /etc/my.cnf

Initialize MYSQL

bin/mysqld --initialize --user=mysql --datadir=/data/mysql/data/ --basedir=/usr/local/mysql

Start MYSQL

cp support-files/mysql.server /etc/init.d/mysqld
chmod +x /etc/init.d/mysqld

2.6 login to MYSQL

## View mysql initialization password
cat /data/mysql/logs/error.log |grep password|awk -F':' '{print $NF}'
5>G=3sJ6Cy2i

## Change Password
mysqladmin -uroot -p'5>G=3sJ6Cy2i' password 123456

## Login to mysql
mysql -uroot -p123456

Hive authorization

create database hive;
create user "hive"@"%" identified by "Passw0rd";
grant all privileges on hive.* to "hive"@"%";
flush privileges;

mysql solves time zone problems

select now();
show variables like '%time_zone%';
set gloable time_zone='+8:00';

2, Hive deployment

  • master node operation

2.1 download and unzip

Download address: https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

tar -zxf apache-hive-3.1.2-bin.tar.gz -C /opt/hadoop/
ln -s /opt/hadoop/apache-hive-3.1.2-bin /usr/local/hive

Configure the environment variable and add it under / etc / profile

cat >> /etc/profile << 'EOF'
#HIVE
HIVE_HOME=/usr/local/hive
PATH=$HIVE_HOME/bin:$PATH
export PATH HIVE_HOME

EOF
source /etc/profile

2.2 modify configuration

cd $HIVE_HOME/conf

hive-log4j2.properties

cp hive-log4j2.properties.template hive-log4j2.properties

hive-env.sh

cat > hive-env.sh << 'EOF'
export HADOOP_HOME=/usr/local/hadoop
export HIVE_CONF_DIR=/usr/local/hive/conf
export HIVE_AUX_JARS_PATH=/usr/local/hive/lib
EOF

2.3 synchronous configuration

rsync -av /opt/hadoop/apache-hive-3.1.2-bin root@sm:/opt/hadoop/
rsync -av /opt/hadoop/apache-hive-3.1.2-bin root@s1:/opt/hadoop/
rsync -av /opt/hadoop/apache-hive-3.1.2-bin root@s2:/opt/hadoop/

And create a soft connection on the node

ln -s /opt/hadoop/apache-hive-3.1.2-bin /usr/local/hive

2.4 metastore configuration

  • master and secondmaster node operations

Create directory

hdfs dfs -mkdir -p /data/hive/{warehouse,tmp,logs}
hdfs dfs -chmod -R 775 /data/hive/
cd $HIVE_HOME/conf

2.4.1 metastore-site.xml

cat > metastore-site.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
      <name>hive.metastore.local</name>
      <value>true</value>
    </property>
    <!--Hive Operational HDFS Root directory location -->
    <property>
      <name>hive.exec.scratchdir</name>
      <value>/data/hive/tmp</value>
    </property>
    <!--Hive Operational HDFS Create write permissions for the root directory -->
    <property>
      <name>hive.scratch.dir.permission</name>
      <value>775</value>
    </property>
    <!--hdfs upper hive Metadata storage location -->
    <property>
      <name>hive.metastore.warehouse.dir</name>
      <value>/data/hive/warehouse</value>
    </property>
    <!--Connection database address, name -->
    <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://slave3:3306/hive?createDatabaseIfNotExist=true</value>
    </property>
    <!--Connect database driver -->
    <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.cj.jdbc.Driver</value>
    </property>
    <!--Connection database user name -->
    <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hive</value>
    </property>
    <!--Connect database user password -->
    <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>Passw0rd</value>
    </property>
    <!-- appoint metastore Connection address -->
    <property>
      <name>hive.metastore.uris</name>
      <value>thrift://master:9083</value>
    </property>
</configuration>
EOF

The configuration of secondmaster is the same. Note that the address of hive.metastore.uris is modified to thrift://secondmaster:9083

2.4.2 download mysql driver package

Drive address: https://mvnrepository.com/artifact/mysql/mysql-connector-java

wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.26/mysql-connector-java-8.0.26.jar -P $HIVE_HOME/lib/

2.4.3 initialize MYSQL

  • Just execute it once
$HIVE_HOME/bin/schematool -dbType mysql -initSchema

report errors:

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V

The guave version is incorrect. The solution is to delete the lower version (hadoop or hive) and copy the higher version, as follows:

ll $HIVE_HOME/lib/guava*; ll $HADOOP_HOME/share/hadoop/common/lib/guava*
-rw-r--r-- 1 root root 2308517 Sep 27  2018 /usr/local/hive/lib/guava-19.0.jar
-rw-r--r-- 1 1000 1000 2747878 Jan  3  2021 /usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar

rm -f $HIVE_HOME/lib/guava-19.0.jar
cp $HADOOP_HOME/share/hadoop/common/lib/guava-27.0-jre.jar $HIVE_HOME/lib/

2.4.4 start metastore

mkdir -pv $HIVE_HOME/logs
nohup $HIVE_HOME/bin/hive --service metastore >> $HIVE_HOME/logs/metastore.log 2>&1 &

2.5 hiveserver2 configuration

  • slave1 and salve2 nodes operate in the same way

2.5.1 hiveserver2-site.xml

cat > hiveserver2-site.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
      <name>hive.metastore.uris</name>
      <value>thrift://master:9083,thrift://secondmaster:9083</value>
    </property>     
    <!--hiveserver2 High availability-->
    <property>
      <name>hive.server2.support.dynamic.service.discovery</name>
      <value>true</value>
    </property>

    <property>
      <name>hive.server2.active.passive.ha.enable</name>
      <value>true</value>
    </property>
    
    <property>
      <name>hive.server2.zookeeper.namespace</name>
      <value>hiveserver2_zk</value>
    </property>

    <property>
      <name>hive.zookeeper.quorum</name>
      <value>slave1:2181,slave2:2181,slave3:2181</value>
    </property>
    
    <property>
      <name>hive.zookeeper.client.port</name>
      <value>2181</value>
    </property>
    
    <property>
      <name>hive.server2.thrift.port</name>
      <value>10001</value>
    </property>

    <!--Fill in node, as slave1,slave2-->
    <property>
      <name>hive.server2.thrift.bind.host</name>
      <value>slave1</value>
    </property>
</configuration>
EOF

Note: modify hive.server2.thrift.bind.host to the hostname of the local machine

2.5.2 modifying hadoop configuration

Add the following configuration in core-site.xml

    <!-- If 10001 is not connected -->
    <property>     
      <name>hadoop.proxyuser.root.hosts</name>     
      <value>*</value>
    </property> 
    <property>     
      <name>hadoop.proxyuser.root.groups</name>    
      <value>*</value> 
    </property>

hadoop.proxyuser.xxx.hosts and hadoop.proxyuser.xxx.groups, where XXX is the user who starts HiveServer2

If you do not modify it, an error will be reported when you start hiveserver2

WARN [main] metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. getCurrentNotificationEventId
org.apache.thrift.TApplicationException: Internal error processing get_current_notificationEventId

2.5.3 start hiveserver2

Start hive

mkdir -pv $HIVE_HOME/logs
nohup $HIVE_HOME/bin/hive --service hiveserver2 >> $HIVE_HOME/logs/hiveserver2.log 2>&1 &

2.5.4 connection test

$HIVE_HOME/bin/beeline -u jdbc:hive2://slave1:10001
$HIVE_HOME/bin/beeline -u jdbc:hive2://slave2:10001

2.5.5 ui interface

http://172.16.20.201:10002/

http://172.16.20.202:10002/

2.6 hive client configuration

hive-site.xml

cd $HIVE_HOME/conf
cat > hive-site.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
      <name>hive.metastore.uris</name>
      <value>thrift://master:9083,thrift://secondmaster:9083</value>
    </property> 
    <!-- Displays the column name of the table -->
    <property>
      <name>hive.cli.print.header</name>
      <value>true</value>
    </property>
    <!-- Display database name -->
    <property>
      <name>hive.cli.print.current.db</name>
      <value>true</value>
    </property>
</configuration>
EOF

Solve the inconsistency of guava version

rm -f $HIVE_HOME/lib/guava-19.0.jar
cp $HADOOP_HOME/share/hadoop/common/lib/guava-27.0-jre.jar $HIVE_HOME/lib/

Start client

$HIVE_HOME/bin/hive

Connection test

Login hive command

$HIVE_HOME/bin/hive

Topics: Big Data Hadoop hive HA