Cluster construction of hadoop, spark, hive and azkaban under ubuntu

Posted by mattal999 on Sat, 02 Nov 2019 11:47:52 +0100

Tuesday, 08. October 2019 11:01 am

Initial preparation:

1. jdk installation

Do the following on all three machines (depending on the number of machines you have):

1) you can install jdk through apt get, execute whereis java on the command line to get the installation path of java, or download the installation package of jdk manually for installation

2) add the following parameters to ~ /. bashrc file or / etc/profile file

export JAVA_HOME=your_path

export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=${JAVA_HOME}/bin:$PATH

3) execute source ~/.bashrc or source /etc/profile on the command line to make the parameters take effect (corresponding to the file you modified). 4) finally, execute java version to determine whether the Java installation is successful.

2. ssh password free login

You also need to perform the following operations on three machines: 1) switch to the root directory, cd ~ 2) execute SSH keygen in this directory,

You can send the secret key to the specified machine through SSH copy ID:

　　　　ssh-copy-id -i ~/.ssh/id_rsa.pub -p 22 root@172.16.22.132，

You can also send the secret key to the target machine through scp, and then create an authorized "keys" on the target machine to copy the secret key to the file:

scp id_rsa.pub root@172.16.22.132:/root/.ssh 　　　　cd /root/.ssh/ cat id_rsa.pub >> authorized_keys

Part I: hadoop cluster building

Reference link: https://www.cnblogs.com/charles1ee/p/11239375.html

hadoop version: 2.7.7

This is a fully distributed version of hadoop, with one master node and two slave nodes

1. Modify the / etc/hostname file to master, slave1 and slave2, and restart the machine after modification and saving;

2. Modify the / etc/hosts file on the master machine to match the domain name and ip address, as shown in the following figure. After the modification, slave1 and slave2 are copied on each machine, or added on the original basis, as shown in the following figure:

3. Create a new workspace in the document directory, where all the following components are placed. Extract hadoop-2.7.7.tar.gz and rename it Hadoop. Execute sudo chown-r 777 Hadoop 4. Modify the master configuration file

Modify hadoop-env.sh file export Java? Home = your? Path
Modify the hdfs-site.xml file

<configuration>
                                <property>
                                <name>dfs.replication</name>
                                <value>2</value>
                                </property>
                                <property>
                                <name>dfs.namenode.name.dir</name>
                                <value>file:/data/hadoop_data/hdfs/namenode</value>
                                </property>
                                <property>
                                 <name>dfs.datanode.data.dir</name>
                                <value>file:/data/hadoop_data/hdfs/datanode</value>
                                </property>
                                #Set secondaryNode to slave1 machine
                                <property>
                                <name>dfs.namenode.secondary.http-address</name>
                                <value>slave1:50090</value>
                                </property>
                            </configuration>

Modify the core-site.xml file

 <configuration>
                                <property>
                                    <name>fs.defaultFS</name>
                                    <value>hdfs://master:9000</value>
                                </property>
                                <property>
                                    <name>hadoop.tmp.dir</name>
                                    <value>/data/hadoop_data/hadoop_tmp</value>
                                </property>
                            </configuration>

Modify mapred-site.xml file

<configuration>
                                <property>
                                    <name>mapreduce.framework.name</name>
                                    <value>yarn</value>
                                </property>
                            </configuration>

Modify the yarn-site.xml file

  <configuration>
                                <property>
                                <name>yarn.resourcemanager.hostname</name>
                                <value>master</value>
                                </property>
                                <property>
                                <name>yarn.nodemanager.aux-services</name>
                                <value>mapreduce_shuffle</value>
                                </property>
                                <property>
                                <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                                <value>org.apache.hadoop.mapred.ShuffleHandler</value>
                                </property>
                                <property>
                                <name>yarn.resourcemanager.address</name>
                                <value>master:8050</value>
                                </property>
                                <property>
                                <name>yarn.resourcemanager.scheduler.address</name>
                                <value>master:8030</value>
                                </property>
                                <property>
                                <name>yarn.resourcemanager.resource-tracker.address</name>
                                <value>master:8025</value>
                                </property>
                                #When running pyspark with hadoop yarn, the following two parameters will not be added and an error will be reported
                                <property>
                                <name>yarn.nodemanager.pmem-check-enabled</name>
                                <value>false</value>
                                </property>
                                <property>
                                <name>yarn.nodemanager.vmem-check-enabled</name>
                                <value>false</value>
                                </property>
                                #The following three parameters are that when azkaban is configured, the timing task is always preparing, because the machine memory is not enough
                                <property>
                                <name>yarn.scheduler.maximum-allocation-mb</name>
                                <value>12240</value>
                                </property>
                                <property>
                                <name>yarn.scheduler.minimum-allocation-mb</name>
                                <value>100</value>
                                </property>
                                <property>
                                <name>yarn.nodemanager.resource.memory-mb</name>
                                <value>12240</value>
                                </property>
                            </configuration>

5. Modify the configuration file of the slave. The configuration of the slave node and the master node is roughly the same, but the mapred-site.xml file is not the same as the master node. The configuration is

Modify mapred-site.xml file

 <configuration>
                                <property>
                                <name>mapred.job.tracker</name>
                                <value>master:54311</value>
                                </property>
                            </configuration>

6. Modify the log file, execute the hadoop command, report the WARNING solution vim log4j.properties, add: log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR

7. After installation and configuration, return to the master node to format the namenode

cd /data/hadoop_data/hdfs/namenode

hadoop namenode -format

Executing commands on the master node

Start all.sh / / start

Stop all.sh / / close

8. The following is the interface diagram (Note: when entering the url, change hadoop-master-001 to the domain name master of the host above us)

Part II: spark cluster construction

The spark component can be obtained by compiling the source code and downloading it directly from the official website (http://spark.apache.org/downloads). Click the link in the box below to find the version you need.

The version used in this construction is 2.1.0 reference link: http://dblab.xmu.edu.cn/blog/1086-2/spark construction steps:

The spark version that supports hive is compiled with source code;
1. Enter the download directory and execute the command tar -zxvf spark-2.1.0.tar.gz to extract the spark package
2. Enter the root directory of spark, execute the command. / dev / make distribution.sh -- name 2.7.3 -- tgz - pyarn - phadop-2.7 - dhadop. Version = 2.7.7 - phadop provided - phive - phive thriftserver - dskiptests. This command starts compiling the source code of spark, and finally generates a tgz package spark-2.1.0-bin-2.7.3.tgz
3. The parameters are explained as follows:
  - -DskipTests, do not execute the test case, but compile the test case class to generate the corresponding class file under target / test classes.
  - -hadoop.version and - phadop: the version number of hadoop on the local machine. Without this parameter, the version of hadoop is 1.0.4.
  - -Pyarn: whether to support Hadoop YARN. It is not supported without parameters.
  - -Phive and - phive thriftserver: whether hive is supported in Spark SQL. Hive is not supported without this parameter.
  - – with Tachyon: whether the memory file system Tachyon is supported. Without this parameter, Tachyon is not supported.
  - – tgz: generate spark-$VERSION-bin.tgz in the root directory. Without this parameter, no tgz file will be generated. Only the / dist directory will be generated.
  - – NAME: in combination with – tgz, the deployment package of spark - $version Bin - $NAME.tgz can be generated. Without this parameter, the NAME is the version number of hadoop
Copy the spark-2.1.0-bin-2.7.3.tgz package to one of the machines and unzip it. The unzipped folder will be renamed spark. Enter the / conf folder under the directory. Here are all the configuration file directories of spark. You need to modify three files here, namely, the slaves, hive-site.xml and spark-env.sh files;
1. Modify ~ /. bashrc

        export SPARK_HOME=/home/benhairui/Documents/workspace/spark 
        export PATH=$PATH:$SPARK_HOME/bin

2. Modify the slaves configuration file. Add the slave node at the end of the file. Here are slave1 and slave2

        slave1 
        slave2

Modify spark-env.sh file

#The ip address bound to the master instance, for example, the ip address bound to a public network
                             export SPARK_MASTER_HOST=master 
                             #Port bound by master instance (default is 7077)
                             export SPARK_MASTER_PORT=7077 
                             #scala installation path
                             export SCALA_HOME=/usr/share/scala-2.11 
                             #java installation path
                             export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
                             #Installation path of hadoop
                             export HADOOP_HOME=/home/benhairui/Documents/workspace/hadoop
                             #Profile path of hadoop
                             export HADOOP_CONF_DIR=/home/benhairui/Documents/workspace/hadoop/etc/hadoop
                             #Official explanation of this parameter: Spark uses the libraries of hadoop client to access HDFS and yarn, and the spark package of "hadoop free" after spark 1.4 allows users to
                             #Connect to any hadoop version cluster. In order to use this version, you need to modify spark ﹣ dist ﹣ classpath in spark-env.sh, including hadoop jar package
                             #https://spark.apache.org/docs/latest/hadoop-provided.html, the link has a detailed parameter explanation
                             export SPARK_DIST_CLASSPATH=$(/home/benhairui/Documents/workspace/hadoop/bin/hadoop classpath)

After the spark test, the installation of spark is finished. You can simply verify whether the installation is successful through the following methods:

1. Start the hadoop cluster through the command line, start-all.sh

2. Start the spark cluster through the command line. After start-all.sh spark is started, the master node has one more "master" ID than that under jps that is not started, as follows:

The slave node has an additional "worker" ID, as follows:

In subsequent tests, you can connect spark cluster through spark shell and access hdfs data to further confirm

./spark-shell --master spark://master:7077

Part 3: build hive (non cluster, only configured on the master node of hadoop)

Reference: https://blog.csdn.net/dante_/article/details/72867493

The version of hive is 2.3.6

Because hive on spark interacts with spark cluster, some configuration files of spark need to be modified, mainly involving hive-site.xml and spark-env.sh.

Copy the scala-library-2.11.8.jar, spark network-common-2.11-2.1.1.jar, spark core-2.11-2.1.1.jar package in spark to the lib folder of hive; download MySQL jdbc connection package mysql-connector-java-5.1.38.jar and put it into the lib package of hive.
Modify hive-site.xml and copy it to the conf directory of spark

<configuration>
							<!--jdbc -->
						<property>
						    <name>javax.jdo.option.ConnectionURL</name>
						    <value>jdbc:mysql://192.168.5.111:3306/hive2?createDatabaseIfNotExist=true&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>
						</property>
						<property>
						    <name>javax.jdo.option.ConnectionDriverName</name>
						    <value>com.mysql.jdbc.Driver</value>
						</property>
						<property>
						    <name>javax.jdo.option.ConnectionUserName</name>
						    <value>benhairui</value>
						</property>
						<property>
						    <name>javax.jdo.option.ConnectionPassword</name>
						    <value>Mahairui123**</value>
						    <description>password to use against metastore database</description>
						</property>

						<!--spark engine -->
						<property>
						    <name>hive.execution.engine</name>
						    <value>spark</value>
						</property>
						<property>
						    <name>hive.enable.spark.execution.engine</name>
						    <value>true</value>
						</property>
						<!--sparkcontext -->
						<property>
						    <name>spark.master</name>
						    <value>yarn-cluster</value>
						</property>
						<property>
						    <name>spark.serializer</name>
						    <value>org.apache.spark.serializer.KryoSerializer</value>
						</property>
						<!--The following configuration is based on the actual situation -->
						<property>
						    <name>spark.executor.instances</name>
						    <value>3</value>
						</property>
						<property>
						    <name>spark.executor.cores</name>
						    <value>4</value>
						</property>
						<property>
						    <name>spark.executor.memory</name>
						    <value>10240m</value>
						</property>
						<property>
						    <name>spark.driver.cores</name>
						    <value>2</value>
						</property>
						<property>
						    <name>spark.driver.memory</name>
						    <value>4096m</value>
						</property>
						<property>
						    <name>spark.yarn.queue</name>
						    <value>default</value>
						</property>
						<property>
						    <name>spark.app.name</name>
						    <value>myInceptor</value>
						</property>

						<!--Transactional correlation -->
						<property>
						    <name>hive.support.concurrency</name>
						    <value>true</value>
						</property>
						<property>
						    <name>hive.enforce.bucketing</name>
						    <value>true</value>
						</property>
						<property>
						    <name>hive.exec.dynamic.partition.mode</name>
						    <value>nonstrict</value>
						</property>
						<property>
						    <name>hive.txn.manager</name>
						    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
						</property>
						<property>
						    <name>hive.compactor.initiator.on</name>
						    <value>true</value>
						</property>
						<property>
						    <name>hive.compactor.worker.threads</name>
						    <value>1</value>
						</property>
						<property>
						    <name>spark.executor.extraJavaOptions</name>
						    <value>-XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
						    </value>
						</property>
						<!--Other -->
						<property>
						    <name>hive.server2.enable.doAs</name>
						    <value>false</value>
						</property>
						<!-- Appoint hive Port exposed by service, default is 10000,Sometimes it doesn't work. You can directly display the specified
						<property>
						    <name>hive.server2.thrift.port</name>
						    <value>11000</value>
						</property>

						<property>
						     <name>hive.server2.thrift.bind.host</name>
						    <value>localhost</value>
						</property>
						-->				
						</configuration>

Modify hiv-env.sh

    export HADOOP_HOME=/home/benhairui/Documents/workspace/hadoop 
    export HIVE_CONF_DIR=/home/benhairui/Documents/workspace/hive/conf 
    export HIVE_AUX_JARS_PATH=/home/benhairui/Documents/workspace/hive/lib

Modify ~ /. bashrc

    export HIVE_HOME=/home/benhairui/Documents/workspace/hive 
    export PATH=$PATH:$HIVE_HOME/bin

Modify spark env.sh of spark, and add the following configuration in spark env.sh

    export HIVE_CONF_DIR=/home/benhairui/Documents/workspace/hive/conf 
    export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/benhairui/Documents/workspace/hive/lib/mysql-connector-java-5.1.40-bin.jar

Initialize database

/opt/hive/bin/schematool -initSchema -dbType mysql

When the following message appears, the initialization is successful

Starting metastore schema initialization to 2.0.0

Initialization script hive-schema-2.0.0.mysql.sql

Initialization script completed schemaTool completed

Hive test start hive, execute command

hiveserver2 (./hive --service hiveserver2)

Connect hive2 with beeline

beeline -u jdbc:hive2://localhost:10000

After entering the command line, you can use the command of hive to create and query tables. Here are my query statements and results:

Of course, you can also go to the mysql database where the meta information is located to check the meta store information.

Part IV: azkaban building (non cluster)

Reference: https://blog.csdn.net/weixin_/article/details/79327996

azkaban is a task scheduling system, which is also a stand-alone version of azkaban.

First, from the source code of clone azkaban on GitHub: git clone https://github.com/azkaban/azkaban.git
azkaban is managed by gradle. Therefore, use gradle to compile and package the source code, enter the root directory, and execute the following command:. / gradle clean. / gradle disttar. In this way, you can see each release package under build/distribution in several folders under the root directory
Copy the tar package under Azkaban exec server and Azkaban web server to the target folder, extract it, rename it to executor and server respectively, and copy the SQL file under. / Azkaban dB / build / SQL folder Azkaban depends on relational database, so find a mysql and execute the following commands: enter mysql > create database Azkaban; mysql > use Azkaban; database changed mysql > source / home / Hadoop / azkaban-2.5.0/create-all-sql-2.5.0.sql;
Create ssl configuration command: keytool -keystore keystore -alias jetty -genkey -keyalg RSA
After running this command, you will be prompted to enter the password and corresponding information of the currently generated keystor e. Please remember the entered password as follows: enter the new password again:
What is your first and last name?
[Unknown]:
What is the name of your organizational unit?
[Unknown]:
What is the name of your organization?
[Unknown]:
What is the name of your city or area?
[Unknown]:
What is the name of your state or province?
[Unknown]:
What is the two letter country code for the unit
[Unknown]: CN
Is CN=Unknown, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=CN correct?
[no]:y
Entered master password
(if the password is the same as keystore, press enter):
Enter the new password again:
After the above work is completed, the keystore certificate file will be generated in the current directory, and the keystore will be transferred to the bin directory in the root directory of the azkaban web server, such as cp keystore azkaban/server/bin
Profile
Note: first, configure the time zone on the server node
1. Use the interactive command tzselect to create the time zone configuration file Asia/Shanghai
2. Copy the time zone file and overwrite the local time zone configuration of the system
Enter the executor/conf directory and modify azkaban.properties:

# Azkaban Personalization Settings
                    #Server name
                    azkaban.name=Test
                    azkaban.label=My Local Azkaban
                    azkaban.color=#FF3601
                    azkaban.default.servlet.path=/index
                    #Default root
                    web.resource.dir=web/
                    default.timezone.id=Asia/Shanghai
                    # Azkaban UserManager class
                    user.manager.class=azkaban.user.XmlUserManager
                    user.manager.xml.file=conf/azkaban-users.xml
                    # Loader for projects
                    executor.global.properties=conf/global.properties
                    azkaban.project.dir=projects
                    # Velocity dev mode
                    velocity.dev.mode=false
                    # Azkaban Jetty server properties.
                    jetty.use.ssl=true
                    jetty.maxThreads=25
                    jetty.port=8081
                    # Where the Azkaban web server is located
                    azkaban.webserver.url=http://master:8081
                    # mail settings
                    mail.sender=
                    mail.host=
                    # User facing web server configurations used to construct the user facing server URLs. They are useful when there is a reverse proxy between Azkaban web servers and users.
                    # enduser -> myazkabanhost:443 -> proxy -> localhost:8081
                    # when this parameters set then these parameters are used to generate email links.
                    # if these parameters are not set then jetty.hostname, and jetty.port(if ssl configured jetty.ssl.port) are used.
                    # azkaban.webserver.external_hostname=myazkabanhost.com
                    # azkaban.webserver.external_ssl_port=443
                    # azkaban.webserver.external_port=8081
                    job.failure.email=
                    job.success.email=
                    lockdown.create.projects=false
                    cache.directory=cache
                    # JMX stats
                    jetty.connector.stats=true
                    executor.connector.stats=true
                    # Azkaban plugin settings
                    azkaban.jobtype.plugin.dir=plugins/jobtypes
                    # Azkaban mysql settings by default. Users should configure their own username and password.
                    database.type=mysql
                    mysql.port=3306
                    mysql.host=192.168.5.111
                    mysql.database=azkaban
                    mysql.user=benhairui
                    mysql.password=Mahairui123**
                    mysql.numconnections=100
                    # Azkaban Executor settings
                    executor.maxThreads=50
                    executor.flow.threads=30
                    executor.port=12321

6. Enter the server/conf directory and modify azkaban.properties:

# Azkaban Personalization Settings
					azkaban.name=Test
					azkaban.label=My Local Azkaban
					azkaban.color=#FF3601
					azkaban.default.servlet.path=/index
					web.resource.dir=web/
					default.timezone.id=Asia/Shanghai
					# Azkaban UserManager class
					user.manager.class=azkaban.user.XmlUserManager
					user.manager.xml.file=conf/azkaban-users.xml
					# Loader for projects
					executor.global.properties=conf/global.properties
					azkaban.project.dir=projects
					# Velocity dev mode
					velocity.dev.mode=false
					# Azkaban Jetty server properties.
					jetty.use.ssl=true
					jetty.maxThreads=25
					jetty.port=8081
					jetty.ssl.port=8443
					jetty.keystore=keystore
					jetty.password=mahairui
					jetty.keypassword=mahairui
					jetty.truststore=keystore
					jetty.trustpassword=mahairui
					# Azkaban Executor settings
					executor.port=12321
					# mail settings
					mail.sender=
					mail.host=
					# User facing web server configurations used to construct the user facing server URLs. They are useful when there is a reverse proxy between Azkaban web servers and users.
					# enduser -> myazkabanhost:443 -> proxy -> localhost:8081
					# when this parameters set then these parameters are used to generate email links.
					# if these parameters are not set then jetty.hostname, and jetty.port(if ssl configured jetty.ssl.port) are used.
					# azkaban.webserver.external_hostname=myazkabanhost.com
					# azkaban.webserver.external_ssl_port=443
					# azkaban.webserver.external_port=8081
					job.failure.email=
					job.success.email=
					lockdown.create.projects=false
					cache.directory=cache
					# JMX stats
					jetty.connector.stats=true
					executor.connector.stats=true
					# Azkaban mysql settings by default. Users should configure their own username and password.
					database.type=mysql
					mysql.port=3306
					mysql.host=192.168.5.111
					mysql.database=azkaban
					mysql.user=benhairui
					mysql.password=Mahairui123**
					mysql.numconnections=100
					#Multiple Executor
					azkaban.use.multiple.executors=true
					#azkaban.executorselector.filters=StaticRemainingFlowSize,MinimumFreeMemory,CpuStatus
					azkaban.executorselector.filters=StaticRemainingFlowSize,CpuStatus
					azkaban.executorselector.comparator.NumberOfAssignedFlowComparator=1
					azkaban.executorselector.comparator.Memory=1
					azkaban.executorselector.comparator.LastDispatched=1
					azkaban.executorselector.comparator.CpuUsage=1

Note: azkaban.executorselector.filters parameter, in which the MinimumFreeMemory filter will check whether the free memory of the executor host will be greater than 6G. If it is less than 6G, the web server will not submit the task to the host for execution, and it can be removed

7. User configuration, modify the azkaban-users.xml file under the web/conf directory. You can add users for it, but you can also add no users.

<azkaban-users>
						<user username="azkaban" password="azkaban" roles="admin" groups="azkaban" />
						<user username="metrics" password="metrics" roles="metrics"/>
						#Add the line below
						<user username="admin" password="admin" roles="admin,metrics" />
						<role name="admin" permissions="ADMIN" />
						<role name="metrics" permissions="METRICS"/>
					</azkaban-users>

8. Start up test

Execute under web

./bin/start-web.sh

Execute under executor

./bin/start-exec.sh

After startup, enter: https: / / server address (this is master):8443 in the browser to access. The following picture:

Then enter the user name and password, that is, the user name and password in azkaban-users.xml above, to access After entering, you can upload your own scheduling tasks, as shown in the following figure:

Finally, click "Execute flow" directly to perform your own dependent tasks. At the same time, you can see the relevant DAG chart and the log after the direct completion, as shown in the following figure:

Note: during task scheduling, there may be an error of "executor became inactive before setting up the flow 11". This is an official bug. We need to activate the executor manually and execute the following command at the command line:

curl http://${executorHost}:${executorPort}/executor?action=activate

End

So far, the construction of related components has ended. If you have any questions, you can leave a message directly. Thank you very much for watching!

Topics: Big Data Spark Hadoop hive MySQL

Programmer Think