Hadoop environment installation

Posted by phelpsa on Sun, 26 Dec 2021 10:23:42 +0100

Hadoop distributed environment

0. Preliminary preparation

Create normal user

# Create fzk user
useradd fzk
# Modify fzk user's password
passwd fzk
# The configuration fzk user has root permission, which is convenient for sudo to execute the command with root permission later (/ etc/sudoers file, added under% wheel)
fzk     ALL=(ALL)       NOPASSWD:ALL



ssh password free login

# Generate public and private keys
ssh-keygen -t rsa

# Copy the public key to the target machine for password free login
ssh-copy-id hadoop152



1. Construction of distributed basic environment

Cluster deployment planning

hadoop151hadoop152hadoop153
HDFSNameNode
DataNode

DataNode
SecondaryNameNode
DataNode
YARN
NodeManager
ResourceManager
NodeManager

NodeManager



Environmental construction steps

  • Step 1: prepare 3 clients (turn off firewall, static IP, host name)

    192.168.37.151 hadoop151
    192.168.37.152 hadoop152
    192.168.37.153 hadoop153

  • Step 2: install JDK, install Hadoop, and configure environment variables

  • Step 3: modify the configuration file

    • hadoop-env.sh ,core-site.xml ,hdfs-site.xml ,yarn-site.xml ,mapred-site.xml ,workers ,hdfs

    • etc/hadoop/hadoop-env.xml

      export JAVA_HOME=/opt/software/jdk/jdk1.8.0_281
      export HDFS_NAMENODE_USER=root
      export HDFS_DATANODE_USER=root
      export HDFS_SECONDARYNAMENODE_USER=root
      export YARN_RESOURCEMANAGER_USER=root
      export YARN_NODEMANAGER_USER=root
      export HADOOP_SECURE_DN_USER=root
      
    • etc/hadoop/core-site.xml

      <configuration>
          <!-- appoint NameNode Address of -->
          <property>
              <name>fs.defaultFS</name>
              <value>hdfs://hadoop151:8020</value>
          </property>
          <!-- appoint hadoop Storage directory of data -->
          <property>
              <name>hadoop.tmp.dir</name>
              <value>/opt/software/hadoop/data</value>
          </property>
          <!-- to configure HDFS The static user used for web page login is root -->
          <property>
              <name>hadoop.http.staticuser.user</name>
              <value>root</value>
          </property>
      </configuration>
      
    • etc/hadoop/hdfs-site.xml

      <configuration>
          <!-- NameNode web End access address-->
          <property>
              <name>dfs.namenode.http-address</name>
              <value>hadoop151:9870</value>
          </property>
          <!-- SecondaryNameNode web End access address-->
          <property>
              <name>dfs.namenode.secondary.http-address</name>
              <value>hadoop153:9868</value>
          </property>
      </configuration>
      
    • etc/hadoop/yarn-site.xml

      <configuration>
          <!-- appoint MapReduce go shuffle -->
          <property>
              <name>yarn.nodemanager.aux-services</name>
              <value>mapreduce_shuffle</value>
          </property>
          <!-- appoint ResourceManager Address of-->
          <property>
              <name>yarn.resourcemanager.hostname</name>
              <value>hadoop152</value>
          </property>
          <!-- Inheritance of environment variables -->
          <property>
              <name>yarn.nodemanager.env-whitelist</name>
              <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
          </property>
      </configuration>
      
    • etc/hadoop/mapred-site.xml

      <configuration>
          <!-- appoint MapReduce The program runs on Yarn upper -->
          <property>
              <name>mapreduce.framework.name</name>
              <value>yarn</value>
          </property>
      </configuration>
      
    • etc/hadoop/workers

      hadoop151
      hadoop152
      hadoop153
      
    • bin/hdfs

      // Add HADOOP_SHELL_EXECNAME="hdfs" changed to
      HADOOP_SHELL_EXECNAME="root"
      



Start cluster

  • Step 1: if the cluster is started for the first time, you need to format the NameNode on the Hadoop 151 node
    • Command: hdfs namenode -format
    • Note: formatting NameNode will generate a new cluster id, resulting in inconsistent cluster IDs of NameNode and datanode, and the cluster cannot find past data. If the cluster reports an error during operation and needs to reformat the NameNode, be sure to stop the NameNode and datanode processes, delete the data and logs directories of all machines, and then format them.
  • Step 2: start HDFS on the machine with NameNode node configured
    • Command: SBIN / start DFS sh
  • Step 3: start YARN on the machine with the ResourceManager node configured
    • Command: SBIN / start yarn sh
  • Step 4: view the NameNode of HDFS on the Web side
    • http://192.168.37.151:9870
  • Step 5: View YARN's resource manager on the Web side
    • http://192/168.37.152:8088





2. Configure aggregation of historical servers and logs

Environment construction

  • Step 1: build a distributed basic environment

  • Step 2: in / etc / Hadoop / mapred site Add the following configuration to the XML and copy the change file to the other two hosts

    	<!-- Historical server address -->
        <property>
            <name>mapreduce.jobhistory.address</name>
            <value>hadoop151:10020</value>
        </property>
        <!-- History server web End address -->
        <property>
            <name>mapreduce.jobhistory.webapp.address</name>
            <value>hadoop151:19888</value>
        </property>
    
  • Step 3: in / etc / Hadoop / yen site Add the following configuration to the XML and copy the change file to the other two hosts

        <!-- Enable log aggregation -->
        <property>
            <name>yarn.log-aggregation-enable</name>
            <value>true</value>
        </property>
        <!-- Set log aggregation server address -->
        <property>
            <name>yarn.log.server.url</name>
            <value>http://hadoop151:19888/jobhistory/logs</value>
        </property>
        <!-- Set the log retention time to 7 days -->
        <property>
            <name>yarn.log-aggregation.retain-seconds</name>
            <value>604800</value>
        </property>
    



start-up

  • Start on the machine with the JobHistory node configured
    • mapred --daemon start historyserver
  • Viewing JobHistory on the Web
    • http://192.168.37.151:19888/jobhistory




3. Cluster start / stop mode summary

Start / stop each module separately (ssh configuration is the premise)

  • Overall start / stop HDFS
    • start-dfs.sh
    • stop-dfs.sh
  • Overall start / stop of YARN
    • start-yarn.sh
    • stop-yarn.sh



Each service component starts / stops one by one

  • Start / stop HDFS components separately
    • hdfs --daemon start/stop namenode/datanode/secondarynamenode
  • Start / stop YARN
    • yarn --daemon start/stop resourcemanager/nodemanager
  • Start / stop JobHistory
    • mapred --daemon start/stop historyserver




4. Write common scripts for Hadoop cluster

Hadoop cluster startup and shutdown script

  • Start HDFS, Yan, Historyserver: myhadoop sh

    #!/bin/bash
    if [ $# -lt 1 ]
    then
        echo "No Args Input..."
        exit ;
    fi
    
    case $1 in
    "start")
        echo " =================== start-up hadoop colony ==================="
        echo " --------------- start-up hdfs ---------------"
        ssh hadoop151 "start-dfs.sh"
        echo " --------------- start-up yarn ---------------"
        ssh hadoop152 "start-yarn.sh"
        echo " --------------- start-up historyserver ---------------"
        ssh hadoop151 "mapred --daemon start historyserver"
        ;;
    "stop")
        echo " =================== close hadoop colony ==================="
        echo " --------------- close historyserver ---------------"
        ssh hadoop151 "mapred --daemon stop historyserver"
        echo " --------------- close yarn ---------------"
        ssh hadoop152 "stop-yarn.sh"
        echo " --------------- close hdfs ---------------"
        ssh hadoop151 "stop-dfs.sh"
        ;;
    *)
    	echo "Input Args Error..."
    	;;
    esac
    



View three server Java process scripts

  • jps: jpsall.sh

    #!/bin/bash
    for host in hadoop151 hadoop152 hadoop153
    do
        echo =============== $host ===============
        ssh $host jps 
    done
    




5. Common port number Description

Port nameHadoop2.xHadoop3.x
NameNode internal communication port8020 / 90008020 / 9000 / 9820
NameNode HTTP UI500709870
MapReduce view task execution port80888088
History server communication port1988819888





6. High availability environment (HA)

Cluster planning

hadoop151hadoop152hadoop153
NameNodeNameNodeNameNode
JournalNodeJournalNodeJournalNode
DataNodeDataNodeDataNode
ZKZKZK
ResourceManagerResourceManager
NodeManagerNodeManagerNodeManager



Configure HDFS-HA cluster

  • Step 1: core site XML configuration file content

        <!-- hold NameNode The addresses are assembled into a cluster mycluster -->
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://mycluster</value>
        </property>
        <!-- appoint hadoop The storage directory where files are generated at run time -->
        <property>
            <name>hadoop.tmp.dir</name>
            <value>/opt/software/hadoop/hadoop-3.1.4/data/tmp</value>
        </property>
        <!-- statement journalnode Server storage directory-->
        <property>
            <name>dfs.journalnode.edits.dir</name>
            <value>/opt/software/hadoop/hadoop-3.1.4/data/tmp/journalnode</value>
        </property>
    
  • Step 2: HDFS site XML configuration file content

        <!-- Fully distributed cluster name -->
        <property>
            <name>dfs.nameservices</name>
            <value>mycluster</value>
        </property>
        <!-- In cluster NameNode What are the nodes -->
        <property>
            <name>dfs.ha.namenodes.mycluster</name>
            <value>nn1,nn2,nn3</value>
        </property>
        <!-- nn1,nn2,nn3 of RPC mailing address -->
        <property>
            <name>dfs.namenode.rpc-address.mycluster.nn1</name>
            <value>hadoop151:8020</value>
        </property>
        <property>
            <name>dfs.namenode.rpc-address.mycluster.nn2</name>
            <value>hadoop152:8020</value>
        </property>
        <property>
            <name>dfs.namenode.rpc-address.mycluster.nn3</name>
            <value>hadoop153:8020</value>
        </property>
        <!-- nn1,nn2,nn3 of http mailing address -->
        <property>
            <name>dfs.namenode.http-address.mycluster.nn1</name>
            <value>hadoop151:9870</value>
        </property>
        <property>
            <name>dfs.namenode.http-address.mycluster.nn2</name>
            <value>hadoop152:9870</value>
        </property>
        <property>
            <name>dfs.namenode.http-address.mycluster.nn3</name>
            <value>hadoop153:9870</value>
        </property>
        <!-- appoint NameNode Metadata in JournalNode Storage location on -->
        <property>
            <name>dfs.namenode.shared.edits.dir</name>
            <value>qjournal://hadoop151:8485;hadoop152:8485;hadoop153:8485/mycluster</value>
        </property>
        <!-- Access proxy class: client,mycluster,active Implementation mode of automatic switching in case of configuration failure-->
        <property>
            <name>dfs.client.failover.proxy.provider.mycluster</name>
            <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
        </property>
        <!-- Close permission check (the company is not allowed to close, and the company needs to configure it. Refer to the official website for specific configuration) -->
        <property>
            <name>dfs.permissions.enable</name>
            <value>false</value>
        </property>
        <!-- Configure the isolation mechanism, that is, only one server can respond at the same time -->
        <property>
            <name>dfs.ha.fencing.methods</name>
            <value>sshfence</value>
        </property>
        <!-- Required when using isolation mechanism ssh No secret key login-->
        <property>
            <name>dfs.ha.fencing.ssh.private-key-files</name>
            <value>/root/.ssh/id_rsa</value>
        </property>
    
  • Step 3: add the core site XML and HDFS site XML two files are distributed to other machines, scp

  • Step 4: start the JournalNode in all clusters

    • hdfs --daemon start journalnode
  • Step 5: format the NameNode in one of the nodes

    • hdfs namenode -format
  • Step 6: copy the contents of the NameNode metadata directory to other unformatted namenodes

    • hdfs namenode -bootstrapStandby
  • Step 7: start NameNode and DataNode in all nodes

    • hdfs --daemon start namenode
    • hdfs --daemon start datanode
  • Step 8: set one of the nodes to Active state (for example, set nn1 to Active state)

    • hdfs haadmin -transitionToActive nn1



Configuring Zookeeper clusters

Deploy Zookeeper on Hadoop 151, Hadoop 152 and Hadoop 153 nodes

  • Step 1: install jdk

  • Step 2: copy the Zookeeper file to Linux and unzip it

  • Step 3: create zkData directory under Zookeeper installation directory to store data

  • Step 4: create a myid file in the zkData directory to distinguish Zookeeper

    • Write 1, 2 and 3 in the myid files of Hadoop 151, Hadoop 152 and Hadoop 153 clusters respectively
  • Step 4: set the * * / Zookeeper installation path / conf * * path to zoo_ sample. One copy of CFG file is zoo CFG and modify

    # Modify data storage path configuration
    dataDir=/zookeeper Installation path/zkData
    
    # Add Zookeeper cluster configuration
    # Format: server A=B:C:D
    # A is a number. In the cluster mode, the number of the configuration file myid indicates the server number
    # B is the ip address of the server
    # C is the port where this server exchanges information with the Leader server in the cluster
    # D is the port where the server communicates with each other when the cluster Leader server hangs up and a new Leader is re selected
    server.1=hadoop151:2888:3888
    server.2=hadoop152:2888:3888
    server.3=hadoop153:2888:3888
    
  • Step 5: distribute the entire installed zookeeper directory to Hadoop 152 and Hadoop 153 machines

  • Step 6: change the number of myid files of Hadoop 151, Hadoop 152 and Hadoop 153 clusters to 1, 2 and 3 respectively

  • Step 7: start Zookeeper on three machines respectively

    • bin/zkServer.sh start
  • Step 8: check the status. The status of one machine is Mode: leader and the status of the other two machines is Mode: follower

    • bin/zkServer.sh status



Configure HDFS-HA cluster

  • Step 1: on HDFS site Add configuration content to XML

        <!-- Turn on automatic failover -->
        <property>
            <name>dfs.ha.automatic-failover.enabled</name>
            <value>true</value>
        </property>
    
  • Step 2: on the core site Add configuration content to XML

        <!-- Specifies the cluster configuration for automatic failover -->
        <property>
            <name>ha.zookeeper.quorum</name>
            <value>hadoop151:2181,hadoop152:2181,hadoop153:2181</value>
        </property>
    
  • Step 3: Hadoop env SH configuration file

    export JAVA_HOME=/opt/software/jdk/jdk1.8.0_281
    export HDFS_NAMENODE_USER=root
    export HDFS_DATANODE_USER=root
    export HDFS_SECONDARYNAMENODE_USER=root
    export YARN_RESOURCEMANAGER_USER=root
    export YARN_NODEMANAGER_USER=root
    export HADOOP_SECURE_DN_USER=root
    export HDFS_ZKFC_USER=root
    export HDFS_JOURNALNODE_USER=root
    
  • Step 4: add the core site xml , hdfs-site.xml ,hadoop-env.sh three files are distributed on other machines, scp

  • Step 5: start Zookeeper on three machines respectively

    • bin/zkServer.sh start
  • Step 6: initialize ZKFC

    • hdfs zkfc -formatZK
  • Step 7: start the cluster

    • start-dfs.sh
  • Step 8: check whether all the processes are started, including six processes

    Jps
    QuorumPeerMain
    NameNode
    DataNode
    JournalNode
    DFSZKFailoverController
    



Configure YARN-HA cluster

  • Step 1: yarn site SH configuration

        <!-- appoint MapReduce go shuffle -->
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
        <!--Enable ResourceManager ha-->
        <property>
            <name>yarn.resourcemanager.ha.enabled</name>
            <value>true</value>
        </property>
        <!--Declare two ResourceManager Address of-->
        <property>
            <name>yarn.resourcemanager.cluster-id</name>
            <value>cluster1</value>
        </property>
        <property>
            <name>yarn.resourcemanager.ha.rm-ids</name>
            <value>rm1,rm2</value>
        </property>
        <property>
            <name>yarn.resourcemanager.hostname.rm1</name>
            <value>hadoop151</value>
        </property>
        <property>
            <name>yarn.resourcemanager.hostname.rm2</name>
            <value>hadoop152</value>
        </property>
        <property>
            <name>yarn.resourcemanager.webapp.address.rm1</name>
            <value>hadoop151:8088</value>
        </property>
        <property>
            <name>yarn.resourcemanager.webapp.address.rm2</name>
            <value>hadoop152:8088</value>
        </property>
        <!--appoint zookeeper Address of the cluster-->
        <property>
            <name>hadoop.zk.address</name>
            <value>hadoop151:2181,hadoop152:2181,hadoop153:2181</value>
        </property>
        <!--Enable automatic recovery--> 
        <property>
            <name>yarn.resourcemanager.recovery.enabled</name>
            <value>true</value>
        </property>
        <!-- Inheritance of environment variables -->
        <property>
            <name>yarn.nodemanager.env-whitelist</name>
            <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
        </property>
        <!-- Enable log aggregation -->
        <property>
            <name>yarn.log-aggregation-enable</name>
            <value>true</value>
        </property>
        <!-- Set the log retention time to 7 days -->
        <property>
            <name>yarn.log-aggregation.retain-seconds</name>
            <value>604800</value>
        </property>
    
  • Step 2: add the yarn site On other machines where XML files are distributed, scp

  • Step 3: start yarn (provided Zookeeper and hdfs are started)

    • start-yarn.sh



Start and stop the cluster after installation

start-up

  • Step 1: start Zookeeper on three machines respectively
    • bin/zkServer.sh start
  • Step 2: start NameNode, DataNode, JournalNode, ZKCF, etc. on any computer
    • start-all.sh

stop it

  • Step 1: start NameNode, DataNode, JournalNode, ZKCF, etc. on any computer
    • stop-all.sh
  • Step 2: start Zookeeper on three machines respectively
    • bin/zkServer.sh stop

Topics: Big Data Hadoop