Big data, Spark, Hadoop, python
Big data Hadoop installation and configuration
1, Hadoop pseudo distributed configuration
1. Create Hadoop user:
sudo useradd -m hadoop -s /bin/bash # Create hadoop user sudo passwd hadoop # Change Password sudo adduser hadoop sudo # Add administrator privileges
Log out and log in with the Hadoop user, then update apt and install vim:
sudo apt-get update # Update apt
Install SSH and configure password less login:
sudo apt-get install openssh-server # Ubuntu installs SSH client by default. Only SSH server can be installed here cd ~ mkdir .ssh # The file may already exist and will not be affected cd ~/.ssh/ ssh-keygen -t rsa # You can press enter if there is a prompt cat id_rsa.pub >> authorized_keys # Join authorization
Enter ssh localhost to log in without entering a password. Otherwise, you need to enter the password hadoop to log in after installing SSH
Note: both cluster and single node modes require SSH login (similar to remote login, you can log in to a Linux host and run commands on it). Ubuntu has installed SSH client by default. In addition, SSH server needs to be installed
2. Configure Java environment
Hadoop needs java environment support. For JDK configuration, please refer to another blog post https://blog.csdn.net/Acegem/article/details/120852985?spm=1001.2014.3001.5502.
3. Install Hadoop
It can be found on the Hadoop official website https://dlcdn.apache.org/hadoop/common/ Download, we download hadoop-3.1.3. The download link is: https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz,
Or domestic mirror websites, such as: http://mirrors.cnnic.cn/apache/hadoop/common/ Download,
Unzip the downloaded hadoop-3.1.3 package to / usr/local /
sudo tar -zxvf hadoop-3.1.3.tar.gz -C /usr/local # Unzip into / usr/local cd /usr/local/ sudo mv ./hadoop-3.1.3/ ./hadoop # Change the folder name to hadoop sudo chown -R hadoop ./hadoop # Modify file permissions
Test:
cd /usr/local/hadoop ./bin/hadoop version
4. Pseudo distributed configuration
The configuration file of Hadoop is located in / usr/local/hadoop/etc/hadoop /. The pseudo distributed system needs to modify two configuration files, core site XML and HDFS site xml .
Modify the configuration file core site xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Modify the configuration file HDFS site xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/data</value> </property> </configuration>
5. Start Hadoop
cd /usr/local/hadoop bin/hdfs namenode -format # namenode formatting sbin/start-dfs.sh # Start daemon jps # Judge whether the startup is successful
If it starts successfully, the following processes will be listed: NameNode, DataNode and SecondaryNameNode
Attachment:
Error 1: if error: Java is prompted in this step_ HOME is not set and could not be found. If there is an error, it indicates that Java was set before_ The home environment variable is not set properly.
If you have followed the previous tutorial JAVA is set in bashrc file_ Home, there is still an Error. Then, go to the Hadoop installation directory and modify the configuration file "/ usr / local / Hadoop / etc / Hadoop / Hadoop env. Sh", find the line "export JAVA_HOME=${JAVA_HOME}" in it, and then modify it to the specific address of the JAVA installation path, for example, "export JAVA_HOME = / usr / lib / JVM / default JAVA", and then start Hadoop again.
Error 2: if you encounter an exception that outputs a lot of "ssh: Could not resolve hostname xxx" when starting Hadoop, as shown in the following figure:
This is not an ssh problem, but can be solved by setting Hadoop environment variables. At ~ / In bashrc, add the following two lines (the setting process is the same as the JAVA_HOME variable, where HADOOP_HOME is the installation directory of Hadoop):
export HADOOP_HOME=/usr/local/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
Test:
Run WordCount instance:
bin/hdfs dfs -mkdir -p /user/hadoop # Create HDFS directory bin/hdfs dfs -mkdir input bin/hdfs dfs -put etc/hadoop/*.xml input # Take the configuration file as input bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+' bin/hdfs dfs -cat output/* # View output
You can also retrieve the running results locally:
rm -r ./output # Delete the local output folder (if any) first ./bin/hdfs dfs -get output ./output # Copy the output folder on HDFS to the local computer cat ./output/*
When Hadoop runs the program, the output directory cannot exist, otherwise the error "org.apache.hadoop.mapred.filealreadyexistsexception: output directory" will be prompted hdfs://localhost:9000/user/hadoop/output "Already exists", so to execute it again, you need to execute the following command to delete the output folder:
./bin/hdfs dfs -rm -r output # Delete output folder
2, Hadoop cluster configuration
Suppose there are two machines:
Master 192.168.1.121 Slave1 192.168.1.122
Hadoop cluster configuration process:
Select a machine as Master,Configure network mapping on all hosts stay Master Configuration on host hadoop User, installation SSH server,install Java environment stay Master Install on host Hadoop,And complete the configuration Configure on other hosts hadoop User, installation SSH server,install Java environment take Master On the host Hadoop Copy directory to another host Open, use Hadoop
Configure hadoop users, install SSH server and install Java environment on all hosts:
sudo useradd -m hadoop -s /bin/bash # Create hadoop user sudo passwd hadoop # Change Password sudo adduser hadoop sudo # Add administrator privileges # Log out and log in using the Hadoop user sudo apt-get update # Update apt sudo apt-get install vim # Install vim sudo apt-get install openssh-server # Install ssh
Then configure the JDK environment of java
Configure network mapping for all hosts:
sudo vim /etc/hostname # Modify host name sudo vim /etc/hosts # Modify the mapping relationship between host and IP sudo reboot # Restart to make the network configuration effective
On the Master host:
cd ~/.ssh ssh-keygen -t rsa # Just press enter all the time cat ~/id_rsa.pub >> ~/authorized_keys scp ~/.ssh/id_rsa.pub hadoop@Slave1:/home/hadoop/ # Transfer public key to Slave1
Then execute on the Slave1 node
cd ~ mkdir .ssh cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
Configure the Hadoop cluster on the Master node (located in / usr/local/hadoop/etc/hadoop):
File slave:
Delete the original localhost and write the host names of all Slave, one on each line.
File core site xml:
<property> <name>fs.defaultFS</name> <value>hdfs://Master:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> <description>Abase for other temporary directories.</description> </property>
File HDFS site xml:
<property> <name>dfs.namenode.secondary.http-address</name> <value>Master:50090</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/data</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property>
File mapred site XML (first execute cp mapred-site.xml.template mapred-site.xml):
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
File yarn site xml:
<property> <name>yarn.resourcemanager.hostname</name> <value>Master</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property>
After configuration, copy the Hadoop file to each node on the Master host:
cd /usr/local rm -r ./hadoop/tmp # Delete Hadoop temporary files sudo tar -zcf ./hadoop.tar.gz ./hadoop scp ./hadoop.tar.gz Slave1:/home/hadoop
Execute on Slave1:
sudo tar -zxf ~/hadoop.tar.gz -C /usr/local sudo chown -R hadoop:hadoop /usr/local/hadoop
Finally, you can start hadoop on the Master host:
cd /usr/local/hadoop/ bin/hdfs namenode -format sbin/start-dfs.sh sbin/start-yarn.sh jps # Judge whether the startup is successful
If it is started successfully, the Master node starts the NameNode, SecondrryNameNode and ResourceManager processes, and the Slave node starts the DataNode and NodeManager processes.
Execute the WordCount instance on the Master host:
bin/hdfs dfs -mkdir -p /user/hadoop bin/hdfs dfs -put etc/hadoop input bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'