1 environment
Next, take the HDP Version (HDP-3.1.5.0-centos7-rpm.tar.gz) as an example to introduce the installation and configuration of hive 3 in detail. The environment can be CentOS 7 or CentOS 8 (although the tar package is CentOS 7, the build RPM used in this paper is still applicable to CentOS 8).
Before installation, manually install and configure the following:
- JDK8
- SSH
- firewall
- hostname / network (CentOS 8 configuration nmcli)
- SELinux
- NTP (CentOS 8 installation chronyc)
Build and version of installation:
- Hadoop | 3.1.1.3.1.5.0-152
- ZooKeeper | 3.4.6.3.1.5.0-152
- Hive | 3.1.0.3.1.5.0-152
- Tez | 0.9.1.3.1.5.0-152
- Spark | 2.3.2.3.1.5.0-152
2 installation
# 0 unzip the HDP package tar -zxf HDP-3.1.5.0-centos7-rpm.tar.gz # 1 dependency (optional) cd HDP/centos7/3.1.5.0-152/bigtop-tomcat rpm -ivh bigtop-tomcat-7.0.94-1.noarch.rpm cd ../bigtop-jsvc rpm -ivh bigtop-jsvc-1.0.15-152.x86_64.rpm cd ../hdp-select rpm -ivh hdp-select-3.1.5.0-152.el7.noarch.rpm yum install -y redhat-lsb # 2 time synchronization yum -y install chrony ## Add server ntp1.aliyun.com iburst vim /etc/chrony.conf ## see chronyc sourcestats -v timedatectl # 3 Zookeeper (HA and some components, for example Hive depends on ZK) cd ../zookeeper rpm -ivh *.rpm ln -s /usr/hdp/3.1.5.0-152 /usr/hdp/current # 4 hadoop cd ../hadoop rpm -ivh *.rpm --nodeps # 5 tez (Hive engine) cd ../tez rpm -ivh *.rpm # 6 hive cd ../hive rpm -ivh *.rpm --nodeps # 7 spark2 (optional) ## If spark 2 is added to yarn.nodemanager.aux-services_ Shuffle to be installed cd ../spark2 rpm -ivh spark2_3_1_5_0_152-yarn-shuffle-2.3.2.3.1.5.0-152.noarch.rpm
3 configuration
The memory involved in the following shall be configured according to their own server configuration, and a reasonable and sufficient size shall be set as far as possible. At the same time, after each configuration, a configuration of the production environment is given for reference.
3.1 /etc/hadoop/conf/workers
Add Hadoop workers node
3.2 /etc/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/local/jdk8 export HADOOP_HOME_WARN_SUPPRESS=1 export HADOOP_HOME=/usr/hdp/current/hadoop export HADOOP_CONF_DIR=${HADOOP_HOME}/conf export JSVC_HOME=/usr/lib/bigtop-utils export HADOOP_HEAPSIZE="1024" export HADOOP_NAMENODE_INIT_HEAPSIZE="-Xms1024m" export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}" USER="$(whoami)" HADOOP_JOBTRACKER_OPTS="-server -XX:ParallelGCThreads=1 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dmapred.audit.logger=INFO,MRAUDIT -Dhadoop.mapreduce.jobsummary.logger=INFO,JSA ${HADOOP_JOBTRACKER_OPTS}" HADOOP_TASKTRACKER_OPTS="-server -Xmx1024m -Dhadoop.security.logger=ERROR,console -Dmapred.audit.logger=ERROR,console ${HADOOP_TASKTRACKER_OPTS}" SHARED_HDFS_NAMENODE_OPTS="-server -XX:ParallelGCThreads=1 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=256m -XX:MaxNewSize=256m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT" export HDFS_NAMENODE_OPTS="${SHARED_HDFS_NAMENODE_OPTS} -XX:OnOutOfMemoryError=\"/usr/hdp/current/hadoop/bin/kill-name-node\" -Dorg.mortbay.jetty.Request.maxFormContentSize=-1 ${HDFS_NAMENODE_OPTS}" export HDFS_DATANODE_OPTS="-server -XX:ParallelGCThreads=1 -XX:+UseConcMarkSweepGC -XX:OnOutOfMemoryError=\"/usr/hdp/current/hadoop/bin/kill-data-node\" -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT ${HDFS_DATANODE_OPTS} -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly" export HDFS_SECONDARYNAMENODE_OPTS="${SHARED_HDFS_NAMENODE_OPTS} -XX:OnOutOfMemoryError=\"/usr/hdp/current/hadoop/bin/kill-secondary-name-node\" ${HDFS_SECONDARYNAMENODE_OPTS}" export HADOOP_CLIENT_OPTS="-Xmx${HADOOP_HEAPSIZE}m $HADOOP_CLIENT_OPTS" HDFS_NFS3_OPTS="-Xmx1024m -Dhadoop.security.logger=ERROR,DRFAS ${HDFS_NFS3_OPTS}" HADOOP_BALANCER_OPTS="-server -Xmx1024m ${HADOOP_BALANCER_OPTS}" export HDFS_DATANODE_SECURE_USER=${HDFS_DATANODE_SECURE_USER:-""} export HADOOP_SSH_OPTS="-o ConnectTimeout=5 -o SendEnv=HADOOP_CONF_DIR" export HADOOP_LOG_DIR=/var/log/hadoop/$USER export HADOOP_SECURE_LOG_DIR=${HADOOP_SECURE_LOG_DIR:-/var/log/hadoop/$HDFS_DATANODE_SECURE_USER} export HADOOP_PID_DIR=/var/run/hadoop/$USER export HADOOP_SECURE_PID_DIR=${HADOOP_SECURE_PID_DIR:-/var/run/hadoop/$HDFS_DATANODE_SECURE_USER} YARN_RESOURCEMANAGER_OPTS="-Dyarn.server.resourcemanager.appsummary.logger=INFO,RMSUMMARY" export HADOOP_IDENT_STRING=$USER # Add database libraries JAVA_JDBC_LIBS="" if [ -d "/usr/share/java" ]; then for jarFile in `ls /usr/share/java | grep -E "(mysql|ojdbc|postgresql|sqljdbc)" 2>/dev/null` do JAVA_JDBC_LIBS=${JAVA_JDBC_LIBS}:$jarFile done fi export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}${JAVA_JDBC_LIBS} export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop/libexec export JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:/usr/hdp/current/hadoop/lib/native/Linux-amd64-64 export HADOOP_OPTS="-Dhdp.version=$HDP_VERSION $HADOOP_OPTS" if [ "$command" == "datanode" ] && [ "$EUID" -eq 0 ] && [ -n "$HDFS_DATANODE_SECURE_USER" ]; then ulimit -n 128000 fi
- HADOOP_ NAMENODE_ INIT_ The heapsize value can be set to "- Xms20480m"
- HADOOP_ JOBTRACKER_ The number of parallel threads of opts GC threads can be set to 8 -XX:ParallelGCThreads=8- 20: Errorfile, - Xloggc can be set as the path of the data disk;
- SHARED_ HDFS_ NAMENODE_ The number of parallel threads of opts GC threads can be set to 8 -XX:ParallelGCThreads=8- 20: Errorfile, - Xloggc can be set as the path of the data disk- 20: Newsize = 2560m - XX: maxnewsize = 2560m the size of Cenozoic initial memory can be set to a slightly larger value (less than - Xms);
- HDFS_ DATANODE_ The number of parallel threads of opts GC threads can be set to 4 -XX:ParallelGCThreads=4- 20: Errorfile, - Xloggc can be set as the path of the data disk; Initialization and maximum heap memory can be increased to - Xms13568m -Xmx13568m;
- HADOOP_LOG_DIR can be set to the data disk.
- HADOOP_SECURE_LOG_DIR can be set to the data disk.
3.3 /etc/hadoop/conf/core-site.xml
<property> <name>fs.defaultFS</name> <value>hdfs://node01:8020</value> <final>true</final> </property> <property> <name>fs.trash.interval</name> <value>360</value> </property> <property> <name>hadoop.http.cross-origin.allowed-headers</name> <value>X-Requested-With,Content-Type,Accept,Origin,WWW-Authenticate,Accept-Encoding,Transfer-Encoding</value> </property> <property> <name>hadoop.http.cross-origin.allowed-methods</name> <value>GET,PUT,POST,OPTIONS,HEAD,DELETE</value> </property> <property> <name>hadoop.http.cross-origin.allowed-origins</name> <value>*</value> </property> <property> <name>hadoop.http.cross-origin.max-age</name> <value>1800</value> </property> <property> <name>hadoop.http.filter.initializers</name> <value>org.apache.hadoop.security.AuthenticationFilterInitializer,org.apache.hadoop.security.HttpCrossOriginFilterInitializer</value> </property> <property> <name>hadoop.security.auth_to_local</name> <value>DEFAULT</value> </property> <property> <name>hadoop.security.authentication</name> <value>simple</value> </property> <property> <name>hadoop.security.authorization</name> <value>false</value> </property> <property> <name>hadoop.security.instrumentation.requires.admin</name> <value>false</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>io.file.buffer.size</name> <value>4096</value> </property> <property> <name>io.serializations</name> <value>org.apache.hadoop.io.serializer.WritableSerialization</value> </property> <property> <name>ipc.client.connect.max.retries</name> <value>10</value> </property> <property> <name>ipc.client.connection.maxidletime</name> <value>10000</value> </property> <property> <name>ipc.client.idlethreshold</name> <value>4000</value> </property> <property> <name>ipc.server.tcpnodelay</name> <value>true</value> </property> <property> <name>mapreduce.jobtracker.webinterface.trusted</name> <value>false</value> </property> <!--<property> <name>net.topology.script.file.name</name> <value>/etc/hadoop/conf/topology_script.py</value> </property>--> <property> <name>hadoop.proxyuser.hdfs.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hdfs.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hive.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hive.hosts</name> <!--<value>bdm0,bdm1</value>--> <value>*</value> </property> <property> <name>hadoop.proxyuser.hue.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hue.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.impala.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.impala.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.livy.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.livy.hosts</name> <value>*</value> </property> <!--<property> <name>hadoop.proxyuser.oozie.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.oozie.hosts</name> <value>bdm0</value> </property>--> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.yarn.hosts</name> <value>*</value> </property> <!-- <property> <name>ha.failover-controller.active-standby-elector.zk.op.retries</name> <value>120</value> </property> <property> <name>ha.zookeeper.quorum</name> <value>es1:2181,es2:2181,bdm0:2181,bdm1:2181,etl1:2181</value> </property> -->
- fs.defaultFS if HA is configured, it can be abbreviated as hdfs://nameservice , Nameservice can be any other legal name, and the subsequent configuration can be unified.
- See official documents for other configuration items core-default.xml
3.4 /etc/hadoop/conf/hdfs-site.xml
<property> <name>dfs.permissions.superusergroup</name> <value>hdfs</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.http-address</name> <value>node01:50070</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/hadoop/hdfs/namenode</value> <final>true</final> </property> <property> <name>dfs.datanode.data.dir</name> <value>/hadoop/hdfs/sda,/hadoop/hdfs/sdb</value> <final>true</final> </property> <property> <name>dfs.namenode.checkpoint.dir</name> <value>/hadoop/hdfs/namesecondary</value> </property> <property> <name>dfs.namenode.secondary.http-address </name> <value>node01:50090</value> </property> <property> <name>dfs.permissions.enabled</name> <value>false</value> </property> <property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.block.access.token.enable</name> <value>true</value> </property> <property> <name>dfs.blockreport.initialDelay</name> <value>120</value> </property> <property> <name>dfs.blocksize</name> <value>134217728</value> </property> <property> <name>dfs.client.failover.proxy.provider.nameservice</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.client.read.shortcircuit.streams.cache.size</name> <value>256</value> </property> <property> <name>dfs.client.retry.policy.enabled</name> <value>false</value> </property> <property> <name>dfs.cluster.administrators</name> <value>hdfs</value> </property> <property> <name>dfs.content-summary.limit</name> <value>5000</value> </property> <property> <name>dfs.datanode.address</name> <value>0.0.0.0:50010</value> </property> <property> <name>dfs.datanode.balance.bandwidthPerSec</name> <value>6250000</value> </property> <property> <name>dfs.datanode.data.dir.perm</name> <value>750</value> </property> <property> <name>dfs.datanode.du.reserved</name> <value>1340866560</value> </property> <property> <name>dfs.datanode.failed.volumes.tolerated</name> <value>0</value> <final>true</final> </property> <property> <name>dfs.datanode.http.address</name> <value>0.0.0.0:50075</value> </property> <property> <name>dfs.datanode.https.address</name> <value>0.0.0.0:50475</value> </property> <property> <name>dfs.datanode.ipc.address</name> <value>0.0.0.0:8010</value> </property> <property> <name>dfs.datanode.max.transfer.threads</name> <value>4096</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/lib/hadoop-hdfs/dn_socket</value> </property> <property> <name>dfs.encrypt.data.transfer.cipher.suites</name> <value>AES/CTR/NoPadding</value> </property> <property> <name>dfs.heartbeat.interval</name> <value>3</value> </property> <property> <name>dfs.hosts.exclude</name> <value>/etc/hadoop/conf/dfs.exclude</value> </property> <property> <name>dfs.http.policy</name> <value>HTTP_ONLY</value> </property> <property> <name>dfs.https.port</name> <value>50470</value> </property> <property> <name>dfs.namenode.accesstime.precision</name> <value>0</value> </property> <property> <name>dfs.namenode.acls.enabled</name> <value>true</value> </property> <property> <name>dfs.namenode.audit.log.async</name> <value>true</value> </property> <property> <name>dfs.namenode.avoid.read.stale.datanode</name> <value>true</value> </property> <property> <name>dfs.namenode.avoid.write.stale.datanode</name> <value>true</value> </property> <property> <name>dfs.namenode.checkpoint.edits.dir</name> <value>${dfs.namenode.checkpoint.dir}</value> </property> <property> <name>dfs.namenode.checkpoint.period</name> <value>21600</value> </property> <property> <name>dfs.namenode.checkpoint.txns</name> <value>1000000</value> </property> <property> <name>dfs.namenode.fslock.fair</name> <value>false</value> </property> <property> <name>dfs.namenode.handler.count</name> <value>800</value> </property> <property> <name>dfs.namenode.name.dir.restore</name> <value>true</value> </property> <property> <name>dfs.namenode.safemode.threshold-pct</name> <value>0.99</value> </property> <property> <name>dfs.namenode.stale.datanode.interval</name> <value>30000</value> </property> <property> <name>dfs.namenode.startup.delay.block.deletion.sec</name> <value>3600</value> </property> <property> <name>dfs.namenode.write.stale.datanode.ratio</name> <value>1.0f</value> </property> <property> <name>dfs.permissions.ContentSummary.subAccess</name> <value>false</value> </property> <property> <name>dfs.replication.max</name> <value>50</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> <final>true</final> </property> <property> <name>fs.permissions.umask-mode</name> <value>022</value> </property> <property> <name>hadoop.caller.context.enabled</name> <value>true</value> </property> <property> <name>hadoop.http.authentication.type</name> <value>simple</value> </property> <property> <name>manage.include.files</name> <value>false</value> </property> <property> <name>nfs.exports.allowed.hosts</name> <value>* rw</value> </property> <property> <name>nfs.file.dump.dir</name> <value>/tmp/.hdfs-nfs</value> </property> <!-- <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> <property> <name>dfs.ha.fencing.methods</name> <value>shell(/bin/true)</value> </property> <property> <name>dfs.ha.namenodes.nameservice</name> <value>nn1,nn2</value> </property> <property> <name>dfs.internal.nameservices</name> <value>nameservice</value> </property> <property> <name>dfs.journalnode.edits.dir</name> <value>/hadoop/hdfs/journal</value> </property> <property> <name>dfs.journalnode.http-address</name> <value>0.0.0.0:8480</value> </property> <property> <name>dfs.journalnode.https-address</name> <value>0.0.0.0:8481</value> </property> <property> <name>dfs.namenode.http-address.nameservice.nn1</name> <value>bdm0:50070</value> </property> <property> <name>dfs.namenode.http-address.nameservice.nn2</name> <value>bdm1:50070</value> </property> <property> <name>dfs.namenode.https-address.nameservice.nn1</name> <value>bdm0:50470</value> </property> <property> <name>dfs.namenode.https-address.nameservice.nn2</name> <value>bdm1:50470</value> </property> <property> <name>dfs.namenode.rpc-address.nameservice.nn1</name> <value>bdm0:8020</value> </property> <property> <name>dfs.namenode.rpc-address.nameservice.nn2</name> <value>bdm1:8020</value> </property> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://bdm0:8485;bdm1:8485;etl1:8485/nameservice</value> </property> <property> <name>dfs.nameservices</name> <value>nameservice</value> </property> -->
- dfs.replication block the number of replicas. It is recommended that the production environment be set to a value greater than 3.
- Starting HA depends on ZK. At the same time, HDFS process SecondaryNameNode does not need to start an odd number of journalnode process services.
- Dfs.namenode.http-address if the HA is enabled and configured as dfs.namenode.http-address.nameservice.nn1=bdm0:50070, dfs.namenode.http-address.nameservice.nn2=bdm1:50070, the Nameservice is the name specified by dfs.nameservices.
- Dfs.namenode.https-address if you enable HA, you can configure dfs.namenode.https-address.nameservice.nn1=bdm0:50470 and dfs.namenode.https-address.nameservice.nn2=bdm1:50470.
- dfs.domain.socket.path=/var/lib/hadoop-hdfs/dn_ The socket configuration item specifies the group of the path. The startup user must have permission. The best owner is the startup user, and the startup user belongs to this group.
- See official documents for other configuration items hdfs-default.xml
3.5 /etc/hadoop/conf/mapred-env.sh
HDP_VERSION="3.1.5.0-152" export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=900 export HADOOP_LOGLEVEL=${HADOOP_LOGLEVEL:-INFO} export HADOOP_ROOT_LOGGER=${HADOOP_ROOT_LOGGER:-INFO,console} export HADOOP_DAEMON_ROOT_LOGGER=${HADOOP_DAEMON_ROOT_LOGGER:-${HADOOP_LOGLEVEL},RFA} export HADOOP_OPTS="-Dhdp.version=$HDP_VERSION $HADOOP_OPTS" #export HADOOP_OPTS="-Djava.io.tmpdir=/var/lib/ambari-server/data/tmp/hadoop_java_io_tmpdir $HADOOP_OPTS" export JAVA_LIBRARY_PATH="${JAVA_LIBRARY_PATH}:/var/lib/ambari-server/data/tmp/hadoop_java_io_tmpdir" export HADOOP_LOG_DIR=/var/log/hadoop-mapreduce/$USER export HADOOP_PID_DIR=/var/run/hadoop-mapreduce/$USER
3.6 /etc/hadoop/conf/mapred-site.xml
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapred.local.dir</name> <value>/hadoop/mapred</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx1024m</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>256</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx1024m</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>256</value> </property> <property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx512m -Dhdp.version=${hdp.version}</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>node01:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>node01:19888</value> </property> <property> <name>mapreduce.jobhistory.webapp.https.address</name> <value>node01:19890</value> </property> <property> <name>hadoop.http.authentication.type</name> <value>simple</value> </property> <property> <name>mapred.map.tasks.speculative.execution</name> <value>true</value> </property> <property> <name>mapred.reduce.tasks.speculative.execution</name> <value>true</value> </property> <property> <name>mapreduce.admin.map.child.java.opts</name> <value>-server -XX:NewRatio=8 -Djava.net.preferIPv4Stack=true -Dhdp.version=${hdp.version}</value> </property> <property> <name>mapreduce.admin.reduce.child.java.opts</name> <value>-server -XX:NewRatio=8 -Djava.net.preferIPv4Stack=true -Dhdp.version=${hdp.version}</value> </property> <property> <name>mapreduce.admin.user.env</name> <value>LD_LIBRARY_PATH=/usr/hdp/current/hadoop/lib/native:/usr/hdp/current/hadoop/lib/native/Linux-amd64-64</value> </property> <property> <name>mapreduce.cluster.acls.enabled</name> <value>false</value> </property> <property> <name>mapreduce.am.max-attempts</name> <value>2</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/current/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure</value> </property> <property> <name>mapreduce.application.framework.path</name> <value>/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework</value> </property> <property> <name>mapreduce.cluster.administrators</name> <value>hadoop</value> </property> <property> <name>mapreduce.job.acl-modify-job</name> <value> </value> </property> <property> <name>mapreduce.job.acl-view-job</name> <value> </value> </property> <property> <name>mapreduce.job.counters.max</name> <value>130</value> </property> <property> <name>mapreduce.job.emit-timeline-data</name> <value>true</value> </property> <property> <name>mapreduce.job.queuename</name> <value>default</value> </property> <property> <name>mapreduce.job.reduce.slowstart.completedmaps</name> <value>0.05</value> </property> <property> <name>mapreduce.jobhistory.admin.acl</name> <value>*</value> </property> <property> <name>mapreduce.jobhistory.bind-host</name> <value>0.0.0.0</value> </property> <!--<property> <name>mapreduce.jobhistory.done-dir</name> <value>/mr-history/done</value> </property>--> <property> <name>mapreduce.jobhistory.http.policy</name> <value>HTTP_ONLY</value> </property> <!--<property> <name>mapreduce.jobhistory.intermediate-done-dir</name> <value>/mr-history/tmp</value> </property>--> <property> <name>mapreduce.jobhistory.recovery.enable</name> <value>true</value> </property> <property> <name>mapreduce.jobhistory.recovery.store.class</name> <value>org.apache.hadoop.mapreduce.v2.hs.HistoryServerLeveldbStateStoreService</value> </property> <property> <name>mapreduce.jobhistory.recovery.store.leveldb.path</name> <value>/hadoop/mapreduce/jhs</value> </property> <property> <name>mapreduce.map.log.level</name> <value>INFO</value> </property> <property> <name>mapreduce.map.output.compress</name> <value>false</value> </property> <property> <name>mapreduce.map.sort.spill.percent</name> <value>0.7</value> </property> <property> <name>mapreduce.map.speculative</name> <value>false</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>false</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.type</name> <value>BLOCK</value> </property> <property> <name>mapreduce.reduce.input.buffer.percent</name> <value>0.0</value> </property> <property> <name>mapreduce.reduce.log.level</name> <value>INFO</value> </property> <property> <name>mapreduce.reduce.shuffle.fetch.retry.enabled</name> <value>1</value> </property> <property> <name>mapreduce.reduce.shuffle.fetch.retry.interval-ms</name> <value>1000</value> </property> <property> <name>mapreduce.reduce.shuffle.fetch.retry.timeout-ms</name> <value>30000</value> </property> <property> <name>mapreduce.reduce.shuffle.input.buffer.percent</name> <value>0.7</value> </property> <property> <name>mapreduce.reduce.shuffle.merge.percent</name> <value>0.66</value> </property> <property> <name>mapreduce.reduce.shuffle.parallelcopies</name> <value>30</value> </property> <property> <name>mapreduce.reduce.speculative</name> <value>false</value> </property> <property> <name>mapreduce.shuffle.port</name> <value>13562</value> </property> <property> <name>mapreduce.task.io.sort.factor</name> <value>100</value> </property> <property> <name>mapreduce.task.io.sort.mb</name> <value>256</value> </property> <property> <name>mapreduce.task.timeout</name> <value>300000</value> </property> <property> <name>yarn.app.mapreduce.am.admin-command-opts</name> <value>-Dhdp.version=${hdp.version}</value> </property> <property> <name>yarn.app.mapreduce.am.log.level</name> <value>INFO</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>512</value> </property> <property> <name>yarn.app.mapreduce.am.staging-dir</name> <value>/user</value> </property>
- mapreduce.map.java.opts=-Xmx12697m start map task is the JVM heap memory size. If this value is set too small, the JVM will throw Out of Memory exception when submitting the task.
- mapreduce.map.memory.mb=12288 the maximum memory used by each map Container. The default is - 1, indicating no limit. If not specified, it is inferred from mapreduce.job.heap.memory-mb.ratio (0.8 by default) and mapreduce.map.java.opts. When the memory size of the Container exceeds this parameter value, the NodeManager will be responsible for kill ing the Container.
- MapReduce. Java. Opts = - xmx16384m, the same as mapreduce.map.java.opts.
- MapReduce. Memory. MB = 12288, the same as mapreduce.map.memory.mb.
- Yarn. App. MapReduce. Am. Command opts = - xmx8192m - dhdp. Version = ${HDP. Version}. The default value is - Xmx1024m. The App Master heap memory size. When too many tasks are submitted and this value is set too large, the memory occupied by AM may exceed yarn.scheduler.capacity.maximum-am-resource-percent, and the currently submitted and subsequent submitted tasks will wait in the queue.
- See official documents for other configuration items mapred-default.xml
3.7 /etc/hadoop/conf/yarn-env.sh
export HADOOP_YARN_HOME=/usr/hdp/current/hadoop-yarn export HADOOP_LOG_DIR=/var/log/hadoop-yarn/yarn export HADOOP_SECURE_LOG_DIR=/var/log/hadoop-yarn/yarn export HADOOP_PID_DIR=/var/run/hadoop-yarn/yarn export HADOOP_SECURE_PID_DIR=/var/run/hadoop-yarn/yarn export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop/libexec export JAVA_HOME=/usr/local/jdk8 #export JAVA_LIBRARY_PATH="${JAVA_LIBRARY_PATH}:/var/lib/ambari-server/data/tmp/hadoop_java_io_tmpdir" export HADOOP_LOGLEVEL=${HADOOP_LOGLEVEL:-INFO} export HADOOP_ROOT_LOGGER=${HADOOP_ROOT_LOGGER:-INFO,console} export HADOOP_DAEMON_ROOT_LOGGER=${HADOOP_DAEMON_ROOT_LOGGER:-${HADOOP_LOGLEVEL},EWMA,RFA} # User for YARN daemons export HADOOP_YARN_USER=${HADOOP_YARN_USER:-yarn} # some Java parameters # export JAVA_HOME=/home/y/libexec/jdk1.6.0/ if [ "$JAVA_HOME" != "" ]; then #echo "run java in $JAVA_HOME" JAVA_HOME=$JAVA_HOME fi if [ "$JAVA_HOME" = "" ]; then echo "Error: JAVA_HOME is not set." exit 1 fi JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx1000m YARN_HEAPSIZE=1024 # check envvars which might override default args if [ "$YARN_HEAPSIZE" != "" ]; then JAVA_HEAP_MAX="-Xmx""$YARN_HEAPSIZE""m" fi export YARN_RESOURCEMANAGER_HEAPSIZE=1024 export YARN_NODEMANAGER_HEAPSIZE=1024 export YARN_TIMELINESERVER_HEAPSIZE=1024 IFS= # default log directory and file if [ "$HADOOP_LOG_DIR" = "" ]; then HADOOP_LOG_DIR="$HADOOP_YARN_HOME/logs" fi if [ "$HADOOP_LOGFILE" = "" ]; then HADOOP_LOGFILE='yarn.log' fi # default policy file for service-level authorization if [ "$YARN_POLICYFILE" = "" ]; then YARN_POLICYFILE="hadoop-policy.xml" fi # restore ordinary behaviour unset IFS HADOOP_OPTS="$HADOOP_OPTS -Dyarn.id.str=$YARN_IDENT_STRING" HADOOP_OPTS="$HADOOP_OPTS -Dyarn.policy.file=$YARN_POLICYFILE" #HADOOP_OPTS="$HADOOP_OPTS -Djava.io.tmpdir=/var/lib/ambari-server/data/tmp/hadoop_java_io_tmpdir" export YARN_NODEMANAGER_OPTS="$YARN_NODEMANAGER_OPTS -Dnm.audit.logger=INFO,NMAUDIT" export YARN_RESOURCEMANAGER_OPTS="$YARN_RESOURCEMANAGER_OPTS -Dyarn.server.resourcemanager.appsummary.logger=INFO,RMSUMMARY -Drm.audit.logger=INFO,RMAUDIT" export YARN_REGISTRYDNS_SECURE_USER=yarn export YARN_REGISTRYDNS_SECURE_EXTRA_OPTS="-jvm server"
- HADOOP_LOG_DIR and HADOOP_SECURE_LOG_DIR is set to the data disk.
- YARN_ RESOURCEMANAGER_ The value of heapsize = 3072 can be increased appropriately, such as 3072.
- YARN_ NODEMANAGER_ The value of heapsize = 3072 can be increased appropriately, such as 3072.
- YARN_ TIMELINESERVER_ The value of heapsize = 8072 can be appropriately increased, such as 8072.
3.8 /etc/hadoop/conf/capacity-scheduler.xml
The capacity scheduling configuration file mainly focuses on the following configurations, which can be modified by default or according to the actual situation. This is configured as the ratio of the maximum YARN total memory used by AM. The default value is 0.1. If there are many streaming tasks or tasks running at the same time in the current YARN, you can appropriately increase this value, such as 0.4 or 0.5.
<property> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>0.4</value> </property>
3.9 /etc/hadoop/conf/yarn-site.xml
<property> <name>yarn.resourcemanager.hostname</name> <value>node01</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <!--<value>mapreduce_shuffle,spark2_shuffle,timeline_collector</value>--> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/hadoop/yarn/local</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/hadoop/yarn/log</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.log.server.url</name> <value>http://node01:19888/jobhistory/logs</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.resourcemanager.webapp.https.address</name> <value>node01:8090</value> </property> <property> <name>yarn.application.classpath</name> <value> $HADOOP_CONF_DIR, /usr/hdp/current/hadoop/*, /usr/hdp/current/hadoop/lib/*, /usr/hdp/current/hadoop-hdfs/*, /usr/hdp/current/hadoop-hdfs/lib/*, /usr/hdp/current/hadoop-yarn/*, /usr/hdp/current/hadoop-yarn/lib/* </value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>4096</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>256</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>5000</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>8</value> </property> <property> <name>hadoop.http.authentication.type</name> <value>simple</value> </property> <!--<property> <name>hadoop.http.cross-origin.allowed-origins</name> <value>regex:.*[.]bdm1[.]com(:\d*)?</value> </property>--> <property> <name>hadoop.registry.dns.bind-address</name> <value>0.0.0.0</value> </property> <property> <name>hadoop.registry.dns.bind-port</name> <value>53</value> </property> <property> <name>hadoop.registry.dns.domain-name</name> <value>EXAMPLE.COM</value> </property> <property> <name>hadoop.registry.dns.enabled</name> <value>true</value> </property> <property> <name>hadoop.registry.dns.zone-mask</name> <value>255.255.255.0</value> </property> <property> <name>hadoop.registry.dns.zone-subnet</name> <value>172.17.0.0</value> </property> <property> <name>hadoop.registry.zk.quorum</name> <value>node01:2181</value> </property> <property> <name>manage.include.files</name> <value>false</value> </property> <property> <name>yarn.acl.enable</name> <value>false</value> </property> <property> <name>yarn.admin.acl</name> <value>activity_analyzer,yarn</value> </property> <property> <name>yarn.client.nodemanager-connect.max-wait-ms</name> <value>60000</value> </property> <property> <name>yarn.client.nodemanager-connect.retry-interval-ms</name> <value>10000</value> </property> <property> <name>yarn.http.policy</name> <value>HTTP_ONLY</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>2592000</value> </property> <property> <name>yarn.log.server.web-service.url</name> <value>http://node01:8188/ws/v1/applicationhistory</value> </property> <property> <name>yarn.node-labels.enabled</name> <value>false</value> </property> <property> <name>yarn.node-labels.fs-store.retry-policy-spec</name> <value>2000, 500</value> </property> <property> <name>yarn.node-labels.fs-store.root-dir</name> <value>/system/yarn/node-labels</value> </property> <property> <name>yarn.nodemanager.address</name> <value>0.0.0.0:45454</value> </property> <property> <name>yarn.nodemanager.admin-env</name> <value>MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX</value> </property> <property> <name>yarn.nodemanager.aux-services.spark2_shuffle.class</name> <value>org.apache.spark.network.yarn.YarnShuffleService</value> </property> <property> <name>yarn.nodemanager.aux-services.spark2_shuffle.classpath</name> <value>/usr/hdp/current/spark2/aux/*</value> </property> <property> <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> <value>org.apache.spark.network.yarn.YarnShuffleService</value> </property> <property> <name>yarn.nodemanager.aux-services.spark_shuffle.classpath</name> <value>/usr/hdp/current/spark/aux/*</value> </property> <property> <name>yarn.nodemanager.aux-services.timeline_collector.class</name> <value>org.apache.hadoop.yarn.server.timelineservice.collector.PerNodeTimelineCollectorsAuxService</value> </property> <property> <name>yarn.nodemanager.bind-host</name> <value>0.0.0.0</value> </property> <property> <name>yarn.nodemanager.container-executor.class</name> <value>org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor</value> </property> <property> <name>yarn.nodemanager.container-metrics.unregister-delay-ms</name> <value>60000</value> </property> <property> <name>yarn.nodemanager.container-monitor.interval-ms</name> <value>3000</value> </property> <property> <name>yarn.nodemanager.delete.debug-delay-sec</name> <value>0</value> </property> <property> <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name> <value>90</value> </property> <property> <name>yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb</name> <value>1000</value> </property> <property> <name>yarn.nodemanager.disk-health-checker.min-healthy-disks</name> <value>0.25</value> </property> <property> <name>yarn.nodemanager.health-checker.interval-ms</name> <value>135000</value> </property> <property> <name>yarn.nodemanager.health-checker.script.timeout-ms</name> <value>60000</value> </property> <property> <name>yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage</name> <value>false</value> </property> <property> <name>yarn.nodemanager.linux-container-executor.group</name> <value>hadoop</value> </property> <property> <name>yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users</name> <value>true</value> </property> <property> <name>yarn.nodemanager.log-aggregation.compression-type</name> <value>gz</value> </property> <property> <name>yarn.nodemanager.log-aggregation.debug-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.log-aggregation.num-log-files-per-app</name> <value>30</value> </property> <property> <name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name> <value>3600</value> </property> <property> <name>yarn.nodemanager.log.retain-seconds</name> <value>1209600</value> </property> <property> <name>yarn.nodemanager.recovery.dir</name> <value>/var/log/hadoop-yarn/nodemanager/recovery-state</value> </property> <property> <name>yarn.nodemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.nodemanager.recovery.supervised</name> <value>true</value> </property> <property> <name>yarn.nodemanager.remote-app-log-dir</name> <value>/hadoop/app-logs</value> </property> <property> <name>yarn.nodemanager.remote-app-log-dir-suffix</name> <value>logs</value> </property> <property> <name>yarn.nodemanager.resource-plugins</name> <value></value> </property> <property> <name>yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices</name> <value></value> </property> <property> <name>yarn.nodemanager.resource-plugins.gpu.docker-plugin</name> <value></value> </property> <property> <name>yarn.nodemanager.resource-plugins.gpu.docker-plugin.nvidiadocker-v1.endpoint</name> <value></value> </property> <property> <name>yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables</name> <value></value> </property> <property> <name>yarn.nodemanager.resource.pcores-vcores-multiplier</name> <value>2</value> </property> <property> <name>yarn.nodemanager.resource.percentage-physical-cpu-limit</name> <value>80</value> </property> <property> <name>yarn.nodemanager.runtime.linux.allowed-runtimes</name> <value>default,docker</value> </property> <property> <name>yarn.nodemanager.runtime.linux.docker.allowed-container-networks</name> <value>host,none,bridge</value> </property> <property> <name>yarn.nodemanager.runtime.linux.docker.capabilities</name> <value> CHOWN,DAC_OVERRIDE,FSETID,FOWNER,MKNOD,NET_RAW,SETGID,SETUID,SETFCAP, SETPCAP,NET_BIND_SERVICE,SYS_CHROOT,KILL,AUDIT_WRITE</value> </property> <property> <name>yarn.nodemanager.runtime.linux.docker.default-container-network</name> <value>host</value> </property> <property> <name>yarn.nodemanager.runtime.linux.docker.privileged-containers.acl</name> <value></value> </property> <property> <name>yarn.nodemanager.runtime.linux.docker.privileged-containers.allowed</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>2.1</value> </property> <property> <name>yarn.nodemanager.webapp.cross-origin.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>node01:8050</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>node01:8141</value> </property> <property> <name>yarn.resourcemanager.am.max-attempts</name> <value>2</value> </property> <property> <name>yarn.resourcemanager.bind-host</name> <value>0.0.0.0</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>yarn-cluster</value> </property> <property> <name>yarn.resourcemanager.connect.max-wait.ms</name> <value>900000</value> </property> <property> <name>yarn.resourcemanager.connect.retry-interval.ms</name> <value>30000</value> </property> <property> <name>yarn.resourcemanager.display.per-user-apps</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.fs.state-store.retry-policy-spec</name> <value>2000, 500</value> </property> <property> <name>yarn.resourcemanager.fs.state-store.uri</name> <value> </value> </property> <property> <name>yarn.resourcemanager.ha.automatic-failover.zk-base-path</name> <value>/yarn-leader-election</value> </property> <property> <name>yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval</name> <value>15000</value> </property> <property> <name>yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor</name> <value>1</value> </property> <property> <name>yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round</name> <value>0.1</value> </property> <!--<property> <name>yarn.resourcemanager.nodes.exclude-path</name> <value>/etc/hadoop/conf/yarn.exclude</value> </property>--> <property> <name>yarn.resourcemanager.placement-constraints.handler</name> <value>scheduler</value> </property> <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>node01:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>node01:8030</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> <property> <name>yarn.resourcemanager.scheduler.monitor.enable</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.state-store.max-completed-applications</name> <value>${yarn.resourcemanager.max-completed-applications}</value> </property> <property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> <property> <name>yarn.resourcemanager.system-metrics-publisher.dispatcher.pool-size</name> <value>10</value> </property> <property> <name>yarn.resourcemanager.system-metrics-publisher.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>node01:8088</value> </property> <property> <name>yarn.resourcemanager.webapp.cross-origin.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled</name> <value>false</value> </property> <property> <name>yarn.resourcemanager.work-preserving-recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms</name> <value>10000</value> </property> <property> <name>yarn.resourcemanager.zk-acl</name> <value>world:anyone:rwcda</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>node01:2181</value> </property> <property> <name>yarn.resourcemanager.zk-num-retries</name> <value>1000</value> </property> <property> <name>yarn.resourcemanager.zk-retry-interval-ms</name> <value>1000</value> </property> <property> <name>yarn.resourcemanager.zk-state-store.parent-path</name> <value>/rmstore</value> </property> <property> <name>yarn.resourcemanager.zk-timeout-ms</name> <value>10000</value> </property> <property> <name>yarn.rm.system-metricspublisher.emit-container-events</name> <value>true</value> </property> <property> <name>yarn.scheduler.capacity.ordering-policy.priority-utilization.underutilized-preemption.enabled</name> <value>true</value> </property> <property> <name>yarn.service.framework.path</name> <value>/hdp/apps/${hdp.version}/hadoop-yarn/lib/service-dep.tar.gz</value> </property> <property> <name>yarn.service.system-service.dir</name> <value>/services</value> </property> <property> <name>yarn.system-metricspublisher.enabled</name> <value>true</value> </property> <property> <name>yarn.timeline-service.address</name> <value>node01:10200</value> </property> <property> <name>yarn.timeline-service.bind-host</name> <value>0.0.0.0</value> </property> <property> <name>yarn.timeline-service.client.max-retries</name> <value>30</value> </property> <property> <name>yarn.timeline-service.client.retry-interval-ms</name> <value>1000</value> </property> <property> <name>yarn.timeline-service.enabled</name> <value>false</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.active-dir</name> <value>/ats/active/</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.app-cache-size</name> <value>10</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.cleaner-interval-seconds</name> <value>3600</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.done-dir</name> <value>/ats/done/</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.group-id-plugin-classes</name> <value>org.apache.hadoop.yarn.applications.distributedshell.DistributedShellTimelinePlugin</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.group-id-plugin-classpath</name> <value></value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.retain-seconds</name> <value>604800</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.scan-interval-seconds</name> <value>60</value> </property> <property> <name>yarn.timeline-service.entity-group-fs-store.summary-store</name> <value>org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore</value> </property> <property> <name>yarn.timeline-service.generic-application-history.save-non-am-container-meta-info</name> <value>false</value> </property> <property> <name>yarn.timeline-service.generic-application-history.store-class</name> <value>org.apache.hadoop.yarn.server.applicationhistoryservice.NullApplicationHistoryStore</value> </property> <property> <name>yarn.timeline-service.hbase-schema.prefix</name> <value>prod.</value> </property> <property> <name>yarn.timeline-service.hbase.configuration.file</name> <value>file:///usr/hdp/${hdp.version}/hadoop/conf/embedded-yarn-ats-hbase/hbase-site.xml</value> </property> <property> <name>yarn.timeline-service.hbase.coprocessor.jar.hdfs.location</name> <value>file:///usr/hdp/${hdp.version}/hadoop-yarn/timelineservice/hadoop-yarn-server-timelineservice-hbase-coprocessor.jar</value> </property> <property> <name>yarn.timeline-service.http-authentication.proxyuser.root.groups</name> <value>*</value> </property> <property> <name>yarn.timeline-service.http-authentication.proxyuser.root.hosts</name> <value>node01</value> </property> <property> <name>yarn.timeline-service.http-authentication.simple.anonymous.allowed</name> <value>true</value> </property> <property> <name>yarn.timeline-service.http-authentication.type</name> <value>simple</value> </property> <property> <name>yarn.timeline-service.http-cross-origin.enabled</name> <value>true</value> </property> <property> <name>yarn.timeline-service.leveldb-state-store.path</name> <value>/hadoop/yarn/timeline</value> </property> <property> <name>yarn.timeline-service.leveldb-timeline-store.path</name> <value>/hadoop/yarn/timeline</value> </property> <property> <name>yarn.timeline-service.leveldb-timeline-store.read-cache-size</name> <value>104857600</value> </property> <property> <name>yarn.timeline-service.leveldb-timeline-store.start-time-read-cache-size</name> <value>10000</value> </property> <property> <name>yarn.timeline-service.leveldb-timeline-store.start-time-write-cache-size</name> <value>10000</value> </property> <property> <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name> <value>300000</value> </property> <property> <name>yarn.timeline-service.reader.webapp.address</name> <value>node01:8198</value> </property> <property> <name>yarn.timeline-service.reader.webapp.https.address</name> <value>node01:8199</value> </property> <property> <name>yarn.timeline-service.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.timeline-service.state-store-class</name> <value>org.apache.hadoop.yarn.server.timeline.recovery.LeveldbTimelineStateStore</value> </property> <property> <name>yarn.timeline-service.store-class</name> <value>org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore</value> </property> <property> <name>yarn.timeline-service.ttl-enable</name> <value>true</value> </property> <property> <name>yarn.timeline-service.ttl-ms</name> <value>2678400000</value> </property> <property> <name>yarn.timeline-service.version</name> <value>2.0f</value> </property> <property> <name>yarn.timeline-service.versions</name> <value>1.5f,2.0f</value> </property> <property> <name>yarn.timeline-service.webapp.address</name> <value>node01:8188</value> </property> <property> <name>yarn.timeline-service.webapp.https.address</name> <value>node01:8190</value> </property> <property> <name>yarn.webapp.api-service.enable</name> <value>true</value> </property> <property> <name>yarn.webapp.ui2.enable</name> <value>true</value> </property> <!-- <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>bdm0</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>bdm1</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm1</name> <value>bdm0:8088</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm2</name> <value>bdm1:8088</value> </property> <property> <name>yarn.resourcemanager.webapp.https.address.rm1</name> <value>bdm0:8090</value> </property> <property> <name>yarn.resourcemanager.webapp.https.address.rm2</name> <value>bdm1:8090</value> </property> -->
-
Yarn. Nodemanager. Resource. CPU vcores = 64 the number of cores that can be used by each nodemanager. The default is - 1. For example, it is set here to twice the number of physical cores.
-
Yarn. Nodemanager. Resource. Memory MB = 131072 memory resources that can be used by each nodemanager. If the value is - 1 and the configuration item yarn. Nodemanager. Resource. Detect hardware capabilities = true, it will be calculated automatically. Otherwise, the default value is 8192MB. For example, it is set to 128GB here.
-
Yarn. Scheduler. Minimum allocation vcores = 1 the minimum number of virtual cores that can be applied for by a single Container. For example, it is specified here as 1.
-
Yarn. Scheduler. Minimum allocation MB = 2048 the minimum memory resources that a single Container can apply for, for example, 2048.
-
Yarn. Scheduler. Maximum allocation vcores = 8 the maximum number of virtual cores that can be applied for by a single Container. For example, it is specified here as 8.
-
Yarn. Scheduler. Maximum allocation MB = 30720 the maximum memory resources that a single Container can apply for. For example, 30720 is specified here.
-
Yarn. Timeline service. Enabled = false whether to enable the timeline service. Set it to false here and do not enable this service.
-
Yarn. Nodemanager. Log aggregation. Compression type = gz log compression type algorithm, which defaults to none and is set to gz here.
-
See official documents for other configuration items yarn-default.xml
3.10 /etc/zookeeper/conf/zookeeper-env.sh
export JAVA_HOME=/usr/local/jdk8 export ZOOKEEPER_HOME=/usr/hdp/current/zookeeper export ZOO_LOG_DIR=/var/log/zookeeper export ZOOPIDFILE=/var/run/zookeeper/zookeeper_server.pid export SERVER_JVMFLAGS=-Xmx256m export JAVA=$JAVA_HOME/bin/java export CLASSPATH=$CLASSPATH:/usr/share/zookeeper/*
- ZOO_LOG_DIR log can be set to a data disk.
- SERVER_ The maximum heap memory of jvmflags can be set to - Xmx1024m.
3.11 /etc/zookeeper/conf/zoo.cfg
tickTime=2000 maxClientCnxns=50 initLimit=10 syncLimit=5 dataDir=/var/lib/zookeeper clientPort=2181 autopurge.snapRetainCount=5 autopurge.purgeInterval=24 admin.enableServer=false server.1=node01:2887:3887 #...
3.12 /etc/tez/conf/tez-env.sh
export TEZ_CONF_DIR=/etc/tez/conf/ export HADOOP_HOME=${HADOOP_HOME:-/usr} export JAVA_HOME=/usr/local/jdk8
3.13 /etc/tez/conf/tez-site.xml
<property> <name>tez.task.resource.memory.mb</name> <value>512</value> </property> <property> <name>tez.am.resource.memory.mb</name> <value>512</value> </property> <property> <name>tez.counters.max</name> <value>10000</value> </property> <property> <name>tez.lib.uris</name> <value>/hdp/apps/3.1.5.0-152/tez/tez.tar.gz</value> </property> <property> <name>tez.runtime.io.sort.mb</name> <value>256</value> </property> <property> <name>tez.am.java.opts</name> <value>-server -Xmx512m -Djava.net.preferIPv4Stack=true</value> </property> <property> <name>tez.am.launch.env</name> <value>LD_LIBRARY_PATH=/usr/hdp/current/hadoop/lib/native:/usr/hdp/current/hadoop/lib/native/Linux-amd64-64</value> </property> <property> <name>tez.cluster.additional.classpath.prefix</name> <value>/usr/hdp/current/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure</value> </property> <property> <name>tez.task.launch.cmd-opts</name> <value>-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+ResizeTLAB</value> </property> <property> <name>tez.task.launch.env</name> <value>LD_LIBRARY_PATH=/usr/hdp/current/hadoop/lib/native:/usr/hdp/current/hadoop/lib/native/Linux-amd64-64</value> </property> <property> <name>tez.am.am-rm.heartbeat.interval-ms.max</name> <value>250</value> </property> <property> <name>tez.am.container.idle.release-timeout-max.millis</name> <value>20000</value> </property> <property> <name>tez.am.container.idle.release-timeout-min.millis</name> <value>10000</value> </property> <property> <name>tez.am.container.reuse.enabled</name> <value>true</value> </property> <property> <name>tez.am.container.reuse.locality.delay-allocation-millis</name> <value>250</value> </property> <property> <name>tez.am.container.reuse.non-local-fallback.enabled</name> <value>false</value> </property> <property> <name>tez.am.container.reuse.rack-fallback.enabled</name> <value>true</value> </property> <property> <name>tez.am.launch.cluster-default.cmd-opts</name> <value>-server -Djava.net.preferIPv4Stack=true -Dhdp.version=${hdp.version}</value> </property> <property> <name>tez.am.launch.cmd-opts</name> <value>-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+ResizeTLAB</value> </property> <property> <name>tez.am.log.level</name> <value>INFO</value> </property> <property> <name>tez.am.max.app.attempts</name> <value>2</value> </property> <property> <name>tez.am.maxtaskfailures.per.node</name> <value>10</value> </property> <property> <name>tez.am.tez-ui.history-url.template</name> <value>__HISTORY_URL_BASE__?viewPath=%2F%23%2Ftez-app%2F__APPLICATION_ID__</value> </property> <property> <name>tez.am.view-acls</name> <value>*</value> </property> <property> <name>tez.counters.max.groups</name> <value>3000</value> </property> <property> <name>tez.generate.debug.artifacts</name> <value>false</value> </property> <property> <name>tez.grouping.max-size</name> <value>1073741824</value> </property> <property> <name>tez.grouping.min-size</name> <value>16777216</value> </property> <property> <name>tez.grouping.split-waves</name> <value>1.7</value> </property> <property> <name>tez.history.logging.proto-base-dir</name> <value>/warehouse/tablespace/external/hive/sys.db</value> </property> <property> <name>tez.history.logging.service.class</name> <value>org.apache.tez.dag.history.logging.proto.ProtoHistoryLoggingService</value> </property> <property> <name>tez.history.logging.timeline-cache-plugin.old-num-dags-per-group</name> <value>5</value> </property> <property> <name>tez.queue.name</name> <value>default</value> </property> <property> <name>tez.runtime.compress</name> <value>true</value> </property> <property> <name>tez.runtime.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>tez.runtime.convert.user-payload.to.history-text</name> <value>false</value> </property> <property> <name>tez.runtime.optimize.local.fetch</name> <value>true</value> </property> <property> <name>tez.runtime.pipelined.sorter.sort.threads</name> <value>2</value> </property> <property> <name>tez.runtime.shuffle.fetch.buffer.percent</name> <value>0.6</value> </property> <property> <name>tez.runtime.shuffle.memory.limit.percent</name> <value>0.25</value> </property> <property> <name>tez.runtime.sorter.class</name> <value>PIPELINED</value> </property> <property> <name>tez.runtime.unordered.output.buffer.size-mb</name> <value>768</value> </property> <property> <name>tez.session.am.dag.submit.timeout.secs</name> <value>600</value> </property> <property> <name>tez.session.client.timeout.secs</name> <value>-1</value> </property> <property> <name>tez.shuffle-vertex-manager.max-src-fraction</name> <value>0.4</value> </property> <property> <name>tez.shuffle-vertex-manager.min-src-fraction</name> <value>0.2</value> </property> <property> <name>tez.staging-dir</name> <value>/tmp/${user.name}/staging</value> </property> <property> <name>tez.task.am.heartbeat.counter.interval-ms.max</name> <value>4000</value> </property> <property> <name>tez.task.generate.counters.per.io</name> <value>true</value> </property> <property> <name>tez.task.get-task.sleep.interval-ms.max</name> <value>200</value> </property> <property> <name>tez.task.launch.cluster-default.cmd-opts</name> <value>-server -Djava.net.preferIPv4Stack=true -Dhdp.version=${hdp.version}</value> </property> <property> <name>tez.task.max-events-per-heartbeat</name> <value>500</value> </property> <property> <name>tez.use.cluster.hadoop-libs</name> <value>false</value> </property> <property> <name>yarn.timeline-service.enabled</name> <value>false</value> </property>
- tez.task.resource.memory.mb=8192. The default value is 1024. The memory size used by tez tasks is set to 8192 here. Increasing this value appropriately is conducive to performance improvement.
- tez.am.resource.memory.mb=5120. The default is 1024. The amount of memory to be used by the AppMaster of tez task is set to 5120 here.
- tez.counters.max=10000. For advanced configuration, the default value is 1200. The number of DAGs (AppMaster and Task) is limited. For example, it is set to 10000 here.
- tez.lib.uris=/hdp/apps/3.1.5.0-152/tez/tez.tar.gz, required item, path on HDFS. You need to upload the / usr/hdp/current/tez/lib/tez.tar.gz resource to the HDFS path configured here.
- tez.runtime.io.sort.mb=2703, which is set to 2703 here.
- tez.am.java.opts=-server -Xmx8192m -Djava.net.preferIPv4Stack=true .
- See official documents for other configuration items TezConfiguration.html
3.14 /etc/hive/conf/hive-env.sh
if [ "$SERVICE" = "metastore" ]; then export HADOOP_HEAPSIZE=12288 # Setting for HiveMetastore export HADOOP_OPTS="$HADOOP_OPTS -Xloggc:/var/log/hive/hivemetastore-gc-%t.log -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCCause -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/hive/hms_heapdump.hprof -Dhive.log.dir=/var/log/hive -Dhive.log.file=hivemetastore.log -Duser.timezone=Asia/Shanghai" fi if [ "$SERVICE" = "hiveserver2" ]; then export HADOOP_HEAPSIZE=12288 # Setting for HiveServer2 and Client export HADOOP_OPTS="$HADOOP_OPTS -Xloggc:/var/log/hive/hiveserver2-gc-%t.log -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCCause -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/hive/hs2_heapdump.hprof -Dhive.log.dir=/var/log/hive -Dhive.log.file=hiveserver2.log -Duser.timezone=Asia/Shanghai" fi export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Xmx${HADOOP_HEAPSIZE}m" export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS" HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop} export HIVE_HOME=${HIVE_HOME:-/usr/hdp/current/hive} export HIVE_CONF_DIR=${HIVE_CONF_DIR:-/usr/hdp/current/hive/conf} if [ "${HIVE_AUX_JARS_PATH}" != "" ]; then if [ -f "${HIVE_AUX_JARS_PATH}" ]; then export HIVE_AUX_JARS_PATH=${HIVE_AUX_JARS_PATH} elif [ -d "/usr/hdp/current/hive-hcatalog/share/hcatalog" ]; then export HIVE_AUX_JARS_PATH=/usr/hdp/current/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar fi elif [ -d "/usr/hdp/current/hive-hcatalog/share/hcatalog" ]; then export HIVE_AUX_JARS_PATH=/usr/hdp/current/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar fi export METASTORE_PORT=9083
- HADOOP_HEAPSIZE can be appropriately increased for metastore and hiveserver2 heap memory. For example, it is set to 12288 here.
- HADOOP_ In order to solve the problem of time zone, opts adds the parameter - Duser.timezone=Asia/Shanghai in metastore and hiveserver2 to set it as the East eighth District of China.
3.15 /etc/hive/conf/hive-exec-log4j2.properties
Refer to / etc/hive/conf/hive-exec-log4j2.properties.template
3.16 /etc/hive/conf/hive-log4j2.properties
Refer to / etc/hive/conf/hive-log4j2.properties.template
3.17 /etc/hive/conf/hive-site.xml
<property> <name>hive.server2.thrift.bind.host</name> <value>node01</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://node01:9083</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/warehouse/tablespace/managed/hive</value> </property> <property> <name>hive.metastore.warehouse.external.dir</name> <value>/warehouse/tablespace/external/hive</value> </property> <property> <name>hive.metastore.db.type</name> <value>mysql</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://node01:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8&useSSL=false</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> </property> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> <property> <name>hive.tez.container.size</name> <value>512</value> </property> <property> <name>hive.heapsize</name> <value>512</value> </property> <property> <name>hive.server2.logging.operation.log.location</name> <value>/tmp/hive/operation_logs</value> </property> <property> <name>datanucleus.schema.autoCreateAll</name> <value>true</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>hive.exec.local.scratchdir</name> <value>/hadoop/hive/exec/${user.name}</value> </property> <property> <name>hive.downloaded.resources.dir</name> <value>/hadoop/hive/${hive.session.id}_resources</value> </property> <property> <name>hive.querylog.location</name> <value>/hadoop/hive/log</value> </property> <property> <name>hive.server2.logging.operation.log.location</name> <value>/hadoop/hive/server2/${user.name}/operation_logs</value> </property> <property> <name>hive.exec.dynamic.partition.mode</name> <value>nonstrict</value> </property> <property> <name>hive.server2.authentication</name> <value>NONE</value> </property> <property> <name>hive.server2.thrift.client.user</name> <value>hive</value> </property> <property> <name>hive.server2.thrift.client.password</name> <value>hive</value> </property> <property> <name>hive.cluster.delegation.token.store.zookeeper.connectString</name> <value>node01:2181</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>node01</value> </property> <property> <name>hive.cluster.delegation.token.store.class</name> <value>org.apache.hadoop.hive.thrift.ZooKeeperTokenStore</value> </property> <property> <name>hive.cluster.delegation.token.store.zookeeper.znode</name> <value>/hive/cluster/delegation</value> </property> <property> <name>hive.server2.zookeeper.namespace</name> <value>hiveserver2</value> </property> <property> <name>hive.zookeeper.client.port</name> <value>2181</value> </property> <property> <name>hive.zookeeper.namespace</name> <value>hive_zookeeper_namespace</value> </property> <property> <name>hive.zookeeper.quorum</name> <value>node01:2181</value> </property> <property> <name>atlas.hook.hive.maxThreads</name> <value>1</value> </property> <property> <name>atlas.hook.hive.minThreads</name> <value>1</value> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>false</value> </property> <property> <name>datanucleus.cache.level2.type</name> <value>none</value> </property> <property> <name>datanucleus.fixedDatastore</name> <value>true</value> </property> <property> <name>hive.auto.convert.join</name> <value>true</value> </property> <property> <name>hive.auto.convert.join.noconditionaltask</name> <value>true</value> </property> <property> <name>hive.auto.convert.join.noconditionaltask.size</name> <value>10737418240</value> </property> <property> <name>hive.auto.convert.sortmerge.join</name> <value>true</value> </property> <property> <name>hive.auto.convert.sortmerge.join.to.mapjoin</name> <value>true</value> </property> <property> <name>hive.cbo.enable</name> <value>true</value> </property> <property> <name>hive.cli.print.header</name> <value>false</value> </property> <property> <name>hive.compactor.abortedtxn.threshold</name> <value>1000</value> </property> <property> <name>hive.compactor.check.interval</name> <value>300</value> </property> <property> <name>hive.compactor.delta.num.threshold</name> <value>10</value> </property> <property> <name>hive.compactor.delta.pct.threshold</name> <value>0.1f</value> </property> <property> <name>hive.compactor.initiator.on</name> <value>true</value> </property> <property> <name>hive.compactor.worker.threads</name> <value>4</value> </property> <property> <name>hive.compactor.worker.timeout</name> <value>86400</value> </property> <property> <name>hive.compute.query.using.stats</name> <value>true</value> </property> <property> <name>hive.convert.join.bucket.mapjoin.tez</name> <value>false</value> </property> <property> <name>hive.create.as.insert.only</name> <value>true</value> </property> <property> <name>hive.default.fileformat</name> <value>TextFile</value> </property> <property> <name>hive.default.fileformat.managed</name> <value>ORC</value> </property> <property> <name>hive.driver.parallel.compilation</name> <value>true</value> </property> <property> <name>hive.enforce.sortmergebucketmapjoin</name> <value>true</value> </property> <property> <name>hive.exec.compress.intermediate</name> <value>false</value> </property> <property> <name>hive.exec.compress.output</name> <value>false</value> </property> <property> <name>hive.exec.dynamic.partition</name> <value>true</value> </property> <property> <name>hive.exec.failure.hooks</name> <value>org.apache.hadoop.hive.ql.hooks.HiveProtoLoggingHook</value> </property> <property> <name>hive.exec.max.created.files</name> <value>100000</value> </property> <property> <name>hive.exec.max.dynamic.partitions</name> <value>5000</value> </property> <property> <name>hive.exec.max.dynamic.partitions.pernode</name> <value>2000</value> </property> <property> <name>hive.exec.orc.split.strategy</name> <value>HYBRID</value> </property> <property> <name>hive.exec.parallel</name> <value>false</value> </property> <property> <name>hive.exec.parallel.thread.number</name> <value>8</value> </property> <property> <name>hive.exec.post.hooks</name> <value>org.apache.hadoop.hive.ql.hooks.HiveProtoLoggingHook</value> </property> <property> <name>hive.exec.pre.hooks</name> <value>org.apache.hadoop.hive.ql.hooks.HiveProtoLoggingHook</value> </property> <property> <name>hive.exec.reducers.bytes.per.reducer</name> <value>4294967296</value> </property> <property> <name>hive.exec.reducers.max</name> <value>1009</value> </property> <property> <name>hive.exec.scratchdir</name> <value>/tmp/hive</value> </property> <property> <name>hive.exec.submit.local.task.via.child</name> <value>true</value> </property> <property> <name>hive.exec.submitviachild</name> <value>false</value> </property> <property> <name>hive.execution.mode</name> <value>container</value> </property> <property> <name>hive.fetch.task.aggr</name> <value>false</value> </property> <property> <name>hive.fetch.task.conversion</name> <value>none</value> </property> <property> <name>hive.fetch.task.conversion.threshold</name> <value>1073741824</value> </property> <property> <name>hive.hook.proto.base-directory</name> <value>{hive_metastore_warehouse_external_dir}/sys.db/query_data/</value> </property> <property> <name>hive.limit.optimize.enable</name> <value>true</value> </property> <property> <name>hive.limit.pushdown.memory.usage</name> <value>0.04</value> </property> <property> <name>hive.load.data.owner</name> <value>hive</value> </property> <property> <name>hive.lock.manager</name> <value></value> </property> <property> <name>hive.log.explain.output</name> <value>true</value> </property> <property> <name>hive.map.aggr</name> <value>true</value> </property> <property> <name>hive.map.aggr.hash.force.flush.memory.threshold</name> <value>0.9</value> </property> <property> <name>hive.map.aggr.hash.min.reduction</name> <value>0.5</value> </property> <property> <name>hive.map.aggr.hash.percentmemory</name> <value>0.5</value> </property> <property> <name>hive.mapjoin.bucket.cache.size</name> <value>10000</value> </property> <property> <name>hive.mapjoin.hybridgrace.hashtable</name> <value>false</value> </property> <property> <name>hive.mapjoin.optimized.hashtable</name> <value>true</value> </property> <property> <name>hive.mapred.reduce.tasks.speculative.execution</name> <value>true</value> </property> <property> <name>hive.materializedview.rewriting.incremental</name> <value>false</value> </property> <property> <name>hive.merge.mapfiles</name> <value>true</value> </property> <property> <name>hive.merge.mapredfiles</name> <value>false</value> </property> <property> <name>hive.merge.orcfile.stripe.level</name> <value>true</value> </property> <property> <name>hive.merge.rcfile.block.level</name> <value>true</value> </property> <property> <name>hive.merge.size.per.task</name> <value>256000000</value> </property> <property> <name>hive.merge.smallfiles.avgsize</name> <value>100000000</value> </property> <property> <name>hive.merge.tezfiles</name> <value>false</value> </property> <property> <name>hive.metastore.authorization.storage.checks</name> <value>false</value> </property> <property> <name>hive.metastore.cache.pinobjtypes</name> <value>Table,Database,Type,FieldSchema,Order</value> </property> <property> <name>hive.metastore.client.connect.retry.delay</name> <value>5s</value> </property> <property> <name>hive.metastore.client.socket.timeout</name> <value>1800s</value> </property> <property> <name>hive.metastore.connect.retries</name> <value>24</value> </property> <property> <name>hive.metastore.dml.events</name> <value>true</value> </property> <property> <name>hive.metastore.event.listeners</name> <value></value> </property> <property> <name>hive.metastore.execute.setugi</name> <value>true</value> </property> <property> <name>hive.metastore.failure.retries</name> <value>24</value> </property> <property> <name>hive.metastore.pre.event.listeners</name> <value>org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener</value> </property> <property> <name>hive.metastore.sasl.enabled</name> <value>false</value> </property> <property> <name>hive.metastore.server.max.threads</name> <value>100000</value> </property> <property> <name>hive.metastore.transactional.event.listeners</name> <value>org.apache.hive.hcatalog.listener.DbNotificationListener</value> </property> <property> <name>hive.optimize.bucketmapjoin</name> <value>true</value> </property> <property> <name>hive.optimize.bucketmapjoin.sortedmerge</name> <value>false</value> </property> <property> <name>hive.optimize.constant.propagation</name> <value>true</value> </property> <property> <name>hive.optimize.cp</name> <value>true</value> </property> <property> <name>hive.optimize.dynamic.partition.hashjoin</name> <value>true</value> </property> <property> <name>hive.optimize.index.filter</name> <value>true</value> </property> <property> <name>hive.optimize.metadataonly</name> <value>true</value> </property> <property> <name>hive.optimize.null.scan</name> <value>true</value> </property> <property> <name>hive.optimize.reducededuplication</name> <value>true</value> </property> <property> <name>hive.optimize.reducededuplication.min.reducer</name> <value>4</value> </property> <property> <name>hive.optimize.sort.dynamic.partition</name> <value>false</value> </property> <property> <name>hive.orc.compute.splits.num.threads</name> <value>10</value> </property> <property> <name>hive.orc.splits.include.file.footer</name> <value>false</value> </property> <property> <name>hive.prewarm.enabled</name> <value>false</value> </property> <property> <name>hive.prewarm.numcontainers</name> <value>3</value> </property> <property> <name>hive.repl.cm.enabled</name> <value></value> </property> <property> <name>hive.repl.cmrootdir</name> <value></value> </property> <property> <name>hive.repl.rootdir</name> <value></value> </property> <property> <name>hive.security.authorization.createtable.owner.grants</name> <value>ALL</value> </property> <property> <name>hive.security.metastore.authenticator.manager</name> <value>org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator</value> </property> <property> <name>hive.security.metastore.authorization.auth.reads</name> <value>false</value> </property> <property> <name>hive.security.metastore.authorization.manager</name> <value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider</value> </property> <property> <name>hive.server2.allow.user.substitution</name> <value>true</value> </property> <property> <name>hive.server2.enable.doAs</name> <value>true</value> </property> <property> <name>hive.server2.idle.operation.timeout</name> <value>6h</value> </property> <property> <name>hive.server2.idle.session.timeout</name> <value>1d</value> </property> <property> <name>hive.server2.logging.operation.enabled</name> <value>true</value> </property> <property> <name>hive.server2.max.start.attempts</name> <value>5</value> </property> <property> <name>hive.server2.support.dynamic.service.discovery</name> <value>true</value> </property> <property> <name>hive.server2.table.type.mapping</name> <value>CLASSIC</value> </property> <property> <name>hive.server2.tez.default.queues</name> <value>default</value> </property> <property> <name>hive.server2.tez.initialize.default.sessions</name> <value>false</value> </property> <property> <name>hive.server2.tez.sessions.per.default.queue</name> <value>1</value> </property> <property> <name>hive.server2.thrift.http.path</name> <value>cliservice</value> </property> <property> <name>hive.server2.thrift.http.port</name> <value>10001</value> </property> <property> <name>hive.server2.thrift.max.worker.threads</name> <value>1200</value> </property> <property> <name>hive.server2.thrift.sasl.qop</name> <value>auth</value> </property> <property> <name>hive.server2.transport.mode</name> <value>binary</value> </property> <property> <name>hive.server2.use.SSL</name> <value>false</value> </property> <property> <name>hive.server2.webui.cors.allowed.headers</name> <value>X-Requested-With,Content-Type,Accept,Origin,X-Requested-By,x-requested-by</value> </property> <property> <name>hive.server2.webui.enable.cors</name> <value>true</value> </property> <property> <name>hive.server2.webui.port</name> <value>10002</value> </property> <property> <name>hive.server2.webui.use.ssl</name> <value>false</value> </property> <property> <name>hive.service.metrics.codahale.reporter.classes</name> <value>org.apache.hadoop.hive.common.metrics.metrics2.JsonFileMetricsReporter,org.apache.hadoop.hive.common.metrics.metrics2.JmxMetricsReporter,org.apache.hadoop.hive.common.metrics.metrics2.Metrics2Reporter</value> </property> <property> <name>hive.smbjoin.cache.rows</name> <value>10000</value> </property> <property> <name>hive.stats.autogather</name> <value>true</value> </property> <property> <name>hive.stats.dbclass</name> <value>fs</value> </property> <property> <name>hive.stats.fetch.column.stats</name> <value>true</value> </property> <property> <name>hive.stats.fetch.partition.stats</name> <value>true</value> </property> <property> <name>hive.strict.managed.tables</name> <value>false</value> </property> <property> <name>hive.support.concurrency</name> <value>true</value> </property> <property> <name>hive.tez.auto.reducer.parallelism</name> <value>true</value> </property> <property> <name>hive.tez.bucket.pruning</name> <value>true</value> </property> <property> <name>hive.tez.cartesian-product.enabled</name> <value>true</value> </property> <property> <name>hive.tez.cpu.vcores</name> <value>-1</value> </property> <property> <name>hive.tez.dynamic.partition.pruning</name> <value>true</value> </property> <property> <name>hive.tez.dynamic.partition.pruning.max.data.size</name> <value>104857600</value> </property> <property> <name>hive.tez.dynamic.partition.pruning.max.event.size</name> <value>1048576</value> </property> <property> <name>hive.tez.exec.print.summary</name> <value>true</value> </property> <property> <name>hive.tez.input.format</name> <value>org.apache.hadoop.hive.ql.io.HiveInputFormat</value> </property> <property> <name>hive.tez.input.generate.consistent.splits</name> <value>true</value> </property> <property> <name>hive.tez.java.opts</name> <value>-server -Djava.net.preferIPv4Stack=true -XX:NewRatio=8 -XX:+UseNUMA -XX:+UseG1GC -XX:+ResizeTLAB -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps</value> </property> <property> <name>hive.tez.log.level</name> <value>INFO</value> </property> <property> <name>hive.tez.max.partition.factor</name> <value>2.0</value> </property> <property> <name>hive.tez.min.partition.factor</name> <value>0.25</value> </property> <property> <name>hive.tez.smb.number.waves</name> <value>0.5</value> </property> <property> <name>hive.txn.manager</name> <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value> </property> <property> <name>hive.txn.max.open.batch</name> <value>1000</value> </property> <property> <name>hive.txn.strict.locking.mode</name> <value>false</value> </property> <property> <name>hive.txn.timeout</name> <value>300</value> </property> <property> <name>hive.user.install.directory</name> <value>/user/</value> </property> <property> <name>hive.vectorized.execution.enabled</name> <value>true</value> </property> <property> <name>hive.vectorized.execution.mapjoin.minmax.enabled</name> <value>true</value> </property> <property> <name>hive.vectorized.execution.mapjoin.native.enabled</name> <value>true</value> </property> <property> <name>hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled</name> <value>true</value> </property> <property> <name>hive.vectorized.execution.reduce.enabled</name> <value>true</value> </property> <property> <name>hive.vectorized.groupby.checkinterval</name> <value>4096</value> </property> <property> <name>hive.vectorized.groupby.flush.percent</name> <value>0.1</value> </property> <property> <name>hive.vectorized.groupby.maxentries</name> <value>100000</value> </property> <property> <name>mapred.max.split.size</name> <value>256000000</value> </property> <property> <name>mapred.min.split.size.per.node</name> <value>128000000</value> </property> <property> <name>mapred.min.split.size.per.rack</name> <value>128000000</value> </property> <property> <name>metastore.create.as.acid</name> <value>true</value> </property> <property> <name>hive.metastore.kerberos.keytab.file</name> <value>/etc/security/keytabs/hive.service.keytab</value> </property> <property> <name>hive.metastore.kerberos.principal</name> <value>hive/_HOST@EXAMPLE.COM</value> </property> <property> <name>hive.server2.authentication.spnego.keytab</name> <value>HTTP/_HOST@EXAMPLE.COM</value> </property> <property> <name>hive.server2.authentication.spnego.principal</name> <value>/etc/security/keytabs/spnego.service.keytab</value> </property> <!-- <property> <name>hive.kudu.master.addresses.default</name> <value>bdd11:7051,bdd12:7051,bdd13:7051,app1:7051,es2:7051</value> </property> <property> <name>hive.server2.authentication</name> <value>LDAP</value> </property> <property> <name>hive.server2.authentication.ldap.baseDN</name> <value>ou=bigdata,dc=gdh,dc=yore,dc=com</value> </property> <property> <name>hive.server2.authentication.ldap.url</name> <value>ldap://bdm0:389</value> </property> <property> <name>hive.cluster.delegation.token.store.zookeeper.connectString</name> <value>bdm0:2181,bdm1:2181,etl1:2181,es1:2181,es2:2181</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>bdm0,bdm1,etl1,es1,es2</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property> <property> <name>hive.cluster.delegation.token.store.class</name> <value>org.apache.hadoop.hive.thrift.ZooKeeperTokenStore</value> </property> <property> <name>hive.cluster.delegation.token.store.zookeeper.znode</name> <value>/hive/cluster/delegation</value> </property> <property> <name>hive.server2.zookeeper.namespace</name> <value>hiveserver2</value> </property> <property> <name>hive.zookeeper.client.port</name> <value>2181</value> </property> <property> <name>hive.zookeeper.namespace</name> <value>hive_zookeeper_namespace</value> </property> <property> <name>hive.zookeeper.quorum</name> <value>node01:2181</value> </property> <property> <name>zookeeper.znode.parent</name> <value>/hbase-unsecure</value> </property> -->
- hive.metastore.warehouse.dir=/warehouse/tablespace/managed/hive and hive.metastore.warehouse.external.dir=/warehouse/tablespace/external/hive specify the paths where inner tables and outer surfaces are stored on HDFS.
- If Hive metadata is saved in MySQL, you need to modify hive.metastore.db.type=mysql, javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver=com.mysql.jdbc.Driver, javax. JDO. Option. Connectionurl = JDBC: MySQL://node01:3306/Hive?createDatabaseIfNotExist=true& ; useUnicode=true& characterEncoding=UTF-8& Usessl = false, javax.jdo.option.ConnectionUserName=hive, javax.jdo.option.ConnectionPassword = ****** and other configurations. It is recommended that the production environment be saved in a relational database outside a separate cluster.
- hive.tez.container.size=8192 and tez.task.resource.memory.mb in tez configuration are always saved.
- hive.heapsize=2048 can be increased appropriately. For example, it is set to 2048 here.
- hive.insert.into.multilevel.dirs=true. When true, it means that multi-level directories are allowed to be generated. Otherwise, the parent directory must exist.
- Hive. Exec. Stagingdir = / TMP / hive /. Hive staging is. Hive staging by default, which means that the staging file of. Hive staging is generated in the current directory of the table and is assigned to the HDFS path outside the table directory. Because some synchronization tools may be used to extract hive data in the production environment, and some tools are used to speed up reading and writing, It will directly read the files that meet the requirements on the HDFS path of the hive corresponding table, such as DataX. If hive data is currently changing, the data may be repeated or doubled when using the tool to synchronize data.
- If it is HA or LDAP user authentication is enabled, refer to the configuration item commented out at the end.
- See official documents for other configuration items AdminManual+Configuration
3.18 send the corresponding configuration and packet to other nodes
The same configuration is sent to other nodes and modified appropriately.
3.19 some problems
If ordinary users are used to start, most of them will encounter permission problems. Set the corresponding folder as the group to which the startup user has permission according to the log. The test environment may encounter resource problems. You can adjust a small part of the configuration according to the actual situation.
If spark 2 is added to yarn.nodemanager.aux-services_ Shuffle, starting YARN may report the following errors:
ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, No such file or directory]
When starting hadoop, / usr / HDP / 3.1.5.0-152 / hadoop HDFS / lib / leveldbjni-all-1.8.jar will be loaded,
Set spark2_ 3_ 1_ 5_ 0_ libleveldbjni.so of the corresponding version of spark-2.3.2.3.1.5.0-152-yarn-shuffle.jar in 152-yarn-shuffle-2.3.1.5.0-152.noarch.rpm is placed in java.library.path. The current java.library.path value of the system can be viewed as follows:
java -XshowSettings:properties
4 start
4.1 Hadoop
/usr/hdp/current/hadoop/bin/hdfs namenode -format chown -R hdfs:hadoop /hadoop mkdir /hadoop/{yarn,mapred,mapreduce} hadoop fs -mkdir /{home,user,tmp} hadoop fs -mkdir -p /hdp/apps/3.1.5.0-152/{mapreduce,tez} hadoop fs -put /usr/hdp/current/hadoop/mapreduce.tar.gz /hdp/apps/3.1.5.0-152/mapreduce/ chmod 755 /usr/hdp/3.1.5.0-152/hadoop-yarn/bin/container-executor usermod -G hadoop hdfs usermod -G hadoop yarn usermod -G hdfs yarn chown root:hadoop /var/lib/{hadoop-hdfs,hadoop-mapreduce,hadoop-yarn} hdfs dfsadmin -safemode get hdfs dfsadmin -safemode leave #su - hdfs /usr/hdp/current/hadoop/bin/hdfs --config /etc/hadoop/conf --daemon start namenode /usr/hdp/current/hadoop/bin/hdfs --config /etc/hadoop/conf --daemon start secondarynamenode /usr/hdp/current/hadoop/bin/hdfs --config /etc/hadoop/conf --daemon start datanode #su - yarn /usr/hdp/current/hadoop/bin/yarn --config /etc/hadoop/conf --daemon start nodemanager /usr/hdp/current/hadoop/bin/yarn --config /etc/hadoop/conf --daemon start resourcemanager
4.2 ZooKeeper
mkdir -p /var/lib/zookeeper echo "1" > /var/lib/zookeeper/myid /usr/hdp/current/zookeeper/bin/zkServer.sh start /etc/zookeeper/conf/zoo.cfg
4.3 Hive
wget https://repo.huaweicloud.com/repository/maven/mysql/mysql-connector-java/5.1.47/mysql-connector-java-5.1.47.jar -P /usr/hdp/current/hive/lib/ /usr/hdp/current/hive/bin/schematool -dbType mysql -initSchema hadoop fs -put /usr/hdp/current/tez/lib/tez.tar.gz /hdp/apps/3.1.5.0-152/tez/ /usr/hdp/current/hive/bin/hive --service metastore >/dev/null 2>&1 & /usr/hdp/current/hive/bin/hive --service hiveserver2 >/dev/null 2>&1 &
5 test
5.1 Hadoop
hadoop dfs -mkdir /tmp/input hadoop fs -put /usr/hdp/current/hadoop/src/dev-support/README.md /tmp/input hadoop jar /usr/hdp/current/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /tmp/input /tmp/output
5.2 ZooKeeper
/usr/hdp/current/zookeeper/bin/zkServer.sh status /etc/zookeeper/conf/zoo.cfg
5.3 Hive
/usr/hdp/current/hive/bin/beeline --color=true -u jdbc:hive2://node01:10000/default -n hive
0: jdbc:hive2://node01:10000/default> set hive.execution.engine; +----------------------------+ | set | +----------------------------+ | hive.execution.engine=tez | +----------------------------+ 1 row selected (0.403 seconds) -- Build table CREATE TABLE `visit_t01` ( uid string, visit_date string, visit_count bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; -- Insert test data INSERT INTO visit_t01 VALUES ('u01', '2019/11/21', 5),('u02', '2019/11/23', 6), ('u03', '2019/11/22', 8),('u04', '2019/11/20', 3),('u01', '2019/11/23', 6), ('u01', '2019/12/21', 8),('u02', '2019/11/23', 6),('u01', '2019/12/22', 4); -- Query inserted data 0: jdbc:hive2://node01:10000/default> SELECT * FROM visit_t01 LIMIT 10; +----------------+-----------------------+------------------------+ | visit_t01.uid | visit_t01.visit_date | visit_t01.visit_count | +----------------+-----------------------+------------------------+ | u01 | 2019/11/21 | 5 | | u02 | 2019/11/23 | 6 | | u03 | 2019/11/22 | 8 | | u04 | 2019/11/20 | 3 | | u01 | 2019/11/23 | 6 | | u01 | 2019/12/21 | 8 | | u02 | 2019/11/23 | 6 | | u01 | 2019/12/22 | 4 | +----------------+-----------------------+------------------------+ 8 rows selected -- Count the cumulative visits, monthly visits and cumulative visits of each user 0: jdbc:hive2://node01:10000/default> SELECT B.uid,B.visit_ date2,B.v_ Count ` monthly ', . . . . . . . . . . . . . . . . . . > SUM(v_count) OVER(PARTITION BY uid ORDER BY visit_date2) `Cumulative` FROM ( . . . . . . . . . . . . . . . . . . > SELECT uid,visit_date2,SUM(visit_count) AS v_count FROM ( . . . . . . . . . . . . . . . . . . > SELECT uid,date_format(regexp_replace(visit_date, '/','-'),'yyyy-MM') visit_date2,visit_count . . . . . . . . . . . . . . . . . . > FROM visit_t01 . . . . . . . . . . . . . . . . . . > ) A GROUP BY uid,visit_date2 . . . . . . . . . . . . . . . . . . > ) B; ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 2 2 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 20.42 s ---------------------------------------------------------------------------------------------- +--------+----------------+-----+-----+ | b.uid | b.visit_date2 | Monthly plan | Cumulative | +--------+----------------+-----+-----+ | u01 | 2019-11 | 11 | 11 | | u01 | 2019-12 | 12 | 23 | | u03 | 2019-11 | 8 | 8 | | u02 | 2019-11 | 12 | 12 | | u04 | 2019-11 | 3 | 3 | +--------+----------------+-----+-----+ 5 rows selected (25.988 seconds)