Posted by tomdude48 on Fri, 31 Dec 2021 12:40:01 +0100

Big data engineering practice reference manual

Restart the virtual machine after executing the following commands in sequence

sudo apt-get autoremove open-vm-tools
sudo apt-get install open-vm-tools
sudo apt-get install open-vm-tools-desktop

ssh password free login

First, check whether ssh is installed

sudo ps -e |grep ssh

If there are two, ssh is installed. Otherwise, execute the following command to install ssh

sudo apt-get install openssh-server

It is recommended to delete the ssh directory first and reconfigure it

rm -r  ~/.ssh

Execute the following command to generate the public key and private key, and then press enter

ssh-keygen -t rsa -P ""
#Parameter Description: - t is the selected encryption algorithm, - P is the set password, and setting to "" indicates that no password is required

Add public key to authorize_keys file

cat ~/.ssh/ >> ~/.ssh/authorized_keys

Finally, ssh connects to the local machine for testing. For the first connection, enter yes

ssh localhost perhaps ssh

Unable to init server: unable to connect

Use the following instructions

$ xhost local:gedit

If the following error is reported

xhost: unable to open display ""

Available instructions

$ export DISPLAY=:0

Then enter again

$ xhost local:gedit

If present

non-network local connections being added to access control list

This indicates that the modification was successful

hadoop installation

In addition to the pseudo distributed configuration in the above blog, please configure Hadoop env SH file

 vim /usr/local/hadoop/etc/hadoop/

add to

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_162

Error Prevention

HBase installation

python script

Modify the pabu to your user name, and create a data folder in advance. The script is saved in the data folder, such as / home / s0109 / data / click log

import random
import time

url_paths = ["class/112.html",

ip_slices = [132, 156, 124, 10, 29, 167, 143, 187, 30, 46,
             55, 63, 72, 87, 98, 168, 192, 134, 111, 54, 64, 110, 43]

http_referers = ["{query}", "{query}",
                 "{query}", "{query}", ]

search_keyword = ["Spark SQL actual combat", "Hadoop Basics", "Storm actual combat",
                  "Spark Streaming actual combat", "10 Hour entry big data", "SpringBoot actual combat", "Linux Advanced ", "Vue.js"]

status_codes = ["200", "404", "500", "403"]

def sample_url():
    return random.sample(url_paths, 1)[0]

def sample_ip():
    slice = random.sample(ip_slices, 4)
    return ".".join([str(item) for item in slice])

def sample_referer():
    if random.uniform(0, 1) > 0.5:
        return "-"
    refer_str = random.sample(http_referers, 1)
    query_str = random.sample(search_keyword, 1)
    return refer_str[0].format(query=query_str[0])

def sample_status_code():
    return random.sample(status_codes, 1)[0]

def generate_log(count=10):
    time_str = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
    f = open("/home/s0109/data/click.log", "w+")
    while count >= 1:
        query_log = "{ip}\t{local_time}\t\"GET /{url} HTTP/1.1\"\t{status_code}\t{referer}".format(url=sample_url(
        ), ip=sample_ip(), referer=sample_referer(), status_code=sample_status_code(), local_time=time_str)
        f.write(query_log + "\n")
        count = count - 1

if __name__ == '__main__':

Set Ubuntu timer

crontab -e after creating a new task, it is recommended to select 2. If you have installed vim, select 1 and use the nano editor: ctrl+o to save and ctrl+x to exit. If you want to modify after selection, use the select editor command to select again

The path in the timer also needs to be changed to the corresponding path

Log data collection using Flume and Kafka

Pay attention to modifying the location of the log file for the configuration file

exec-memory-kafka.sources = exec-source
exec-memory-kafka.sinks = kafka-sink
exec-memory-kafka.channels = memory-channel

exec-memory-kafka.sources.exec-source.type = exec
exec-memory-kafka.sources.exec-source.command = tail -F /home/s0109/data/click.log = /bin/sh -c

exec-memory-kafka.channels.memory-channel.type = memory

exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
exec-memory-kafka.sinks.kafka-sink.brokerList = localhost:9092
exec-memory-kafka.sinks.kafka-sink.topic = streamtopic
exec-memory-kafka.sinks.kafka-sink.batchSize = 10
exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1

exec-memory-kafka.sources.exec-source.channels = memory-channel = memory-channel

When you start zookeeper and Kafka, you can add & after the command to make it run in the background

consumer failed to receive data

Check the Flume configuration file for errors

An error occurred while creating the Hbase table

Restart hbase

After the restart, for example, execute the list command after the hbase shell command. It will not be created when it is stuck. Otherwise, continue to try to restart

Building back-end projects

Note: before executing all the codes for HBase operations from this step, please save the virtual machine snapshot. After executing the error code, the HBase environment will crash!!!

Installing Intellij

Simply put, it is 2 steps, decompressing and running

sudo tar -zxvf ideaIU-2020.2.3.tar.gz -C /opt  #/opt can be changed to the location you want to unzip. The compressed package can be changed to the version you downloaded. Before unzipping, you need to enter the location where the compressed package is downloaded

After installation, select Plugin and search for scala installation plug-ins. The download speed is slow and wait patiently

The reference of the old version

Then Create the project, select Scala and click Create

Select 2.11 12. Download. Wait patiently. The download is extremely slow, but it is only 40 mb files. If it is too slow, find an offline installation method yourself

When setting up the maven environment, the repository does not exist and is created by itself


rely on

            <name>Maven Aliyun Mirror</name>
            <releases> <enabled>true</enabled> </releases>
            <snapshots> <enabled>false</enabled> </snapshots>
            <version>2.11.8</version> </dependency>
        <dependency> <groupId>org.apache.hadoop</groupId>
        <dependency> <groupId>org.apache.hbase</groupId>
        <dependency> <groupId>org.apache.spark</groupId>

After filling in the old version, click Enable Auto Import to update it


Note that zk service parameters should be modified to replace the contents in s0109 below with your account

package com.spark.streaming.project.utils;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.HTable;

public class HBaseUtils {
    private Configuration configuration = null;
    private Connection connection = null;
    private static HBaseUtils instance = null;

    private HBaseUtils(){
        try {
            configuration = new Configuration();
            //Specify the zk server to access
            configuration.set("hbase.zookeeper.quorum", "s0109:2181");
            // Get Hbase connection
            connection = ConnectionFactory.createConnection(configuration);
        }catch(Exception e){
     * Get HBase connection instance

    public static synchronized HBaseUtils getInstance(){
        if(instance == null){
            instance = new HBaseUtils();
        return instance;

     *Get an instance of a table from the table name
     * @param tableName
     * @return
    public HTable getTable(String tableName) {
        HTable hTable = null; 
        try {
            hTable = (HTable)connection.getTable(TableName.valueOf(tableName));
        }catch (Exception e){
        return hTable;


package com.spark.streaming.project.utils

import org.apache.commons.lang3.time.FastDateFormat

 * Format date tool class
object DateUtils {
  //Specifies the date format to enter
    val YYYYMMDDHMMSS_FORMAT = FastDateFormat.getInstance("yyyy-MM-dd hh:mm:ss");
  //Specify output format
  val TARGET_FORMAT = FastDateFormat.getInstance("yyyyMMddhhmmss")

  // Enter String to return the result converted to log in this format
  def getTime(time: String) = {

  def parseToMinute(time: String) = {
    //Call getTime


package com.spark.streaming.project.dao

import com.spark.streaming.project.domain.CourseClickCount
import com.spark.streaming.project.utils.HBaseUtils

import org.apache.hadoop.hbase.util.Bytes
import scala.collection.mutable.ListBuffer

object CourseClickCountDao {
  val tableName = "ns1:courses_clickcount" //Table name
  val cf = "info" //Column family
  val qualifer = "click_count" //column

   * Save data to Hbase
   * @param list (day_course:String,click_count:Int) //Count the total hits of each course on the same day
  def save(list: ListBuffer[CourseClickCount]): Unit = {
    //Call the method of HBaseUtils to obtain the HBase table instance
    val table = HBaseUtils.getInstance().getTable(tableName)
    for (item <- list) {
      //Call a self increasing method of Hbase
        Bytes.toBytes(cf), Bytes.toBytes(qualifer),
        item.click_count) //If the value is Long, it will be automatically converted


package com.spark.streaming.project.dao

import com.spark.streaming.project.domain.CourseSearchClickCount
import com.spark.streaming.project.utils.HBaseUtils
import org.apache.hadoop.hbase.util.Bytes
import scala.collection.mutable.ListBuffer

object CourseSearchClickCountDao {
  val tableName = "ns1:courses_search_clickcount"
  val cf = "info"
  val qualifer = "click_count"

   * Save data to Hbase
   * @param list (day_course:String,click_count:Int) //Count the total hits of each course on the same day 
  def save(list: ListBuffer[CourseSearchClickCount]): Unit = {
    val table = HBaseUtils.getInstance().getTable(tableName)
    for (item <- list) {
        Bytes.toBytes(cf), Bytes.toBytes(qualifer),
      ) //If the value is Long, it will be automatically converted 


package com.spark.streaming.project.application

import com.spark.streaming.project.domain.ClickLog
import com.spark.streaming.project.domain.CourseClickCount
import com.spark.streaming.project.domain.CourseSearchClickCount
import com.spark.streaming.project.utils.DateUtils
import com.spark.streaming.project.dao.CourseClickCountDao
import com.spark.streaming.project.dao.CourseSearchClickCountDao
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.collection.mutable.ListBuffer

object CountByStreaming {
  def main(args: Array[String]): Unit = {
     * Finally, the program will be packaged and run on the cluster, 
     * Several parameters need to be received: the ip address of the zookeeper server, the kafka consumption group, 
     * Topic, and number of threads 
    if (args.length != 4) {
      System.err.println("Error:you need to input:<zookeeper> <group> <toplics> <threadNum>")
    //Receive the parameters of the main function and pass the parameters outside 
    val Array(zkAdderss, group, toplics, threadNum) = args

     * When creating Spark context, you need to set AppName for local operation 
     * Master And other attributes, which need to be deleted before packaging on the cluster 
    val sparkConf = new SparkConf()

    //Create a Spark discrete stream and receive data every 60 seconds 
    val ssc = new StreamingContext(sparkConf, Seconds(60))
    //Using kafka as the data source 
    val topicsMap = toplics.split(",").map((_, threadNum.toInt)).toMap
    //Create a kafka discrete stream and consume the data of the kafka cluster every 60 seconds 
    val kafkaInputDS = KafkaUtils.createStream(ssc, zkAdderss, group, topicsMap)
    //Get the original log data 
    val logResourcesDS =
     * (1)Clean the data and package it into ClickLog 
     * (2)Filter out illegal data 
    val cleanDataRDD = => {
      val splits = line.split("\t")
      if (splits.length != 5) {
        //Illegal data is directly encapsulated and given an error value by default, and the filter will filter it 
        ClickLog("", "", 0, 0, "")
      else {
        val ip = splits(0) //Get the ip address of the user in the log 
        val time = DateUtils.parseToMinute(splits(1)) //Obtain the access time of the user in the log and call DateUtils to format the time 
        val status = splits(3).toInt //Get access status code 
        val referer = splits(4)
        val url = splits(2).split(" ")(1) //Get search url
        var courseId = 0
        if (url.startsWith("/class")) {
          val courseIdHtml = url.split("/")(2)
          courseId = courseIdHtml.substring(0, courseIdHtml.lastIndexOf(".")).toInt
        ClickLog(ip, time, courseId, status, referer) //Encapsulate the cleaned log into ClickLog
    }).filter(x => x.courseId != 0) //Filter out non practical courses
     * (1)statistical data 
     * (2)Write the calculation results into HBase 
     */ => {
      //This is equivalent to defining the RowKey of the HBase table "ns1:courses_clickcount", 
      // Set 'date'_ Course 'as a RowKey means the number of visits to a course on a certain day 
      (line.time.substring(0, 8) + "_" + line.courseId, 1) //Map to tuple 
    }).reduceByKey(_ + _) //polymerization
      .foreachRDD(rdd => { //There are multiple RDD S in a DStream 
        rdd.foreachPartition(partition => { //There are multiple partitions in an RDD
          val list = new ListBuffer[CourseClickCount]
          partition.foreach(item => { //There are multiple records in a Partition 
            list.append(CourseClickCount(item._1, item._2))
 //Save to HBase 

     * Count the total hits of practical courses from various search engines so far 
     * (1)statistical data 
     * (2)Write the statistical results into HBase 
     */ => {
      val referer = line.referer
      val time = line.time.substring(0, 8)
      var url = ""
      if (referer == "-") { //Filter illegal URLs 
        (url, time)
      else {
        //Take out the name of the search engine 
        url = referer.replaceAll("//", "/").split("/")(1)
        (url, time)
    }).filter(x => x._1 != "").map(line => {
      //This is equivalent to defining the RowKey of the HBase table "ns1:courses_search_clickcount", 
      // Will 'date_ Search engine name 'as RowKey means the number of times a course is accessed through a search engine on a certain day 
      (line._2 + "_" + line._1, 1) //Map to tuple 
    }).reduceByKey(_ + _) //polymerization
      .foreachRDD(rdd => {
        rdd.foreachPartition(partition => {
          val list = new ListBuffer[CourseSearchClickCount]
          partition.foreach(item => {
            list.append(CourseSearchClickCount(item._1, item._2))


Running environment configuration

First click run, a pop-up box will pop up, select CountByStreaming, select the one without $, then refer to the environment configuration in pdf, and fill in the parameters in Program arguments

Build front end projects

to configure

        </dependency> <dependency> 
            <version>2.4</version> </dependency> 
        <dependency> <groupId></groupId> 
        </dependency> <dependency> 
        <version>4.11</version> </dependency> 

Unable to connect to mysql

First of all, make sure that you have the database spark in your mysql


Similar errors may be reported in different versions

Reason: mysql was not started on port 3306

sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf
 take skip-grant-tables notes


jdbc version incompatible

If mysql executes the command on pdf without specifying the version, go to the official website to download the latest version of jdbc for reference( MySQL_ JDBC_ Download and use of jar package (Windows) - desolate and warm - blog Garden (



**Note: * * modify the time in the testHbase() code to the time when you execute the back-end code. For example, 20211001 is changed to 20211229. Before execution, please take a snapshot of the virtual machine and check whether the zookeeper server address in the HBaseUtils file is correct

import com.test.utils.HBaseUtils;
import com.test.utils.JdbcUtils;
import org.junit.Test;
import java.sql.*;
import java.util.Map;

public class testSQL {
    public void testjdbc() throws ClassNotFoundException {
        String url = "jdbc:mysql://localhost:3306/spark";
        String username = "root";
        String password = "root";
        try {
            Connection conn = DriverManager.getConnection(url, username,
            Statement stmt = conn.createStatement();
            ResultSet res = stmt.executeQuery("select * from course");
            while (
                System.out.println(res.getString(1)+" "+res.getString(2));
        } catch (SQLException e) {
    public void testJdbcUtils() throws ClassNotFoundException {
    public void testHbase() {    	
        Map<String,Long>clickCount=HBaseUtils.getInstance().getClickCount("ns1:courses_clickcount", "20211001");
        for (String x : clickCount.keySet())
            System.out.println(x + " " +clickCount.get(x));


package com.test.utils;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
import java.util.HashMap;
import java.util.Map;
public class JdbcUtils {
    private Connection connection = null;
    private static JdbcUtils jdbcUtils = null;
    Statement stmt = null;
    private JdbcUtils() throws ClassNotFoundException {
        String url = "jdbc:mysql://localhost:3306/spark?useSSL=false";
        String username = "root";
        String password = "root";
        try {
            connection = DriverManager.getConnection(url, username, password);
            stmt = connection.createStatement();
        }catch (Exception e){
     * Get JdbcUtil instance
     * @return
    public static synchronized JdbcUtils getInstance() throws
            ClassNotFoundException {
        if(jdbcUtils == null){
            jdbcUtils = new JdbcUtils();
        return jdbcUtils;
     * Get the course name according to the course id
    public String getCourseName(String id){
        try {
            ResultSet res = stmt.executeQuery("select * from course where id =\'" + id + "\'");
            while (
                return res.getString(2);
        }catch (Exception e){
        return null;
     * Query statistics results by date
    public Map<String,Long> getClickCount(String tableName, String date){
        Map<String,Long> map = new HashMap<String, Long>();
        try {
        }catch (Exception e){
            return null;
        return map;


Note that change the locahost in the zookeeper server address to your user name

package com.test.utils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.filter.Filter;
import org.apache.hadoop.hbase.filter.PrefixFilter;
import org.apache.hadoop.hbase.util.Bytes;
import java.util.HashMap;
import java.util.Map;
public class HBaseUtils {
    private Configuration configuration = null;
    private Connection connection = null;
    private static HBaseUtils hBaseUtil = null;
    private HBaseUtils(){
        try {
            configuration = new Configuration();
//The address of the zookeeper server
            connection = ConnectionFactory.createConnection(configuration);
        }catch (Exception e){
     * Get HBaseUtil instance
     * @return
    public static synchronized HBaseUtils getInstance(){
        if(hBaseUtil == null){
            hBaseUtil = new HBaseUtils();
        return hBaseUtil;
     * Get table objects from table names
    public HTable getTable(String tableName){
        try {
            HTable table = null;
            table = (HTable)connection.getTable(TableName.valueOf(tableName));
            return table;
        }catch (Exception e){
        return null;
     * Query statistics results by date
    public Map<String,Long> getClickCount(String tableName, String date){
        Map<String,Long> map = new HashMap<String, Long>();
        try {
        //Get table instance
            HTable table = getInstance().getTable(tableName);
        //Column family
            String cf = "info";
            String qualifier = "click_count";
//Define a scanner prefix filter to scan only row s of a given date
            Filter filter = new PrefixFilter(Bytes.toBytes(date));
//Define scanner
            Scan scan = new Scan();
            ResultScanner results = table.getScanner(scan);
            for(Result result:results){
//Remove rowKey
                String rowKey = Bytes.toString(result.getRow());
//Take out hits
                Long clickCount =
        }catch (Exception e){
            return null;
        return map;

Mysql data

Use the shell command to log in to mysql and execute

use spark;
CREATE TABLE `course`  (
  `id` int NOT NULL,
  `course` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL,
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci ROW_FORMAT = Dynamic;
INSERT INTO `course` VALUES (112, 'Spark');
INSERT INTO `course` VALUES (127, 'HBase');
INSERT INTO `course` VALUES (128, 'Flink');
INSERT INTO `course` VALUES (130, 'Hadoop');
INSERT INTO `course` VALUES (145, 'Linux');
INSERT INTO `course` VALUES (146, 'Python');

