Big data engineering practice reference manual

Posted by tomdude48 on Fri, 31 Dec 2021 12:40:01 +0100

Big data engineering practice reference manual

vmtools installation still cannot copy and paste, and drag and drop files

reference resources (93 messages) solve the problem that the virtual machine cannot be copied and pasted between VMware and the host when running ubuntu_ Leled blog - CSDN blog_ The ubuntu virtual machine cannot copy and paste

Restart the virtual machine after executing the following commands in sequence

sudo apt-get autoremove open-vm-tools
sudo apt-get install open-vm-tools
sudo apt-get install open-vm-tools-desktop

ssh password free login

reference resources How to configure SSH secret free login for Ubuntu - unassigned microservices - blog Park (cnblogs.com)

First, check whether ssh is installed

sudo ps -e |grep ssh

If there are two, ssh is installed. Otherwise, execute the following command to install ssh

sudo apt-get install openssh-server

It is recommended to delete the ssh directory first and reconfigure it

rm -r  ~/.ssh

Execute the following command to generate the public key and private key, and then press enter

ssh-keygen -t rsa -P ""
#Parameter Description: - t is the selected encryption algorithm, - P is the set password, and setting to "" indicates that no password is required

Add public key to authorize_keys file

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Finally, ssh connects to the local machine for testing. For the first connection, enter yes

ssh localhost perhaps ssh 127.0.0.1

Unable to init server: unable to connect

reference resources (97 messages) [error handling] Unable to init server: unable to connect: refuse to connect to _gloriiaaablog - CSDN blog

Use the following instructions

$ xhost local:gedit

If the following error is reported

xhost: unable to open display ""

Available instructions

$ export DISPLAY=:0

Then enter again

$ xhost local:gedit

If present

non-network local connections being added to access control list

This indicates that the modification was successful

hadoop installation

reference resources Hadoop3.1.3 installation tutorial_ Stand alone / pseudo distributed configuration_ Hadoop3.1.3/Ubuntu18.04(16.04)_ Xiamen University database lab blog (xmu.edu.cn) You can not create hadoop users

In addition to the pseudo distributed configuration in the above blog, please configure Hadoop env SH file

 vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh

add to

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_162

Error Prevention

HBase installation

reference resources HBase2.2.2 installation and programming practice guide_ Xiamen University database lab blog (xmu.edu.cn)

python script

Modify the pabu to your user name, and create a data folder in advance. The script is saved in the data folder, such as / home / s0109 / data / click log

import random
import time

url_paths = ["class/112.html",
             "class/128.html",
             "class/145.html",
             "class/146.html",
             "class/500.html",
             "class/250.html",
             "class/131.html",
             "class/130.html",
             "class/271.html",
             "class/127.html",
             "learn/821",
             "learn/823",
             "learn/987",
             "learn/500",
             "course/list"]

ip_slices = [132, 156, 124, 10, 29, 167, 143, 187, 30, 46,
             55, 63, 72, 87, 98, 168, 192, 134, 111, 54, 64, 110, 43]

http_referers = ["http://www.baidu.com/s?wd={query}", "https://www.sogou.com/web?query={query}",
                 "http://cn.bing.com/search?q={query}", "https://search.yahoo.com/search?p={query}", ]

search_keyword = ["Spark SQL actual combat", "Hadoop Basics", "Storm actual combat",
                  "Spark Streaming actual combat", "10 Hour entry big data", "SpringBoot actual combat", "Linux Advanced ", "Vue.js"]

status_codes = ["200", "404", "500", "403"]


def sample_url():
    return random.sample(url_paths, 1)[0]


def sample_ip():
    slice = random.sample(ip_slices, 4)
    return ".".join([str(item) for item in slice])


def sample_referer():
    if random.uniform(0, 1) > 0.5:
        return "-"
    refer_str = random.sample(http_referers, 1)
    query_str = random.sample(search_keyword, 1)
    return refer_str[0].format(query=query_str[0])


def sample_status_code():
    return random.sample(status_codes, 1)[0]


def generate_log(count=10):
    time_str = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
    f = open("/home/s0109/data/click.log", "w+")
    while count >= 1:
        query_log = "{ip}\t{local_time}\t\"GET /{url} HTTP/1.1\"\t{status_code}\t{referer}".format(url=sample_url(
        ), ip=sample_ip(), referer=sample_referer(), status_code=sample_status_code(), local_time=time_str)
        f.write(query_log + "\n")
        count = count - 1


if __name__ == '__main__':
    generate_log(100)

Set Ubuntu timer

crontab -e after creating a new task, it is recommended to select 2. If you have installed vim, select 1 and use the nano editor: ctrl+o to save and ctrl+x to exit. If you want to modify after selection, use the select editor command to select again

The path in the timer also needs to be changed to the corresponding path

Log data collection using Flume and Kafka

Pay attention to modifying the location of the log file for the configuration file

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-HAbucEfR-1640857591193)(E: \ file \ course \ junior college \ big data \ big data. assets\image-20211226150043176.png)]

exec-memory-kafka.sources = exec-source
exec-memory-kafka.sinks = kafka-sink
exec-memory-kafka.channels = memory-channel

exec-memory-kafka.sources.exec-source.type = exec
exec-memory-kafka.sources.exec-source.command = tail -F /home/s0109/data/click.log
exec-memory-kafka.sources.exec-source.shell = /bin/sh -c

exec-memory-kafka.channels.memory-channel.type = memory

exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
exec-memory-kafka.sinks.kafka-sink.brokerList = localhost:9092
exec-memory-kafka.sinks.kafka-sink.topic = streamtopic
exec-memory-kafka.sinks.kafka-sink.batchSize = 10
exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1

exec-memory-kafka.sources.exec-source.channels = memory-channel
exec-memory-kafka.sinks.kafka-sink.channel = memory-channel

When you start zookeeper and Kafka, you can add & after the command to make it run in the background

consumer failed to receive data

Check the Flume configuration file for errors

An error occurred while creating the Hbase table

Restart hbase

stop-hbase.sh
start-hbase.sh

After the restart, for example, execute the list command after the hbase shell command. It will not be created when it is stuck. Otherwise, continue to try to restart

Building back-end projects

Note: before executing all the codes for HBase operations from this step, please save the virtual machine snapshot. After executing the error code, the HBase environment will crash!!!

Installing Intellij

reference resources (98 messages) Ubuntu 20 04 install idea2020 2 ide detailed tutorial_ liutao43 blog - CSDN blog_ Ubuntu 20 installation idea Modify the corresponding command to the version you downloaded

Simply put, it is 2 steps, decompressing and running

sudo tar -zxvf ideaIU-2020.2.3.tar.gz -C /opt  #/opt can be changed to the location you want to unzip. The compressed package can be changed to the version you downloaded. Before unzipping, you need to enter the location where the compressed package is downloaded
/opt/ideaIU-2020.2.3/bin/idea.sh

After installation, select Plugin and search for scala installation plug-ins. The download speed is slow and wait patiently

The reference of the old version

Then Create the project, select Scala and click Create

Select 2.11 12. Download. Wait patiently. The download is extremely slow, but it is only 40 mb files. If it is too slow, find an offline installation method yourself

When setting up the maven environment, the repository does not exist and is created by itself

code

rely on

 <repositories>
        <repository>
            <id>alimaven</id>
            <name>Maven Aliyun Mirror</name>
            <url>http://maven.aliyun.com/nexus/content/repositories/central/</url>
            <releases> <enabled>true</enabled> </releases>
            <snapshots> <enabled>false</enabled> </snapshots>
        </repository>
    </repositories>
    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.8</version> </dependency>
        <dependency> <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.5.1</version>
        </dependency>
        <dependency> <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>2.2.2</version>
        </dependency>
        <dependency> <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
            <version>2.1.1</version>
        </dependency>
        <dependency>
        <groupId>com.fasterxml.jackson.module</groupId>
        <artifactId>jackson-module-scala_2.11</artifactId>
        <version>2.6.5</version>
        </dependency>
        <dependency>
        <groupId>net.jpountz.lz4</groupId>
            <artifactId>lz4</artifactId>
        <version>1.3.0</version>
    </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.1.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.6</version>
        </dependency>
    </dependencies>

After filling in the old version, click Enable Auto Import to update it

HBaseUtils

Note that zk service parameters should be modified to replace the contents in s0109 below with your account

package com.spark.streaming.project.utils;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.HTable;

public class HBaseUtils {
    private Configuration configuration = null;
    private Connection connection = null;
    private static HBaseUtils instance = null;

    private HBaseUtils(){
        try {
            configuration = new Configuration();
            //Specify the zk server to access
            configuration.set("hbase.zookeeper.quorum", "s0109:2181");
            // Get Hbase connection
            connection = ConnectionFactory.createConnection(configuration);
        }catch(Exception e){
            e.printStackTrace();
        }
    }
    /**
     * Get HBase connection instance
     */

    public static synchronized HBaseUtils getInstance(){
        if(instance == null){
            instance = new HBaseUtils();
        }
        return instance;
    }

    /**
     *Get an instance of a table from the table name
     * @param tableName
     * @return
     */
    public HTable getTable(String tableName) {
        HTable hTable = null; 
        try {
            hTable = (HTable)connection.getTable(TableName.valueOf(tableName));
        }catch (Exception e){
            e.printStackTrace();
        }
        return hTable;
    }
}

DateUtils

package com.spark.streaming.project.utils

import org.apache.commons.lang3.time.FastDateFormat

/**
 * Format date tool class
 */
object DateUtils {
  //Specifies the date format to enter
    val YYYYMMDDHMMSS_FORMAT = FastDateFormat.getInstance("yyyy-MM-dd hh:mm:ss");
  //Specify output format
  val TARGET_FORMAT = FastDateFormat.getInstance("yyyyMMddhhmmss")

  // Enter String to return the result converted to log in this format
  def getTime(time: String) = {
    YYYYMMDDHMMSS_FORMAT.parse(time).getTime
  }

  def parseToMinute(time: String) = {
    //Call getTime
    TARGET_FORMAT.format(getTime(time))
  }
}

CourseClickCountDao

package com.spark.streaming.project.dao

import com.spark.streaming.project.domain.CourseClickCount
import com.spark.streaming.project.utils.HBaseUtils

import org.apache.hadoop.hbase.util.Bytes
import scala.collection.mutable.ListBuffer

object CourseClickCountDao {
  val tableName = "ns1:courses_clickcount" //Table name
  val cf = "info" //Column family
  val qualifer = "click_count" //column

  /**
   * Save data to Hbase
   * @param list (day_course:String,click_count:Int) //Count the total hits of each course on the same day
   */
  def save(list: ListBuffer[CourseClickCount]): Unit = {
    //Call the method of HBaseUtils to obtain the HBase table instance
    val table = HBaseUtils.getInstance().getTable(tableName)
    for (item <- list) {
      //Call a self increasing method of Hbase
      table.incrementColumnValue(Bytes.toBytes(item.day_course),
        Bytes.toBytes(cf), Bytes.toBytes(qualifer),
        item.click_count) //If the value is Long, it will be automatically converted
    }
  }
}

CourseSearchClickCountDao

package com.spark.streaming.project.dao

import com.spark.streaming.project.domain.CourseSearchClickCount
import com.spark.streaming.project.utils.HBaseUtils
import org.apache.hadoop.hbase.util.Bytes
import scala.collection.mutable.ListBuffer

object CourseSearchClickCountDao {
  val tableName = "ns1:courses_search_clickcount"
  val cf = "info"
  val qualifer = "click_count"

  /**
   * Save data to Hbase
   * @param list (day_course:String,click_count:Int) //Count the total hits of each course on the same day 
   */
  def save(list: ListBuffer[CourseSearchClickCount]): Unit = {
    val table = HBaseUtils.getInstance().getTable(tableName)
    for (item <- list) {
      table.incrementColumnValue(Bytes.toBytes(item.day_serach_course),
        Bytes.toBytes(cf), Bytes.toBytes(qualifer),
        item.click_count
      ) //If the value is Long, it will be automatically converted 
    }
  }
}

CountByStreaming

package com.spark.streaming.project.application

import com.spark.streaming.project.domain.ClickLog
import com.spark.streaming.project.domain.CourseClickCount
import com.spark.streaming.project.domain.CourseSearchClickCount
import com.spark.streaming.project.utils.DateUtils
import com.spark.streaming.project.dao.CourseClickCountDao
import com.spark.streaming.project.dao.CourseSearchClickCountDao
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.collection.mutable.ListBuffer

object CountByStreaming {
  def main(args: Array[String]): Unit = {
    /**
     * Finally, the program will be packaged and run on the cluster, 
     * Several parameters need to be received: the ip address of the zookeeper server, the kafka consumption group, 
     * Topic, and number of threads 
     */
    if (args.length != 4) {
      System.err.println("Error:you need to input:<zookeeper> <group> <toplics> <threadNum>")
      System.exit(1)
    }
    //Receive the parameters of the main function and pass the parameters outside 
    val Array(zkAdderss, group, toplics, threadNum) = args

    /**
     * When creating Spark context, you need to set AppName for local operation 
     * Master And other attributes, which need to be deleted before packaging on the cluster 
     */
    val sparkConf = new SparkConf()
      .setAppName("CountByStreaming")
      .setMaster("local[4]")

    //Create a Spark discrete stream and receive data every 60 seconds 
    val ssc = new StreamingContext(sparkConf, Seconds(60))
    //Using kafka as the data source 
    val topicsMap = toplics.split(",").map((_, threadNum.toInt)).toMap
    //Create a kafka discrete stream and consume the data of the kafka cluster every 60 seconds 
    val kafkaInputDS = KafkaUtils.createStream(ssc, zkAdderss, group, topicsMap)
    //Get the original log data 
    val logResourcesDS = kafkaInputDS.map(_._2)
    /**
     * (1)Clean the data and package it into ClickLog 
     * (2)Filter out illegal data 
     */
    val cleanDataRDD = logResourcesDS.map(line => {
      val splits = line.split("\t")
      if (splits.length != 5) {
        //Illegal data is directly encapsulated and given an error value by default, and the filter will filter it 
        ClickLog("", "", 0, 0, "")
      }
      else {
        val ip = splits(0) //Get the ip address of the user in the log 
        val time = DateUtils.parseToMinute(splits(1)) //Obtain the access time of the user in the log and call DateUtils to format the time 
        val status = splits(3).toInt //Get access status code 
        val referer = splits(4)
        val url = splits(2).split(" ")(1) //Get search url
        var courseId = 0
        if (url.startsWith("/class")) {
          val courseIdHtml = url.split("/")(2)
          courseId = courseIdHtml.substring(0, courseIdHtml.lastIndexOf(".")).toInt
        }
        ClickLog(ip, time, courseId, status, referer) //Encapsulate the cleaned log into ClickLog
      }
    }).filter(x => x.courseId != 0) //Filter out non practical courses
    /**
     * (1)statistical data 
     * (2)Write the calculation results into HBase 
     */
    cleanDataRDD.map(line => {
      //This is equivalent to defining the RowKey of the HBase table "ns1:courses_clickcount", 
      // Set 'date'_ Course 'as a RowKey means the number of visits to a course on a certain day 
      (line.time.substring(0, 8) + "_" + line.courseId, 1) //Map to tuple 
    }).reduceByKey(_ + _) //polymerization
      .foreachRDD(rdd => { //There are multiple RDD S in a DStream 
        rdd.foreachPartition(partition => { //There are multiple partitions in an RDD
          val list = new ListBuffer[CourseClickCount]
          partition.foreach(item => { //There are multiple records in a Partition 
            list.append(CourseClickCount(item._1, item._2))
          })
          CourseClickCountDao.save(list) //Save to HBase 
        })
      })

    /**
     * Count the total hits of practical courses from various search engines so far 
     * (1)statistical data 
     * (2)Write the statistical results into HBase 
     */
    cleanDataRDD.map(line => {
      val referer = line.referer
      val time = line.time.substring(0, 8)
      var url = ""
      if (referer == "-") { //Filter illegal URLs 
        (url, time)
      }
      else {
        //Take out the name of the search engine 
        url = referer.replaceAll("//", "/").split("/")(1)
        (url, time)
      }
    }).filter(x => x._1 != "").map(line => {
      //This is equivalent to defining the RowKey of the HBase table "ns1:courses_search_clickcount", 
      // Will 'date_ Search engine name 'as RowKey means the number of times a course is accessed through a search engine on a certain day 
      (line._2 + "_" + line._1, 1) //Map to tuple 
    }).reduceByKey(_ + _) //polymerization
      .foreachRDD(rdd => {
        rdd.foreachPartition(partition => {
          val list = new ListBuffer[CourseSearchClickCount]
          partition.foreach(item => {
            list.append(CourseSearchClickCount(item._1, item._2))
          })
          CourseSearchClickCountDao.save(list)
        })
      })
    ssc.start()
    ssc.awaitTermination()
  }
}

Running environment configuration

First click run, a pop-up box will pop up, select CountByStreaming, select the one without $, then refer to the environment configuration in pdf, and fill in the parameters in Program arguments

Build front end projects

to configure

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
    </properties>
    <dependencies> 
        <dependency> 
            <groupId>org.apache.hbase</groupId> 
            <artifactId>hbase-client</artifactId> 
            <version>2.2.2</version> 
        </dependency> 
        <dependency> 
            <groupId>junit</groupId> 
            <artifactId>junit</artifactId> 
            <version>4.11</version> 
        </dependency> <dependency> 
        <groupId>javax.servlet</groupId> 
        <artifactId>javax.servlet-api</artifactId> 
        <version>3.1.0</version> 
    </dependency> 
        <dependency> 
            <groupId>net.sf.json-lib</groupId> 
            <artifactId>json-lib</artifactId> 
            <classifier>jdk15</classifier>
            <version>2.4</version> </dependency> 
        <dependency> <groupId>com.alibaba</groupId> 
            <artifactId>fastjson</artifactId> 
            <version>1.2.78</version> 
        </dependency> <dependency> 
        <groupId>junit</groupId> 
        <artifactId>junit</artifactId> 
        <version>4.11</version> </dependency> 
    </dependencies>

Unable to connect to mysql

First of all, make sure that you have the database spark in your mysql

ERROR: 1

Similar errors may be reported in different versions

Reason: mysql was not started on port 3306

sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf
 take skip-grant-tables notes

ERROR: 2

jdbc version incompatible

If mysql executes the command on pdf without specifying the version, go to the official website to download the latest version of jdbc for reference( MySQL_ JDBC_ Download and use of jar package (Windows) - desolate and warm - blog Garden (cnblogs.com))

code

testSQL

**Note: * * modify the time in the testHbase() code to the time when you execute the back-end code. For example, 20211001 is changed to 20211229. Before execution, please take a snapshot of the virtual machine and check whether the zookeeper server address in the HBaseUtils file is correct

import com.test.utils.HBaseUtils;
import com.test.utils.JdbcUtils;
import org.junit.Test;
import java.sql.*;
import java.util.Map;

public class testSQL {
    @Test
    public void testjdbc() throws ClassNotFoundException {
        Class.forName("com.mysql.cj.jdbc.Driver");
        String url = "jdbc:mysql://localhost:3306/spark";
        String username = "root";
        String password = "root";
        try {
            Connection conn = DriverManager.getConnection(url, username,
                    password);
            Statement stmt = conn.createStatement();
            ResultSet res = stmt.executeQuery("select * from course");
            while (res.next())
                System.out.println(res.getString(1)+" "+res.getString(2));
            conn.close();
            stmt.close();
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }
    @Test
    public void testJdbcUtils() throws ClassNotFoundException {
        System.out.println(JdbcUtils.getInstance().getCourseName("128"));
        System.out.println(JdbcUtils.getInstance().getCourseName("112"));
    }
    @Test
    public void testHbase() {    	
        Map<String,Long>clickCount=HBaseUtils.getInstance().getClickCount("ns1:courses_clickcount", "20211001");
        for (String x : clickCount.keySet())
            System.out.println(x + " " +clickCount.get(x));
    }
}

JdbcUtils

package com.test.utils;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
import java.util.HashMap;
import java.util.Map;
public class JdbcUtils {
    private Connection connection = null;
    private static JdbcUtils jdbcUtils = null;
    Statement stmt = null;
    private JdbcUtils() throws ClassNotFoundException {
        Class.forName("com.mysql.jdbc.Driver");
        String url = "jdbc:mysql://localhost:3306/spark?useSSL=false";
        String username = "root";
        String password = "root";
        try {
            connection = DriverManager.getConnection(url, username, password);
            stmt = connection.createStatement();
        }catch (Exception e){
            e.printStackTrace();
        }
    }
    /**
     * Get JdbcUtil instance
     * @return
     */
    public static synchronized JdbcUtils getInstance() throws
            ClassNotFoundException {
        if(jdbcUtils == null){
            jdbcUtils = new JdbcUtils();
        }
        return jdbcUtils;
    }
    /**
     * Get the course name according to the course id
     */
    public String getCourseName(String id){
        try {
            ResultSet res = stmt.executeQuery("select * from course where id =\'" + id + "\'");
            while (res.next())
                return res.getString(2);
        }catch (Exception e){
            e.printStackTrace();
        }
        return null;
    }
    /**
     * Query statistics results by date
     */
    public Map<String,Long> getClickCount(String tableName, String date){
        Map<String,Long> map = new HashMap<String, Long>();
        try {
        }catch (Exception e){
            e.printStackTrace();
            return null;
        }
        return map;
    }
}

HBaseUtils

Note that change the locahost in the zookeeper server address to your user name

package com.test.utils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.filter.Filter;
import org.apache.hadoop.hbase.filter.PrefixFilter;
import org.apache.hadoop.hbase.util.Bytes;
import java.util.HashMap;
import java.util.Map;
public class HBaseUtils {
    private Configuration configuration = null;
    private Connection connection = null;
    private static HBaseUtils hBaseUtil = null;
    private HBaseUtils(){
        try {
            configuration = new Configuration();
//The address of the zookeeper server
            configuration.set("hbase.zookeeper.quorum","localhost:2181");
            connection = ConnectionFactory.createConnection(configuration);
        }catch (Exception e){
            e.printStackTrace();
        }
    }
    /**
     * Get HBaseUtil instance
     * @return
     */
    public static synchronized HBaseUtils getInstance(){
        if(hBaseUtil == null){
            hBaseUtil = new HBaseUtils();
        }
        return hBaseUtil;
    }
    /**
     * Get table objects from table names
     */
    public HTable getTable(String tableName){
        try {
            HTable table = null;
            table = (HTable)connection.getTable(TableName.valueOf(tableName));
            return table;
        }catch (Exception e){
            e.printStackTrace();
        }
        return null;
    }
    /**
     * Query statistics results by date
     */
    public Map<String,Long> getClickCount(String tableName, String date){
        Map<String,Long> map = new HashMap<String, Long>();
        try {
        //Get table instance
            HTable table = getInstance().getTable(tableName);
        //Column family
            String cf = "info";
        //column
            String qualifier = "click_count";
//Define a scanner prefix filter to scan only row s of a given date
            Filter filter = new PrefixFilter(Bytes.toBytes(date));
//Define scanner
            Scan scan = new Scan();
            scan.setFilter(filter);
            ResultScanner results = table.getScanner(scan);
            for(Result result:results){
//Remove rowKey
                String rowKey = Bytes.toString(result.getRow());
//Take out hits
                Long clickCount =
                        Bytes.toLong(result.getValue(cf.getBytes(),qualifier.getBytes()));
                map.put(rowKey,clickCount);
            }
        }catch (Exception e){
            e.printStackTrace();
            return null;
        }
        return map;
    }
}

Mysql data

Use the shell command to log in to mysql and execute

use spark;
DROP TABLE IF EXISTS `course`;
CREATE TABLE `course`  (
  `id` int NOT NULL,
  `course` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci ROW_FORMAT = Dynamic;
INSERT INTO `course` VALUES (112, 'Spark');
INSERT INTO `course` VALUES (127, 'HBase');
INSERT INTO `course` VALUES (128, 'Flink');
INSERT INTO `course` VALUES (130, 'Hadoop');
INSERT INTO `course` VALUES (145, 'Linux');
INSERT INTO `course` VALUES (146, 'Python');

The later tutorial is gone. Do it yourself
t.getValue(cf.getBytes(),qualifier.getBytes()));
map.put(rowKey,clickCount);
}
}catch (Exception e){
e.printStackTrace();
return null;
}
return map;
}
}

### Mysql data

use shell Command login mysql Post execution

```sql
use spark;
DROP TABLE IF EXISTS `course`;
CREATE TABLE `course`  (
  `id` int NOT NULL,
  `course` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci ROW_FORMAT = Dynamic;
INSERT INTO `course` VALUES (112, 'Spark');
INSERT INTO `course` VALUES (127, 'HBase');
INSERT INTO `course` VALUES (128, 'Flink');
INSERT INTO `course` VALUES (130, 'Hadoop');
INSERT INTO `course` VALUES (145, 'Linux');
INSERT INTO `course` VALUES (146, 'Python');

The later tutorial is gone. Do it yourself

Topics: Linux Operation & Maintenance Ubuntu

Programmer Think

Big data engineering practice reference manual

Big data engineering practice reference manual

vmtools installation still cannot copy and paste, and drag and drop files

ssh password free login

Unable to init server: unable to connect

hadoop installation

HBase installation

python script

Set Ubuntu timer

Log data collection using Flume and Kafka

consumer failed to receive data

An error occurred while creating the Hbase table

Building back-end projects

Installing Intellij

code

rely on

HBaseUtils

DateUtils

CourseClickCountDao

CourseSearchClickCountDao

CountByStreaming

Running environment configuration

Build front end projects

to configure

Unable to connect to mysql

code

testSQL

JdbcUtils

HBaseUtils

Mysql data

Hot Topics