Three level practical subject report

Posted by rebelo on Sun, 19 Dec 2021 20:30:54 +0100

Three level practical subject report

Project Name: website log analysis system of canva paintable online graphic design software

Major name: Data Science and big data technology

Class: = = 201

Student No.:============

Student Name: CS daydream

Instructor:===

– -- December 2021 –

Abstract

With the development of the Internet, network data is growing exponentially. IDC data shows that global enterprise data is growing at a rate of 55% year by year. Big data contains huge commercial value and has attracted extensive attention of enterprises. However, big data has brought some problems and difficulties to data synchronization, storage and data statistical analysis, The current tools are gradually unable to effectively deal with these problems. Google first launched MapReduce to meet its demand for big data processing. Hadoop is an open source version of MapReduce and has gradually become a core part of the basic computing platform of many Internet companies.

Based on the demand analysis of this system, this group designs an architecture based on Hadoop and Hive clusters, which integrates the data source layer, storage layer and computing layer, and designs and implements the functions of offline log statistical analysis, task scheduling, visual operation and so on.

This paper comprehensively uses a variety of open source technologies of Hadoop ecosystem, including Flume NG, Sqoop, HDFS, MapReduce and Hive. From the collection of log analysis to log storage and calculation analysis, the final visual interface is obtained, covering the typical processes and technologies of Hadoop log statistical analysis.

This paper uses the development language Java and shell, and the development tools are IDEA, Xshell and so on. Hadoop clusters are built on three CentOS machines for distributed storage and computing.

Keywords: big data; Hadoop； MapReduce； Log statistical analysis

Chapter I Introduction

1.1 topic introduction

In Internet applications, logs are very important data, because Internet projects often require 7 * 24 uninterrupted operation, so it is very important to obtain and analyze the log data related to the operation of the monitoring system. Website traffic statistics is one of the important means to improve website services. By obtaining and analyzing the behavior data of users on the website, we can get valuable information and improve the website based on this information.

1.2 subject source

For the future development of the Internet, big data log analysis has become increasingly important. In order to consolidate their big data knowledge and application ability, team members have completed a massive log analysis system.

1.3 subject requirements

1, Collect and monitor logs with Flume

2, Clean the collected original logs and upload them to hive

3, The following indicators are counted through log analysis:

PV,UV: day / week / month
Count the number of login users per day: the ID of the login user is POST /api/login request
Count the number of student visits in each time period: day, week and month
Statistics of the most frequently visited pages: Top 10. Day / week / month
Users with more frequent visits: Top 10. Day / week / month
System environment for statistical access: day / week / month [operating system version, browser and other information]
IP source of statistical access: [if the IP location will be queried by calling the interface: prompt: you can directly obtain it by using the request tool in hutool]
Total IP count: day / week / month
4, Upload the statistical data table from sqoop to mysql

5, Finally, the front-end technology shows the results

Chapter II Implementation Technology

2.1 main technologies of data processing

2.1.1 Sqoop

As an open source offline data transmission tool, it is mainly used for data transmission between Hadoop(Hive) and traditional databases (MySql,PostgreSQL). It can import data from a relational database into Hadoop HDFS or HDFS into relational databases.

2.1.2 Flume

An open source framework for real-time data collection. It is a highly available, highly reliable and distributed system for massive log collection, aggregation and transmission provided by Cloudera. It is now a top-level subproject of Apache. Flume can collect data such as logs and time, and store these data centrally for downstream use (especially the data flow framework, such as Storm). Another framework similar to flume is Scribe (FaceBook's open source log collection system, which provides an extensible, high error tolerance and simple scheme for distributed log collection and unified processing)

2.1.3 MapReduce

MapReduce is the core computing model of Google. It highly abstracts the complex parallel computing process running on large-scale clusters into two functions: map and reduce. The greatest thing about MapReduce is that it gives ordinary developers the ability to process big data, so that ordinary developers can run their own programs on distributed systems to process massive data even without any distributed programming knowledge.

2.1.4 Hive

MapReduce gives the ability to process big data to ordinary developers, while Hive further gives the ability to process and analyze big data to actual data users (data development engineers, data analysts, Algorithm Engineers, and business analysts).

Hive is developed by Facebook and contributed to the Hadoop open source community. It is a layer of SQL abstraction based on Hadoop architecture. Hive provides some tools for processing, querying and analyzing data sets in Hadoop files. It supports a query language similar to the SQL language of traditional RDBMS. First, it helps users who are familiar with SQL to process and query data in Hadoop. The query language is called hive SQL. Hive SQL is actually parsed by the SQL parser first, and then parsed into a MapReduce executable plan by the hive framework. The MapReduce tasks are produced according to the plan and handed over to the Hadoop cluster for processing.

2.2 main data storage technology - HDFS

Hadoop Distributed File System (FDFS for short) is a distributed file system. It has a high degree of fault tolerance and high-throughput data access, and is very suitable for applications on large-scale data sets. HDFS provides a massive data storage solution with high fault tolerance and high throughput.

In the whole Hadoop architecture, HDFS provides support for file operation and storage during MapReduce task processing. MapReduce realizes task distribution, tracking and execution based on HDFS, and collects results. The two interact to jointly complete the main tasks of Hadoop distributed cluster.

2.3 main technology of data application - JAVA

Java is the general name of java object-oriented programming language and Java platform launched by Sun Microsystems in May 1995. It was jointly developed by James Gosling and colleagues and officially launched in 1995. Later, Sun company was acquired by Oracle (Oracle), and Java became Oracle's product.

Java is divided into three systems:

·JavaSE (J2SE) (Java2 Platform Standard Edition)

·java EE (J2EE) (java 2 platform, enterprise edition, java Platform Enterprise Edition)

·JavaME(J2ME)(Java 2 Platform Micro Edition).

In June 2005, the JavaOne conference was held, and SUN disclosed Java SE 6. At this time, various versions of java have been renamed to cancel the number "2": J2EE has been renamed Java EE, J2SE has been renamed Java SE, and J2ME has been renamed Java ME.

Chapter III Implementation Process

3.1 simple process display

3.2 environment construction

Install hadoop on all three virtual machines
Install hive on node01 and node02 nodes, and node01 is the server and node02 is the client
Take node03 as the storage node, install MySQL on node03, and add MySQL permissions to node01 and node02
Flume is installed on node01 to monitor the collected logs
Install Sqoop on node02 to export the data tables in hive to MySQL in node03

3.3 acquisition log file

The following are the main classes for generating logs:

import cn.hutool.core.date.DateTime;
import cn.hutool.core.date.DateUtil;
import cn.hutool.core.util.RandomUtil;
import lombok.extern.slf4j.Slf4j;
import java.io.*;
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
@Slf4j
public class CreateData {
    private static List<String> urls;
    private static List<String> ips;
    public static void create(int days,int maxSize,String dir) throws Exception {
        Date date = DateUtil.date();
        String dateString;
        for (int j = 0; j < days; j++) {//How many days of logs are generated
            DateTime day = DateUtil.offsetDay(date, -j);
            dateString = DateUtil.format(day,"yyyy_MM_dd");
            File file = new File(dir+File.separator + "access_" +dateString + ".log");
            if (!file.exists()) {
                file.createNewFile();
            }
            try {
                FileWriter writer = new FileWriter(file);
                BufferedWriter bw = new BufferedWriter(writer);
                while (file.length() / (1024 * 1024) < 50) {//day
                    //String Time_str = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH).format(day);
                    //String str = getRandomIp() + " - - [" + Time_str + " +0800] " + urls[rand.nextInt(urls.length)] + " 200 " + (rand.nextInt(1100) + 100);
                    //bw.write(str);
                    bw.newLine();//Line feed
                }

                bw.close();
                writer.close();
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
            //}
        }
    }

    public static void main(String[] args) throws InterruptedException {
        while (true){
            Thread.sleep(RandomUtil.randomLong(100,2*1000));
            log.info("{} - - [{} +0800] {}",getIp(),DateUtil.format(DateUtil.date(),"dd/MM/yyyy:HH:mm:ss"),getUrl());
        }
    }
    private static String getIp(){
        return ips.get(RandomUtil.randomInt(ips.size()));
    }
    private static String getUrl(){
        return urls.get(RandomUtil.randomInt(urls.size()));
    }
    public void createLog(){

    }
    static {
        init();
    }
    public static void init() {
        //Get URL and IP collection
        parseData();
    }
    private static void parseData() {
        try {
            Set<String> url = new HashSet<>();
            Set<String> ip = new HashSet<>();
            InputStream is = CreateData.class.getResourceAsStream("/access.log");
            InputStreamReader reader = new InputStreamReader(is);
            BufferedReader br = new BufferedReader(reader);
            String line;
            String sep = "+0800]";
            int index;
            while ((line = br.readLine()) != null) {
                //Resolve URL
                index = line.indexOf(sep);
                url.add(line.substring(index+sep.length()).trim());
                //Resolve IP
                index = line.indexOf(" ");
                ip.add(line.substring(0,index).trim());
            }
            br.close();
            reader.close();
            is.close();
            urls = new ArrayList<>(url);
            ips = new ArrayList<>(ip);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

1. Install and configure Maven on idea

2. Print the written log generation class CreateData into a jar package and upload it to node01, as shown in the figure:

3. After uploading to node01, execute the operation as shown in the figure

The log data generated under the command is part of the log data

4. Start Flume monitoring system to monitor while generating log data

5. Stop running after generating enough data

6. Access node01:9870 port in the browser, as shown in the figure:

7. Enter the flume directory as shown:

3.4 data cleaning

1. Write MapReduce program with idea to clean the data

import cn.edu.zut.level.util.CollectionUtil;
import cn.edu.zut.level.util.LoggerParse;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class LogCleaner {
    public static void main(String[] args) throws Exception {
        run(args[0], args[1]);
    }

    public static void run(String inPath,String outPath) throws Exception {
        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf, "LogCleaner");

        job.setJarByClass(LogCleaner.class);
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);

        job.setMapperClass(CleanerMapper.class);
        job.setNumReduceTasks(0);
        FileInputFormat.setInputPaths(job, new Path(inPath));
        FileOutputFormat.setOutputPath(job, new Path(outPath));

        job.waitForCompletion(true);
    }
}

class CleanerMapper extends Mapper<LongWritable, Text,NullWritable, Text>{
    private LoggerParse parse = new LoggerParse();

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, NullWritable, Text>.Context context) throws IOException, InterruptedException {
        String[] strs = this.parse.parse(value.toString());
        String uri = strs[4];
        if(uri.startsWith("/static") ||
                uri.endsWith("_navbar.md") ||
                uri.endsWith(".jpg") ||
                uri.endsWith(".png") ||
                uri.endsWith(".PNG") ||
                uri.endsWith(".jpeg") ||
                uri.endsWith(".gif") ||
                uri.endsWith(".svg") ||
                uri.endsWith(".js") ||
                uri.endsWith(".css")
        ) return;
        context.write(NullWritable.get(), new Text(CollectionUtil.mkString(strs,"\t")));
    }
}

When writing the log cleaning program, it should be noted that the input and output paths are ready-made, and the input style can be selected by the console, so that when running on node01, the log files of different days can be selected for cleaning, and then output to the location you want to save

2. Print the written log cleaning class LogCleaner into a jar package and upload it to node01, as shown in the figure:

3. Run the jar package after uploading, as shown in the figure below:

4. After running the jar package of data cleaning, a file named out will be generated, in which the data after cleaning is stored.

3.5 importing data into Hive

1. After cleaning the original log data, upload it to the hive table

2. For operation reasons, choose idea to connect the hive table remotely. Operate on idea, as shown in the figure:

3. Create Hive table and import the cleaned data into Hive table, as shown in the figure:

4. Right click each part of the code to run

5. The effect after operation is shown in the figure:

3.6 import the final table into MySQL

1. Statistics of data:

Before importing into MySQL, the following statistics should be performed on the data:

PV,UV: day / week / month
Count the number of login users per day: the ID of the login user is POST /api/login request
Count the number of student visits in each time period: day, week and month
Statistics of the most frequently visited pages: Top 10. Day / week / month
Users with more frequent visits: Top 10. Day / week / month
System environment for statistical access: day / week / month [operating system version, browser and other information]
Statistics of IP sources accessed
Total IP count: day / week / month

[external chain picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly
PV, UV: day / week / month: