Big data offline processing data project website log file data collection log splitting data collection to HDFS and preprocessing

Posted by erth on Tue, 30 Nov 2021 12:59:03 +0100

Introduction:

This article is about the first process of big data offline data processing project: data collection

Main contents:

1) Use flume to collect website log file data to access.log

2) Write shell script: split the collected log data file (otherwise the access.log file is too large) and rename it to access_ Mm / DD / yyyy.log.   The execution cycle of this script is one minute

3) Collect the collected, split and renamed log data files to HDFS

4) Transfer the log data files on HDFS to the preprocessing working directory on HDFS

1. Collect log data files and split log files

To install crontab (there are crontab operation instructions at the end of the article), you need to switch to root:

yum install crontabs

Script disassembly log files

#!/bin/sh
logs_path="/home/hadoop/nginx/logs/"
pid_path="/home/hadoop/nginx/logs/nginx.pid"
filepath=${logs_path}"access.log"
echo $filepath
mv ${logs_path}access.log ${logs_path}access_$(date -d '-1 day' '+%Y-%m-%d-%H-%M').log
kill -USR1 `cat ${pid_path}`

Of which:

logs_path is the path where the split log files are stored

pid_path refers to the process file running nginx (storing the process id of nginx)

  filepath indicates the path of the log file to be split

Note: in renaming, the - 1 here is because it is offline processing, that is, today's processing is yesterday's data, so the name needs the date - 1

Enter crontab configuration command (1 for every minute)

*/1 * * * * sh /home/hadoop/bigdatasoftware/project1/nginx_log.sh

Restart crontab service:

service crond restart

  Reload the configuration to make the scheduled task effective:

service crond reload

Start nginx and visit the web pages a.html and b.html under nginx

At this time, the log data is generated continuously, stored in the access.log file, and then split

Log data was successfully collected, split and renamed:   

Note that the split access.log is empty:

2. Collect the successfully split log files to HDFS

Create a new configuration file in the flume/job directory

touch job/TaildirSource-hdfs.conf

Write flume configuration file:

agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1
#Monitor the newly added contents of multiple files in a directory
agent1.sources.source1.type = TAILDIR
#Save the offset consumed by each file in json format to avoid consumption from scratch
agent1.sources.source1.positionFile = /home/hadoop/taildir_position.json
agent1.sources.source1.filegroups = f1
agent1.sources.source1.filegroups.f1 = /home/hadoop/nginx/logs/access_*.log

agent1.sources.source1.interceptors = i1
agent1.sources.source1.interceptors.i1.type = host
agent1.sources.source1.interceptors.i1.hostHeader = hostname
#Configure sink component as hdfs
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path=hdfs://hadoop121:8020/weblog/flume-collection/%Y-%m-%d/%H-%M_%{hostname}
#Specify file name prefix
agent1.sinks.sink1.hdfs.filePrefix = access_log
#Specify the number of records for each batch of sinking data
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
#Specifies that the sinking file is scrolled by 1MB
agent1.sinks.sink1.hdfs.rollSize = 1048576
#Specify the number of files to scroll by 1000000
agent1.sinks.sink1.hdfs.rollCount = 1000000
#Specify the file to scroll by 30 minutes
agent1.sinks.sink1.hdfs.rollInterval = 30
#agent1.sinks.sink1.hdfs.round = true
#agent1.sinks.sink1.hdfs.roundValue = 10
#agent1.sinks.sink1.hdfs.roundUnit = minute
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

#Using memory type channel
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 500000
agent1.channels.channel1.transactionCapacity = 600

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

Start the program in the flume Directory:

bin/flume-ng agent --conf conf/ --name agent1 --conf-file job/TaildirSource-hdfs.conf -Dflume.root.logger=INFO,console

Then crazy access to a.html and b.html is over  

The final results are as follows:

 

 

3. Transfer the log data file to the preprocessing working file directory

Create a new script file in / home / Hadoop / bigdata software / project1 Directory: movetopreworkdir.sh

 

Script:

#!/bin/bash

#
# ===========================================================================
# Program name:      
# Function Description:       Move files to preprocessing working directory
# Input parameters:       Run date
# Destination path:      / data/weblog/preprocess/input
# Data source:       Flume collection data storage path: / weblog / flume collection
# Code review:      
# Modified by:
# Modification date:
# Reason for modification:
# Modify list: 
# ===========================================================================

#flume the directory where the generated log files are stored
log_flume_dir=/weblog/flume-collection

#Working directory of preprocessor
log_pre_input=/data/weblog/preprocess/input


#Get time information
day_01="2013-09-18"
day_01=`date -d'-1 day' +%Y-%m-%d`
syear=`date --date=$day_01 +%Y`
smonth=`date --date=$day_01 +%m`
sday=`date --date=$day_01 +%d`


#Read the directory of log files and judge whether there are files to upload
files=`hadoop fs -ls $log_flume_dir | grep $day_01 | wc -l`
if [ $files -gt 0 ]; then
hadoop fs -mv ${log_flume_dir}/${day_01} ${log_pre_input}
echo "success moved ${log_flume_dir}/${day_01} to ${log_pre_input} ....."
fi

Before execution:

 

Execute script:

Run successfully and view the results:

 

Since then~

crontab service operation

Start service: service crond start

Close service: Service Cross stop

Restart the service: service crond restart

Reload configuration: service crond reload

View crontab service status: service crond status

Manually start the crontab service: service crond start

Check whether the crondtab service is set to startup and execute the command: chkconfig --list

Add startup and self startup: chkconfig --level 35 crond on

Enter the edit command: crontabs -e

Enter view run instruction: crontabs -l

Delete instruction: crontabs -r

Configuration description

Basic format:
*  *  *  *  *  command 
Cent   Time   Day   Month   Week       command  
The first column represents 1 ~ 59 minutes, and each minute is represented by * or * / 1  
The second column represents hours 0 ~ 23 (0 represents 0:00) and 7-9 represents between 8:00 and 10:00
The third column indicates dates 1 to 31  
The fourth column indicates the month from January to December  
Column 5 identification number: 0 ~ 6 (0 indicates Sunday)  
Column 6 commands to run  

Topics: Big Data Hadoop hdfs flume