Flume cluster installation and deployment, flume entry operation cases: Official cases of monitoring port data and real-time monitoring of multiple additional files in the specified directory

Posted by 3.grosz on Tue, 28 Dec 2021 09:57:51 +0100

Introduction: This is a learning note blog about the installation and deployment of flume. The main contents include: flume installation and deployment and two entry cases of flume. They are: the official case of monitoring port data and the file changes tracked by multiple files in the specified directory in real time. If there are mistakes, please criticize and correct!!!

flume Brief

flume is a highly available, reliable and distributed system for massive log collection, aggregation and transmission. At the same time, the lightweight service framework based on streaming architecture is flexible and simple.
It is mainly used in the data transmission phase of big data.

flume installation deployment

Installation related address

Flume official website address
Document viewing address
Download address

Installation deployment

1. Prepare the installation package to your own compressed package directory. Mine is in: / home/lqs/software /
2. Decompress

[lqs@bdc112 software]$ pwd
/home/lqs/software
[lqs@bdc112 software]$ tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /home/lqs/module/

3. Modify name

[lqs@bdc112 software]$ cd ../module/
[lqs@bdc112 module]$  mv apache-flume-1.9.0-bin/ flume-1.9.0

4. Put guava-11.0. In the lib file 2. Remove the jar to make flume compatible with Hadoop 3.1 three

[lqs@bdc112 module]$ rm flume-1.9.0/lib/guava-11.0.2.jar

5. Modify log4j. Under conf Properties file

[lqs@bdc112 conf]$ pwd
/home/lqs/module/flume-1.9.0/conf
[lqs@bdc112 conf]$ vim log4j.properties

Modify it as follows

flume introduction case

Case 1: Official case of monitoring port data

Introduction and requirements

Official case of monitoring port data. Requirements: use flume to listen to a port, collect the port data, and print it to the console.

Implementation steps

1. Install the netcat tool (if installed, you don't need to install it)

[lqs@bdc112 flume-1.9.0]$ sudo yum install -y nc

2. Create job / NC flume log. In flume file Conf file

[lqs@bdc112 flume-1.9.0]$ mkdir job
[lqs@bdc112 flume-1.9.0]$ touch nc-flume-log.conf
[lqs@bdc112 flume-1.9.0]$ vim job/nc-flume-log.conf

Add the following:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Source: http://flume.apache.org/FlumeUserGuide.html

5. Start flume listening port first
Writing method I

[lqs@bdc112 flume-1.9.0]$ bin/flume-ng agent -c conf/ -n a1 -f job/taildir-flume-hdfs.conf -Dflume.root.logger=INFO,console

Writing method 2

[lqs@bdc112 flume-1.9.0]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/taildir-flume-hdfs.conf -Dflume.root.logger=INFO,console

Parameter Description:
--Conf: indicates the directory where the configuration files are stored. By default, they are all in ${FLUME_HOEM}/conf /. The configuration command can be abbreviated as: - c
--Name: indicates the name given to the agent. Here, the configuration file is named a1. The configuration command can be abbreviated as: - n
--Conf file: the configuration file read by flume this time is under that configuration file. By default, it is under conf /, and mine is under job /.

be careful:
-Dflume.root.logger=INFO,console: - D indicates that flume is dynamically modified when flume is running root. The logger parameter property value, and set the console log printing level to info level. Log levels include: log, info, warn and error. The log parameters have been modified in the configuration file. You no longer need to enter them repeatedly.

6. Use the netcat toolkit to send content to the 44444 port of the machine
Determine whether port 44444 is occupied sudo netstat -nlp | grep 44444

[lqs@bdc112 flume-1.9.0]$ nc localhost 44444
ninhao
OK
flume
OK

#Another window displays
2021-12-27 14:14:11,148 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:166)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
2021-12-27 14:14:41,155 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 6E 69 6E 68 61 6F                               ninhao }
2021-12-27 14:14:46,816 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 66 6C 75 6D 65                                  flume }

Case 2

Description: real time monitor multiple additional files in the specified directory

aildir Source is suitable for listening to multiple real-time appended files, and can realize breakpoint continuation.

Requirements and requirements analysis

Use flume to listen for the real-time additional files of the whole directory and upload them to hdfs
The demand analysis is shown in the figure below

Implementation configuration steps

1. Create the configuration file taildir flume HDFS. Exe in the job directory conf

[lqs@bdc112 job]$ pwd
/home/lqs/module/flume-1.9.0/job
[lqs@bdc112 job]$ vim taildir-flume-hdfs.conf

And enter the following:

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1 f2
# It must be accurate to the file. You can write a matching expression to match multiple files
a1.sources.r1.filegroups.f1 = /home/lqs/module/flume-1.9.0/files1/.*file.*
a1.sources.r1.filegroups.f2 = /home/lqs/module/flume-1.9.0/files2/.*log.*
# The file storage location of the breakpoint continuation can be realized without changing the default location
a1.sources.r1.positionFile = /home/lqs/module/flume-1.9.0/taildir_position.json

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://bdc112:8020/flume-1.9.0/%Y%m%d/%H
#Prefix of uploaded file
a1.sinks.k1.hdfs.filePrefix = log-

#Use local timestamp
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#How many events are accumulated to flush to HDFS once
a1.sinks.k1.hdfs.batchSize = 100

#Set the file type to support compression
a1.sinks.k1.hdfs.fileType = DataStream


#How often do I generate a new file
a1.sinks.k1.hdfs.rollInterval = 30
#Set the scroll size of each file to about 128M
a1.sinks.k1.hdfs.rollSize = 134217700
#File scrolling is independent of the number of events
a1.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2. Start monitor folder command

[lqs@bdc112 flume-1.9.0]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group2/flume02

Add data to file

[lqs@bdc112 files1]$ pwd
/home/lqs/module/flume-1.9.0/files1
[lqs@bdc112 files1]$ echo test >> file7.txt
[lqs@bdc112 files1]$ echo demo >> file8.txt

3. View data

be careful:
The area where file metadata is stored in Linux is called inode. Each inode has a number. The operating system uses the inode number to identify different files. The Unix/Linux system does not use the file name, but uses the inode number to identify files. TailDir source uses inode and the full path of the file to identify the same file, so after modifying the file name, if the expression can match, it will re read the data of a file.

Check in the log

[lqs@bdc112 logs]$ pwd
/home/lqs/module/flume-1.9.0/logs
[lqs@bdc112 logs]$ tail -n -5 flume.log

Topics: Big Data Hadoop Distribution hdfs flume