Flume Agent Component Matching

Posted by sohdubom on Sun, 21 Nov 2021 19:51:24 +0100

1. Agent Components

Components in Agent include Source, Channel, Sink.

1.1 Source

The Source component can handle various types and formats of log data.

Common source s in Flume:

  • avro
  • exec
  • netcat
  • spooling directory
  • taildir
Common Categoriesdescribe
avroListen for Avro ports and receive Event s from external Avro client streams
execExec source runs a given Unix command at startup, tail-F file
netcatListens for a given port and converts each line of text to an Event
soopling directpryMonitor the entire directory, read all files in the directory, do not support recursion, do not support breakpoint continuation
taildirYou can monitor changes to multiple files or folders and support logging progress

1.3 Channel

Channel is a buffer between Source and Ink that stores Event s.

Channel is thread-safe and can handle several Source write operations and several Sink read operations simultaneously.

Channel commonly used in Flume:

  • Memary Channel
  • File Channel
  • Kafka Channel
Common Categoriesdescribe
MemaryQueues Stored in Memory for Event s
FileEvent s are stored in files without worrying about data loss
KafKaEvent s are stored in a Kafka cluster. Kafka provides high availability

1.3 Sink

Sink writes Event s from Channel in bulk to a storage or indexing system or is sent to another Flume Agent.

Common sink s in Flume:

  • logger
  • hdfs
  • avro
  • HBase
Common Categoriesdescribe
loggerConsole output, log level INFO. Usually used for testing/debugging
hdfsEvent events written to HDFS
avroFlume Event s sent to Sink are converted to Avro events and sent to the configured host name/port
HBaseEvent events written to HBase

2. Agent Component Matching

Agent components can be customized to match, using different Source s, Sink s, Channel s.

Here are a few simple combinations.

exec2logger stands for exec to logger, and 2 stands for to, which translates two into to. Common ones are p2p, log4j, etc., which are analogues as numbers.

You can start the Hadoop cluster first: start-dfs.sh.

Complex Flume startup commands are:

flume-ng agent --conf <Flume Configuration Path> --conf-file <agent Absolute path to configuration file> --name <agent Name> -Dflume.root.logger=INFO,console

The above commands can be abbreviated as:

flume-ng agent -c <Flume Configuration Path> -f <agent Absolute path to configuration file> -n <agent Name> -Dflume.root.logger=INFO,console

2.1 Source

2.1.1 exec2logger

Common Configuration ItemsDefault valuedescribe
type-Configure type to exec
command-Configure commands to execute, such as tail-F file

1 ️⃣ Step1: Customize a directory in the flume directory (mine is the agents directory) and add the configuration file exec2logger.conf

vim exec2logger.conf
# exec reads log files and sends data to logger sink

# Name the components on this agent
exec2logger.sources = r1
exec2logger.sinks = k1
exec2logger.channels = c1

# Describe/configure the source
exec2logger.sources.r1.type = exec
exec2logger.sources.r1.command = tail -F /root/log.txt

# Describe the sink
exec2logger.sinks.k1.type = logger

# Use a channel which buffers events in memory
exec2logger.channels.c1.type = memory
exec2logger.channels.c1.capacity = 1000
exec2logger.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
exec2logger.sources.r1.channels = c1
exec2logger.sinks.k1.channel = c1

2 ️⃣ Step2: Start Flume exec2logger

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/exec2logger.conf -n exec2logger -Dflume.root.logger=INFO,console

3 ️⃣ Step3: Test and view the console output (instead of touch ing to create the file, > is a flow redirection, and if the file does not exist, create it directly)

2.1.2 spoolDir2logger

Common Configuration ItemsDefault valuedescribe
type-Configure the type as spooldir
spoolDir-Configure to file to read (absolute path)

1 ️⃣ Step1: Add configuration file spooldir2logger.conf

# spooldir reads multiple log files in the same directory and sends data to logger sink

# Name the components on this agent
spooldir2logger.sources = r1
spooldir2logger.sinks = k1
spooldir2logger.channels = c1

# Describe/configure the source
spooldir2logger.sources.r1.type = spooldir
spooldir2logger.sources.r1.spoolDir = /root/log

# Describe the sink
spooldir2logger.sinks.k1.type = logger

# Use a channel which buffers events in memory
spooldir2logger.channels.c1.type = memory
spooldir2logger.channels.c1.capacity = 1000
spooldir2logger.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
spooldir2logger.sources.r1.channels = c1
spooldir2logger.sinks.k1.channel = c1

2 ️⃣ Step2: Start Flume spooldir2logger

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/spooldir2logger.conf -n spooldir2logger -Dflume.root.logger=INFO,console

3 ️⃣ Step3: Test that when the output of content within a file is complete, the file suffix will be marked COMPLETED to indicate the completed file.

4 ️⃣ Step4: If we enter content into the finished file, we will no longer be able to get the file changes.

2.1.3 tailDir2logger

Common Configuration ItemsDefault valuedescribe
type-Configure type to TAILDIR
filegroups-Profile Group
filegroups.<filegroupName>-Configure the corresponding file or directory for the filegroup
positionFile~/.flume/taildir_position.jsonThe absolute path to the tailDir record file file

1 ️⃣ Step1: Add configuration file taildir2logger.conf

# tailDir reads multiple log files in the same directory and sends data to logger sink

# Name the components on this agent
taildir2logger.sources = r1
taildir2logger.sinks = k1
taildir2logger.channels = c1

# Describe/configure the source
taildir2logger.sources.r1.type = TAILDIR
taildir2logger.sources.r1.filegroups = g1 g2
taildir2logger.sources.r1.filegroups.g1 = /root/log.txt
taildir2logger.sources.r1.filegroups.g2 = /root/log/.*.txt

# Describe the sink
taildir2logger.sinks.k1.type = logger

# Use a channel which buffers events in memory
taildir2logger.channels.c1.type = memory
taildir2logger.channels.c1.capacity = 1000
taildir2logger.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
taildir2logger.sources.r1.channels = c1
taildir2logger.sinks.k1.channel = c1

2 ️⃣ Step2: Start Flume taildir2logger

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/taildir2logger.conf -n taildir2logger -Dflume.root.logger=INFO,console

3 ️⃣ Step3: Test 1, detect file changes when appending content to a file

4 ️⃣ Step4: Test 2, detect changes in.txt files in folders

5 ️⃣ Step5:taildir Logging File does not add COMPLETED to the file suffix, but places the record in a json file

6 ️⃣ Step6: Format the json file and find that an array stores data in json format, and each json stores an absolute path file that uniquely identifies the inode, the pos where it was read, and the file


7 ️⃣ Step7: Append content to / root/log.txt and start Flume again. Flume loads the read location once, then reads it, and records the new read location in json

Interview Question: Which Source is commonly used for log collection?

Answer: We usually use tailDir type Source, before we tried exec, spoolDir, later there should be more directories in the middle of the application scenario, and tailDir was chosen for each read to follow the last read. The main reason for using tailDir is that tailDir will record the location in the json file and read the file. If the agent hangs up later, Simply restart and continue reading as you progressed from your last reading

2.2 Channel

2.2.1 memary channel

Common Configuration ItemsDefault valuedescribe
type-Configuring the type as memory
capacity100Maximum number of Event s in channel
transactionCapacity100Maximum number of Event s per channel sent to sink

capacity is estimated based on estimating the amount of data channel may cache (typically 2w-20w) and the Jvm memory available to the agent (typically 20MB).

The default maximum memory is 20MB. If a log is 512B long, it can hold more than 4w of data (20x1024x2=40960), but capacity is usually not set too full, which can easily overflow memory.

❓ How can I increase the Jvm memory capacity since the default value is small?

⭐ Expansion can be achieved by modifying the Flume boot configuration, vim/flume-1.9.0/bin/flume-ng can view the Flume boot script and modify JAVA_OPTS item:

2.2.2 File channel

Channel s of File type are used when data security is high to avoid data loss due to program crashes.

Common Configuration ItemsDefault valuedescribe
type-Configuring types as file s
checkpointDir~/.flume/file-channel/checkpointAbsolute path to checkpoint file

When the program crashes before it starts, it reads the checkpointDir configuration file from which Sink will continue reading the last unread Event.

2.3 Sink

2.3.1 logger sink

Log events at the INFO level, usually for testing/debugging purposes. Usually, when configuring Fume, you use logger to output the Event to the console to test if the Source has a problem, and then modify the Sink type.

Common Configuration ItemsDefault valuedescribe
type-Configure type as logger

2.3.2 tailDir2hdfs

Common Configuration ItemsDefault valuedescribe
type-Configuring types as file s
hdfs.path-Path to HDFS log
hdfs.fileTypeSequenceFileConfiguration file type, which can be configured as DataStream or CompressedStream
hdfs.writeFormatWritableFormatting type, configurable as Text
hdfs.useLocalTimeStampfalseConfigure whether to use local timestamps
hdfs.rollInterva30Configure scroll cycle, units s
hdfs.rollSize1024Specified size of profile arrival
hdfs.rollCount10Configure how many pieces of data are written to the file, and configure 0 to indicate no dependency on that item

1 ️⃣ Step1: Add configuration file taildir2hdfs.conf

# taildir reads multiple log files in the same directory and sends data to logger sink

# Name the components on this agent
taildir2hdfs.sources = r1
taildir2hdfs.sinks = k1
taildir2hdfs.channels = c1

# Describe/configure the source
taildir2hdfs.sources.r1.type = TAILDIR
taildir2hdfs.sources.r1.filegroups = g1 g2
taildir2hdfs.sources.r1.filegroups.g1 = /root/log.txt
taildir2hdfs.sources.r1.filegroups.g2 = /root/log/.*.txt

# Describe the sink
taildir2hdfs.sinks.k1.type = hdfs
taildir2hdfs.sinks.k1.hdfs.path = hdfs://node1:8020/flume/%y-%m-%d-%H-%M
# Change hdfs output file type to data stream
taildir2hdfs.sinks.k1.hdfs.fileType = DataStream
# Change output formatting to text formatting
taildir2hdfs.sinks.k1.hdfs.writeFormat = Text
taildir2hdfs.sinks.k1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
taildir2hdfs.channels.c1.type = memory
taildir2hdfs.channels.c1.capacity = 30000
taildir2hdfs.channels.c1.transactionCapacity = 3000

# Bind the source and sink to the channel
taildir2hdfs.sources.r1.channels = c1
taildir2hdfs.sinks.k1.channel = c1

2 ️⃣ Step2: Writing data one by one is too cumbersome, so write a dead-loop script to enter data into the file, stop after a while, and check the size of the file

[root@node1 ~]# vim printer.sh
[root@node1 ~]# chmod a+x printer.sh
[root@node1 ~]# cat printer.sh
while true
        echo '123123123123123123123123123123123123123123123123123123123123' >> log.txt
[root@node1 ~]# ll -h
 Total usage 17 M
drwxr-xr-x. 3 root root  65 11 21/16:51 log
-rw-r--r--. 1 root root 14M 11 February 21 23:29 log.txt
-rwxr-xr-x. 1 root root 134 11 February 21 23:22 printer.sh

3 ️⃣ Step3: Start HDFS start-dfs.sh, start Flume taildir 2hdfs, you can see and are already sending logs frantically

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/taildir2hdfs.conf -n taildir2hdfs -Dflume.root.logger=INFO,console

4 ️⃣ Step4: Look at the web page and find / flume / has a file named by the current time, go in and find it's all small files

5 ️⃣ Step5:flume has a fast size of 128MB, and configuration information needs to be added to make the most of space.

# Set the scrolling condition of the file-------------------------------
# Set scroll cycle units s
taildir2hdfs.sinks.k1.hdfs.rollInterval = 300
# The settings file reaches the specified size, 128MB is 134,217,728B, but will not normally be full 
taildir2hdfs.sinks.k1.hdfs.rollSize = 130000000
# Set how many records are written to the file
taildir2hdfs.sinks.k1.hdfs.rollCount = 0

6 ️⃣ Step6: Rerun Flume after modification and view changes on the web side

2.3.3 tailDir2avro-avro2hdfs

Two servers are currently required, and if only one is configured with Flume, you can send files to other servers via scp-rq <file> <hostname>: <path>.

scp -rq /opt/flume-1.9.0/ node2:/opt/

Implement the following topology:

avro sink common configuration itemsDefault valuedescribe
type-Configuring types as avro
hostname-Configure type as host name or IP address
port-Configure port number
avro source common configuration itemsDefault valuedescribe
type-Configuring types as avro
bind-Configure the type as a host name or IP address to listen on
port-Configure Listening Port Number

1 ️⃣ Step1:node1 host, add configuration file tailDir2avro.conf

# taildir reads multiple log files in the same directory and sends data to avro sink

# Name the components on this agent
taildir2avro.sources = r1
taildir2avro.sinks = k1
taildir2avro.channels = c1

# Describe/configure the source
taildir2avro.sources.r1.type = TAILDIR
taildir2avro.sources.r1.filegroups = g1 g2
taildir2avro.sources.r1.filegroups.g1 = /root/log.txt
taildir2avro.sources.r1.filegroups.g2 = /root/log/.*.txt

# Describe the sink
taildir2avro.sinks.k1.type = avro
taildir2avro.sinks.k1.hostname = node2
taildir2avro.sinks.k1.port = 12345

# Use a channel which buffers events in memory
taildir2avro.channels.c1.type = memory
taildir2avro.channels.c1.capacity = 30000
taildir2avro.channels.c1.transactionCapacity = 3000

# Bind the source and sink to the channel
taildir2avro.sources.r1.channels = c1
taildir2avro.sinks.k1.channel = c1

2 ️⃣ Step2:node2 host, add configuration file avro2hdfs.conf

# spooldir reads multiple log files in the same directory and sends data to logger sink

# Name the components on this agent
avro2hdfs.sources = r1
avro2hdfs.sinks = k1
avro2hdfs.channels = c1

# Describe/configure the source
avro2hdfs.sources.r1.type = avro
avro2hdfs.sources.r1.bind = node2
avro2hdfs.sources.r1.port = 12345

# Describe the sink
avro2hdfs.sinks.k1.type = hdfs
avro2hdfs.sinks.k1.hdfs.path = hdfs://node1:8020/flume/%y-%m-%d-%H-%M
avro2hdfs.sinks.k1.hdfs.fileType = DataStream
avro2hdfs.sinks.k1.hdfs.writeFormat = Text
avro2hdfs.sinks.k1.hdfs.useLocalTimeStamp = true
avro2hdfs.sinks.k1.hdfs.rollInterval = 300
avro2hdfs.sinks.k1.hdfs.rollSize = 130000000
avro2hdfs.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory
avro2hdfs.channels.c1.type = memory
avro2hdfs.channels.c1.capacity = 30000
avro2hdfs.channels.c1.transactionCapacity = 3000

# Bind the source and sink to the channel
avro2hdfs.sources.r1.channels = c1
avro2hdfs.sinks.k1.channel = c1

3 ️⃣ Step3: Start Flume on node2

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/avro2hdfs.conf -n avro2hdfs -Dflume.root.logger=INFO,console

4 ️⃣ Step4: Check if port 12345 of node2 is being listened on and make sure it is already listening on

5 ️⃣ Step5: Run the printer.sh script for node1 for a while and start the Flume for node1

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/taildir2avro.conf -n taildir2avro -Dflume.root.logger=INFO,console

6 ️⃣ Step6: View the node2 console output, node1 successfully sent data to port 12345 of node2

7 ️⃣ Step7: See if there is data submission on the web side

Already running!

3. Write at the end

When starting a program, sometimes the terminal is occupied, closing the terminal will close the program, so sometimes we need to mention submitting the program to the background for execution, then we need to add a &symbol after the command.

If there is content in the program that is output to the console and you don't want to be output to the console all the time, you can add a nohup command before the command and input the output to a file to make the program silent.


❤️ END ❤️

Topics: Linux CentOS Hadoop flume