Big data learning tutorial SD version Chapter 9 [Flume]

Posted by mikeylikesyou on Tue, 28 Dec 2021 16:52:10 +0100

Flume log collection tool is mainly used since it is a tool!

Distributed acquisition processing and aggregation streaming framework

A tool for collecting data by writing a collection scheme, that is, a configuration file. The configuration scheme is in the official document

1. Flume architecture

  • Agent JVM process
  1. Source: receive data
  2. Channel: buffer
  3. Sink: output data
  • Event transmission unit

2. Flume installation

The environment variables of Java and Hadoop are configured in advance. At this time, decompress and use!

3. Flume official example

The official documents of different sink, channel and sink configurations have examples

# example.conf : port -> console
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Start command

bin/flume-ng agent -c conf -f jobs/example.conf -n a1 -Dflume.root.logger=INFO,console

Transmit data

# yum install -y nc
nc localhost 44444

4. Flume example

4.1 File New Context -> HDFS

The new content of the collection file is added to HDFS, and cannot be transmitted continuously at a breakpoint

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /data/test.log

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%Y%m%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 24
a1.sinks.k1.hdfs.roundUnit = hour
a1.sinks.k1.hdfs.fileType = DataStream

a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

start-up

bin/flume-ng agent -c conf -f jobs/log2hdfs.conf -n a1

4.2 Dir New File -> HDFS

Collect new files in the directory to HDFS and cannot monitor the changes of file contents

a1.sources = src-1
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = c1
a1.sources.src-1.spoolDir = /data/data1
a1.sources.src-1.fileHeader = true

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%Y-%m-%d/%H
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream

start-up

bin/flume-ng agent -c conf -f jobs/file2hdfs.conf -n a1

4.3 Dir New FIle And Context -> HDFS

It can monitor the changes of files and file contents in multiple directories to HDFS, and can resume transmission at breakpoints. The log under log4j will be renamed, and the file renamed will be uploaded again

a1.sources = r1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /data/data2/.*file.*
a1.sources.r1.filegroups.f2 = /data/data3/.*log.*
a1.sources.ri.maxBatchCount = 1000

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events2/%Y-%m-%d/%H
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream

start-up

bin/flume-ng agent -c conf -f jobs/dir2hdfs.conf -n a1

[{"inode": 786450, "pos": 1501, "file": "/ data/data2/file1.txt"}] the source code locates a file according to inode and file

If the problem of renaming the file is handled, modify tailfile Java 123 and reliabletaildireventreader Repackage Java 256 and replace the jar package of taildersource under libs

5. Flume affairs

Source pushes events to the Channel and Sink pulls events from the Channel. These are advanced temporary buffers

  1. Source - > Channel doput putlist rollback is to directly empty the Channel queue data, which may lose data. If there is a location record, it will not

  2. Channel - > sink doTake takelist rollback is to reverse write the pulled data back to the channel queue. There may be duplicate data

6. Flume Agent principle

  1. Source receive data
  2. Source - > channel processor processing events
  3. Channel processor - > interceptor event interception and filtering
  4. Channel processor - > channel selector: replicating and multiplexing by default
  5. Channel processor - > channel n: write event to channel
  6. Channel - > Sink processor: three types: Default [one Sink], LoadBalancing [load balancing], and Failover [Failover]
  7. Sink processor - > sink: write sink

7. Flume topology

Connect multiple flume agents with Avro

Polling strategy: if Sink fails to pull data, change Sink

  1. Simple concatenation: sink - > source
  2. Replication and multiplexing: multi channel - > multi Sink
  3. Load balancing and failover: Channel - > multi Sink
  4. Aggregation: multiple sink - > source

8. Flume custom Interceptor

Customize the Interceptor to realize multiplexing:

  1. Enter different channels through Header information

  2. After collecting the information including Error and Exception, it enters one Channel and others enter another Channel

  3. Each Channel Sink is output to the console

  1. Encoding custom Interceptor
package com.ipinyou.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class TypeInterceptor implements Interceptor {

    private List<Event> eventList;

    @Override
    public void initialize() {
        eventList = new ArrayList<>();
    }

    @Override
    public Event intercept(Event event) {
        Map<String, String> headers = event.getHeaders();
        String body = new String(event.getBody());
        if (body.contains("Error") || body.contains("Exception")) {
            headers.put("type", "error");
        } else {
            headers.put("type", "normal");
        }
        return event;
    }

    @Override
    public List<Event> intercept(List<Event> list) {
        eventList.clear();
        for (Event event : list) {
            eventList.add(intercept(event));
        }
        return eventList;
    }

    @Override
    public void close() {

    }

    public static class Builder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new TypeInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}
  1. Package and upload to the lib directory of Flume
  2. Preparation of acquisition scheme

flume-s1-s2.conf

a1.sources = r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1 c2
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.ipinyou.flume.interceptor.TypeInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.error = c1
a1.sources.r1.selector.mapping.normal = c2

a1.channels = c1 c2
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 10000
a1.channels.c2.byteCapacityBufferPercentage = 20
a1.channels.c2.byteCapacity = 800000


a1.sinks = k1 k2
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 7771
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 7772

flume-console1.conf

a1.sources = r1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 7771

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

flume-console2.conf

a1.sources = r1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 7772

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

start-up

# Start in sequence: Hadoop 103 Hadoop 104 Hadoop 102
bin/flume-ng agent -c conf -f jobs/flume-console1.conf -n a1 -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf -f jobs/flume-console2.conf -n a1 -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf -f jobs/dir2hdfs.conf -n a1

9. Flume custom Source

  • Coding implementation
  1. The custom class inherits AbstractSource and implements configurable and pollablesource
  2. Implement configure(): read the configuration file
  3. Implement process (): receive external data, encapsulate events, and write to channels
  • Package to lib
  • Write configuration file

source type: full class name

  • start-up

10. Flume custom Sink

  • Coding implementation
  1. The custom class inherits AbstractSink and implements Configurable

  2. Implement configure(): read the configuration file

  3. Implement process (): receive Channel data, open things, and write to the corresponding location

  • Follow up is consistent with the above

11. Flume monitoring

With Ganglia third-party open source tools

Ganglia: web presentation data, gmetad storage data, gmod collection data

11.1 Ganglia installation

  1. install
# 102 103 104
yum install -y epel-release
# 102
yum install -y ganglia-gmetad
yum install -y ganglia-web
yum install -y ganglia-gmod
# 103 104
yum install -y ganglia-gmod
  1. Modify profile

/etc/httpd/conf.d/ganglia.conf

# Configure WindowsIP under Location
Require ip 192.168.xxx.xxx

/etc/ganglia/gmetad.conf

data_source "my cluster" hadoop102

/etc/ganglia/gmod. Conf: Hadoop 102 103 104 Distribution

# Modify the following configuration
name = "my cluster"
host = hadoop102
bind = 0.0.0.0

Close selinux: / etc/selinux/config and restart to take effect or temporarily take effect

SELINUX=disabled
# Provisional entry into force
setenforce 0

11.2 Ganlia startup

# If the permission is insufficient, modify the permission
chmod -R 777 /var/lib/ganglia
# hadoop102
systemctl start gmond
systemctl start httpd
systemctl start gmetad

# hadoop103 hadoop104
systemctl start gmond

Browser open Web UI:

http://hadoop102/ganglia

11.3 Flume start

bin/flume-ng agent -n a1 -c conf -f jobs/xxx
-Dflume.root.logger=INFO,console
-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts=hadoop102:8649

Topics: Hadoop hdfs flume