Big data journey for beginners who play strange and upgrade < Flume advanced >

Posted by xgab on Tue, 25 Jan 2022 12:08:08 +0100

Xiaobai's big data journey (73)

Flume advanced

Last review

The previous chapter introduced the internal principle of Flume. This chapter explains the extended knowledge of Flume. The focus of this chapter is to understand and learn to use the user-defined components of Flume

Custom components

The internal principle was introduced in the previous chapter, so we can formulate custom components according to the internal principle. For example, the multiplexing in the Channel selector mentioned in the previous chapter needs to be used with a custom Interceptor

Custom Interceptor

In the actual development, the custom interceptor is our common means. It can cooperate with the channel selector to classify and store our log information. The following is a case to simulate and realize this function

Case requirements:

  • Using Flume to collect server local logs
  • You need to send different kinds of logs to different analysis systems according to different log types

Case study:

  • In actual development, there may be many types of logs generated by a server, and different types of logs may need to be sent to different analysis systems
  • Multiplexing in the Flume topology is used
  • The principle of Multiplexing is to send different events to different channels according to the value of a key in the event Header. Therefore, we need to customize an Interceptor to assign different values to the keys in the headers of different types of events
  • In this case, we simulate logs with port data and different types of logs with hello and non Hello information. We need to customize the interceptor to distinguish the log contents and send them to different analysis systems (channels)


Case realization:

  • Since it is a custom interceptor, we need to write Java code, create a Maven project, and then import dependencies:
 <dependencies>
        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
            <version>1.9.0</version>
        </dependency>
    </dependencies>
  • Custom interceptor class
package com.company.myinterceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class MyInterceptor implements Interceptor {
    // Declare storage event collection
    private  List<Event> addHeaderEvents;

    // Initialize interceptor
    @Override
    public void initialize() {
        // Initialize collection
        addHeaderEvents = new ArrayList<>();
    }

    // Handling a single event
    @Override
    public Event intercept(Event event) {
        // Get the header information of the event
        Map<String, String> headers = event.getHeaders();

        // Get the body information of the event
        String eventBody = new String(event.getBody());

        if (eventBody.contains("hello")){
            headers.put("type","hello");
        }else {
            headers.put("type","other");
        }
        return event;
    }

    // Batch processing events
    @Override
    public List<Event> intercept(List<Event> events) {
        // Empty collection
        addHeaderEvents.clear();

        // Traverse events and add header information for each event
        for (Event event : events) {
            // Add header information
            addHeaderEvents.add(intercept(event));

        }

        // Return to events
        return addHeaderEvents;
    }

    // close resource
    @Override
    public void close() {

    }

    // Construction method of interceptor class
    public static class MyBuilder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new MyInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}

Package the written interceptor class, change its name and put it in flume

Change the name to myInterceptor.jar
 put to /opt/module/flume/lib Under folder

Configure flume files: first in Hadoop 102, Hadoop 103, Hadoop 104 Create a folder to store configuration files

mkdir /opt/module/flume/job/group4

Flume1

vim /opt/module/flume/job/group4/flume1-netcat.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.company.myinterceptor.MyInterceptor$MyBuilder



a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.hello = c1
a1.sources.r1.selector.mapping.other= c2




# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141

a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4142

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100


# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

Flume2

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 4141

a2.sinks.k1.type = logger

a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

a2.sinks.k1.channel = c1
a2.sources.r1.channels = c1

Flume3

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4142

a3.sinks.k1.type = logger

a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

a3.sinks.k1.channel = c1
a3.sources.r1.channels = c1

Run Flume

# flume3
flume-ng agent -n a3 -c /opt/module/flume/conf/ -f /opt/module/flume/job/group4/flume3-console2.conf -Dflume.root.logger=INFO,console

# flume2
flume-ng agent -n a2 -c /opt/module/flume/conf/ -f /opt/module/flume/job/group4/flume2-console1.conf -Dflume.root.logger=INFO,console

# flume1
flume-ng agent -c /opt/module/flume/conf/ -f /opt/module/flume/job/group4/flume1-netcat.conf -n a1 -Dflume.root.logger=INFO,console

test data

# Analog data generation, using natcat to send data to port 44444
nc localhost 44444
hello
hello world
hive
flume

Custom source

There are many officially provided sources, such as exec source, avro source, Taildir source, etc. if the officially provided source cannot meet our needs, we can customize the source according to our needs (so far, I have not encountered any needs that cannot be solved by official cases)

Description document of custom source: https://flume.apache.org/FlumeDeveloperGuide.html#source

Custom source case implementation

Case requirements:

  • flume is used to receive data, prefix each data and output it to the console
  • The prefix can be configured from the flume configuration file

Demand analysis:

  • Because Flume is mainly configured to complete the corresponding data transmission through the configuration file, in order to customize the Source, you need to know what the class implementing the Source is and what the specific implementation method is
  • Through Source code analysis, the bottom layer of Source implementation is the interface between abstract classes AbstractSource, Configurable and PollableSource
  • Its main methods are:
    • Configure (context) / / initialize the context (read the content of the configuration file)
    • process() / / encapsulate the obtained data into an event and write it to channel. This method will be called circularly

      Case realization
  • The first step is to create a maven project and import dependencies
     <dependencies>
            <dependency>
                <groupId>org.apache.flume</groupId>
                <artifactId>flume-ng-core</artifactId>
                <version>1.9.0</version>
            </dependency>
        </dependencies>
    

Mysource specific code

package com.company;

import org.apache.flume.Context;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;

import java.util.HashMap;

public class MySource extends AbstractSource implements Configurable, PollableSource {

    //Define the fields that the configuration file will read in the future
    private Long delay;
    private String field;

    //Initialize configuration information
    @Override
    public void configure(Context context) {
        delay = context.getLong("delay");
        field = context.getString("field", "Hello!");
    }

    @Override
    public Status process() throws EventDeliveryException {

        try {
            //Create event header information
            HashMap<String, String> hearderMap = new HashMap<>();
            //Create event
            SimpleEvent event = new SimpleEvent();
            //Loop encapsulation event
            for (int i = 0; i < 5; i++) {
                //Set header information for events
                event.setHeaders(hearderMap);
                //Set content for events
                event.setBody((field + i).getBytes());
                //Write events to channel
                getChannelProcessor().processEvent(event);
                Thread.sleep(delay);
            }
        } catch (Exception e) {
            e.printStackTrace();
            return Status.BACKOFF;
        }
        return Status.READY;
    }

    @Override
    public long getBackOffSleepIncrement() {
        return 0;
    }

    @Override
    public long getMaxBackOffSleepInterval() {
        return 0;
    }
}

Package the written Source class, change its name and put it in flume

Change the name to mySource.jar
 put to /opt/module/flume/lib Under folder

Configure flume files: first in Hadoop 102, Hadoop 103, Hadoop 104 Create a folder to store configuration files

mkdir /opt/module/flume/job/group5

This requirement is relatively simple, just a flume

vim /opt/module/flume/job/group5/mysource.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = com.company.source.MySource
a1.sources.r1.delay = 1000
#a1.sources.r1.field = atguigu

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Run flume

flume-ng agent -c /opt/module/flume/conf/ -f /opt/module/flume/job/group4/mysource.conf -n a1 -Dflume.root.logger=INFO,console

Custom sink

The same is true for custom sink. When business needs are met, we need to customize sink to meet our needs. For example, use sink to write data to mysql (sqoop will be introduced later, which is specifically used to write to the database)

Description document address of custom sink: https://flume.apache.org/FlumeDeveloperGuide.html#sink

Custom Sink case implementation

Case requirements

  • Use flume to receive data, add prefix and suffix to each data at the Sink end, and output it to the console
  • The prefix and suffix can be configured in the flume task configuration file

Demand analysis:

  • Similarly, we need to understand the implementation method of Sink. From the source code, we can know that custom Sink needs to inherit the AbstractSink class and implement the Configurable interface
  • The main implementation methods are:
    • Configure (context) / / initialize the context (read the content of the configuration file)
    • process() / / read and obtain data (event) from the Channel. This method will be called circularly

Case realization

  • The first step is to create a maven project and import dependencies
     <dependencies>
            <dependency>
                <groupId>org.apache.flume</groupId>
                <artifactId>flume-ng-core</artifactId>
                <version>1.9.0</version>
            </dependency>
        </dependencies>
    

mysink specific code

package com.company;

import org.apache.flume.*;
import org.apache.flume.conf.Configurable;
import org.apache.flume.sink.AbstractSink;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class MySink extends AbstractSink implements Configurable {

    //Create Logger object
    private static final Logger LOG = LoggerFactory.getLogger(AbstractSink.class);

    private String prefix;
    private String suffix;

    @Override
    public Status process() throws EventDeliveryException {

        //Declare return value status information
        Status status;

        //Gets the Channel bound by the current Sink
        Channel ch = getChannel();

        //Get transaction
        Transaction txn = ch.getTransaction();

        //Declare event
        Event event;

        //Open transaction
        txn.begin();

        //Read the events in the Channel until the end of the loop is read
        while (true) {
            event = ch.take();
            if (event != null) {
                break;
            }
        }
        try {
            //Handling events (printing)
            LOG.info(prefix + new String(event.getBody()) + suffix);

            //Transaction commit
            txn.commit();
            status = Status.READY;
        } catch (Exception e) {

            //Exception encountered, transaction rolled back
            txn.rollback();
            status = Status.BACKOFF;
        } finally {

            //Close transaction
            txn.close();
        }
        return status;
    }

    @Override
    public void configure(Context context) {

        //Read the contents of the configuration file, with default values
        prefix = context.getString("prefix", "hello:");

        //Read the contents of the configuration file, no default value
        suffix = context.getString("suffix");
    }
}

Package the written Sink class, change its name and put it in flume

Change the name to mySink.jar
 put to /opt/module/flume/lib Under folder

Configure flume files: first in Hadoop 102, Hadoop 103, Hadoop 104 Create a folder to store configuration files

mkdir /opt/module/flume/job/group6

This requirement is relatively simple, just a flume

vim /opt/module/flume/job/group5/mysink.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = com.atguigu.MySink
#a1.sinks.k1.prefix = atguigu:
a1.sinks.k1.suffix = :atguigu

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Run flume

flume-ng agent -c /opt/module/flume/conf/ -f /opt/module/flume/job/group4/mysink.conf -n a1 -Dflume.root.logger=INFO,console

Monitor Flume using Ganglia

Ganglia allows us to easily monitor the running status of our Flume through web pages
Ganglia consists of gmond, gmetad and gweb

  • Gmond (Ganglia Monitoring Daemon) is a lightweight service installed on each node host that needs to collect indicator data. Using gmond, you can easily collect a lot of system index data, such as CPU, memory, disk, network and active process data.
  • gmetad (Ganglia Meta Daemon) is a service that integrates all information and stores it to disk in RRD format
  • gweb (Ganglia Web) Ganglia visualization tool. gweb is a PHP front end that uses a browser to display the data stored in gmetad. In the Web interface, a variety of different indicator data collected under the running state of the cluster are displayed in the form of charts

Ganglia installation and deployment

Step 1: install epel source on three nodes

sudo yum install -y epel-release

Step 2: install web,meta and monitor at 102

sudo yum -y install ganglia-gmetad ganglia-web ganglia-gmond

Step 3: Install monitor at 103 and 104

sudo yum -y install ganglia-gmond

Step 4: modify the ganglia configuration file of Hadoop 102

sudo vim /etc/httpd/conf.d/ganglia.conf

<Location /ganglia>
  Require ip 192.168.5.1
  Require all granted
</Location>

Step 5: modify the gmatad configuration file of Hadoop 102

sudo vim /etc/ganglia/gmetad.conf

data_source "hadoop102" hadoop102

Step 6: modify the gmond configuration file of Hadoop 102

sudo vim /etc/ganglia/gmond.conf

cluster {
  name = "hadoop102"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}
udp_send_channel {
  #bind_hostname = yes # Highly recommended, soon to be default.
                       # This option tells gmond to use a source address
                       # that resolves to the machine's hostname.  Without
                       # this, the metrics may appear to come from any
                       # interface and the DNS names associated with
                       # those IPs will be used to create the RRDs.
  # mcast_join = 239.2.11.71
  host = hadoop102
  port = 8649
  ttl = 1
}
udp_recv_channel {
  # mcast_join = 239.2.11.71
  port = 8649
  bind = 0.0.0.0
  retry_bind = true
}

Step 7: synchronize the modified file

xsync  /etc/httpd
xsync  /etc/ganglia

Step 8: modify hadoop selinux configuration file

sudo vim /etc/selinux/config
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted

Step 9: restart each node server

sudo reboot 

Step 10: start ganglia

# hadoop102
start httpd
start gmetad
start gmond
# hadoop103/hadoop104
start gmond
start gmond

Monitor Flume using ganglia

Open web browsing: http://192.168.1.102/ganglia

Let's start the previous example and take necat flume as an example

vim /opt/module/flume/job/netcat-flume.conf
# Define Name
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Specify source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Specify sink
a1.sinks.k1.type = logger

# Specifies the Channel buffer type
a1.channels.c1.type = memory

# Maximum channel capacity
a1.channels.c1.capacity = 1000

# Size of each transmission event
a1.channels.c1.transactionCapacity = 100

# Bind source,sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Start flume

cd /opt/module/flume
flume-ng agent \
--conf conf/ \
--name a1 \
--conf-file jobs/netcat-flume.conf \
-Dflume.root.logger=INFO,console \
-Dflume.monitoring.type=ganglia \
-Dflume.monitoring.hosts=hadoop102:8649

Send data to port 444

nc localhost 44444
hello
world
...Keep sending, and then through web See the effect


Legend Description:

Field (Chart name)Field meaning
EventPutAttemptCountThe total number of events source attempted to write to channel
EventPutSuccessCountTotal number of events successfully written to channel and submitted
EventTakeAttemptCountThe total number of sink attempts to pull events from the channel
EventTakeSuccessCountsink the total number of events successfully read
StartTimeTime channel started (MS)
StopTimeTime channel stopped (MS)
ChannelSizeThe total number of events in the current channel
ChannelFillPercentagechannel occupancy percentage
ChannelCapacityCapacity of channel

summary

There are so many knowledge points about flume. In specific use, it is mainly how to configure the conf file, understand the architecture of flume, and cooperate with the official website example when customization is needed. Once again, the official documentation of flume is recommended, which is really easy to use..

Topics: Hadoop hdfs flume