kafka installation and start-up

Posted by Osiris Beato on Wed, 18 Sep 2019 05:49:10 +0200

Kafka's background knowledge has been talked about a lot, let's start to practice now, assuming you don't have Kafka and Zoo Keeper environments.

Step 1: Download the code

Download version 1.1.0 and unzip it.

> tar -xzf kafka_2.12-2.3.0.tgz
> cd kafka_2.12-2.3.0

Step 2: Start the service

Running kafka requires Zookeeper, so you need to start Zookeeper first. If you don't have Zookeeper, you can use kafka packaged and configured Zookeeper.

> bin/zookeeper-server-start.sh config/zookeeper.properties
[2013-04-22 15:01:37,495] INFO Reading configuration from: config/zookeeper.properties (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
...

Now start the kafka service

> bin/kafka-server-start.sh config/server.properties &
[2013-04-22 15:01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties)
[2013-04-22 15:01:47,051] INFO Property socket.send.buffer.bytes is overridden to 1048576 (kafka.utils.VerifiableProperties)
...

Step 3: Create a topic

Create a Topic called "test" with only one partition and one backup:

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

Once created, you can view the created topic information by running the following commands:

> bin/kafka-topics.sh --list --zookeeper localhost:2181
test

Or, in addition to creating a topic manually, you can configure your broker to automatically create a topic when a non-existent topic is published.

Step 4: Send a message

Kafka provides a command-line tool to read and cancel messages from input files or command lines and send them to the Kafka cluster. Each line is a message.
Run producer, and then enter several messages to the server in the console.

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
This is a message
This is another message

Step 5: Consumer News

Kafka also provides a command-line tool for consuming messages to output stored information.

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
This is a message
This is another message

If you run the above command on two different terminals, then when you run the producer, the consumer can consume the message sent by the producer.

Step 6: Setting up multiple broker clusters

So far, we've only run a single broker, which is not interesting. For Kafka, a broker is just the size of a cluster, so let's set up a few more brokers.

First, create a configuration file for each broker:

> cp config/server.properties config/server-1.properties 
> cp config/server.properties config/server-2.properties

Now edit these new files and set the following properties:

config/server-1.properties: 
    broker.id=1 
    listeners=PLAINTEXT://:9093 
    log.dir=/tmp/kafka-logs-1

config/server-2.properties: 
    broker.id=2 
    listeners=PLAINTEXT://:9094 
    log.dir=/tmp/kafka-logs-2

broker.id is the unique and permanent name of each node in the cluster. We modify the port and log directory because we are now running on the same machine. We want to prevent broker from registering and overwriting each other's data on the same port.

We have run zookeeper and just one kafka node, all we need to do is start two new kafka nodes.

> bin/kafka-server-start.sh config/server-1.properties &
... 
> bin/kafka-server-start.sh config/server-2.properties &
...

Now, let's create a new topic and set the backup to: 3

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic

Well, now that we have a cluster, how do we know what each cluster is doing? Run the command "describe topics"

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topic
Topic:my-replicated-topic    PartitionCount:1    ReplicationFactor:3    Configs:
Topic: my-replicated-topic    Partition: 0    Leader: 1    Replicas: 1,2,0    Isr: 1,2,0

Output Interpretation: The first line is a summary of all partitions, and the second line provides partition information because we only have one partition, so there is only one line.

  • leader: This node is responsible for all reading and writing of the partition, and leaders of each node are randomly selected.
  • "replicas": The list of backup nodes, whether or not the node is leader or is still alive, is only displayed.
  • "isr": "synchronous backup" is the list of nodes that are alive and are synchronizing leader s.

Let's run this command and see the node we created at the beginning:

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
Topic:test    PartitionCount:1    ReplicationFactor:1    Configs:
Topic: test    Partition: 0    Leader: 0    Replicas: 0    Isr: 0

It's not surprising that the theme we just created does not have Replicas, and on the server "0", when we created it, there was only one server in the cluster, so it was "0".

Let's publish some information on the new topic:

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my-replicated-topic
 ...
my test message 1
my test message 2
^C

Now, consume the news.

> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from-beginning --topic my-replicated-topic
 ...
my test message 1
my test message 2
^C

We want to test the fault tolerance of the cluster, killing the leader, Broker1 as the current leader, that is killing the Broker1.

> ps | grep server-1.properties
7564 ttys002    0:15.91 /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin/java... 
> kill -9 7564

On Windows:

> wmic process where "caption = 'java.exe' and commandline like '%server-1.properties%'" get processid
ProcessId
6016
> taskkill /pid 6016 /f

One of the backup nodes becomes the new leader, and broker1 is no longer in the synchronous backup set.

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topic
Topic:my-replicated-topic    PartitionCount:1    ReplicationFactor:3    Configs:
Topic: my-replicated-topic    Partition: 0    Leader: 2    Replicas: 1,2,0    Isr: 2,0

However, the news is still not lost:

> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from-beginning --topic my-replicated-topic
...
my test message 1
my test message 2
^C

Step 7: Import/export data using Kafka Connect

Writing and writing back data from the console is a convenient start, but you may want to import or export data from other sources to other systems. For most systems, kafka Connect can be used without writing custom integration code.

Kafka Connect is a tool for importing and exporting data. It is an extensible tool that runs connectors and interacts with external systems of custom logic. In this Quick Start, we will see how to run Kafka Connect to import data from files to Kafka topics with simple connectors, and then export data from Kafka topics to files.

First of all, we create some "seed" data to test. (ps: seed means to make some messages, which is understood in seconds?) :

echo -e "foo\nbar" > test.txt

On Windows:

> echo foo> test.txt
> echo bar>> test.txt

Next, we started running two connectors in separate mode, which means they run in a single, local, dedicated process. We provide three configuration files as parameters. The first is the configuration of Kafka Connect processing, which includes common configurations, such as the Kafka broker to be connected and the serialization format of the data. The rest of the configuration files specify the connector to be created. Includes the unique name of the connector and the connector class to instantiate. And any other configuration required by the connector.

> bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties

Kafka comes with the configuration files for these examples, and uses the local cluster configuration we just built and creates two connectors: the first is the source connector, which reads and publishes from the input file to the Kafka theme, and the second is the receiving connector, which reads messages from the Kafka theme and outputs them to the external file.

During startup, you will see some log messages, including some illustrations of connector instantiation. Once the kafka Connect process has started, the import connector should read from

test.txt

And write to topic

connect-test

Export connectors from topics

connect-test

Read messages to write to files

test.sink.txt

We can verify that all data has been exported by verifying the contents of the output file:

more test.sink.txt
 foo
 bar

Note that the imported data is already on the Kafka theme

connect-test

Here, we can use this command to view this topic:

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic connect-test --from-beginning
 {"schema":{"type":"string","optional":false},"payload":"foo"}
{"schema":{"type":"string","optional":false},"payload":"bar"}
...

Connectors continue to process data, so we can add data to files and move through pipes:

echo "Another line" >> test.txt

You should see a line of information that appears in the consumer console and is exported to the file.

Step 8: Using Kafka Stream to process data

Kafka Stream is a client library of kafka for real-time stream processing and analysis of data stored in kafka broker. This quick-start example demonstrates how to run a stream application. An example of WordCountDemo (using java8 lambda expressions for easy reading)

KTable wordCounts = textLines
    // Split each text line, by whitespace, into words.
    .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+")))

    // Ensure the words are available as record keys for the next aggregate operation.
    .map((key, value) -> new KeyValue<>(value, value))

    // Count the occurrences of each word (record key) and store the results into a table named "Counts".
    .countByKey("Counts")

It implements wordcount algorithm and calculates the number of occurrences of a word from the input text. However, unlike other WordCount examples, you may see that the behavior of the demo application executed before the limited data is slightly different, because its purpose is to operate in an infinite stream of data. Similar to bounded variables, it is a dynamic algorithm for tracking and updating word counts. However, since it must assume potential unbounded input data, it will periodically output its current status and results, while continuing to process more data, because it does not know when it has processed "all" input data.

Now we are ready to input the data into the topic of kafka, and then the kafka Stream application processes the topic data.

> echo -e "all streams lead to kafka\nhello kafka streams\njoin kafka summit" > file-input.txt

Next, use the producer of the console to send the input data to the specified topic (streams-file-input). (In practice, stream data may continue to flow in, where the application of kafka will start and run.)

> bin/kafka-topics.sh --create \
            --zookeeper localhost:2181 \
            --replication-factor 1 \
            --partitions 1 \
            --topic streams-file-input
> cat /tmp/file-input.txt | ./bin/kafka-console-producer --broker-list localhost:9092 --topic streams-file-input

Now, we run WordCount to process the input data:

> ./bin/kafka-run-class org.apache.kafka.streams.examples.wordcount.WordCountDemo

There won't be any STDOUT output. Except for the log, the result is constantly written back to another topic (streams-wordcount-output), demo runs for a few seconds, and then automatically terminates, unlike a typical stream processing application.

Now let's examine the WordCountDemo application and read from the output topic.

> ./bin/kafka-console-consumer --zookeeper localhost:2181 
            --topic streams-wordcount-output 
            --from-beginning 
            --formatter kafka.tools.DefaultMessageFormatter 
            --property print.key=true 
            --property print.key=true 
            --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer 
            --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer

Output data is printed to the console (you can stop using Ctrl-C):

all     1
streams 1
lead    1
to      1
kafka   1
hello   1
kafka   2
streams 2
join    1
kafka   3
summit  1
^C

The first column is the key of the message, and the second column is the value of the message. It should be noted that the output is actually a continuous update stream, in which each data (i.e., each row of the original output) is the latest count of a word, also known as the record key "kafka". There are multiple records for the same key, and each record is followed by an update of the previous one.

 
 

Topics: Programming kafka Zookeeper Apache Java