Kafka Topic and Partition description

Posted by thefarhan on Mon, 24 Jan 2022 03:07:10 +0100

1: Topic

Topic is the logical division of kafka messages, which can be understood as the name of a category;

kafka classifies messages through topics, and different topics will be consumed by consumers subscribing to the topic.

When there are so many messages in this topic that it takes a few T to save, because the messages will be saved to the log file, there will undoubtedly be some problems.

In order to solve the problem that the file is too large, kafka proposed the concept of Partition partition

2: Partition

2.1 concept description


Legend Description:

  1. topic is divided into three partitions for partitioned storage
  2. Partition is used to store messages in kafka in segments
  3. Advantages: producers can write in parallel; Improve the throughput of system read and write (production and consumption); Can be distributed storage

2.2 create partition command

./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1  --partitions 2 --topic Subject name

2.3 viewing partition information

./kafka-topics.sh --describe --zookeeper localhost:2181 --topic Subject name

2.4 practice

# establish
[root@localhost bin]# ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1  --partitions 2 --topic test1
Created topic test1.
# see
[root@localhost bin]# ./kafka-topics.sh --describe --zookeeper localhost:2181 --topic test1
Topic: test1    PartitionCount: 2       ReplicationFactor: 1    Configs: 
        Topic: test1    Partition: 0    Leader: 0       Replicas: 0     Isr: 0
        Topic: test1    Partition: 1    Leader: 0       Replicas: 0     Isr: 0
[root@localhost bin]# 

2.5 viewing log files

[root@localhost bin]# cd /root/kafka/data/kafka-logs/
[root@localhost kafka-logs]# ll
 Total consumption 16
...
drwxr-xr-x. 2 root root  141 1 May 21-15:34 test1-0
drwxr-xr-x. 2 root root  141 1 May 21-15:34 test1-1
drwxr-xr-x. 2 root root  141 1 June 20:11 userlog-0
[root@localhost kafka-logs]#
[root@localhost kafka-logs]# cd test1-0
[root@localhost test1-0]# ll
 Total consumption 4
-rw-r--r--. 1 root root 10485760 1 May 21-15:34 00000000000000000000.index
-rw-r--r--. 1 root root        0 1 May 21-15:34 00000000000000000000.log
-rw-r--r--. 1 root root 10485756 1 May 21-15:34 00000000000000000000.timeindex
-rw-r--r--. 1 root root        8 1 May 21-15:34 leader-epoch-checkpoint
[root@localhost test1-0]# ll ../test1-1
 Total consumption 4
-rw-r--r--. 1 root root 10485760 1 May 21-15:34 00000000000000000000.index
-rw-r--r--. 1 root root        0 1 May 21-15:34 00000000000000000000.log
-rw-r--r--. 1 root root 10485756 1 May 21-15:34 00000000000000000000.timeindex
-rw-r--r--. 1 root root        8 1 May 21-15:34 leader-epoch-checkpoint
[root@localhost test1-0]# 

In the above files, there are two files: index and log. The functions of these two files are as follows:

For example, there are 10 million pieces of data in the file. I need to find the messages sent in the last 24 hours

  1. The suffix index file contains: for example, which index is the location of the first message and which index is the location of the 1000th message.... The effect of creating a sparse index through this interval index can quickly locate the location of the message
  2. The suffix timeindex file has time: for example, what is the index position at a certain time point in 24 hours

2.6 details of log storage

The data is actually 000... 000 stored in data / Kafka logs / test-0 and test-1 Log file

  1. 000…000.log: save message

  2. Consumergroupld topic partition number:

    kafka created it internally_ consumer_ The offsets topic contains 50 partitions. This topic is used to store the offset of a topic consumed by consumers.

    1. Which partition to commit to: through hash function: hash (consumergroupld)%_ consumer_ Offsets the number of partitions for the topic
    2. The content submitted to this topic is: key is the consumerGroupld+topic + partition number, and value is the value of the current offset
  3. Messages saved in the file are saved for 7 days by default. After seven days, the message will be deleted.

Regularly submit the offset of your own consumption partition to kafka internal topic:__consumer_offsets: when submitting the past, the key is the consumergroupld topic partition number, and the value is the current offset value. kafka will regularly clean up the messages in the topic, and finally retain the latest data

When a consumer hangs up, the information consumed by the consumer will be stored in__ consumer_offsets - in the partition number, consumers in the same consumer group will consume the messages produced by the producer. At this time, the location of the suspended consumer consumption will be obtained from this partition

Because_ consumer_offsets may receive highly concurrent requests. kafka allocates 50 partitions by default (which can be set through offsets.topic.num.partitions), so that it can resist concurrency by adding machines.

The offset consumed by the consumer can be selected through the following formula to be submitted to_ consumer_ Which partition formula of offsets: hash (consumergroupld)% * *_ consumer_offsets the number of partitions for the topic**

Topics: Java kafka Distribution