Message sending mechanism of kafka producer

Posted by emopoops on Fri, 20 Sep 2019 11:22:46 +0200

Opening a picture, the reader is happier, not to mention the structure chart.

This schematic diagram was introduced in the previous article "kafka Producer's Storage Pool Mechanism". In the previous article, we introduced the message collection process (we became the "Storage Pool" mechanism) in this diagram. Here we introduce another part of it, the message sending mechanism.

1.1. Sender Running Process

All messages are sent from the Sender thread, which is a daemon thread, so we need to look at Sender's run method first. The outermost run method is that the main loop constantly calls the specific logical operation method run. Let's look at its specific logical processing run method:

 void run(long now) {
        //Producer affairs management related processing, this chapter does not do a specific analysis, after the special chapter to do analysis, you first understand.
        if (transactionManager != null) {
            try {
                if (transactionManager.shouldResetProducerStateAfterResolvingSequences())
                    // Check if the previous run expired batches which requires a reset of the producer state.
                    transactionManager.resetProducerId();

                if (!transactionManager.isTransactional()) {
                    // this is an idempotent producer, so make sure we have a producer id
                    maybeWaitForProducerId();
                } else if (transactionManager.hasUnresolvedSequences() && !transactionManager.hasFatalError()) {
                    transactionManager.transitionToFatalError(new KafkaException("The client hasn't received acknowledgment for " +
                            "some previously sent messages and can no longer retry them. It isn't safe to continue."));
                } else if (transactionManager.hasInFlightTransactionalRequest() || maybeSendTransactionalRequest(now)) {
                    // as long as there are outstanding transactional requests, we simply wait for them to return
                    client.poll(retryBackoffMs, now);
                    return;
                }

                // do not continue sending if the transaction manager is in a failed state or if there
                // is no producer id (for the idempotent case).
                if (transactionManager.hasFatalError() || !transactionManager.hasProducerId()) {
                    RuntimeException lastError = transactionManager.lastError();
                    if (lastError != null)
                        maybeAbortBatches(lastError);
                    client.poll(retryBackoffMs, now);
                    return;
                } else if (transactionManager.hasAbortableError()) {
                    accumulator.abortUndrainedBatches(transactionManager.lastError());
                }
            } catch (AuthenticationException e) {
                // This is already logged as error, but propagated here to perform any clean ups.
                log.trace("Authentication exception while processing transactional request: {}", e);
                transactionManager.authenticationFailed(e);
            }
        }
        //Actual data sends requests and processes server-side responses
        long pollTimeout = sendProducerData(now);
        client.poll(pollTimeout, now);
    }

Next, we look at two levels: one is message sending, the other is message return response processing.

1.2. Message Sending

Let's first look at the logic of sendProducerData:

private long sendProducerData(long now) {
        //Getting Cluster Information
        Cluster cluster = metadata.fetch();

        // Get partition list information that can send messages
        RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);

        // If these partitions do not have a corresponding leader, you need to force updates to metadata information
        if (!result.unknownLeaderTopics.isEmpty()) {
            // Scenarios without leaders, such as leader elections, or topic failures, require that topics be re-added and sent to the server to request updates, because they now need to send messages to these topics.
            for (String topic : result.unknownLeaderTopics)
                this.metadata.add(topic);
            this.metadata.requestUpdate();
        }

        // Traversing through all the acquired network nodes, the network connection status is used to detect whether these nodes are available or not, and if they are not, they are rejected.
        Iterator<Node> iter = result.readyNodes.iterator();
        long notReadyTimeout = Long.MAX_VALUE;
        while (iter.hasNext()) {
            Node node = iter.next();
            //Node Connection Status Check, If Connections are Allowed, Recreate Connections
            if (!this.client.ready(node, now)) {
                //Unprepared node deletion
                iter.remove();
                notReadyTimeout = Math.min(notReadyTimeout, this.client.connectionDelay(node, now));
            }
        }

        // Get all batch messages to be sent and their corresponding set of leader nodes
        Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes,
                this.maxRequestSize, now);
                
        //If strong sequentiality of messages is required, the corresponding topic partition object is cached to prevent multiple incomplete messages from being sent to the same topic partition at the same time.
        if (guaranteeMessageOrder) {
            // Add the partition object information of each batch to the mute set and implement it with Set. The repeated topicpartition information will not be added.
            for (List<ProducerBatch> batchList : batches.values()) {
                for (ProducerBatch batch : batchList)
                    this.accumulator.mutePartition(batch.topicPartition);
            }
        }

        // Get local expired messages, return TimeoutException, and free up space
        List<ProducerBatch> expiredBatches = this.accumulator.expiredBatches(this.requestTimeout, now);
        // Outdated batch message processing
        if (!expiredBatches.isEmpty())
            log.trace("Expired {} batches in accumulator", expiredBatches.size());
        for (ProducerBatch expiredBatch : expiredBatches) {
            failBatch(expiredBatch, -1, NO_TIMESTAMP, expiredBatch.timeoutException(), false);
            if (transactionManager != null && expiredBatch.inRetry()) {
                // This ensures that no new batches are drained until the current in flight batches are fully resolved.
                transactionManager.markSequenceUnresolved(expiredBatch.topicPartition);
            }
        }
        //Update Metric Information
        sensors.updateProduceRequestMetrics(batches);

        // Set pollTimeout, if there are messages to be sent, then set pollTimeout equal to 0, so that requests can be sent immediately, which can shorten the cache time of the remaining messages and avoid accumulation.
        long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
        if (!result.readyNodes.isEmpty()) {
            log.trace("Nodes with data ready to send: {}", result.readyNodes);
            pollTimeout = 0;
        }
        //Call NetWorkClient to send messages to the server
        sendProduceRequests(batches, now);

        return pollTimeout;
}

In summary, the core processes of sendProducerData are as follows:
1. Get the sending partition list information by accumulator. read method
2. Call client. read to detect the connectivity of all network nodes acquired
3. Obtain all batch messages to be sent and their corresponding set of leader nodes through. accumulator.drain
4. Call accumulator.mutePartition to add partition information to the mute collection in scenarios where strong sequentiality of partition messages needs to be guaranteed.
5. Call sendProduceRequests to send production message requests

Following is a process-by-process explanation:
Accumulator. read method is used to obtain the sent partition list information:

 public ReadyCheckResult ready(Cluster cluster, long nowMs) {
        //Collection of nodes that accept messages
        Set<Node> readyNodes = new HashSet<>();
        long nextReadyCheckDelayMs = Long.MAX_VALUE;
        //Record Topic Information Set that Didn't Find a leader Copy
        Set<String> unknownLeaderTopics = new HashSet<>();
        
        // Are there threads waiting for BufferPool to allocate space?
        boolean exhausted = this.free.queued() > 0;
        //Traverse each partition information in the batch to be sent, and determine its leader
        for (Map.Entry<TopicPartition, Deque<ProducerBatch>> entry : this.batches.entrySet()) {
            TopicPartition part = entry.getKey();
            Deque<ProducerBatch> deque = entry.getValue();
            
            // Get the node where the current top partition leader copy is located
            Node leader = cluster.leaderFor(part);
            synchronized (deque) {
                if (leader == null && !deque.isEmpty()) {
                    // The leader under the partition is unknown, but there are messages sent to the partition, which need to be recorded. When an unknown leader is found in the subsequent process, it is necessary to force the sending of metadata update requests to the server.
                    unknownLeaderTopics.add(part.topic());
                } 
                //All sending nodes need not be in the mute set to ensure the orderliness of messages. When there are still messages in the mute, they can not continue to send additional messages.
                else if (!readyNodes.contains(leader) && !muted.contains(part)) {
                    ProducerBatch batch = deque.peekFirst();
                    if (batch != null) {
                        long waitedTimeMs = batch.waitedTimeMs(nowMs);
                        //Is it in retry operation judgment?
                        boolean backingOff = batch.attempts() > 0 && waitedTimeMs < retryBackoffMs;
                        long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;
                        boolean full = deque.size() > 1 || batch.isFull();
                        boolean expired = waitedTimeMs >= timeToWaitMs;
                        //Mark whether the current leader can be sent
                        boolean sendable = full // 1. There are multiple RecordBatches in the queue, or the first RecordBatch is full
                        || expired // 2. The current waiting time for retry is too long
                        || exhausted // 3. There are other threads waiting for BufferPoll to allocate space, that is, the local message cache is full
                        || closed // 4. producer has been closed
                        || flushInProgress();// 5. There are threads waiting for the flush operation to complete
                        if (sendable && !backingOff) {
                        //The current leader is added to the sendable node when it satisfies the sendable state and is not in the state of retry operation.
                            readyNodes.add(leader);
                        } else {
                            long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);
                            // Update the time interval for the next read decision
                            nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);
                        }
                    }
                }
            }
        }
        //Return the check result
        return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeaderTopics);
}

Call client. read to detect the connectivity of all network nodes obtained:

 public boolean ready(Node node, long now) {
        if (node.isEmpty())
            throw new IllegalArgumentException("Cannot connect to empty node " + node);
        //connectionStates are ready to go directly back to connectable
        if (isReady(node, now))
            return true;
        
        //Connection status display connectable
        if (connectionStates.canConnect(node.idString(), now))
            // Initialize the connection by calling selector
            initiateConnect(node, now);

        return false;
}

Get all batch messages to be sent and their corresponding set of leader nodes through. accumulator.drain:

 public Map<Integer, List<ProducerBatch>> drain(Cluster cluster,
                                                   Set<Node> nodes,
                                                   int maxSize,
                                                   long now) {
        if (nodes.isEmpty())
            return Collections.emptyMap();
            
        //The batch message corresponding to the returned nodeid
        Map<Integer, List<ProducerBatch>> batches = new HashMap<>();
        //Traversing through every connected node
        for (Node node : nodes) {
            int size = 0;
            List<PartitionInfo> parts = cluster.partitionsForNode(node.id());
            List<ProducerBatch> ready = new ArrayList<>();
            /* drainIndex Used to record the last stop position, this time continue to send from the current position.
            * If you start from zero at each time, you may starve to death in the next partition. This is a simple load balancing strategy.
            */
            int start = drainIndex = drainIndex % parts.size();
            do {
                PartitionInfo part = parts.get(drainIndex);
                TopicPartition tp = new TopicPartition(part.topic(), part.partition());
                //If strong sequentiality of messages needs to be guaranteed, messages cannot be added to the target partition, otherwise messaging will occur.
                if (!muted.contains(tp)) {
                    // Get the RecordBatch set corresponding to the current partition
                    Deque<ProducerBatch> deque = getDeque(tp);
                    if (deque != null) {
                        synchronized (deque) {
                            ProducerBatch first = deque.peekFirst();
                            if (first != null) {
                                //Is the first batch currently in retry or has it been retried?
                                boolean backoff = first.attempts() > 0 && first.waitedTimeMs(now) < retryBackoffMs;
                                // No retries, or retries have expired
                                if (!backoff) {
                                    if (size + first.estimatedSizeInBytes() > maxSize && !ready.isEmpty()) {
                                         // The data volume of a single message has reached the upper limit and the loop is closed. It generally corresponds to the size of a request to prevent the request message from being too large.
                                        break;
                                    } 
                                    //Processing messages in retry state
                                    else {
                                        //Eliminate the transaction processing flow in retry state
                                        //Traveling through each node, the starting position of the node is traversed in a round training mode, and batch in each queue is only the first one. Each queue is trained in turn. All these operations are for balanced processing of message sending and ensuring fair sending of message.
                                        ProducerBatch batch = deque.pollFirst();
                                        //close means that the message batch channel is closed and can only be read, not written.
                                        batch.close();
                                        size += batch.records().sizeInBytes();
                                        ready.add(batch);
                                        batch.drained(now);
                                    }
                                }
                            }
                        }
                    }
                }
                //Update drainIndex
                this.drainIndex = (this.drainIndex + 1) % parts.size();
            } while (start != drainIndex);
            batches.put(node.id(), ready);
        }
        return batches;
}

Call accumulator.mutePartition to add partition information to the mute set. The process is relatively simple, which is to traverse batch messages to be sent. If a strong consistency of message timing is guaranteed, the partition information is saved in the mute set, and the queue is checked before each message is sent. Existing partitions, if any, do not send this time. After each sending is completed, the mute set is called to remove the partition information so that the next message can be sent.

Call sendProduceRequests to send a production message request:

private void sendProduceRequest(long now, int destination, short acks, int timeout, List<ProducerBatch> batches) {
        if (batches.isEmpty())
            return;

        Map<TopicPartition, MemoryRecords> produceRecordsByPartition = new HashMap<>(batches.size());
        final Map<TopicPartition, ProducerBatch> recordsByPartition = new HashMap<>(batches.size());

        // Travel through all batch messages to find the smallest version number information
        byte minUsedMagic = apiVersions.maxUsableProduceMagic();
        for (ProducerBatch batch : batches) {
            if (batch.magic() < minUsedMagic)
                minUsedMagic = batch.magic();
        }

        // Traversing through the RecordBatch collection, collating into produceRecords ByPartition and recordsByPartition
        for (ProducerBatch batch : batches) {
            TopicPartition tp = batch.topicPartition;
            MemoryRecords records = batch.records();

            // Additional reconstructions of Memory Records are required for downward compatible transformation of messages, such as migration of partitioned messages from a higher version to a lower version.
            if (!records.hasMatchingMagic(minUsedMagic))
                records = batch.records().downConvert(minUsedMagic, 0, time).records();
            produceRecordsByPartition.put(tp, records);
            recordsByPartition.put(tp, batch);
        }

        String transactionalId = null;
        if (transactionManager != null && transactionManager.isTransactional()) {
            transactionalId = transactionManager.transactionalId();
        }
        
        // Create the ProduceRequest request constructor, which produceRecords ByPartition uses to construct the requester
        ProduceRequest.Builder requestBuilder = ProduceRequest.Builder.forMagic(minUsedMagic, acks, timeout,
                produceRecordsByPartition, transactionalId);
        // Create callback objects for handling responses, and recordsByPartition for response callback processing
        RequestCompletionHandler callback = new RequestCompletionHandler() {
            public void onComplete(ClientResponse response) {
                handleProduceResponse(response, recordsByPartition, time.milliseconds());
            }
        };

        String nodeId = Integer.toString(destination);
        // Create the ClientRequest request object, if acks are not equal to 0, it means waiting for the response from the server
        ClientRequest clientRequest = client.newClientRequest(nodeId, requestBuilder, now, acks != 0, callback);
        //Call NetWorkClient to send messages
        client.send(clientRequest, now);
        log.trace("Sent produce request to {}: {}", nodeId, requestBuilder);
}

Next, we need to understand the sending process of NetWorkClient, which is sent by calling the doSend function.

private void doSend(ClientRequest clientRequest, boolean isInternalRequest, long now, AbstractRequest request) {
        //Get the target node id
        String nodeId = clientRequest.destination();
        RequestHeader header = clientRequest.makeHeader(request.version());
        //Omitting log information printing
        Send send = request.toSend(nodeId, header);
        //Create a new InFlightRequest and add the request
        InFlightRequest inFlightRequest = new InFlightRequest(
                header,
                clientRequest.createdTimeMs(),
                clientRequest.destination(),
                clientRequest.callback(),
                clientRequest.expectResponse(),
                isInternalRequest,
                request,
                send,
                now);
        this.inFlightRequests.add(inFlightRequest);
        //Network Message Sending
        selector.send(inFlightRequest.send);
}

At this point, we have completed the message delivery explanation, and then we will explain the response pull process of the message.

1.3. Message response pull-out

The response pull-out of a message begins with the poll ing method of the Network Client, which is logically parsed as follows:

 public List<ClientResponse> poll(long timeout, long now) {
        ensureActive();

        if (!abortedSends.isEmpty()) {
            // When the connection is disconnected or the version is not supported, these responses need to be processed first.
            List<ClientResponse> responses = new ArrayList<>();
            handleAbortedSends(responses);
            completeResponses(responses);
            return responses;
        }
        //Response processing of metada information
        long metadataTimeout = metadataUpdater.maybeUpdate(now);
        try {
            //The poll process handles all network connections, disconnects, initializes new sending and processes the total sending and receiving requests, and the received information is eventually placed in completedReceives.
            this.selector.poll(Utils.min(timeout, metadataTimeout, requestTimeoutMs));
        } catch (IOException e) {
            log.error("Unexpected error during I/O", e);
        }

        // Handle all completed operations and responses
        long updatedNow = this.time.milliseconds();
        List<ClientResponse> responses = new ArrayList<>();
        handleCompletedSends(responses, updatedNow);
        handleCompletedReceives(responses, updatedNow);
        handleDisconnections(responses, updatedNow);
        handleConnections();
        handleInitiateApiVersionRequests(updatedNow);
        handleTimedOutRequests(responses, updatedNow);
        completeResponses(responses);

        return responses;
}

The core processing functions of response operations are handle* functions. Let's introduce them separately.
The handleCompletedSends method traverses all sent completed objects and creates a local response queue and adds it to requests that do not want to receive a response:

private void handleCompletedSends(List<ClientResponse> responses, long now) {
        // Traverse through all send-completed send objects
        for (Send send : this.selector.completedSends()) {
            //Find out the last inFlight Requests send request information
            InFlightRequest request = this.inFlightRequests.lastSent(send.destination());
            //For requests that are sent successfully but do not expect the server to respond, create a local response queue and add it to it
            if (!request.expectResponse) {
                //In Flight Requests are added at the time of sending and removed after receiving.
                this.inFlightRequests.completeLastSent(send.destination());
                // Add to the local response queue
                responses.add(request.completed(null, now));
            }
        }
    }

The handleCompletedReceives method obtains the response of the server and classifies the response into metadata and apiversion, respectively.

private void handleCompletedReceives(List<ClientResponse> responses, long now) {
        //Traversing through all received information from completedReceives, the information in completedReceives is added to the previous selector.poll
        for (NetworkReceive receive : this.selector.completedReceives()) {
            //Get the node ID of the return response
            String source = receive.source();
            //Getting cached request objects from the inFlightRequests collection
            InFlightRequest req = inFlightRequests.completeNext(source);
            //Analytical response information
            Struct responseStruct = parseStructMaybeUpdateThrottleTimeMetrics(receive.payload(), req.header,
                throttleTimeSensor, now);
           //Ellipsis log
            AbstractResponse body = AbstractResponse.parseResponse(req.header.apiKey(), responseStruct);
            if (req.isInternalRequest && body instanceof MetadataResponse)
                //Processing update response information for metadata
                metadataUpdater.handleCompletedMetadataResponse(req.header, now, (MetadataResponse) body);
            else if (req.isInternalRequest && body instanceof ApiVersionsResponse)
                 // If the response is to update the API version, update the API version information supported by the locally cached target node
                handleApiVersionsResponse(responses, req, now, (ApiVersionsResponse) body);
            else
                //Add to the local response queue
                responses.add(req.completed(body, now));
        }
    }

The handleDisconnections method will eventually call the Selector disconnected method to get the set of disconnected node ID s, update the connection status of the corresponding node to DISCONNECTED, empty the local cached data related to the node, and finally create a disconnected ClientResponse object. Add to the result set. If this step does find disconnected connections, the tag needs to update the locally cached node metadata information.

handleConnections This method calls the Selector Connected method to get the normal set of node IDs. If the current node is the first time to establish a connection, it needs to get the API version information supported by the node. The method sets the connection status of the current node to CHECKING_API_VERSIONS, and adds the node ID to Net. In the workclient node needingapi versionsfetch collection, for other nodes, update the corresponding connection status to READY.

The handleInitiateApiVersionRequests method is used to process nodes tagged in the NetworkClient handleConnections method that need to obtain support API version information, that is, nodes recorded in the NetworkClient nodesNeedingApiVersionsFetch collection. The method traverses the nodes in the set and constructs an ApiVersionsRequest request to obtain API version information supported by the target node, judging that the target node is allowed to receive the request. The request is packaged as a ClientRequest object and sent out at the next Selector poll operation.

The handleTimedOutRequests method traverses the set of nodes corresponding to related requests that have been timed out in inFlightRequests and treats them as disconnected for such nodes. The method creates a disconnected ClientResponse object to be added to the result set and tags the cluster metadata information that needs to be updated locally.

The last one is completeResponses, whose process is very simple. It triggers the producer's callback function and informs the server of the response information.

  private void completeResponses(List<ClientResponse> responses) {
        for (ClientResponse response : responses) {
            try {
                //The response added during handle * processing at all previous stages is traversed and its callback method is called back so that the producer receives the response information from the server.
                response.onComplete();
            } catch (Exception e) {
                log.error("Uncaught error in request completion:", e);
            }
        }
    }

Topics: Java network kafka