background
From the perspective of source code, this paper mainly analyzes how the bottom layer of flink allocates tasks to each task executor for execution, as well as the thread model of task execution. It will involve how jobmaster distributes tasks to taskExecutor, the detailed process of task execution by taskExecutor, and the mailBox thread model of task.
JobMaster deploy task TM start Task thread
jobmaster mainly obtains the task manager corresponding to the slot through and the allocated slot, and then submits the task. The TaskExecutor creates and starts a task thread. The thread pool of the TaskExecutor schedules the run method of the task thread, and then executes the main computing logic (StreamTask).
- jobmaster obtains the taskManager corresponding to the slot and submits the task through Rpc call
public void deploy() throws JobException { assertRunningInJobMasterMainThread(); //Get the description information object of the slot final LogicalSlot slot = assignedResource; // ..... /** * A series of inspection operations are carried out in the middle **/ //The task description information object is used to deploy the task to the TaskManager final TaskDeploymentDescriptor deployment = TaskDeploymentDescriptorFactory .fromExecutionVertex(vertex, attemptNumber) .createDeploymentDescriptor( slot.getAllocationId(), slot.getPhysicalSlotNumber(), taskRestore, producedPartitions.values()); // null taskRestore to let it be GC'ed taskRestore = null; //Get the corresponding NMS of TaskManager through slot final TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway(); final ComponentMainThreadExecutor jobMasterMainThreadExecutor = vertex.getExecutionGraph().getJobMasterMainThreadExecutor(); // We run the submission in the future executor so that the serialization of large TDDs does not block // the main thread and sync back to the main thread once submission is completed. //rpc calls asynchronously and submits the task to the corresponding TaskExecutor CompletableFuture.supplyAsync(() -> taskManagerGateway.submitTask(deployment, rpcTimeout), executor) .thenCompose(Function.identity()) .whenCompleteAsync( (ack, failure) -> { // only respond to the failure case //... A series of callbacks handle rpc callbacks } }, jobMasterMainThreadExecutor); } catch (Throwable t) { //... } }
- Task manager creates and runs task threads based on the submitted task information
public class TaskExecutor extends RpcEndpoint implements TaskExecutorGateway { @Override public CompletableFuture<Acknowledge> submitTask( TaskDeploymentDescriptor tdd, JobMasterId jobMasterId, Time timeout) { //Create Task thread Task task = new Task(...); //... boolean taskAdded; try { taskAdded = taskSlotTable.addTask(task); } catch (SlotNotFoundException | SlotNotActiveException e) { throw new TaskSubmissionException("Could not submit task.", e); } if (taskAdded) { //If the task is added successfully, the task thread is executed task.startTaskThread(); //... } else { //... } catch (TaskSubmissionException e) { return FutureUtils.completedExceptionally(e); } } }
- The Task in TaskExecutor implements the Runnable interface and will be scheduled and executed by the thread pool in TaskExecutor. The main logic of Task thread is in the doRun() method, which mainly includes two parts: 1. Initialize the Task execution environment. 2. Start the calculation logic of AbstractInvokable class (the invoke method of StreamTask will be called here)
public class Task implements Runnable, TaskSlotPayload, TaskActions, PartitionProducerStateProvider, CheckpointListener, BackPressureSampleableTask { //The thread pool of TaskExecutor will schedule the run() of the thread executing the Task @Override public void run() { try { doRun(); } finally { terminationFuture.complete(executionState); } } private void doRun() { // ---------------------------- // Initialize the Task running environment // ---------------------------- //.. AbstractInvokable invokable = null; // Reflection loads the invokable code in the Task and generates an AbstractInvokable object invokable = loadAndInstantiateInvokable(userCodeClassLoader, nameOfInvokableClass, env); // ---------------------------------------------------------------- // actual task core work // ---------------------------------------------------------------- this.invokable = invokable; //... // Execute the invoke method of the AbstractInvokable object (StreamTask) invokable.invoke(); } }
StreamTask execution process
StreamTask inherits and implements the AbstractInvokable abstract class. StreamTask will eventually run in the Task thread of TaskExecutor, and the logic to be executed within the Task thread defined by StreamTask.
-
Main structure of StreamTask
[headOperator]: the head operator of StreamTask, which is the first operator in the operatorChain (StreamTask transmits data to the head operator through DataOutput).
[operatorChain]: a series of operators executed in a Task.
[StreamInputProcess]: the default operation of mailboxProcessor, which is used to read and process data from the network or data source
[stateBackend]: the state backend used by StreamTask.
[mailboxProcessor]: all data reading (through processInput() of StreamInputProcessor object) and event operations (checkpoint, etc.) are carried out serially through this object, which changes the Task execution process into the form of single thread + blocking queue. The mailbox mechanism of Actor model replaces the previous multi-threaded model (there is no need to lock when processing checkpoint and other events). -
Execution process of StreamTask
StreamTask reads data and events through the mailboxProcessor and executes operator logic
public abstract class StreamTask<OUT, OP extends StreamOperator<OUT>> extends AbstractInvokable implements AsyncExceptionHandler{ protected StreamTask( Environment environment, @Nullable TimerService timerService, Thread.UncaughtExceptionHandler uncaughtExceptionHandler, StreamTaskActionExecutor.SynchronizedStreamTaskActionExecutor actionExecutor, TaskMailbox mailbox) { super(environment); //... //The default execution logic of mailboxProcessor in StreamTask is processInput() of StreamTask this.mailboxProcessor = new MailboxProcessor(this::processInput, mailbox, actionExecutor); } @Override public final void invoke() throws Exception { try { beforeInvoke(); //... // Execute tasks through mailBox mode isRunning = true; runMailboxLoop(); //... afterInvoke(); } finally { cleanUpInvoke(); } } private void runMailboxLoop() throws Exception { try { //Read data (execute operator logic) and mail (execute checkpoint and other events) through mailboxProcessor. mailboxProcessor.runMailboxLoop(); } catch (Exception e) { //... } } //Default logic executed by mailBox (read data) protected void processInput(MailboxDefaultAction.Controller controller) throws Exception { //The data is read through the StreamInputProcessor and passed to the headerOperator InputStatus status = inputProcessor.processInput(); //... } }
StreamTask thread model -- MailboxProcessor
-
The main logic of the StreamTask threading model
[Mail]: encapsulates the execution logic of events (run method, such as triggerCheckPoint, notifyCheckPointComplete, etc.), and the priority parameter controls the priority of Mail execution to prevent deadlock.
[MailboxDefaultAction]: default data processing logic (StreamTask.processInput())
[MailboxExecutor]: provides the operation of submitting Mail to Mailbox
[TaskMailbox]: store Mail, including a queue blocking queue and a batch non blocking queue. Store the Mail in the queue in batch by calling createBatch(), and then obtain Mail in batch by tryTakeFromBatch(). -
Execution logic of MailboxProcessor
The processing logic of MailboxProcessor is to process Mail events in mailboxfirst, and then read data through StreamInputProcessor object.
public class MailboxProcessor implements Closeable { public void runMailboxLoop() throws Exception { final TaskMailbox localMailbox = mailbox; //... //Loop Mail and read data while (processMail(localMailbox)) { // Execute streamtask Processinput(), which calls the StreamInputProcessor object to process data mailboxDefaultAction.runDefaultAction(defaultActionContext); } } //The Mail in the Mailbox will be processed circularly and then returned private boolean processMail(TaskMailbox mailbox) throws Exception { //Add all Mail from the queue blocking queue to the batch non blocking queue if (!mailbox.createBatch()) { return true; } Optional<Mail> maybeMail; while (isMailboxLoopRunning() && (maybeMail = mailbox.tryTakeFromBatch()).isPresent()) { maybeMail.get().run();//run method to execute Mail } while (isDefaultActionUnavailable() && isMailboxLoopRunning()) { mailbox.take(MIN_PRIORITY).run(); } return isMailboxLoopRunning(); } }
StreamInputProcessor processes data
- Structure of StreamInputProcessor
[StreamTaskInput]: obtain data from outside the Task (read data from the network or data source), and update WaterMark and alignment barrier at the same time.
[DataOutput]: send the data read by StreamTaskInput to the headOperator of the current Task for processing
[OperatorChain]: a series of operators and recordwriters running in the same Task (partition the Record and cache it for downstream pull)
- StreamInputProcessor processes data flow
It will be in the next section Analysis of flink network communication (network stack) Detailed analysis in.