Research on storm source code analysis

Posted by skizzay on Thu, 09 Dec 2021 05:40:07 +0100

2021SC@SDUSC

Bolt node analysis of Trident

2021SC@SDUSC

SubTopologyBolt type is the basic unit running in Trident, but it is not a real Bolt node. Trident will use tridentboltexecution to interface adapt SubTopologyBolt.

Tridentboltexecution inherits from the IRichBolt interface and is the Bolt node that actually runs in Trident. It provides a function similar to the coordinated Bolt node, which synchronizes each node by sending a coordination message.

SubTopologyBolt is mainly used to abstract the execution of TridentProcessor. This article will discuss the implementation of SubTopologyBolt and tridentboltexector.

SubtopologyBolt.java

Class is defined as:

public class SubtopologyBolt implements ITridentBatchBolt {
    private static final long serialVersionUID = 1475508603138688412L;
    @SuppressWarnings("rawtypes")
    final DirectedGraph<Node, IndexedEdge> graph;
    final Set<Node> nodes;
    final Map<String, InitialReceiver> roots = new HashMap<>();
    final Map<Node, Factory> outputFactories = new HashMap<>();
    final Map<String, List<TridentProcessor>> myTopologicallyOrdered = new HashMap<>();
    final Map<Node, String> batchGroups;


Graph: the directed graph corresponding to the entire Topology.
Nodes: the processing nodes contained in the Bolt_ nodesS_ A subset of graph nodes.
roots: each type of input flow will correspond to an lnitalReceiver object, which is used to indicate how to process the messages of the flow.
outputFactories: each processing node corresponds to an output factory. Mytopologically ordered: its key is the node group serial number, and its value is the TridentProcessor corresponding to the node group.
batchGroups: this variable holds a reverse index to indicate which node group each node belongs to.
The BatchGroup corresponds to one of the most general subgraphs in the graph.

Main methods:

    public void prepare(Map<String, Object> conf, TopologyContext context, BatchOutputCollector batchCollector) {
        int thisComponentNumTasks = context.getComponentTasks(context.getThisComponentId()).size();
        for (Node n : nodes) {
            if (n.stateInfo != null) {
                State s = n.stateInfo.spec.stateFactory.makeState(conf, context, context.getThisTaskIndex(), thisComponentNumTasks);
                context.setTaskData(n.stateInfo.id, s);
            }
        }
        DirectedSubgraph<Node, ?> subgraph = new DirectedSubgraph<>(graph, nodes, null);
        TopologicalOrderIterator<Node, ?> it = new TopologicalOrderIterator<>(subgraph);
        int stateIndex = 0;
        while (it.hasNext()) {
            Node n = it.next();
            if (n instanceof ProcessorNode) {
                ProcessorNode pn = (ProcessorNode) n;
                String batchGroup = batchGroups.get(n);
                if (!myTopologicallyOrdered.containsKey(batchGroup)) {
                    myTopologicallyOrdered.put(batchGroup, new ArrayList<>());
                }
                myTopologicallyOrdered.get(batchGroup).add(pn.processor);
                List<String> parentStreams = new ArrayList<>();
                List<Factory> parentFactories = new ArrayList<>();
                for (Node p : TridentUtils.getParents(graph, n)) {
                    parentStreams.add(p.streamId);
                    if (nodes.contains(p)) {
                        parentFactories.add(outputFactories.get(p));
                    } else {
                        if (!roots.containsKey(p.streamId)) {
                            roots.put(p.streamId, new InitialReceiver(p.streamId, getSourceOutputFields(context, p.streamId)));
                        }
                        roots.get(p.streamId).addReceiver(pn.processor);
                        parentFactories.add(roots.get(p.streamId).getOutputFactory());
                    }
                }
                List<TupleReceiver> targets = new ArrayList<>();
                boolean outgoingNode = false;
                for (Node cn : TridentUtils.getChildren(graph, n)) {
                    if (nodes.contains(cn)) {
                        targets.add(((ProcessorNode) cn).processor);
                    } else {
                        outgoingNode = true;
                    }
                }
                if (outgoingNode) {
                    targets.add(new BridgeReceiver(batchCollector));
                }

                TridentContext triContext = new TridentContext(
                    pn.selfOutFields,
                    parentFactories,
                    parentStreams,
                    targets,
                    pn.streamId,
                    stateIndex,
                    batchCollector
                );
                pn.processor.prepare(conf, context, triContext);
                outputFactories.put(n, pn.processor.getOutputFactory());
            }
            stateIndex++;
        }
    }

for (Node n : nodes):
In this loop, judge whether the statelnfo of the node is empty, and initialize the stored state object. The initialized state object is stored in the taskData of TopologyContext with statelnfo.id as the key. Statelnfo.id is a unique string in Topology prefixed with the string "state".
subgraph:
Obtain a subgraph based on the nodes contained in the SubTopologyBolt.
it:
Topologically sort the subgraphs. The it variable of topologicalorderltrator type is used to traverse the subgraphs in the order of topological sorting.
if (n instanceof ProcessorNode):
SubTopologyBolt operates only on processing nodes. The processing node contains a TridentProcessor. The sprout node and partition node are not within the processing scope of SubTopologyBolt.
pn.processor.prepare(conf, context, triContext:
Call the prepare method of the TridentProcessor, which passes the newly generated TridentContext object as a parameter of the constructor.
outputFactories.put(n, pn.processor.getOutputFactory()):
Add the output corresponding to the TridentProcessor to the output of SubTopologyBolt. The output can then be used as input by other subtopologybolts.
statelndex variable:
Used to uniquely identify each node in the SubTopologyBolt.

public void execute(BatchInfo batchInfo, Tuple tuple) {
        String sourceStream = tuple.getSourceStreamId();
        InitialReceiver ir = roots.get(sourceStream);
        if (ir == null) {
            throw new RuntimeException("Received unexpected tuple " + tuple.toString());
        }
        ir.receive((ProcessorContext) batchInfo.state, tuple);
    }

First, according to the stream number of the input message, find the corresponding InitialReceiver object in the roots and call its receive method.
The execute method of all TridentProcessor waiting for the stream message will be called.
In the execute method of a TridentProcessor, the execute method of the downstream TridentProcessor will also be called to in turn, forming a call chain until the SubTopologyBolt completes the processing of the message.

public void finishBatch(BatchInfo batchInfo) {
        for (TridentProcessor p : myTopologicallyOrdered.get(batchInfo.batchGroup)) {
            p.finishBatch((ProcessorContext) batchInfo.state);
        }
    }

    @Override
    public Object initBatchState(String batchGroup, Object batchId) {
        ProcessorContext ret = new ProcessorContext(batchId, new Object[nodes.size()]);
        for (TridentProcessor p : myTopologicallyOrdered.get(batchGroup)) {
            p.startBatch(ret);
        }
        return ret;
    }
public void declareOutputFields(OutputFieldsDeclarer declarer) {
        for (Node n : nodes) {
            declarer.declareStream(n.streamId, TridentUtils.fieldsConcat(new Fields("$batchId"), n.allOutputFields));
        }
    }

In the initBatchState method, the data of the ProcessorContext is initialized, and then the ProcessorContext object is returned. Aggregators in Trident should be implemented based on the data stored in state in ProcessorContext.
The declareOutputFields method declares the output of each node in SubTopologyBolt, with $batch as column 1. It can be seen that although SubTopologyBolt exists as a whole, the output of each node may become the final output.

Topics: Big Data storm