1, Distributed algorithm
1.1 INTRODUCTION
The difference between parallel processing and distributed processing:
- Parallel processing goal: use all processors to perform a large task.
- Distributed processing has a higher degree of uncertainty and behavior independence. Each processor has its own independent task.
Distributed system functions:
- shared resource
- improve performance
- Improve reliability
Model:
- Asynchronous Shared Storage Model: for tightly coupled machines
- Asynchronous msg delivery model: for loosely coupled machines and WAN
- Synchronous msg delivery model: the upper bound of msg delay is known, and the system execution is divided into round execution, which is a special case of asynchronous system
1.2 basic algorithms in message passing
Each processor \ (p_i \) can be modeled as a state machine with a state set \ (Q_i \)
Initial state: \ (Q_i \) contains a special subset of initial states. Each inbuf must be empty, and the outbuf may not be empty.
Conversion function: convert the accessible state of inbuf to outbuf
Configuration: the global state of the whole algorithm at a certain point on the distributed system
Events: calculating and delivering events
Execution: configure a sequence of interleaved events.
Safety and activity conditions shall be met.
-
Security: a property holds in every reachable configuration executed. "Bad things never happen"
-
Activity: a property holds in certain reachable configurations for each execution. "Eventually something good happened"
Meeting the security conditions is called execution. Satisfying the activity condition at the same time is called allowable execution.
1.2.1 transfer system and safety and activity
Transfer system: \ (Triple \ S=(C,\rightarrow,I) \), \ (C \) is the configuration set, \ (I \) is a subset of the initial configuration, and \ (\ rightarrow \) is the binary transfer relationship on the configuration set.
Reachable: \ (\ exist \ sequence \ gamma_0 \rightarrow \gamma_1 \rightarrow \cdots \rightarrow \gamma_k = \delta, and \ forall 0 \le i \le k-1,\gamma_i \rightarrow \gamma_{i+1}, then \ delta is reachable \)
Security: \ (S=(C,\rightarrow,I), assertion / Property P always holds \)
\(\ {P\} \rightarrow \{Q\}:\forall \gamma \rightarrow \delta, if \ P(\gamma),then \ Q(\delta) \), that is, if P is established before the transfer, q is established after the transfer
Definition: \ (P is the invariant of S, \ forall \gamma \in I,P(\gamma) holds and \ {P\}\rightarrow\{P\},P always holds), P is a security condition
Theorem: if P is an invariant of S, then every configuration P executed for s holds
Activity: in some configurations of each execution of the algorithm, P is true, and finally P is true.
Judgment method:
- Norm function
- No deadlock / normal system termination
Norm function: \ (transfer system S and assertion P,f: configuration set C\rightarrow good base set \ omega \)
Good basis set: \ (a partially ordered set (\ Omega, <) is a good basis if and only if there is no infinitely decreasing sequence, that is, there is a minimum value \)
1.2.2 system
Asynchronous system
Asynchrony: there is no upper bound between msg delivery time and two successive steps of a processor
Execution fragment: a finite or infinite sequence consisting of configuration and event alternation
Execution: the execution fragment from the initial configuration
Scheduling: sequence of events in execution
Allow execution: a processor has an infinite number of computing events (it means that the processor has no errors, not infinite steps), and each sent msg is finally passed.
synchronous system
Each processor can send an msg to a neighbor in a round, and each processor will calculate as soon as it receives an MSG.
Round: the configuration and event sequence can be divided into disjoint rounds. Each round consists of a delivery event and a calculation event.
Difference between synchronous system and asynchronous system:
- In the error free synchronization system, the execution of the algorithm only depends on the initial configuration
- In asynchronous systems, algorithms may be executed differently (not representing different results)
1.2.3 complexity measurement
Message complexity: the maximum number of messages sent on all allowed executions
Time complexity:
- Synchronization system: maximum number of rounds
- Asynchronous system: the maximum time allowed for all timings to be executed until termination
1.2.4 broadcast and convergence on spanning tree
Spanning tree ST: a subgraph with common vertices and no loops.
Minimum spanning tree MST: the spanning tree with the minimum weight of all edges.
radio broadcast
Basic steps:
- Root \ (p_r \) sends \ (M \) to all children.
- When a node receives \ (M \) from the parent node, it sends \ (M \) to all its children.
upon receiving no msg: if i=r then send <M> to all children; terminates; upon receiving <M> from P_j: send <M> to all children; terminates;
Message complexity \ (O(n) \)
Time complexity \ (O(h),h is the height of spanning tree \)
Convergence sowing
Basic steps:
- Each leaf node sends messages to its parents
- Each non leaf node waits for messages from all children before sending messages to parents
Message complexity \ (O(n) \)
Time complexity \ (O(h),h is the height of spanning tree \)
1.2.5 constructing spanning tree
Flooding algorithm
Message complexity: \ (O(2m-(n-1)) \)
Event complexity: \ (O(D),D is net diameter \)
Basic idea:
-
The root node sends a message \ (M \) to all neighbors
-
When \ (P_i \) receives a message \ (M \) from \ (P_j \) and receives the message \ (M \) for the first time, \ (P_i \) sends \ (< parent > \) to \ (P_j \), \ (P_i \) sends \ (< reject > \) to other nodes that send messages \ (M \), and \ (P_i \) sends messages \ (M \) to neighbors other than \ (P_j \)
-
\(P_i \) receiving a \ (P_j's < parent > \) message indicates that \ (P_i is P_j's parent node \)
Code for Pi (0<=i<=n-1) Initial value: parent=nil;aggregate children and other All empty sets upon receiving no message: if i=r and parent=nil then { //Root has not sent M send M to all neighbors; parent:=i;} //Root's parents set themselves upon receiving M from neighbor pj: if parent=nil then { //pi has not received m before. M is the first msg pi received parent:=j; send <parent> to pj; //pj is the parent of pi send M to all neighbors except pj; }else //pj cannot be the parent of PI, and the M received by pi is not the first msg send <reject> to pj; upon receiving <parent> from neighbor pj: children:=children∪{ j }; //pj is the child of pi. Add j to the child set if children∪other Contains Division parent All neighbors outside then terminate; upon receiving <reject> from neighbor pj: other:=other∪{ j }; //Add j to other and send msg through non tree edge. if children∪other Contains Division parent All neighbors outside then terminate
In the asynchronous model, a spanning tree with \ (p_r \) as the root is constructed
In the synchronization model, BFS must be constructed.
BFS cannot be generated in an asynchronous system because there is no upper bound on message passing, and a single chain may occur at worst.
1.2.6 specify root to construct DFS spanning tree
Basic idea:
- Select the root node \ (P_r \) and send a message \ (M \) to an adjacent node
- \(P_i \) when the message \ (M \) received from \ (P_j \) is the first message from an adjacent node, it recognizes \ (P_j \) as its parent and sends \ (< reject > \) to the node that sends the message \ (M \) to itself later
- \(P_i \) send a message \ (M \) to any neighbor who has not sent a message and wait for a reply
- \(P_i \) sends a message to all neighbors, then \ (P_i \) terminates
Code for Pi (0<=i<=n-1) Initial value: parent = nil; children = Φ; unexplored = Pi's neighbors upon receiving no message: if i=r and parent=nil then { //When Pi is root and M is not sent parent:=i; // Set parent to itself arbitrarily Pj ∈ unexplored take Pj from unexplored Delete from send M to Pj;}//endif upon receiving M from neighbor Pj: if parent=nil then { //pi has not received M before parent:=j; take Pj from unexplored Delete from if unexplored!=Φ then { arbitrarily Pk ∈ unexplored take Pk from unexplored Delete from send M to Pk; }else send <parent> to parent; }else send <reject> to pj; //When Pj has visited upon receiving <parent> or <reject> from neighbor pj: if received <parent> then add j to children; // Pj is Pi's child if unexplored = Φ then { //Pi's neighbors have been visited if parent!=i then send <parent> to parent; //Pi is not root, return to parent terminate; //DFS subtree with Pi as root has been constructed }else{ //Select a neighbor that Pi has not visited arbitrarily Pk ∈ unexplored take Pk from unexplored Delete from send M to Pk; }
Message complexity, time complexity: \ (O(m) \)
1.2.7 construct DFS spanning tree without specifying root
Similar to the leader election problem, it is a symmetry breaking problem
Basic idea:
-
Each node can wake up spontaneously and try to construct a DFS spanning tree with itself as its root. If two DFS trees try to link the same node (not necessarily at the same time), select the DFS tree with larger root ID to join.
-
Each node sets a leader variable, which is the root ID of the current DFS tree
-
When a node wakes up spontaneously, it sends its leader to a neighbor
Code for Pi (0<=i<=n-1) Initial value: parent = nil; leader = 0; children = Φ; unexplored = Pi's neighbors upon receiving no message: // Autonomous Awakening if parent=nil then { //If it is not empty, Pi is on a subtree, and Pi loses the election opportunity leader:=id; parent:=i; // Set parent to itself arbitrarily Pj ∈ unexplored take Pj from unexplored Delete from send <leader> to Pj;}//endif upon receiving <new_id> from neighbor Pj: if leader<new_id then { //Merge the tree of Pi into the tree of Pj leader:=new_id; parent:=j; unexplored:=all neighbors of Pi except Pj; //Reset unreachable neighbor set if unexplored!=Φ then { //Modify the id of the DFS tree where the original Pi is located arbitrarily Pk ∈ unexplored take Pk from unexplored Delete from send <leader> to Pk; }else send <parent> to parent; }else if leader=new_id then send <already> to Pj; upon receiving <parent> or <already> from neighbor pj: if received <parent> then add j to children; if unexplored = Φ then { //Pi's neighbors have been visited if parent!=i then send <parent> to parent; //Pi is not root, return to parent else terminate as root of the DFS tree; //DFS subtree with Pi as root has been constructed }else{ //Select a neighbor that Pi has not visited arbitrarily Pk ∈ unexplored take Pk from unexplored Delete from send <leader> to Pk; }
A network with m edges and n nodes has p spontaneously started nodes, and the startup time with the largest ID value is t.
Message complexity: \ (O(pn^2) \)
Time complexity: \ (O(t+m) \)
1.3 election algorithm on ring
Anonymous: the processors in the ring do not have unique identifiers, and each processor has the same state machine.
Consistent and uniform: the number of processors is unknown
Explain consistency and inconsistency:
Algorithm 1: forward n steps of msg to the right, and then terminate. = > non_ uniform
Algorithm 2: forward msg to the right until the MSG sender receives the MSG. = > uniform
In an anonymous, consistent algorithm, all processors have only one state machine.
In an anonymous and inconsistent algorithm, each n value corresponds to a state machine.
- There is no anonymous and consistent leader election algorithm in synchronous ring system
- There is no anonymous leader election algorithm in asynchronous ring system
1.3.1 asynchronous ring
Open scheduling: in the primary scheduling of algorithm A, if there is no message transmission on one edge, it is open edge.
(open scheduling is not necessarily an allowable scheduling, but a limited prefix of allowable execution)
The election algorithm on the asynchronous ring has a lower bound of message complexity \ (O(nlg \ n) \)
\(algorithm for O (n ^ 2 \)
Basic idea:
- Each processor sends an identifier msg to the left neighbor, and then waits to receive the MSG of the right neighbor.
- When each processor receives msg, if the received id is greater than itself, it will forward it to the left neighbor.
- If the processor receives its own identifier id, it declares itself a leader, terminates msg to the left neighbor, and then terminates
- If the processor receives the termination msg, it forwards it to the left and then terminates
Core idea: the processor sends the identifier msg to the left neighbor and determines the leader by comparing the id. only the largest id message will return to him.
Code for Pi (0<=i<=n-1) Initial value: asleep=true; id = i; While (receiving no message) do (1) if asleep do (1.1) asleep = false (1.2) send <id> to left-negihbor end if End while While (receiving <i> from right-neighbor) do (1) if id < <i> then send <i> to left-neighbor end if (2) if id = <i> then (2.1) send <Leader,i> to left-neighbor (2.2) terminates as Leader end if End while While (receiving <Leader,j> from right-neighbor) do (1) send <Leader,j> to left-neighbor (2) terminates as non-Leader End while
Message complexity: \ (O(n^2) \)
\(O(nlg(n)) \) algorithm
K neighbors: there are k processors on the left and K on the right, with a total of \ (2k+1 \).
Basic idea:
- Stage 0: each node sends an id message to two 1-neighbors. If the neighbor's id is less than this message, reply. Otherwise, it is swallowed. If a node receives a reply from two neighbors, it considers itself a leader and enters phase 1.
- Stage \ (l \): the processor that became the leader in the previous stage continues to send id messages to \ (2^l \) neighbors. If it receives a reply from the left and right directions, it will consider itself a leader.
- Terminate if you receive your own message.
Core idea: in the \ (l \) stage, a processor attempts to become a temporary leader of its \ (2^l \) - neighbor. Only the processor that becomes the leader in the \ (l-th \) phase can continue the \ ((l+1)-th \) phase.
Code for Pi (0<=i<=n-1) Initial value: asleep=true; upon receiving no msg: if asleep then{ asleep:=false;//After each node wakes up, it will no longer enter this code send <probe,id, 0, 0> to left and right; } upon receiving <probe, j, l,d> from left_or_right (resp, right): if(j=id) then //Terminate after receiving your own id, omit sending termination msg send <leader,id> to left neighbour; terminate as the leader; if(j>id) and (d<2^l) then //Forward probe msg send <probe, j, l, d+1> to right_or_left (resp, left) if(j>id) and (d≥2^l) then //Arrive at the last neighbor and have not been confiscated send <reply, j, l > to left_or_right (resp, right) // answer // If J < ID, the probe message will be confiscated upon receiving <reply ,j , l> from left (resp, right): if j≠id then // Determine whether to forward to the initial point send <reply, j, l> to right (resp, left); //Forward reply else //When j=id, Pi has received an answer in one direction msg if already received <reply, j, l> from right (resp, left) then //Also received a reply from the other direction send <probe, id, l+1, 0> to left and right; //Pi is the temporary leader of phase l. continue to the next stage upon receiving <leader, idj> from right: send <leader, idj> to left; terminate as nonleader;
Stage \ (K \), the maximum number of temporary leader trees is \ (\ frac{n}{2^k+1} \), and the maximum number of started msg is \ (4*2^k \)
Message complexity: \ (O(nlg \ n) \)
1.3.2 synchronizing ring
Message complexity upper bound \ (O(n) \), lower bound \ (O(nlg \ n) \)
Upper bound \ (O(n) \):
- Non uniform: all nodes in the ring are required to start at the same round
- Uniform: nodes can start at different wheels
Nonuniform algorithm for proving upper bound
The ring size n must be known, and all nodes start in the same round
Assume the leader with the smallest id
Basic idea: run by stages, and each stage is composed of n wheels. In stage \ (i \), the processor with id \ (i \) is elected as leader, and then the algorithm is terminated.
Each node does not know each other's id value.
Message complexity: exactly n messages are sent, \ (O(n) \)
Uniform algorithm for proving upper bound
It is not necessary to know the ring size
One process can be: spontaneous arousal or passive arousal
Basic idea:
- The msg from the \ (id=i \) node delays \ (2^i-1 \) rounds before being forwarded.
- Each spontaneous wake-up node sends a wake-up msg around the ring without delay
- If a node receives a wake-up msg before startup, the node only forwards it.
Only nodes that wake up spontaneously can be selected as leader s
Initial value: asleep=true;waiting = ф;R = Received in calculation event msg Set of; s = ф;//the msg to be sent if asleep then { asleep:=false; if R = Φ then { // pi has not received msg, which belongs to spontaneous awakening min:=id; // Participation in elections s:=s+{<id>}; // Ready to send }else{ //msg has been received, but it has not been started before, so Pi does not participate min:=∞; //Election, set min to ∞ and make it a relay node // relay:=true; ? } } for each <m> in R do {// After processing the received m, it is equivalent to deleting it from R if m < min then { // The received id is too small to pass become not elected; // Pi is not selected // Can relay control be used to make forwarding nodes not delay? take<m>join waiting And remember m When to join; // m join delay forwarding min:=m; } // if m > min then it is swallowed if m=id then become elected; // Pi is selected } //endfor for each <m> in waiting do if <m> It's at 2^m-1 Received before wheel then take<m>from waiting Delete and add S send S to left;
Messages sent:
- Class I: msg in the first stage (wake-up msg)
- The second type: the second stage msg sent by the final leader before entering its own second stage (sent by other nodes)
- The third category: the second stage msg sent after the final leader's msg enters its own second stage (including the msg sent by the leader)
The maximum number of the first category is \ (n \)
The maximum number of the second category is \ (n \)
The third category has the largest total number of \ (2n \)
Message complexity: \ (O(4n) \)
Restricted lower bound \ (O(nlg \ n) \)
Order equivalence: two rings \ (x_0,x_1,\cdots,x_n-1 \) and \ (y_0,y_1,\cdots,y_n-1 \) (x_i < x_j if and only if y_i < y_j \)
If an algorithm is only related to the relative order of identifiers on the ring, but not to the specific id, the algorithm must only be based on the comparison of identifiers
For each \ (n\ge8,n is the power of 2 \), there is A ring \ (S_n \) of size n, so that the number of msg sent in the allowable execution of the comparison based synchronous leader election algorithm A is \ (O(nlg \ n) \).
Construct \ (S_n \)
- Define a ring of size n \ (R_n^{rev} \), so that the id of \ (P_i \) is \ (binary inverse sequence of rev(i):i \)
- If the ring is divided into continuous segments with length \ (j,j is the power of 2 \), then these segments are order equivalent.
- The number of fragments is \ (\ frac{n}{j} \)
1.4 calculation model
Complexity of calculation model:
- The system consists of concurrent components
- No global clock
- Possible failures of components must be captured
Countermeasures:
- causal relationship
- Consistent state
- Global state
Why do distributed systems lack global system state?
-
Non instant messaging (propagation delay, resource competition, lost msg retransmission)
-
Relativity effect (the actual clock of most computers drifts, and clock synchronization is still a problem)
-
Interrupt (it is impossible to observe the global state of a distributed system at the same time)
Order: \ (e_1 < e_2: event e_1 occurs before e_2 \)
Sequencing:
- Events on the same node meet the total order, \ (e_1 < _pe_2 \)
- \(e_1 \) sends a message and \ (e_2 \) receives a message, then \ (e_1 < _me_2 \)
Happens before relation: \ (< _h \), a partial order relation
Concurrent events: two events cannot be ordered by \ (< _h \)
The happens before relation is sometimes described as a directed acyclic graph

How to change the partial order relation of H relation into total order relation:
- Topological ordering on directed acyclic graph DAG
- Lamport algorithm
1.4.1 Lamport timestamp
Basic idea:
- Event E has an additional timestamp: e.TS
- Node has local timestamp: my_TS
- msg has additional timestamp: m.TS
- When a node executes an event, it assigns its own timestamp to the event.
- When a node sends msg, it sends its own timestamp to all msg.
- When a message is received, the local timestamp is updated to the maximum value
- When sending a message, label it with a local timestamp
Initially: my_TS = 0; On event e: if(e == receive messages m) then my_TS = max(m.TS, my_TS)); // Take the larger value of msg timestamp and node timestamp my_TS++; e.TS = my_TS; // Time stamp event e if(e == send message m) then m.TS = my_TS; // Time stamp the message
The timestamp of each time is greater than the timestamp of the precursor time
Problem: different times may have the same timestamp (concurrent events)
Improved algorithm: use the node address as the low bit of the timestamp, such as \ (1.1,2.2 \). After labeling, the total order relationship is obtained according to the dictionary order.
Problem: it is impossible to determine whether there is a causal relationship between two events through timestamp
1.4.2 vector timestamp
Vector timestamp VT: \ (e.VT is an array, e.VT[i]=k indicates that there are k events (including itself) before event E on node I.)
Basic idea:
-
Node has local vector timestamp: my_VT
-
Event E has vector timestamp: e.VT
-
msg vector timestamp: m.VT
Initially: my_VT=[0,0,,,0]; On event e: if(e == receive messages m) then for i= 1 to M do // Each component of the vector timestamp increases without decreasing my_VT[i] = max(m.VT[i], my_VT[i]); my_VT[self]++; // self is the name of this node e.VT = my_VT; // Time stamp event e if(e == send message m) then m.VT = my_VT; // Timestamp message m
Vector timestamp comparison:
e1.VT = (5,4,1,3) e2.VT = (3,6,4,2) e3.VT = (0,0,1,3)
-
If there is no relationship between e1 and e2 that is completely greater than or less than, e1 and e2 are concurrent.
-
e3 precedes e1 in causal order
1.4.3 causal communication
The processor cannot select the arrival time of msg, but it can suppress msg arriving too early.
Such as FIFO communication in TCP.
Basic idea:
- Suppress messages sent from P until it can be concluded that no other messages occur earlier than m
- On each node:
- \(early [1... M]: lower bound of message timestamps that can be delivered by different nodes \)
- \(blocked[1...M]P: blocked queue array, each component is a queue \)
Causal Msg Delivery Definition: time stamp lk: use Lamport time stamp,lk=1; Use vector timestamp,lk=(0,,,0,1,0,,,0),Number k Bit 1; Initially: earliest[k] = lk,k=1,,,M // Lower bound of message timestamp that can be delivered by different nodes blocked[k] = {},k=1,,,M // Block queue empty On the receipt of msg <m> from node P: delivery_list = {} if (blocked[p]==empty) then earliest[p] = m.timestamp; blocked[p].push_back(m); // Process received messages //Processing blocking queue while(∃k send blocked[k]Non empty && yes i=1,,,M(except k and self Outside) not_earliest(earliest[i],earliest[k],i) ) // No message arrived earlier than k { take blocked[k]Team head element m'Get out of the team and join delivery_list; if (blocked[k]!=empty) then earliest[k] = m'.timestamp; else increment earliest[k] by lk } deliver the msgs in delivery_list; // In causal order
Deadlock may occur (the msg you want is not sent on a node for a long time)
The causal communication algorithm is often used in multicast.