University of science and technology - algorithm analysis and design - distributed algorithm review knowledge points

Posted by coops on Tue, 04 Jan 2022 16:29:55 +0100

1, Distributed algorithm


The difference between parallel processing and distributed processing:

  • Parallel processing goal: use all processors to perform a large task.
  • Distributed processing has a higher degree of uncertainty and behavior independence. Each processor has its own independent task.

Distributed system functions:

  • shared resource
  • improve performance
  • Improve reliability


  • Asynchronous Shared Storage Model: for tightly coupled machines
  • Asynchronous msg delivery model: for loosely coupled machines and WAN
  • Synchronous msg delivery model: the upper bound of msg delay is known, and the system execution is divided into round execution, which is a special case of asynchronous system

1.2 basic algorithms in message passing

Each processor \ (p_i \) can be modeled as a state machine with a state set \ (Q_i \)

Initial state: \ (Q_i \) contains a special subset of initial states. Each inbuf must be empty, and the outbuf may not be empty.

Conversion function: convert the accessible state of inbuf to outbuf

Configuration: the global state of the whole algorithm at a certain point on the distributed system

Events: calculating and delivering events

Execution: configure a sequence of interleaved events.

Safety and activity conditions shall be met.

  1. Security: a property holds in every reachable configuration executed. "Bad things never happen"

  2. Activity: a property holds in certain reachable configurations for each execution. "Eventually something good happened"

Meeting the security conditions is called execution. Satisfying the activity condition at the same time is called allowable execution.

1.2.1 transfer system and safety and activity

Transfer system: \ (Triple \ S=(C,\rightarrow,I) \), \ (C \) is the configuration set, \ (I \) is a subset of the initial configuration, and \ (\ rightarrow \) is the binary transfer relationship on the configuration set.

Reachable: \ (\ exist \ sequence \ gamma_0 \rightarrow \gamma_1 \rightarrow \cdots \rightarrow \gamma_k = \delta, and \ forall 0 \le i \le k-1,\gamma_i \rightarrow \gamma_{i+1}, then \ delta is reachable \)

Security: \ (S=(C,\rightarrow,I), assertion / Property P always holds \)

\(\ {P\} \rightarrow \{Q\}:\forall \gamma \rightarrow \delta, if \ P(\gamma),then \ Q(\delta) \), that is, if P is established before the transfer, q is established after the transfer

Definition: \ (P is the invariant of S, \ forall \gamma \in I,P(\gamma) holds and \ {P\}\rightarrow\{P\},P always holds), P is a security condition

Theorem: if P is an invariant of S, then every configuration P executed for s holds

Activity: in some configurations of each execution of the algorithm, P is true, and finally P is true.

Judgment method:

  • Norm function
  • No deadlock / normal system termination

Norm function: \ (transfer system S and assertion P,f: configuration set C\rightarrow good base set \ omega \)

Good basis set: \ (a partially ordered set (\ Omega, <) is a good basis if and only if there is no infinitely decreasing sequence, that is, there is a minimum value \)

1.2.2 system

Asynchronous system

Asynchrony: there is no upper bound between msg delivery time and two successive steps of a processor

Execution fragment: a finite or infinite sequence consisting of configuration and event alternation

Execution: the execution fragment from the initial configuration

Scheduling: sequence of events in execution

Allow execution: a processor has an infinite number of computing events (it means that the processor has no errors, not infinite steps), and each sent msg is finally passed.

synchronous system

Each processor can send an msg to a neighbor in a round, and each processor will calculate as soon as it receives an MSG.

Round: the configuration and event sequence can be divided into disjoint rounds. Each round consists of a delivery event and a calculation event.

Difference between synchronous system and asynchronous system:

  1. In the error free synchronization system, the execution of the algorithm only depends on the initial configuration
  2. In asynchronous systems, algorithms may be executed differently (not representing different results)

1.2.3 complexity measurement

Message complexity: the maximum number of messages sent on all allowed executions

Time complexity:

  • Synchronization system: maximum number of rounds
  • Asynchronous system: the maximum time allowed for all timings to be executed until termination

1.2.4 broadcast and convergence on spanning tree

Spanning tree ST: a subgraph with common vertices and no loops.

Minimum spanning tree MST: the spanning tree with the minimum weight of all edges.

radio broadcast

Basic steps:

  1. Root \ (p_r \) sends \ (M \) to all children.
  2. When a node receives \ (M \) from the parent node, it sends \ (M \) to all its children.
upon receiving no msg:
	if i=r then
        send <M> to all children;
upon receiving <M> from P_j:
	send <M> to all children;

Message complexity \ (O(n) \)

Time complexity \ (O(h),h is the height of spanning tree \)

Convergence sowing

Basic steps:

  1. Each leaf node sends messages to its parents
  2. Each non leaf node waits for messages from all children before sending messages to parents

Message complexity \ (O(n) \)

Time complexity \ (O(h),h is the height of spanning tree \)

1.2.5 constructing spanning tree

Flooding algorithm

Message complexity: \ (O(2m-(n-1)) \)

Event complexity: \ (O(D),D is net diameter \)

Basic idea:

  1. The root node sends a message \ (M \) to all neighbors

  2. When \ (P_i \) receives a message \ (M \) from \ (P_j \) and receives the message \ (M \) for the first time, \ (P_i \) sends \ (< parent > \) to \ (P_j \), \ (P_i \) sends \ (< reject > \) to other nodes that send messages \ (M \), and \ (P_i \) sends messages \ (M \) to neighbors other than \ (P_j \)

  3. \(P_i \) receiving a \ (P_j's < parent > \) message indicates that \ (P_i is P_j's parent node \)

Code for Pi (0<=i<=n-1)
Initial value: parent=nil;aggregate children and other All empty sets

upon receiving no message:
    if i=r and parent=nil then { //Root has not sent M
    	send M to all neighbors;
    	parent:=i;} //Root's parents set themselves
upon receiving M from neighbor pj:
    if parent=nil then { //pi has not received m before. M is the first msg pi received
    	send <parent> to pj; //pj is the parent of pi
    	send M to all neighbors except pj;
    }else //pj cannot be the parent of PI, and the M received by pi is not the first msg
    	send <reject> to pj;
upon receiving <parent> from neighbor pj:
    children:=children∪{ j }; //pj is the child of pi. Add j to the child set
    if children∪other Contains Division parent All neighbors outside then terminate;
upon receiving <reject> from neighbor pj:
    other:=other∪{ j }; //Add j to other and send msg through non tree edge.
    if children∪other Contains Division parent All neighbors outside then terminate

In the asynchronous model, a spanning tree with \ (p_r \) as the root is constructed

In the synchronization model, BFS must be constructed.

BFS cannot be generated in an asynchronous system because there is no upper bound on message passing, and a single chain may occur at worst.

1.2.6 specify root to construct DFS spanning tree

Basic idea:

  1. Select the root node \ (P_r \) and send a message \ (M \) to an adjacent node
  2. \(P_i \) when the message \ (M \) received from \ (P_j \) is the first message from an adjacent node, it recognizes \ (P_j \) as its parent and sends \ (< reject > \) to the node that sends the message \ (M \) to itself later
  3. \(P_i \) send a message \ (M \) to any neighbor who has not sent a message and wait for a reply
  4. \(P_i \) sends a message to all neighbors, then \ (P_i \) terminates
Code for Pi (0<=i<=n-1)
Initial value:
parent = nil;
children = Φ;
unexplored = Pi's neighbors

upon receiving no message:
	if i=r and parent=nil then { //When Pi is root and M is not sent
    	parent:=i; // Set parent to itself
     	arbitrarily Pj ∈ unexplored
      	take Pj from unexplored Delete from
    	send M to Pj;}//endif
upon receiving M from neighbor Pj:
    if parent=nil then { //pi has not received M before
       	take Pj from unexplored Delete from
        if unexplored!=Φ then {
           arbitrarily Pk ∈ unexplored
      	take Pk from unexplored Delete from
    	send M to Pk;
        }else send <parent> to parent;
    }else send <reject> to pj; //When Pj has visited
upon receiving <parent> or <reject> from neighbor pj:
    if received <parent> then add j to children; // Pj is Pi's child
    if unexplored = Φ then { //Pi's neighbors have been visited
    	if parent!=i then send <parent> to parent; //Pi is not root, return to parent
     	terminate; //DFS subtree with Pi as root has been constructed
    }else{ //Select a neighbor that Pi has not visited
    	arbitrarily Pk ∈ unexplored
      	take Pk from unexplored Delete from
    	send M to Pk;

Message complexity, time complexity: \ (O(m) \)

1.2.7 construct DFS spanning tree without specifying root

Similar to the leader election problem, it is a symmetry breaking problem

Basic idea:

  1. Each node can wake up spontaneously and try to construct a DFS spanning tree with itself as its root. If two DFS trees try to link the same node (not necessarily at the same time), select the DFS tree with larger root ID to join.

  2. Each node sets a leader variable, which is the root ID of the current DFS tree

  3. When a node wakes up spontaneously, it sends its leader to a neighbor

Code for Pi (0<=i<=n-1)
Initial value:
parent = nil;
leader = 0;
children = Φ;
unexplored = Pi's neighbors

upon receiving no message: // Autonomous Awakening
    if parent=nil then { //If it is not empty, Pi is on a subtree, and Pi loses the election opportunity
    	leader:=id; parent:=i; // Set parent to itself
     	arbitrarily Pj ∈ unexplored
      	take Pj from unexplored Delete from
    	send <leader> to Pj;}//endif
upon receiving <new_id> from neighbor Pj:
    if leader<new_id then { //Merge the tree of Pi into the tree of Pj
    	leader:=new_id; parent:=j;
     	unexplored:=all neighbors of Pi except Pj; //Reset unreachable neighbor set
      	if unexplored!=Φ then { //Modify the id of the DFS tree where the original Pi is located
           	arbitrarily Pk ∈ unexplored
            	take Pk from unexplored Delete from
    		send <leader> to Pk;
      	}else send <parent> to parent;
    }else if leader=new_id then send <already> to Pj;
upon receiving <parent> or <already> from neighbor pj:
    if received <parent> then add j to children;
    if unexplored = Φ then { //Pi's neighbors have been visited
    	if parent!=i then send <parent> to parent; //Pi is not root, return to parent
     	else terminate as root of the DFS tree; //DFS subtree with Pi as root has been constructed
    }else{ //Select a neighbor that Pi has not visited
    	arbitrarily Pk ∈ unexplored
      	take Pk from unexplored Delete from
    	send <leader> to Pk;

A network with m edges and n nodes has p spontaneously started nodes, and the startup time with the largest ID value is t.

Message complexity: \ (O(pn^2) \)

Time complexity: \ (O(t+m) \)

1.3 election algorithm on ring

Anonymous: the processors in the ring do not have unique identifiers, and each processor has the same state machine.

Consistent and uniform: the number of processors is unknown

Explain consistency and inconsistency:

Algorithm 1: forward n steps of msg to the right, and then terminate. = > non_ uniform

Algorithm 2: forward msg to the right until the MSG sender receives the MSG. = > uniform

In an anonymous, consistent algorithm, all processors have only one state machine.

In an anonymous and inconsistent algorithm, each n value corresponds to a state machine.

  • There is no anonymous and consistent leader election algorithm in synchronous ring system
  • There is no anonymous leader election algorithm in asynchronous ring system

1.3.1 asynchronous ring

Open scheduling: in the primary scheduling of algorithm A, if there is no message transmission on one edge, it is open edge.

(open scheduling is not necessarily an allowable scheduling, but a limited prefix of allowable execution)

The election algorithm on the asynchronous ring has a lower bound of message complexity \ (O(nlg \ n) \)

\(algorithm for O (n ^ 2 \)

Basic idea:

  1. Each processor sends an identifier msg to the left neighbor, and then waits to receive the MSG of the right neighbor.
  2. When each processor receives msg, if the received id is greater than itself, it will forward it to the left neighbor.
  3. If the processor receives its own identifier id, it declares itself a leader, terminates msg to the left neighbor, and then terminates
  4. If the processor receives the termination msg, it forwards it to the left and then terminates

Core idea: the processor sends the identifier msg to the left neighbor and determines the leader by comparing the id. only the largest id message will return to him.

Code for Pi (0<=i<=n-1)
Initial value: asleep=true; id = i;

While (receiving no message) do
  (1) if asleep do
      (1.1) asleep = false
      (1.2) send <id> to left-negihbor
   end if
End while
While (receiving <i> from right-neighbor) do
 (1) if id < <i> then send <i> to left-neighbor
       end if
 (2) if id = <i> then
      (2.1) send <Leader,i> to left-neighbor
      (2.2) terminates as Leader
   end if
End while
While (receiving <Leader,j> from right-neighbor) do
 (1) send <Leader,j> to left-neighbor
 (2) terminates as non-Leader
End while

Message complexity: \ (O(n^2) \)

\(O(nlg(n)) \) algorithm

K neighbors: there are k processors on the left and K on the right, with a total of \ (2k+1 \).

Basic idea:

  1. Stage 0: each node sends an id message to two 1-neighbors. If the neighbor's id is less than this message, reply. Otherwise, it is swallowed. If a node receives a reply from two neighbors, it considers itself a leader and enters phase 1.
  2. Stage \ (l \): the processor that became the leader in the previous stage continues to send id messages to \ (2^l \) neighbors. If it receives a reply from the left and right directions, it will consider itself a leader.
  3. Terminate if you receive your own message.

Core idea: in the \ (l \) stage, a processor attempts to become a temporary leader of its \ (2^l \) - neighbor. Only the processor that becomes the leader in the \ (l-th \) phase can continue the \ ((l+1)-th \) phase.

Code for Pi (0<=i<=n-1)
Initial value: asleep=true;

upon receiving no msg:
      if  asleep then{
      	asleep:=false;//After each node wakes up, it will no longer enter this code
       	send <probe,id, 0, 0> to left and right;
upon receiving <probe, j, l,d> from left_or_right (resp, right):
    if(j=id) then //Terminate after receiving your own id, omit sending termination msg
      	send <leader,id> to left neighbour;
       	terminate as the leader;
    if(j>id) and (d<2^l) then //Forward probe msg
    	send <probe, j, l, d+1> to right_or_left (resp, left)
    if(j>id) and (d≥2^l) then //Arrive at the last neighbor and have not been confiscated
    	send <reply, j, l > to left_or_right (resp, right) // answer
    // If J < ID, the probe message will be confiscated
upon receiving <reply ,j , l> from left (resp, right):
    if j≠id then // Determine whether to forward to the initial point
    	send <reply, j, l> to right (resp, left); //Forward reply
    else //When j=id, Pi has received an answer in one direction msg
    	if already received <reply, j, l> from right (resp, left) then //Also received a reply from the other direction
 			send <probe, id, l+1, 0> to left and right; //Pi is the temporary leader of phase l. continue to the next stage
upon receiving <leader, idj> from right:
    send <leader, idj> to left;
    terminate as nonleader; 

Stage \ (K \), the maximum number of temporary leader trees is \ (\ frac{n}{2^k+1} \), and the maximum number of started msg is \ (4*2^k \)

Message complexity: \ (O(nlg \ n) \)

1.3.2 synchronizing ring

Message complexity upper bound \ (O(n) \), lower bound \ (O(nlg \ n) \)

Upper bound \ (O(n) \):

  • Non uniform: all nodes in the ring are required to start at the same round
  • Uniform: nodes can start at different wheels

Nonuniform algorithm for proving upper bound

The ring size n must be known, and all nodes start in the same round

Assume the leader with the smallest id

Basic idea: run by stages, and each stage is composed of n wheels. In stage \ (i \), the processor with id \ (i \) is elected as leader, and then the algorithm is terminated.

Each node does not know each other's id value.

Message complexity: exactly n messages are sent, \ (O(n) \)

Uniform algorithm for proving upper bound

It is not necessary to know the ring size

One process can be: spontaneous arousal or passive arousal

Basic idea:

  1. The msg from the \ (id=i \) node delays \ (2^i-1 \) rounds before being forwarded.
  2. Each spontaneous wake-up node sends a wake-up msg around the ring without delay
  3. If a node receives a wake-up msg before startup, the node only forwards it.

Only nodes that wake up spontaneously can be selected as leader s

Initial value: asleep=true;waiting = ф;R = Received in calculation event msg Set of; s = ф;//the msg to be sent

if asleep then {
    if R = Φ then { // pi has not received msg, which belongs to spontaneous awakening
        min:=id;   // Participation in elections
        s:=s+{<id>}; // Ready to send
    }else{ //msg has been received, but it has not been started before, so Pi does not participate
    	min:=∞; //Election, set min to ∞ and make it a relay node
    	// relay:=true; ?
for each <m> in R do {// After processing the received m, it is equivalent to deleting it from R
    if m < min then { // The received id is too small to pass
    	become not elected;  // Pi is not selected
    	// Can relay control be used to make forwarding nodes not delay?
    	take<m>join waiting And remember m When to join; // m join delay forwarding
    } // if m > min then it is swallowed
    if m=id then become elected; // Pi is selected
} //endfor
for each <m> in waiting do
    if <m> It's at 2^m-1 Received before wheel then
    	take<m>from waiting Delete and add S
send S to left;

Messages sent:

  • Class I: msg in the first stage (wake-up msg)
  • The second type: the second stage msg sent by the final leader before entering its own second stage (sent by other nodes)
  • The third category: the second stage msg sent after the final leader's msg enters its own second stage (including the msg sent by the leader)

The maximum number of the first category is \ (n \)

The maximum number of the second category is \ (n \)

The third category has the largest total number of \ (2n \)

Message complexity: \ (O(4n) \)

Restricted lower bound \ (O(nlg \ n) \)

Order equivalence: two rings \ (x_0,x_1,\cdots,x_n-1 \) and \ (y_0,y_1,\cdots,y_n-1 \) (x_i < x_j if and only if y_i < y_j \)

If an algorithm is only related to the relative order of identifiers on the ring, but not to the specific id, the algorithm must only be based on the comparison of identifiers

For each \ (n\ge8,n is the power of 2 \), there is A ring \ (S_n \) of size n, so that the number of msg sent in the allowable execution of the comparison based synchronous leader election algorithm A is \ (O(nlg \ n) \).

Construct \ (S_n \)

  1. Define a ring of size n \ (R_n^{rev} \), so that the id of \ (P_i \) is \ (binary inverse sequence of rev(i):i \)
  2. If the ring is divided into continuous segments with length \ (j,j is the power of 2 \), then these segments are order equivalent.
  3. The number of fragments is \ (\ frac{n}{j} \)

1.4 calculation model

Complexity of calculation model:

  • The system consists of concurrent components
  • No global clock
  • Possible failures of components must be captured


  • causal relationship
  • Consistent state
  • Global state

Why do distributed systems lack global system state?

  1. Non instant messaging (propagation delay, resource competition, lost msg retransmission)

  2. Relativity effect (the actual clock of most computers drifts, and clock synchronization is still a problem)

  3. Interrupt (it is impossible to observe the global state of a distributed system at the same time)

Order: \ (e_1 < e_2: event e_1 occurs before e_2 \)


  • Events on the same node meet the total order, \ (e_1 < _pe_2 \)
  • \(e_1 \) sends a message and \ (e_2 \) receives a message, then \ (e_1 < _me_2 \)

Happens before relation: \ (< _h \), a partial order relation

Concurrent events: two events cannot be ordered by \ (< _h \)

The happens before relation is sometimes described as a directed acyclic graph

How to change the partial order relation of H relation into total order relation:

  • Topological ordering on directed acyclic graph DAG
  • Lamport algorithm

1.4.1 Lamport timestamp

Basic idea:

  • Event E has an additional timestamp: e.TS
  • Node has local timestamp: my_TS
  • msg has additional timestamp: m.TS
  • When a node executes an event, it assigns its own timestamp to the event.
  • When a node sends msg, it sends its own timestamp to all msg.
  • When a message is received, the local timestamp is updated to the maximum value
  • When sending a message, label it with a local timestamp
Initially: my_TS = 0;
On event e:
    if(e == receive messages m) then
    	my_TS = max(m.TS, my_TS)); // Take the larger value of msg timestamp and node timestamp
     e.TS = my_TS; // Time stamp event e
     if(e == send message m) then
     	m.TS = my_TS; // Time stamp the message

The timestamp of each time is greater than the timestamp of the precursor time

Problem: different times may have the same timestamp (concurrent events)

Improved algorithm: use the node address as the low bit of the timestamp, such as \ (1.1,2.2 \). After labeling, the total order relationship is obtained according to the dictionary order.

Problem: it is impossible to determine whether there is a causal relationship between two events through timestamp

1.4.2 vector timestamp

Vector timestamp VT: \ (e.VT is an array, e.VT[i]=k indicates that there are k events (including itself) before event E on node I.)

Basic idea:

  • Node has local vector timestamp: my_VT

  • Event E has vector timestamp: e.VT

  • msg vector timestamp: m.VT

Initially: my_VT=[0,0,,,0];
On event e:
    if(e == receive messages m) then
		for i= 1 to M do // Each component of the vector timestamp increases without decreasing
    		my_VT[i] = max(m.VT[i], my_VT[i]);
    my_VT[self]++; // self is the name of this node
    e.VT = my_VT; // Time stamp event e
    if(e == send message m) then
    	m.VT = my_VT; // Timestamp message m

Vector timestamp comparison:

e1.VT = (5,4,1,3)
e2.VT = (3,6,4,2)
e3.VT = (0,0,1,3)
  1. If there is no relationship between e1 and e2 that is completely greater than or less than, e1 and e2 are concurrent.

  2. e3 precedes e1 in causal order

1.4.3 causal communication

The processor cannot select the arrival time of msg, but it can suppress msg arriving too early.

Such as FIFO communication in TCP.

Basic idea:

  • Suppress messages sent from P until it can be concluded that no other messages occur earlier than m
  • On each node:
    • \(early [1... M]: lower bound of message timestamps that can be delivered by different nodes \)
    • \(blocked[1...M]P: blocked queue array, each component is a queue \)
Causal Msg Delivery
    time stamp lk:
    	use Lamport time stamp,lk=1;
     	Use vector timestamp,lk=(0,,,0,1,0,,,0),Number k Bit 1;
    earliest[k] = lk,k=1,,,M // Lower bound of message timestamp that can be delivered by different nodes
    blocked[k] = {},k=1,,,M // Block queue empty
On the receipt of msg <m> from node P:
    delivery_list = {}
    if (blocked[p]==empty) then
    	earliest[p] = m.timestamp;
     	blocked[p].push_back(m); // Process received messages
    //Processing blocking queue
    while(∃k send blocked[k]Non empty && 
    	yes i=1,,,M(except k and self Outside) not_earliest(earliest[i],earliest[k],i) )
     	// No message arrived earlier than k
    	take blocked[k]Team head element m'Get out of the team and join delivery_list;
    	if (blocked[k]!=empty) then
    		earliest[k] = m'.timestamp;
     		increment earliest[k] by lk
    deliver the msgs in delivery_list; // In causal order

Deadlock may occur (the msg you want is not sent on a node for a long time)

The causal communication algorithm is often used in multicast.