Review points of parallel and distributed computing

Posted by web_noob on Sat, 08 Jan 2022 03:35:27 +0100

concept

Computer architecture

SISD

  • Single instruction stream Single Data stream
  • Single instruction single data, serial computer
  • In any clock cycle, the CPU has only one instruction stream; In any clock cycle, there is only one data stream as input
  • Deterministic execution

SIMD

  • Single instruction stream Multiple Data stream
  • Single instruction multiple data, parallel computer
  • Each processing unit executes the same instruction in any clock cycle; Each processing unit can operate on different data elements
  • It is suitable for dealing with problems with high regularity
  • Synchronization and certainty

MISD

  • Multiple instruction stream Single Data stream
  • Multi instruction single data, parallel computer, there are few practical examples
  • Each processing unit processes data through a separate instruction stream, and a single data stream is input to multiple processing units

MIMD

  • Multiple Instruction stream Single Data stream
  • Multi instruction multi data, the most common parallel computer
  • Each processing unit executes different instruction streams and uses different data streams
  • Synchronous or asynchronous, deterministic or nondeterministic

Shared memory

A storage structure of a parallel computer with a global address space

  • characteristic
    All processors access all memory as a global address space
    Multiple processors run independently but share memory resources
    All processors can see the memory location change caused by one processor
  • advantage
    The global address space is easy to program
    The memory is close to the CPU, and the data sharing between tasks is fast and uniform
  • shortcoming
    Lack of scalability between memory and CPU
    High cost and high cost

They are classified as UMA and NUMA according to memory access time

UMA unified memory access

  • Uniform Memory Access
  • All processors have equal access to memory
  • One processor updates the shared memory, which can be known by other processors, that is, cache coherence
    SMP, symmetric multi processing

NUMA non-uniform memory access

  • Non-Uniform Memory Access
  • Connect multiple SMP S together by physical connection
  • One SMP can access the memory of another SMP. Not all SMPS have equal access time to access all memory
  • Slow memory access across links
  • Cache coherence

Distributed memory

A storage structure of parallel computer

  • characteristic
    A communication network is required to connect the processor's memory
    The processor has local memory and there is no global address space
    Run independently, not applicable to cache coherence
    When the processor needs to access the data of other processors, the programmer needs to clearly define the data transmission mode and access time to synchronize the tasks
    The network structures used for data transmission vary greatly
  • advantage
    Memory can increase as the number of processors increases
    Each processor can quickly access its own memory without conflict and overhead of maintaining cache consistency
    Low cost
  • shortcoming
    Programmers need to be responsible for the details of many communications
    Because there is no global address space, it is difficult to establish distributed memory management mapping

Parallel computing model

PRAM

  • Parallel Random Access Machine
  • SIMD model of shared storage
    There is a centralized shared memory and an instruction controller to exchange data through the R/W of SM(Shared Memory) for implicit synchronous calculation.
  • Concurrent read / write control strategy
    According to the nature that the processor reads and writes to the shared storage unit at the same time, it is divided into

    (E:Exclusive C:Concurrent)
  • Computing power of different concurrent read-write control strategies

APRAM

  • Asynchronous Parallel Access Machine, asynchronous PRAM model, also known as split phase PRAM
  • For MIMD
    Each processor has its local memory, local clock and local program; There is no global clock, and each processor executes asynchronously; The processor communicates through SM; The dependencies between processors need to explicitly add synchronization roadblocks in parallel programs.

BSP

  • Bulk Synchronous Parallel, a block synchronous model, is an asynchronous MIMD-DM(distributed memory) model
  • Intra block asynchronous parallel, inter block explicit synchronization

General design process of parallel method PCAM

  • PCAM Partitioning Communication Agglomeration Mapping
  • Partition communication combination mapping

MPI parallel programming

  • Message Passing Interface
  • Is a cross language parallel programming technology based on message passing, which supports point-to-point communication and broadcast communication. Message passing interface is a programming interface standard, not a detailed programming language. MPI is a standard or specification. At present, there are many implementations of MPI standards

Parallel test standard

Communication test standard

Amdahl's Law (fixed load)


Gustafson's law

In practical problems, it is meaningless to increase the number of processors with a fixed load, so increasing the number of processors must also increase the load

Acceleration ratio and efficiency

Efficiency = S/P
Speedup ratio / number of processors
If the scale of the problem remains unchanged, the number of processors increases, and the efficiency decreases

⭐⭐ There are three ways of message passing

Data sharing and process synchronization are realized between processes through message passing

  • Synchronous messaging
    Synchronous Message Passing
    Start sending: both parties reach the sending point and the receiving point
    Send return and receive return: the message has been received
    Start receiving: both parties reach the sending point and the receiving point
    Waiting time: maximum
  • Blocked messaging
    Blocking Message Passing
    Start sending: the sender reaches the sending point
    Send return: the message has been sent
    Start receiving: the receiver reaches the acceptance point
    Receive return: the message has been received
    Waiting time: between the two
  • Non blocking messaging
    Nonblocking Message Passing
    Start sending: the sender reaches the sending point
    Send return: notify the system to send a message after sending
    Start receiving: the receiver reaches the acceptance point
    Receive return: notify the system to receive the message after receiving
    Waiting time: minimum

Similarities and differences of the three methods




MPI point-to-point communication and cluster communication

Point to point communication


Q receives X for the next calculation
Q sends the last calculated Y
Then use the X received this time to calculate Y
The code in the figure below should be the process pipeline. How to receive and send specifically requires double buffering

Isend and Irevc are non blocking. They return immediately after calling. It is necessary to call the waiting function
X Y Xbuf0 Xbuf1 Ybuf0 Ybuf1 Xin Yout are all pointers
Xin Yout is a pointer to the data to be sent or received
X Y refers to the buffer used for the current calculation

Cluster communication


All processes in the communication subsystem must call the cluster routine

  • Broadcast
    MPI_Bcast(Address, Count ,DataType, Root, Comm)
    The process marked as root sends the same message to all processes in the Comm communication subsystem
  • Scatter Gather
MPI_Scatter(SendAddress, SendCount, SendDateTyoe, RecvAddress, RecvCount, RecvDataType, Root, Comm);
MPI_Gather(SendAddress, SendCount, SendDateTyoe, RecvAddress, RecvCount, RecvDataType, Root, Comm);

Broadcast: the root process sends a different message to each process, and the messages to be sent are orderly stored in the sending cache of the root process
Aggregation: root receives messages from each process, and the received messages are orderly stored in the receiving cache of root process.
Seeding and gathering are two opposite operations

  • Extended aggregation and dissemination (Allgather)
MPI_Allgather(SendAddress, SendCount, SendDateTyoe, RecvAddress, RecvCount, RecvDataType, Comm);

  • Global exchange
MPI_Alltoall(SendAddress, SendCount, SendDateTyoe, RecvAddress, RecvCount, RecvDataType, Comm);

Each process sends messages to n processes, and the messages to be sent are orderly stored in the send cache
In other words, each process receives messages from n processes, and the messages to be received are orderly stored in the receive cache
Global exchange means that n processes gather n times. There are n^2 message communications in a global exchange

  • Aggregation
    There are two aggregation operations: Reduction and Scan
    Reduction: each process stores the values to be reduced in SendAddress. All processes reduce these values into final results and store them in RecvAddress of the root process. The reduction operation is op
MPI_Reduce(SendAddress, RecvAddress, Count, DataType, Op, Root, Comm);

Scan: without root, combine some values into n final values and store them in RecvAddress of each process. The operation is op

MPI_Scan(SendAddress, RecvAddress, Count, DataType, Op, Comm);
  • Barrier
MPI_Barrier(Comm);

All processes in the communication subsystem synchronize, that is, wait for each other until all processes finish executing their Barrier function

⭐⭐ Find the value of PI by MPI

#include <stdio.h>
#include <mpi.h>
#include <math.h>
long    n,    	/*number of slices         */
        i;    	/* slice counter           */
double sum, 	/* running sum             */
        pi,     /* approximate value of pi */
        mypi,
        x,      /* independent var.        */
        h;      /* base of slice           */
int group_size,my_rank;
 
main(argc,argv)
int argc;
char* argv[]; 
{      
	int group_size,my_rank;
	MPI_Status status;
	MPI_Init(&argc,&argv);
	MPI_Comm_rank( MPI_COMM_WORLD, &my_rank);
	MPI_Comm_size( MPI_COMM_WORLD, &group_size);
	 
	n=2000;
	/* Broadcast n to all other nodes */
	MPI_Bcast(&n,1,MPI_LONG,0,MPI_COMM_WORLD); 
	h = 1.0/(double) n;
	sum = 0.0;
	//Each process calculates a portion
	for (i = my_rank; i < n; i += group_size) {
		x = h*(i+0.5);
		sum = sum +4.0/(1.0+x*x);
	}
	mypi = h*sum;
	/*Global sum * reduce To the root process, the summation operation is used*/ 
	MPI_Reduce(&mypi,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); 
	if(my_rank==0) {        /* Node 0 handles output */
	printf("pi is approximately : %.16lf\n",pi);
	}
	MPI_Finalize();
}

MapReduce

Model principle

For big data that does not depend on each other, the best way to achieve parallelism is divide and conquer.
MPI and other parallel methods lack high-level parallel programming model, and programmers need to specify tasks such as storage, division and calculation; Without a unified parallel framework, programmers need to consider many details.
Therefore, MapReduce designs and provides a unified computing framework, hides most system layer details for programmers, and provides high-level concurrent programming model abstraction with Map and Reduce functions.

A large data set is divided into many independent splits, which are processed by multiple Map tasks in parallel
< K1, v1 >: K1 is the primary key and v1 is the data. After Map processing, many intermediate results are generated, that is, list (< K2, V2 >).
Reduce merges the intermediate results, merges the data with equal key values, i.e. < K2, list (V2) >, and finally generates the final result, i.e. < K3, V3 >

⭐⭐ wordcount instance

Illustration:

Core pseudo code:

Click to view the pseudo code
public class Mapper()
{
	/*
	key And value are the key value pairs entered
	context Is the result of the output
	*/
	public void map(Object key, Text value, Context context)
				throws IOException, InterruptedException {
		//By default, the data of each row is taken. Each row has spaces and is divided by spaces
		String lineContent = value.toString();	//Take out the data of each row
		String words[] = lineContent.split(" ");	//Split each row of data
		
		for(String word : words){	//Loop each word and then generate data
				//The number of saved words finally generated for each word is 1
				context.write(word, 1);
		}
	}	
}

public class Reducer()
{
	/*
	key And values are intermediate results of map output, which need to be processed by reduce
	context This is the final result of reducer output
	*/	
	public void reduce(Text key, Iterable<IntWritable> values, Context context)
			throws IOException, InterruptedException {{
		int sum = 0 ;	//Save the total number of occurrences of each word
	}
	for(IntWritable count :  values){
		sum += count.get();
	}
	context.write(key, sum);
}

public static void main(String[] args) throws Exception{
	//Make relevant configuration
	Configuration configuration=new Configuration();
	//Create a work object
	Job job=Job.getInstance(configuration);
	//Set current work object
	job.setJarByClass(Myjob.class);
	//Set map object class
	job.setMapperClass(Mapper.class);
	//Set the reduce object class
	job.setReducerClass(Reducer.class);
	//Set the key type of the output
	job.setOutputKeyClass(Text.class);
	//Set the value type of the output
	job.setOutputValueClass(IntWritable.class);
	//Set the location of the imported file
	FileInputFormat.addInputPath(job, new Path("/hadoop/hadoop.txt"));
	//Set file location for output
	FileOutputFormat.setOutputPath(job, new Path("/hadoop/out"));
	job.waitForCompletion(true);
	return 0;
}

Fault tolerant control strategy (4 kinds)

Fault tolerance: the system can continue to provide services under the premise of failure

Checkpoint fault tolerance

Process: periodic backup and rollback recovery
When a task fails, the calculation can be resumed from the last successful checkpoint

Backup overhead and recovery efficiency cannot be combined

Lineage fault tolerance

Dependency of input and output:
Narrow dependence: one to one
Wide dependence: one to many

Lineage fault tolerance: record the dependency of intermediate results, recalculate a part according to the lineage information, and recover the lost data
If the lineage information is too long, the calculation will be time-consuming, and checkpoints need to be used to avoid too many repeated calculations

Recalculating F1 with D1 can accurately recover data, but it is only effective for narrow dependencies

Speculative execution

If a task is executed too slowly, a backup task will be started. Finally, the results of the faster tasks in the original task and the backup task will be used.
Speculative execution is turned off by default, because repeated tasks will reduce the efficiency of the cluster.
Speculative execution is space for time. It is generally started when resources are idle and job completion is large

Compensated fault tolerance

No preparation, reset lost data in case of failure (recalculation)
Advantages: no redundancy (the first three have redundancy, data redundancy or computational redundancy)
Disadvantages: it has great limitations and is only applicable to some algorithms

Note: please point out any errors!