paper address: http://nil.csail.mit.edu/6.824/2021/schedule.html
MapReduce principle
- Start MapReduce and cut the input file into files with a size of 16-64MB. Then start the user program on a group of multiple machines
- One copy will become master and the rest will become worker The master assigns tasks to the worker (M map tasks and R reduce tasks). The master selects an idle worker to give map or reduce tasks
- The Map worker receives the input after segmentation, executes the Map function, and caches the results in memory
- The intermediate results after caching will be written to the local disk periodically and divided into R copies (number of reducers). The location of R files will be sent to the master, and the master will forward it to reducer
- The Reduce worker receives the location information of the intermediate file and reads it through RPC. After reading, sort according to the middle < K, V > first, and then group and merge according to the key.
- The Reduce worker iterates over the sorted data and gives the intermediate < K, V > to the reduce function for processing. Write the final result to the corresponding output file (fragment)
- After all map and reduce tasks are completed, the master wakes up the user program
MapReduce implementation process
Master
The paper mentioned that each (Map or Reduce)Task has three states: idle, in-progress and completed.
// Enumeration, indicating the task execution stage. According to the paper, it is divided into idle, executing and completed const ( Idle MasterTaskStatus = iota InProgress Completed )
The Master saves the Task information
// The Task information recorded by the Master includes the Task execution phase, Task start time and the pointer of the Task object type MasterTask struct { TaskStatus MasterTaskStatus // Task execution phase StartTime time.Time // Task start time TaskReference *Task // Indicates which task is currently executing }
The Master stores the information of R intermediate files generated by the Map task.
// Master node object type Master struct { TaskQueue chan *Task // Save the Task queue and implement the queue through the channel channel TaskMeta map[int]*MasterTask // Information of all tasks in the current system. The key is taskId MasterPhase State // Master phase NReduce int // R Reduce worker threads InputFiles []string // Enter file name Intermediates [][]string // A two-dimensional array of M rows and R columns, which saves M*R intermediate files generated by the Map task }
Map and Reduce use the same Task structure, which can take into account two-stage tasks.
// Task object type Task struct { Input string // The name of the input file that the task is responsible for processing TaskState State // Task status NReducer int // R reducers TaskNumber int // TaskId Intermediates []string // Save the disk path of the R intermediate files generated by the Map task Output string // Output file name }
Merge the states of task and master into one State
type State int // Enumeration, indicating the status of Master and Task const ( Map State = iota // Enumerating from 0 Reduce Exit Wait )
MapReduce implements Map and Reduce
1. Start the master
// create a Master. // main/mrmaster.go calls this function. // nReduce is the number of reduce tasks to use. // Create a Master node, which is responsible for distributing tasks as a service registration center and a service scheduling center func MakeMaster(files []string, nReduce int) *Master { // Create Master node m := Master{ // Save the task queue and realize first in first out through chan channel TaskQueue: make(chan *Task, max(nReduce, len(files))), // The main function is to obtain the corresponding Task information through the key taskId TaskMeta: make(map[int]*MasterTask), // At the beginning, both the Master and Task are in the Map stage MasterPhase: Map, NReduce: nReduce, InputFiles: files, // Create a two-dimensional array, save the intermediate file path generated in the Map stage, and set the number of columns to nReduce Intermediates: make([][]string, nReduce), } // TODO divides the files in files into 16MB-64MB files // Create Map task m.createMapTask() // Start the Master node, register all the Master methods in the registry, and the worker can access the Master methods through RPC m.server() // crash, start a coroutine to constantly check the overtime tasks go m.catchTimeOut() return &m }
Create Map task
// Create Map task func (m *Master) createMapTask() { // Traverse all input files, and each file is processed with a Map task for idx, fileName := range m.InputFiles { // Create a Map Task object taskMeta := Task{ Input: fileName, TaskState: Map, NReducer: m.NReduce, TaskNumber: idx, } // Put the Task object into the queue m.TaskQueue <- &taskMeta // Fill in the Master's information about all tasks in the current queue. taskId is key and value saves the task information m.TaskMeta[idx] = &MasterTask{ TaskStatus: Idle, TaskReference: &taskMeta, } } }
Constantly check overtime tasks to improve execution efficiency
// crash, start a coroutine to constantly check the overtime tasks func (m *Master) catchTimeOut() { for { time.Sleep(5 * time.Second) // Lock the m.MasterPhase that other threads may use mu.Lock() // If the execution status of the Master node is exit status, exit the check if m.MasterPhase == Exit { mu.Unlock() return } // Check all tasks for _, masterTask := range m.TaskMeta { // If the task is in execution and the execution time is greater than 10 seconds, it will be put into the queue again for execution by other worker s if masterTask.TaskStatus == InProgress && time.Now().Sub(masterTask.StartTime) > 10*time.Second { m.TaskQueue <- masterTask.TaskReference masterTask.TaskStatus = Idle } } mu.Unlock() } }
2. The master listens to worker RPC calls and assigns tasks
// Wait for the worker to request the Master's service through rpc func (m *Master) AssignTask(args *ExampleArgs, reply *Task) error { // Lock the Master node mu.Lock() defer mu.Unlock() // There are still free tasks in the queue if len(m.TaskQueue) > 0 { // When there are free tasks in the taskQueue, a task pointer is sent to a worker *reply = *<-m.TaskQueue // Set Task status m.TaskMeta[reply.TaskNumber].TaskStatus = InProgress m.TaskMeta[reply.TaskNumber].StartTime = time.Now() } else if m.MasterPhase == Exit { // There are still tasks in the queue, but the Master status is Exit // A Task with Exit status is returned, indicating that the Master has terminated the service *reply = Task{ TaskState: Exit, } } else { // If there is no task in the queue, let the requested worker wait *reply = Task{ TaskState: Wait, } } return nil }
3. Start the worker
// main/mrworker.go calls this function. // Start Worker func Worker(mapf func(string, string) []KeyValue, reducef func(string, []string) string) { for { // Get idle tasks via RPC task := getTask() // Conduct corresponding processing according to the current execution status of the task switch task.TaskState { case Map: mapper(&task, mapf) case Reduce: reducer(&task, reducef) case Wait: time.Sleep(5 * time.Second) case Exit: return } } }
4. The worker sends an RPC request to the master
// Get idle tasks via RPC func getTask() Task { args := ExampleArgs{} reply := Task{} // The RPC request calls the service of the Master to get the Task call("Master.AssignTask", &args, &reply) return reply }
5. The worker obtains the MapTask and submits it to the mapper for processing
// Execute Map task func mapper(task *Task, mapf func(string, string) []KeyValue) { // Get the file path corresponding to the task content, err := ioutil.ReadFile(task.Input) if err != nil { log.Fatal("Failed to read file: "+task.Input, err) } // Execute WC The mapf method in go performs the map phase of MapReduce to obtain a string array of nReduce intermediate file paths intermediates := mapf(task.Input, string(content)) // Save the intermediate file path generated in the map phase to a two-dimensional array with NReducer columns buffer := make([][]KeyValue, task.NReducer) // Save the results to the memory buffer for _, intermediate := range intermediates { // hash according to the key and divide the result into NReducer shares slot := ihash(intermediate.Key) % task.NReducer buffer[slot] = append(buffer[slot], intermediate) } // Periodically save from memory to disk mapOutput := make([]string, 0) for i := 0; i < task.NReducer; i++ { // Write intermediate results to NReducer temporary files mapOutput = append(mapOutput, writeToLocalFile(task.TaskNumber, i, &buffer[i])) } // NReducer saves the path of a file to memory and the Master can get it task.Intermediates = mapOutput // Set the task status to completed TaskCompleted(task) }
6. Notify the master after the worker completes the task
func TaskCompleted(task *Task) { reply := ExampleReply{} call("Master.TaskCompleted", task, &reply) }
7. The master receives the completed Task
// Update Task status to completed and check func (m *Master) TaskCompleted(task *Task, reply *ExampleReply) error { mu.Lock() defer mu.Unlock() // Fault tolerance, check node status, check duplicate tasks if task.TaskState != m.MasterPhase || m.TaskMeta[task.TaskNumber].TaskStatus == Completed { // Repeat task to discard return nil } m.TaskMeta[task.TaskNumber].TaskStatus = Completed go m.processTaskResult(task) return nil }
- If all reducetasks have been completed, go to the Exit phase
// The master obtains the results of task execution through collaboration func (m *Master) processTaskResult(task *Task) { mu.Lock() defer mu.Unlock() switch task.TaskState { case Map: // The results are collected in the middle stage of Master Map in memory // key is the taskId, value is the string array of file paths, and a task has NReducer and filePath for reduceTaskId, filePath := range task.Intermediates { m.Intermediates[reduceTaskId] = append(m.Intermediates[reduceTaskId], filePath) } // If all tasks have been completed, enter the reduce phase if m.allTaskDone() { m.createReduceTask() m.MasterPhase = Reduce } case Reduce: // Reduce sets the status to Exit if m.allTaskDone() { m.MasterPhase = Exit } } }
8. If all maptasks have been completed, create a ReduceTask and move to the Reduce phase
// Execute Reduce task func reducer(task *Task, reducef func(string, []string) string) { // Read intermediate files from disk intermediate := *readFromLocalFile(task.Intermediates) // Sort dictionary order according to key sort.Sort(ByKey(intermediate)) dir, _ := os.Getwd() tempFile, err := ioutil.TempFile(dir, "mr-2021-tmp-*") if err != nil { log.Fatal("Failed to create temp file", err) } i := 0 // Traverse every key for i < len(intermediate) { j := i + 1 // Grouping and merging of the same key for j < len(intermediate) && intermediate[i].Key == intermediate[j].Key { j++ } // Save the final count of the key, that is, consolidate the counts of the same key values := []string{} for k := i; k < j; k++ { values = append(values, intermediate[k].Value) } // Submit the results to reducef for statistics output := reducef(intermediate[i].Key, values) // Save the string content of the final result to a temporary file fmt.Fprintf(tempFile, "%v %v\n", intermediate[i].Key, output) i = j } tempFile.Close() // Defines the file name of the output file oname := fmt.Sprintf("mr-2021-out-%d", task.TaskNumber) os.Rename(tempFile.Name(), oname) task.Output = oname TaskCompleted(task) }
9. The master confirms that all reducetasks have been completed, enters the Exit phase, and terminates all master and worker goroutine s
// // main/mrmaster.go calls Done() periodically to find out // if the entire job has finished. // func (m *Master) Done() bool { mu.Lock() defer mu.Unlock() ret := m.MasterPhase == Exit return ret }
- Concurrent
Because the master saves the information related to the Task, the master needs to be modified concurrently when the worker executes the Task, so it needs to be locked. The master communicates with multiple workers, and the data of the master is shared.
// Master node object type Master struct { TaskQueue chan *Task // Save the Task queue and implement the queue through the channel channel TaskMeta map[int]*MasterTask // Information of all tasks in the current system. The key is taskId MasterPhase State // Master phase NReduce int // R Reduce worker threads InputFiles []string // Enter file name Intermediates [][]string // A two-dimensional array of M rows and R columns, which saves M*R intermediate files generated by the Map task }
Among them, taskmeta, phase, intermediates and TaskQueue all have reading and writing. TaskQueue is implemented using channel and has its own lock. Only operations involving intermediates, taskmeta and phase need to be locked. InputFiles and NReduce are written at one time when creating the Master, so there will be no concurrent write scenario.
11. Fault tolerance
- Send heartbeat detection to worker periodically
- If the worker is lost for a period of time, the master marks the worker as failed
- After the worker fails, the completed map task is re marked as idle, and the completed reduce task does not need to be changed
- For tasks with in progress and timeout, they will be put into the queue again for execution by other worker s
// crash, start a coroutine to constantly check the overtime tasks func (m *Master) catchTimeOut() { for { time.Sleep(5 * time.Second) // Lock the m.MasterPhase that other threads may use mu.Lock() // If the execution status of the Master node is exit status, exit the check if m.MasterPhase == Exit { mu.Unlock() return } // Check all tasks for _, masterTask := range m.TaskMeta { // If the task is in execution and the execution time is greater than 10 seconds, it will be put into the queue again for execution by other worker s if masterTask.TaskStatus == InProgress && time.Now().Sub(masterTask.StartTime) > 10*time.Second { m.TaskQueue <- masterTask.TaskReference masterTask.TaskStatus = Idle } } mu.Unlock() } }