[Source Parsing] Deep Learning Distributed Training Framework horovod (20) --- Elastic Training Operator

Posted by jackie11 on Mon, 10 Jan 2022 18:52:26 +0100

[Source Parsing] Deep Learning Distributed Training Framework horovod(20) - Elastic Training Operator

0x00 Summary

Horovod is a distributed training framework based on ALReduce. Horovod is widely used in data parallel training due to its support for mainstream deep learning frameworks such as TensorFlow, PyTorch, and communication optimization.

This is the last article on horovod on k8s to see how MPI-Operator might be improved, primarily by learning the source code from the blog content of the Elastic Training Operator author team. So this paper mainly focuses on a large number of sources.

The links to other articles in this series are as follows:

[\Source Parsing] Deep learning of the distributed training framework Horovod - (1) Basic knowledge

[\Source Parsing] In-depth learning distributed training framework horovod(2) - from the user's Perspective

[\Source Parsing] Deep learning distributed training framework horovod(3) - What's behind Horovodrun

[\Source Parsing] Deep Learning Distributed Training Framework horovod(4) - Network Foundation & Driver

[\Source Parsing] In-depth Learning Distributed Training Framework horovod(5) - Fusion Framework

[\Source Parsing] In-depth learning distributed training framework horovod(6) - background threading architecture

[\Source Parsing] In-depth Learning Distributed Training Framework horovod(7) - Distributed Optimizer

[Source Parsing] In-depth Learning Distributed Training Framework horovod(8) - on spark

[Source Parsing] Deep Learning Distributed Training Framework horovod(9) - Start on spark

[Source Parsing] In-depth learning distributed training framework horovod(10) - run on spark

[Source Parsing] Deep Learning Distributed Training Framework horovod(11)-on spark-GLOO Scheme

[Source Code Analysis] Deep Learning Distributed Training Framework horovod(12) - Overall Elastic Training Architecture

Hoovod(13) - Driver for Elastic Training

[Source Parsing] Deep Learning Distributed Training Framework horovod (14) - How do I find a node hanging?

[Source Parsing] In-depth Learning Distributed Training Framework horovod(15) - Radio & Notification

[Source Analysis] In-depth Learning Distributed Training Framework horovod(16) - Worker Life Cycle of Elastic Training

Hoovod(17) - Fault Tolerance for Elastic Training

[Source Parsing] Deep Learning Distributed Training Framework horovod(18) - kubeflow tf-operator

[Source Parsing] In-depth Learning Distributed Training Framework horovod(19) - kubeflow MPI-operator

0x01 Background Knowledge

Both sections 0x01 and 0x02 come from the Elastic Training Operator team blog content, which is really great.

1.1 Elastic

Kubernetes and cloud computing provide agility and scalability. We can set up flexibility strategies for training tasks through cluster-AutoScaler and other components, and use the flexibility of Kubernetes to create GPU devices on demand to reduce idling.

However, this scaling mode is slightly inadequate for offline tasks such as training:

Fault tolerance is not supported, when some Worker s fail due to device reasons, the entire task needs to stop and start over.
Training tasks usually take a long time, take up a lot of energy and lack flexibility. When resources are insufficient, resources cannot be freed for other businesses on demand unless the task is terminated.
Training tasks take a long time, do not support worker dynamic configuration, can not safely use preemptive instances, play the best value-for-money ratio in the cloud

How to give flexibility to training tasks is the key path to improve cost-effectiveness. Recently, distributed frameworks such as horovod have gradually supported Elastic Training, or flexibility training. That is, to allow a training worker to expand or shrink dynamically during the execution of a training task, never causing the interruption of the training task. A small amount of modification to the adapter is required in the code, for reference: https://horovod.readthedocs.io/en/stable/elastic_include.html .

Disadvantages of 1.2 mpi-operator

In mpi-operator, the Worker s participating in training are designed and maintained as static resources. Supporting the flexible training mode adds flexibility to tasks, but also challenges the operation and maintenance layer, such as:

The horovordrun provided by horovod must be used as the entrance. The launcher in horovod logs on to the worker through ssh, and the landing tunnel between the launcher and the worker needs to be opened.
Elastic Driver module responsible for calculating resilience by specifying discover_ The host script gets the latest worker topology information to pull up or stop the worker instance. When the worker changes, first update discover_ Return value of host script.
In scenarios such as preemption or price calculation, it is sometimes necessary to specify worker extensions, K8s native layout meta-language deployment, and statefulset cannot meet the specified extensions.

To address these issues, we have designed and developed et-operator, which provides TrainingJob CRD descriptions of training tasks, ScaleOut and ScaleIn CRD descriptions of scaling and shrinking operations, and their combination makes our training tasks more flexible. Open source this project, welcome you to ask for, communicate, and spit out.

Open source solution address: https://github.com/AliyunContainerService/et-operator

0x02 Overall Architecture

TrainingJob Controller has the following main functions:

Maintain the creation/deletion lifecycle of TrainingJob as well as subresource management.
Perform a scaling operation.
Fault tolerance, when the worker is expelled, create a new worker to join the training.

2.1 Resource Creation

TrainingJob subresources are created in the following order:

Create a key pair to get through ssh and create a secret.
Create workers, including service s and pod s, and mount the secret public key.
Create configmap with discover_host script, hostfile.
Create a launcher and mount configmap. Since hostfiles are subsequently modified with topological relationships, hostfiles are copied from configmap to a separate directory through initcontainer.

TrainingJob related resources:

2.2 Roles

The configuration of TrainingJob CR is divided into Lanucher and Worker. Specify mirroring and startup execution of tasks in Launcher. The default et-operator generates a hostfile and discover_based on worker assignments Host script, discover_host script mounted to Launcher's/etc/edl/discover_hosts.sh file, specified by the--host-discovery-script parameter in the horovodrun execution of the entry script. Specify the mirroring and GPU usage of the worker in the Worker settings and the allowable range for the number of copies of the worker through maxReplicas / minReplicas.

2.3 Main Procedures

The main program diagrams are as follows:

0x03 Entry

In fact, learning ETO is mainly about how to expand and shrink. But to learn this, we still need to comb the program logic.

Those who are not familiar with K8S will also like to see how their CRD s are used.

3.1 Creation

The entry code is main. The go/main function, as you can see from the entry,

Controller generated. Manager.
Using this Manager, three Reconcilers are built: TrainingJobReconciler, ScaleInReconciler, and ScaleOutReconciler.
Then start Manager;

func main() {
	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
		Scheme:             scheme,
		MetricsBindAddress: metricsAddr,
		LeaderElection:     enableLeaderElection,
		Port:               9443,
	})

	const jobPollInterval = "5s"
  
	if err = controllers.NewReconciler(mgr, parseDurationOrPanic(jobPollInterval)).SetupWithManager(mgr); err != nil {
		os.Exit(1)
	}
	if err = controllers.NewScaleOutReconciler(mgr, parseDurationOrPanic(jobPollInterval)).SetupWithManager(mgr); err != nil {
		os.Exit(1)
	}
	if err = controllers.NewScaleInReconciler(mgr, parseDurationOrPanic(jobPollInterval)).SetupWithManager(mgr); err != nil {
		os.Exit(1)
	}

	if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
		os.Exit(1)
	}
}

3.2 Settings

The configuration here is to set up a response function for the message, which CR s it responds to.

In addition to TrainingJob, et-operator supports both ScaleOut and ScaleIn CRD s to expand and shrink training tasks.
Now a ScaleOut CR is sent, and the ScaleOutController triggers the Reconcile. The simple work here is to find the corresponding TrainingJob for Scaler based on the Selector field in ScaleOut CR and set it on the OwnerReferences of the CR.
Updates to the ScaleOut CR belonging to TrainingJob were heard in TrainingJobController, which triggered the Reeconcile of TrainingJob, traversed to filter the ScaleIn and ScaleOut pointed to by OwnerReference under TrainingJob, and scaled up or down to the creation and state time.
When zooming, you can use spec.toDelete in ScaleIn CR. Count or spec.toDelete. The podNames field specifies a scaled worker. Configuring the number of shrinks through count calculates the high-to-low shrink Worker through index.

func (r *ScaleInReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&kaiv1alpha1.ScaleIn{}).
		Complete(r)
}

func (r *ScaleOutReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&kaiv1alpha1.ScaleOut{}).
		Complete(r)
}

func (r *TrainingJobReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&kaiv1alpha1.TrainingJob{}).
		Owns(&kaiv1alpha1.ScaleIn{}).
		Owns(&kaiv1alpha1.ScaleOut{}).
		Owns(&corev1.Pod{}).
		Owns(&corev1.Service{}).
		Owns(&corev1.ConfigMap{}).
		Owns(&corev1.Secret{}).
		// Ignore status-only and metadata-only updates
		//WithEventFilter(predicate.GenerationChangedPredicate{}).
		Complete(r)
}

0x04 TrainingJobReconciler

Follow the code to find the subtleties of its design ideas.

4.1 Reconcile

The function of reconcile method in k8s operator is continuous watch, which triggers the reconcile method when resources change, and how many times the reconcile method will be executed theoretically.

The Reconcile method is called when a message comes.

func (r *TrainingJobReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
	// Fetch latest training job instance.
	sharedTrainingJob := &kaiv1alpha1.TrainingJob{}
	err := r.Get(context.Background(), req.NamespacedName, sharedTrainingJob)
	trainingJob := sharedTrainingJob.DeepCopy()
	// Check reconcile is required.
	// No need to do reconcile or job has been deleted.
	r.Scheme.Default(trainingJob)
	return r.ReconcileJobs(trainingJob)
}

4.2 ReconcileJobs

Since the status in the message is "", initializeJob is run and reconcileResource is reconciled.

func (r *TrainingJobReconciler) ReconcileJobs(job *kaiv1alpha1.TrainingJob) (result reconcile.Result, err error) {
	oldJobStatus := job.Status.DeepCopy()

	defer func() {
		latestJob := &kaiv1alpha1.TrainingJob{}
		err := r.Get(context.Background(), types.NamespacedName{
			Name:      job.Name,
			Namespace: job.Namespace,
		}, latestJob)
		if err == nil {
			if latestJob.ObjectMeta.ResourceVersion != job.ObjectMeta.ResourceVersion {
				latestJob.Status = job.Status
				job = latestJob
			}
		}
		r.updateObjectStatus(job, oldJobStatus)
	}()

	switch job.Status.Phase {
    case commonv1.JobSucceeded, commonv1.JobFailed:
      err = r.cleanup(job)
    case "", commonv1.JobCreated: // Initialize if state is empty or JobCreated
      r.initializeJob(job)
      err = r.reconcileResource(job)
    case commonv1.JobRunning:
      err = r.reconcileJobRunning(job)
    case commonv1.Scaling:
      err = r.executeScaling(job)
	}

	if err != nil {
		if IsRequeueError(err) {
			return RequeueAfterInterval(r.PollInterval, nil)
		}
		return RequeueAfterInterval(r.PollInterval, err)
	}
	return NoRequeue()
}

4.3 reconcileResource

reconcileResource actually calls doSteps, calling a state machine to continue initialization.

func (r *TrainingJobReconciler) reconcileResource(job *kaiv1alpha1.TrainingJob) error {
	steps := r.newSteps()
	err := r.doSteps(job, steps)
	return err
}

4.4 doSteps

newSteps builds a simple state machine and is an initialization step that is executed sequentially, and doSteps branches differently based on the state.

There are a few points to explain:

The following states after Created should be: WorkersCreated --> WorkersReady ----> LauncherCreated --> JobRunning.
This is the post-event state, which should be reached after the corresponding action is completed.
In the for loop, if the current Job has reached a state, skip continuing until an incomplete state, and execute the corresponding action. So in theory, it's going to go from WorkersCreated to JobRunning.
In an Action corresponding to a state, Job is set to this completion state after execution is complete.

The code is as follows:

func (r *TrainingJobReconciler) newSteps() []Step {
	return []Step{
		Step{
			JobCondition: commonv1.WorkersCreated,
			Action:       r.createTrainingJobWorkers,
		},
		Step{
			JobCondition: commonv1.WorkersReady,
			Action:       r.waitWorkersRunning,
		},
		Step{
			JobCondition: commonv1.LauncherCreated,
			Action:       r.createLauncher,
		},
		Step{
			JobCondition: commonv1.JobRunning,
			Action:       r.syncLauncherState,
		},
	}
}

func (r *TrainingJobReconciler) doSteps(job *kaiv1alpha1.TrainingJob, steps []Step) error {
	for _, step := range steps {
		if hasCondition(*job.GetJobStatus(), step.JobCondition) {
			continue
		}
		err := step.Action(job)
		break
	}
	return nil
}

So the following are specific:

           Request("")
K8S  +-------------------->  Reconcile
                                 +
                                 |
                                 |
                                 v
          +----------------------+---------------------+
          |                 ReconcileJobs              |
          |                      +                     |
          |                      |                     |
          |        +------------------------------+    |
          |        |             |                |    |
          |        v             v                v    |
          |  "", JobCreated   JobRunning      Scaling  |
          +--------+-----------------------------------+
                   |
                   |
                   v
           reconcileResource
                   +
                   |
                   |
                   v
         +---------+---------------+
         | doSteps                 |
         |                         |
         |                         |
         |     WorkersCreated +---------> createTrainingJobWorkers
         |                         |
         |                         |
         |     WorkersReady  +----------> waitWorkersRunning
         |                         |
         |                         |
         |     LauncherCreated +--------> createLauncher
         |                         |
         |                         |
         |     JobRunning  +------------> syncLauncherState
         |                         |
         +-------------------------+

4.5 createTrainingJobWorkers

In the doSteps step, start with the createTrainingJobWorkers Action. This will set the Job status to WorkersCreated.

func (r *TrainingJobReconciler) createTrainingJobWorkers(job *kaiv1alpha1.TrainingJob) error {
	if job.GetAttachMode() == kaiv1alpha1.AttachModeSSH {
		if cm, err := r.GetOrCreateSecret(job); cm == nil || err != nil {
			updateStatus(job.GetJobStatus(), common.JobFailed, trainingJobFailedReason, msg)
			return nil
		}
	}

	workers := getJobReplicasWorkers(job)
	job.Status.TargetWorkers = workers
    
    // Create worker
	if err := r.CreateWorkers(job, workers); err != nil {
		updateStatus(job.GetJobStatus(), common.JobFailed, trainingJobFailedReason, msg)
		return nil
	}
    // Set new state
	updateJobConditions(job.GetJobStatus(), common.WorkersCreated, "", msg)
	return nil
}

4.5.1 CreateWorkers

CreateWorkers creates a worker, which, as described earlier in this article, contains service s and pod s, so the creation process is as follows:

Call another function with the same name, CreateWorkers, to indirectly create the workerService.
Call newWorker to create a Pod.

func (r *TrainingJobReconciler) CreateWorkers(job *kaiv1alpha1.TrainingJob, workers []string) error {
	return r.createWorkers(job, workers, func(name string, index string) *corev1.Pod {
		worker := newWorker(job, name, index)
		return worker
	})
}

4.5.1.1 createWorkers

createWorker is called iteratively to generate a series of workers based on the configuration.

func (r *TrainingJobReconciler) createWorkers(job *kaiv1alpha1.TrainingJob, workers []string, newPod PodTplGenerator) error {
    // Traverse, create
	for _, podName := range workers {
		index, err := getWorkerIndex(job.Name, podName)
		if err != nil {
			return err
		}
		_, err = r.createWorker(job, int32(index), newPod)
		if err != nil {
			return err
		}
	}
	return nil
}

4.5.1.2 createWorker

The worker Pod is judged by its parameters here, and if it does not exist, a worker is created.

func (r *TrainingJobReconciler) createWorker(job *kaiv1alpha1.TrainingJob, index int32, workerPodTempl PodTplGenerator) (*corev1.Pod, error) {
	name := getWorkerName(job.Name, int(index))
	indexStr := strconv.Itoa(int(index))
	pod := &corev1.Pod{}
	nsn := types.NamespacedName{
		Name:      name,
		Namespace: job.Namespace,
	}
	err := r.Get(context.Background(), nsn, pod)

	if err != nil {
		// If the worker Pod doesn't exist, we'll create it.
		if errors.IsNotFound(err) {
            // If you don't have a pod, you can also create a pod here
			worker := workerPodTempl(name, indexStr)
			if job.GetAttachMode() == kaiv1alpha1.AttachModeSSH {
				util.MountRsaKey(worker, job.Name)
			}
			if err = r.Create(context.Background(), worker); err != nil {
				return nil, err
			}
		} 
	}

	service := &corev1.Service{}
	err = r.Get(context.Background(), nsn, service)
	if errors.IsNotFound(err) {
        // Call newService for specific creation
		err = r.Create(context.Background(), newService(job, name, indexStr))
	}
	return nil, nil
}

4.5.1.3 newService

It's a million turns to come here to create a service.

func newService(obj interface{}, name string, index string) *corev1.Service {
	job, _ := obj.(*kaiv1alpha1.TrainingJob)
	labels := GenLabels(job.Name)
	labels[labelTrainingRoleType] = worker
	labels[replicaIndexLabel] = index
	return &corev1.Service{ // Specific creation
		ObjectMeta: metav1.ObjectMeta{
			Name:      name,
			Namespace: job.Namespace,
			Labels:    labels,
			OwnerReferences: []metav1.OwnerReference{
				*metav1.NewControllerRef(job, kaiv1alpha1.SchemeGroupVersionKind),
			},
		},
		Spec: corev1.ServiceSpec{
			ClusterIP: "None",
			Selector:  labels,
			Ports: []corev1.ServicePort{
				{
					Name: "ssh-port",
					Port: 22,
				},
			},
		},
	}
}

4.5.2 newWorker

newWorker built Pod, which is a more common routine.

func newWorker(obj interface{}, name string, index string) *corev1.Pod {
	job, _ := obj.(*kaiv1alpha1.TrainingJob)
	labels := GenLabels(job.Name)
	labels[labelTrainingRoleType] = worker
	labels[replicaIndexLabel] = index
	podSpec := job.Spec.ETReplicaSpecs.Worker.Template.DeepCopy()

	// keep the labels which are set in PodTemplate
	if len(podSpec.Labels) == 0 {
		podSpec.Labels = make(map[string]string)
	}
	for key, value := range labels {
		podSpec.Labels[key] = value
	}

	// RestartPolicy=Never
	setRestartPolicy(podSpec)

	container := podSpec.Spec.Containers[0]

	// if we want to use ssh, will start sshd service firstly.
	if len(container.Command) == 0 {
		if job.GetAttachMode() == kaiv1alpha1.AttachModeSSH {
			container.Command = []string{"sh", "-c", "/usr/sbin/sshd  && sleep 365d"}
		} else {
			container.Command = []string{"sh", "-c", "sleep 365d"}
		}
	}
	podSpec.Spec.Containers[0] = container

    // Created pod
	return &corev1.Pod{
		ObjectMeta: metav1.ObjectMeta{
			Name:        name,
			Namespace:   job.Namespace,
			Labels:      podSpec.Labels,
			Annotations: podSpec.Annotations,
			OwnerReferences: []metav1.OwnerReference{
				*metav1.NewControllerRef(job, kaiv1alpha1.SchemeGroupVersionKind),
			},
		},
		Spec: podSpec.Spec,
	}
}

The logic is as follows:

           Request("")
K8S  +-------------------->  Reconcile
                                 +
                                 |
                                 |
                                 v
          +----------------------+---------------------+
          |                 ReconcileJobs              |
          |                      +                     |
          |                      |                     |
          |        +------------------------------+    |
          |        |             |                |    |
          |        v             v                v    |
          |  "", JobCreated   JobRunning      Scaling  |
          +--------+-----------------------------------+
                   |
                   |
                   v
           reconcileResource
                   +
                   |
                   |
                   v
         +---------+---------------+
         | doSteps                 |                                           +----> createWorkers +----> createWorker +----> newService
         |                         |                                           |
         |                         |                                           +
         |     WorkersCreated +---------> createTrainingJobWorkers +-----> CreateWorkers  +------->  newWorker +------> WorkersCreated
         |                         |
         |                         |
         |     WorkersReady  +----------> waitWorkersRunning
         |                         |
         |                         |
         |     LauncherCreated +--------> createLauncher
         |                         |
         |                         |
         |     JobRunning  +------------> syncLauncherState
         |                         |
         +-------------------------+

Mobile phones are as follows:

4.8 createLauncher

Once the worker is set up, Launcher is set up. So continue with createLauncher.

func (r *TrainingJobReconciler) createLauncher(job *kaiv1alpha1.TrainingJob) error {
	if _, err := r.GetOrCreateLauncherServiceAccount(job); err != nil {
		updateStatus(job.GetJobStatus(), commonv1.JobFailed, trainingJobFailedReason, msg)
		return nil
	}
	if _, err := r.GetOrCreateLauncherRole(job, 0); err != nil {
		updateStatus(job.GetJobStatus(), commonv1.JobFailed, trainingJobFailedReason, msg)
		return nil
	}
	if _, err := r.GetLauncherRoleBinding(job); err != nil {
		updateStatus(job.GetJobStatus(), commonv1.JobFailed, trainingJobFailedReason, msg)
		return nil
	}

	if cm, err := r.CreateHostConfigMap(job); cm == nil || err != nil {
		updateStatus(job.GetJobStatus(), commonv1.JobFailed, trainingJobFailedReason, msg)
		return nil
	}

	launcher, err := r.GetLauncherJob(job)

	if launcher == nil {
		if _, err := r.CreateLauncher(job); err != nil {
			updateStatus(job.GetJobStatus(), commonv1.JobFailed, trainingJobFailedReason, msg)
			return nil
		}
	}

	updateJobConditions(job.GetJobStatus(), commonv1.LauncherCreated, "", msg)
	return nil
}

Let's take two key steps.

4.8.1 CreateHostConfigMap

Get the configuration about host here.

func (r *TrainingJobReconciler) CreateHostConfigMap(job *kaiv1alpha1.TrainingJob) (*corev1.ConfigMap, error) {
	return r.createConfigMap(job, newHostfileConfigMap)
}

func (r *TrainingJobReconciler) createConfigMap(job *kaiv1alpha1.TrainingJob, newCm func(job *kaiv1alpha1.TrainingJob) *corev1.ConfigMap) (*corev1.ConfigMap, error) {
	cm := &corev1.ConfigMap{}
	name := ctrl.Request{}
	name.NamespacedName.Namespace = job.GetNamespace()
	name.NamespacedName.Name = job.GetName() + configSuffix
	err := r.Get(context.Background(), name.NamespacedName, cm)
	if errors.IsNotFound(err) {
		if err = r.Create(context.Background(), newCm(job)); err != nil {
			return cm, err
		}
	}
	return cm, nil
}

4.8.2 Create pod

4.8.2.1 CreateLauncher

pod creation here

func (r *TrainingJobReconciler) CreateLauncher(obj interface{}) (*corev1.Pod, error) {
	job, ok := obj.(*kaiv1alpha1.TrainingJob)
	launcher := newLauncher(job) // Create pod
	if job.GetAttachMode() == kaiv1alpha1.AttachModeSSH {
		util.MountRsaKey(launcher, job.Name)
	}
	err := r.Create(context.Background(), launcher)
	return launcher, nil
}

4.8.2.2 newLauncher

Here's how to build a Pod.

func newLauncher(obj interface{}) *corev1.Pod {
	job, _ := obj.(*kaiv1alpha1.TrainingJob)
	launcherName := job.Name + launcherSuffix
	labels := GenLabels(job.Name)
	labels[labelTrainingRoleType] = launcher
	podSpec := job.Spec.ETReplicaSpecs.Launcher.Template.DeepCopy()
	// copy the labels and annotations to pod from PodTemplate
	if len(podSpec.Labels) == 0 {
		podSpec.Labels = make(map[string]string)
	}
	for key, value := range labels {
		podSpec.Labels[key] = value
	}
	podSpec.Spec.InitContainers = append(podSpec.Spec.InitContainers, initContainer(job))
	//podSpec.Spec.InitContainers = append(podSpec.Spec.InitContainers, kubedeliveryContainer())

	container := podSpec.Spec.Containers[0]
	container.VolumeMounts = append(container.VolumeMounts,
		corev1.VolumeMount{
			Name:      hostfileVolumeName,
			MountPath: hostfileMountPath,
		},
		corev1.VolumeMount{
			Name:      configVolumeName,
			MountPath: configMountPath,
		},
		corev1.VolumeMount{
			Name:      kubectlVolumeName,
			MountPath: kubectlMountPath,
		})

	if job.GetAttachMode() == kaiv1alpha1.AttachModeKubexec {
		container.Env = append(container.Env, corev1.EnvVar{
			Name:  "OMPI_MCA_plm_rsh_agent",
			Value: getKubexecPath(),
		})
	}
	podSpec.Spec.Containers[0] = container
	podSpec.Spec.ServiceAccountName = launcherName

	setRestartPolicy(podSpec)
	hostfileMode := int32(0444)
	scriptMode := int32(0555)

	podSpec.Spec.Volumes = append(podSpec.Spec.Volumes,
		corev1.Volume{
			Name: hostfileVolumeName,
			VolumeSource: corev1.VolumeSource{
				EmptyDir: &corev1.EmptyDirVolumeSource{},
			},
		},
		corev1.Volume{
			Name: kubectlVolumeName,
			VolumeSource: corev1.VolumeSource{
				EmptyDir: &corev1.EmptyDirVolumeSource{},
			},
		},
		corev1.Volume{
			Name: configVolumeName,
			VolumeSource: corev1.VolumeSource{
				ConfigMap: &corev1.ConfigMapVolumeSource{
					LocalObjectReference: corev1.LocalObjectReference{
						Name: job.Name + configSuffix,
					},
					Items: []corev1.KeyToPath{
						{
							Key:  hostfileName,
							Path: hostfileName,
							Mode: &hostfileMode,
						},
						{
							Key:  discoverHostName,
							Path: discoverHostName,
							Mode: &hostfileMode,
						},
						{
							Key:  kubexeclFileName,
							Path: kubexeclFileName,
							Mode: &scriptMode,
						},
					},
				},
			},
		})
	return &corev1.Pod{
		ObjectMeta: metav1.ObjectMeta{
			Name:        launcherName,
			Namespace:   job.Namespace,
			Labels:      podSpec.Labels,
			Annotations: podSpec.Annotations,
			OwnerReferences: []metav1.OwnerReference{
				*metav1.NewControllerRef(job, kaiv1alpha1.SchemeGroupVersionKind),
			},
		},
		Spec: podSpec.Spec,
	}
}

At this point, a new training job has been run with the following logical extensions:

           Request("")
K8S  --------------------->  Reconcile
                                 +
                                 |
                                 |
                                 v
          +----------------------+---------------------+
          |                 ReconcileJobs              |
          |                      +                     |
          |                      |                     |
          |        +------------------------------+    |
          |        |             |                |    |
          |        v             v                v    |
          |  "", JobCreated   JobRunning      Scaling  |
          +--------+-----------------------------------+
                   |
                   |
                   v
           reconcileResource
                   +
                   |
                   |
                   v
         +---------+---------------+
         | doSteps                 |                                           +----> createWorkers +----> createWorker +----> newService
         |                         |                                           |
         |                         |                                           |
         |     WorkersCreated +---------> createTrainingJobWorkers +-----> CreateWorkers  +------->  newWorker +------> WorkersCreated
         |                         |
         |                         |
         |     WorkersReady  +----------> waitWorkersRunning
         |                         |
         |                         |
         |     LauncherCreated +--------> createLauncher+----> CreateHostConfigMap +-----> CreateLauncher  +------>  newLauncher
         |                         |
         |                         |
         |     JobRunning  +------------> syncLauncherState
         |                         |
         +-------------------------+

Mobile phones are as follows:

Finished creating a new job, let's look at the key technical points of this article, scaleOut and scaleIn.

0x05 ScaleOut

5.1 Ideas

The ScaleOut task CR is as follows:

Now a ScaleOut CR is sent, and the ScaleOutController triggers the Reconcile. The simple work here is to find the corresponding TrainingJob for Scaler based on the Selector field in ScaleOut CR and set it on the OwnerReferences of the CR.

Take a ScaleOut operation as an example:

- apiVersion: kai.alibabacloud.com/v1alpha1
  kind: ScaleOut
  metadata:
    creationTimestamp: "2020-11-04T13:54:26Z
    name: scaleout-ptfnk
    namespace: default
    ownerReferences:
    - apiVersion: kai.alibabacloud.com/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: TrainingJob
      name: elastic-training // Pointing to Extension Object TrainingJob
      uid: 075b9c4a-22f9-40ce-83c7-656b329a2b9e
  spec:
  selector:
    name: elastic-training
  toAdd:
    count: 2

5.2 Reconcile

A ScaleOut CR is sent and the ScaleOutController triggers the Reconcile. The main thing is to call setScalingOwner.

func (r *ScaleOutReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
	scaleOut, err := getScaleOut(req.NamespacedName, r.Client)
	if err != nil {
		// Error reading the object - requeue the request.
		return RequeueImmediately()
	}
	if scaleOut == nil || scaleOut.DeletionTimestamp != nil {
		return NoRequeue()
	}

	if isScaleFinished(*scaleOut.GetJobStatus()) {
		return NoRequeue()
	}

  return setScalingOwner(r, scaleOut, r.PollInterval)
}

5.3 setScalingOwner

setScalingOwner is one of the keys.

The main thing to do here is to set one when ScaleOut CR does not have OwnerReferences set.

The logic is to find the TrainingJob corresponding to Scaler based on the Selector field in ScaleOut CR and set it on OwnerReferences of the CR.

func setScalingOwner(r client.Client, scaler Scaler, pollInterval time.Duration) (ctrl.Result, error) {
	ownerRefs := scaler.GetOwnerReferences()
	if len(ownerRefs) == 0 {
		trainingJob := &kaiv1alpha1.TrainingJob{}
		nsn := types.NamespacedName{}
		nsn.Namespace = scaler.GetNamespace()
		nsn.Name = scaler.GetSelector().Name
		err := r.Get(context.Background(), nsn, trainingJob)
		gvk := kaiv1alpha1.SchemeGroupVersionKind
		ownerRefs = append(ownerRefs, *metav1.NewControllerRef(trainingJob, schema.GroupVersionKind{Group: gvk.Group, Version: gvk.Version, Kind: gvk.Kind}))
		scaler.SetOwnerReferences(ownerRefs)

		initializeJobStatus(scaler.GetJobStatus())
		updateJobConditions(scaler.GetJobStatus(), v1.JobCreated, "", msg)
		err = r.Status().Update(context.Background(), scaler)
		err = r.Update(context.Background(), scaler)
	}
	return NoRequeue()
}

// RequeueAfterInterval requeues after a duration when duration > 0 is specified.
func RequeueAfterInterval(interval time.Duration, err error) (ctrl.Result, error) {
	return ctrl.Result{RequeueAfter: interval}, err
}

5.4 TrainingJobController

Updates to the ScaleOut CR belonging to TrainingJob were heard in TrainingJobController, which triggered the Reeconcile of TrainingJob, traversed to filter the ScaleIn and ScaleOut pointed to by OwnerReference under TrainingJob, and scaled up or down to the creation and state time.

5.4.1 Reconcile

func (r *TrainingJobReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {

	rlog := r.Log.WithValues("trainingjob", req.NamespacedName)
	// Fetch latest training job instance.
	sharedTrainingJob := &kaiv1alpha1.TrainingJob{}
	err := r.Get(context.Background(), req.NamespacedName, sharedTrainingJob)

	trainingJob := sharedTrainingJob.DeepCopy()
	// Check reconcile is required.
	// No need to do reconcile or job has been deleted.

	r.Scheme.Default(trainingJob)

	return r.ReconcileJobs(trainingJob)
}

5.4.2 ReconcileJobs

func (r *TrainingJobReconciler) ReconcileJobs(job *kaiv1alpha1.TrainingJob) (result reconcile.Result, err error) {
	oldJobStatus := job.Status.DeepCopy()

	logger.Infof("jobName: %v, phase %s", job.Name, job.Status.Phase)

	defer func() {
		latestJob := &kaiv1alpha1.TrainingJob{}
		err := r.Get(context.Background(), types.NamespacedName{
			Name:      job.Name,
			Namespace: job.Namespace,
		}, latestJob)
		if err == nil {
			if latestJob.ObjectMeta.ResourceVersion != job.ObjectMeta.ResourceVersion {
				latestJob.Status = job.Status
				job = latestJob
			}
		}
		r.updateObjectStatus(job, oldJobStatus)
	}()

	switch job.Status.Phase {
	case commonv1.JobSucceeded, commonv1.JobFailed:
		err = r.cleanup(job)
	case "", commonv1.JobCreated:
		r.initializeJob(job)
		err = r.reconcileResource(job)
	case commonv1.JobRunning:
		err = r.reconcileJobRunning(job)
	case commonv1.Scaling:
		err = r.executeScaling(job)
	default:
		logger.Warnf("job %s unknown status %s", job.Name, job.Status.Phase)
	}

	if err != nil {
		if IsRequeueError(err) {
			return RequeueAfterInterval(r.PollInterval, nil)
		}
		return RequeueAfterInterval(r.PollInterval, err)
	}
	return NoRequeue()
}

There are two lines, JobRunning, Scaling, and JobRunning, depending on the current job status.

Let's do one analysis.

5.5 JobRunning

The first step is to get to the JobRunning state, so let's take a look at what to do.

5.5.1 reconcileJobRunning

func (r *TrainingJobReconciler) reconcileJobRunning(job *kaiv1alpha1.TrainingJob) error {
	if err := r.syncLauncherState(job); err != nil {
		return err
	}
	if err := r.syncWorkersState(job); err != nil {
		return err
	}

	if job.Status.Phase == commonv1.JobRunning {
		return r.setTrainingJobScaler(job) // Now that you are in the JobRunning state, you can start setting up scaler s
	}

	return nil
}

5.5.2 setTrainingJobScaler

First, through availableScaleOutList or availableScaleInList, then update.

func (r *TrainingJobReconciler) setTrainingJobScaler(job *kaiv1alpha1.TrainingJob) error {
	scaleOut, err := r.availableScaleOutList(job) // Find scaleout list

	scaleIn, err := r.availableScaleInList(job) // Find scaleIn list

	scalerList := append(scaleOut, scaleIn...) // merge

	// Select the latest scaling job
	r.updateLatestScaler(job, scalerList) // Start Setting
	return nil
}

5.5.3 updateLatestScaler

Find the last Scaler based on the creation time and state time.

func (r *TrainingJobReconciler) updateLatestScaler(job *kaiv1alpha1.TrainingJob, scalers []Scaler) error {
	var latestScaler Scaler
	if len(scalers) == 0 {
		return nil
	}
	for i, _ := range scalers {
		scalerItem := scalers[i]
        // Find the last Scaler based on creation time and state time
		if latestScaler == nil || latestScaler.GetCreationTimestamp().Time.Before(scalerItem.GetCreationTimestamp().Time) {
			latestScaler = scalerItem
		}
	}
	return r.updateCurrentScaler(job, latestScaler)
}

5.5.4 updateCurrentScaler

Set the scaler found.

func (r *TrainingJobReconciler) updateCurrentScaler(job *kaiv1alpha1.TrainingJob, scaleItem Scaler) error {
	job.Status.CurrentScaler = scaleItem.GetFullName()
	msg := fmt.Sprintf("trainingJobob(%s/%s) execute %s", job.Namespace, job.Name, scaleItem.GetFullName())
    
    // Set state
	r.updateScalerState(scaleItem, job, newCondition(common.Scaling, scalingStartReason, msg))

	if err := r.updateObjectStatus(scaleItem, nil); err != nil {
		return err
	}
	return nil
}

5.5.5 updateScalerState

This will set common.Scaling. So next time you run, you'll go to the Scaling branch.

func (r *TrainingJobReconciler) updateScalerState(scaleObj Scaler, trainingJob *kaiv1alpha1.TrainingJob, condition common.JobCondition) error {
    
	jobPhase := common.Scaling // Set common.Scaling. So next time you run, you'll go to the Scaling branch
	currentJob := scaleObj.GetFullName()
	if condition.Type == common.ScaleSucceeded || condition.Type == common.ScaleFailed {
		jobPhase = common.JobRunning
		currentJob = ""
	}

	setCondition(trainingJob.GetJobStatus(), condition)
	updateStatusPhase(trainingJob.GetJobStatus(), jobPhase)
	updateTrainingJobCurrentScaler(trainingJob.GetJobStatus(), currentJob)

	setCondition(scaleObj.GetJobStatus(), condition)
	updateStatusPhase(scaleObj.GetJobStatus(), condition.Type)

	return nil
}

The logic is as follows:

           1 Request("")
  K8S  +-------------------->  Reconcile  <------------------+
           2 ScaleOut CR           +                         |
  K8S  +-------------------->      |                         |
                                   |                         |
                                   v                         |
            +----------------------+---------------------+   |
            |                 ReconcileJobs              |   |
            |                      +                     |   |
            |                      |                     |   |
            |        +------------------------------+    |   |
            |     1  |             | 2            3 |    |   |
            |        v             v                v    |   |
            |  "", JobCreated   JobRunning      Scaling  |   |
            +--------+-------------+---------------------+   |
                     |             |                         |
                  1  |             | 2                       |
                     v             v                         |
             reconcileResource   reconcileJobRunning         |
                     +             +                         |
                  1  |             | 2                       |
                     |             |                         |
                     v             v                         |
+--------------------+----+      setTrainingJobScaler        |
| doSteps                 |        +                         |
|                         |        | 2                       |
|                         |        |                         |
|     WorkersCreated      |        v                         |
|                         |      updateScalerState           |
|                         |        +                         |
|     WorkersReady        |        |                         |
|                         |        | 2                       |
|                         |        v                         |
|     LauncherCreated     |      common.Scaling              |
|                         |        +                         |
|                         |        |                         |
|     JobRunning          |        | 2                       |
|                         |        |                         |
+-------------------------+        +-------------------------+

5.6 Scaling

5.6.1 executeScaling

Extensions vary depending on the type of scale.

func (r *TrainingJobReconciler) executeScaling(job *kaiv1alpha1.TrainingJob) error {
	if err := r.syncLauncherState(job); err != nil {
		return err
	}

	if job.Status.CurrentScaler == "" {
		updateStatusPhase(job.GetJobStatus(), common.JobRunning)
		return nil
	}

	if isFinished(*job.GetJobStatus()) {
		return nil
	}

	scalerType, scalerName := getScalerName(job.Status.CurrentScaler)
    // Processing differently depending on in or out
	if scalerType == "ScaleIn" {
		scaleIn, err := getScaleIn(scalerName, r)

		if scaleIn == nil || isScaleFinished(*scaleIn.GetJobStatus()) {
			finishTrainingScaler(job.GetJobStatus())
			return nil
		}

		oldStatus := scaleIn.Status.DeepCopy()
		defer r.updateObjectStatus(scaleIn, oldStatus)

        // Perform specific zoom operations
		if err = r.executeScaleIn(job, scaleIn); err != nil {
			return err
		}
	} else if scalerType == "ScaleOut" {
		scaleOut, err := getScaleOut(scalerName, r)

		if scaleOut == nil || isScaleFinished(*scaleOut.GetJobStatus()) {
			finishTrainingScaler(job.GetJobStatus())
			return nil
		}

		oldStatus := scaleOut.Status.DeepCopy()
		defer r.updateObjectStatus(scaleOut, oldStatus)

        // Perform specific capacity expansion operations
		if err = r.executeScaleOut(job, scaleOut); err != nil {
		}
	}
	return nil
}

5.6.2 executeScaleOut

Expand.

Use setScaleOutWorkers for scaleOut.Status.AddPods adds a new pods.
Use workersAfterScaler to get the final worker.
Use executeScaleScript to scale.

func (r *TrainingJobReconciler) executeScaleOut(job *kaiv1alpha1.TrainingJob, scaleOut *kaiv1alpha1.ScaleOut) error {

  initializeJobStatus(scaleOut.GetJobStatus())

	if err := r.validateScaleOut(scaleOut); err != nil {
		r.updateScalerFailed(scaleOut, job, err.Error())
		return err
	}

	if err := r.setScaleOutWorkers(job, scaleOut); err != nil {
		return err
	}

	err := r.ScaleOutWorkers(job, scaleOut)
	if err != nil {
		msg := fmt.Sprintf("%s create scaleout workers failed, error: %v", scaleOut.GetFullName(), err)
		r.ScaleOutFailed(job, scaleOut, msg)
		return err
	}

	scaleOutWorkers, err := r.getScalerOutWorkers(job, scaleOut)

	workerStatuses, _ := r.workerReplicasStatus(scaleOut.GetJobStatus(), scaleOutWorkers)

	if workerStatuses.Active < *scaleOut.Spec.ToAdd.Count {
		if IsScaleOutTimeout(scaleOut) {
			msg := fmt.Sprintf("scaleout job %s execution timeout", scaleOut.GetFullName())
			r.ScaleOutFailed(job, scaleOut, msg)
		}
		return NewRequeueError(fmt.Errorf("wait for workers running"))
	}

	hostWorkers := r.workersAfterScaler(job.Status.CurrentWorkers, scaleOut)

	// execute scalein script
    // Execute scale script
	if err := r.executeScaleScript(job, scaleOut, hostWorkers); err != nil {
		msg := fmt.Sprintf("%s execute script failed, error: %v", scaleOut.GetFullName(), err)
		r.ScaleOutFailed(job, scaleOut, msg)
		return err
	} else {
		job.Status.TargetWorkers = r.workersAfterScaler(job.Status.TargetWorkers, scaleOut)
		r.updateScalerSuccessd(scaleOut, job)
	}

	return nil
}

5.6.3 executeScaleScript

At this point, call hostfileUpdateScript to update the host file;

The executeOnLauncher is finally called to execute the script.

func (r *TrainingJobReconciler) executeScaleScript(trainingJob *kaiv1alpha1.TrainingJob, scaler Scaler, workers []string) error {
	if isScriptExecuted(*scaler.GetJobStatus()) {
		return nil
	}
	msg := fmt.Sprintf("trainingjob(%s/%s): execute script on launcher for %s", trainingJob.Namespace, trainingJob.Name, scaler.GetFullName())

	slots := getSlots(trainingJob)
	scriptSpec := scaler.GetScriptSpec()

	var script string
    // Get the script
	if scriptSpec.Script != "" {
		script = scalerScript(scriptSpec.GetTimeout(), scriptSpec.Env, scriptSpec.Script, scaler.GetPodNames(), slots)
	} else {
		hostfilePath := getHostfilePath(trainingJob)
		script = hostfileUpdateScript(hostfilePath, workers, slots)
	}

    // Execute script
	_, _, err := r.executeOnLauncher(trainingJob, script)

	updateJobConditions(scaler.GetJobStatus(), common.ScriptExecuted, "", msg)
	return nil
}

5.6.3.1 hostfileUpdateScript

Get the final script string.

func hostfileUpdateScript(hostfile string, workers []string, slot int) string {
	return fmt.Sprintf(
		`echo '%s' > %s`, getHostfileContent(workers, slot), hostfile)
}

5.6.3.2 getHostfileContent

Get host file content

func getHostfileContent(workers []string, slot int) string {
	var buffer bytes.Buffer
	for _, worker := range workers {
		buffer.WriteString(fmt.Sprintf("%s:%d\n", worker, slot))
	}
	return buffer.String()
}

5.6.3.3 executeOnLauncher

Execute on pod

func (r *TrainingJobReconciler) executeOnLauncher(trainingJob *kaiv1alpha1.TrainingJob, script string) (string, string, error) {
	var err error
	var launcherPod *corev1.Pod
	if launcherPod, err = r.GetLauncherJob(trainingJob); err != nil {
	}

	if launcherPod != nil {
		stdOut, stdErr, err := kubectlOnPod(launcherPod, script)
		return stdOut, stdErr, nil
	}
	return "", "", nil
}

5.6.3.4 kubectlOnPod

Pull the worker.

func kubectlOnPod(pod *corev1.Pod, cmd string) (string, string, error) {
	cmds := []string{
		"/bin/sh",
		"-c",
		cmd,
	}
	stdout, stderr, err := util.ExecCommandInContainerWithFullOutput(pod.Name, pod.Spec.Containers[0].Name, pod.Namespace, cmds)
	if err != nil {
		return stdout, stderr, err
	}
	return stdout, stderr, nil
}

The logic is as follows:

           1 Request("")
  K8S  +-------------------->  Reconcile  <------------------+
           2 ScaleOut CR           +                         |
  K8S  +-------------------->      |                         |
                                   |                         |
                                   v                         |
            +----------------------+---------------------+   |
            |                 ReconcileJobs              |   |
            |                      +                     |   |
            |                      |                     |   |
            |        +------------------------------+    |   |
            |     1  |             | 2            3 |    |   |
            |        v             v                v    |   |   3
            |  "", JobCreated   JobRunning      Scaling +----------->  executeScaling
            +--------+-------------+---------------------+   |              +
                     |             |                         |              |
                  1  |             | 2                       |              | 3
                     v             v                         |              v
             reconcileResource   reconcileJobRunning         |        executeScaleOut
                     +             +                         |              +
                  1  |             | 2                       |              |
                     |             |                         |              | 3
                     v             v                         |              v
+--------------------+----+      setTrainingJobScaler        |      executeScaleScript
| doSteps                 |        +                         |              +
|                         |        | 2                       |              |
|                         |        |                         |              | 3
|     WorkersCreated      |        v                         |              v
|                         |      updateScalerState           |     hostfileUpdateScript
|                         |        +                         |              +
|     WorkersReady        |        |                         |              | 3
|                         |        | 2                       |              |
|                         |        v                         |              v
|     LauncherCreated     |      common.Scaling              |       executeOnLauncher
|                         |        +                         |              +
|                         |        |                         |              |
|     JobRunning          |        | 2                       |              | 3
|                         |        |                         |              v
+-------------------------+        +-------------------------+         kubectlOnPod

0x06 ScaleIn

6.1 Ideas

The ScaleIn task CR is as follows:

When zooming, you can use spec.toDelete in ScaleIn CR. Count or spec.toDelete. The podNames field specifies a scaled worker.

Configuring the number of shrinks through count calculates the high-to-low shrink Worker through index.

apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleIn
metadata:
  name: scalein-workers
spec:
  selector:
    name: elastic-training
  toDelete:
    count: 1

If you want to shrink a specific Worker, you can configure podNames:

apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleIn
metadata:
  name: scalein-workers
spec:
  selector:
    name: elastic-training
  toDelete:
    podNames:
    - elastic-training-worker-1

Run an example of a worker with a specified number of scales:

kubectl create -f examples/scale_in_count.yaml

6.2 Reconcile

A scaleInCR is sent and the Controller triggers the Reconcile. The main thing is to call setScalingOwner.

func (r *ScaleInReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
	//silog := r.Log.WithValues("scalein", req.NamespacedName)
	scaleIn, err := getScaleIn(req.NamespacedName, r.Client)

	if isScaleFinished(*scaleIn.GetJobStatus()) {
		return NoRequeue()
	}

    // These are basically all kinds of checks
	return setScalingOwner(r, scaleIn, r.PollInterval)
}

6.3 setScalingOwner

setScalingOwner is one of the keys.

The main thing to do here is to set one when ScaleIn CR does not have OwnerReferences set.

The logic is to find the corresponding TrainingJob for Scaler based on the Selector field in ScaleIn CR and set it on the OwnerReferences of the CR.

The various error check codes are removed below.

func setScalingOwner(r client.Client, scaler Scaler, pollInterval time.Duration) (ctrl.Result, error) {
	ownerRefs := scaler.GetOwnerReferences()
	if len(ownerRefs) == 0 {
		trainingJob := &kaiv1alpha1.TrainingJob{}
		nsn := types.NamespacedName{}
		nsn.Namespace = scaler.GetNamespace()
		nsn.Name = scaler.GetSelector().Name
		err := r.Get(context.Background(), nsn, trainingJob)

		gvk := kaiv1alpha1.SchemeGroupVersionKind
		ownerRefs = append(ownerRefs, *metav1.NewControllerRef(trainingJob, schema.GroupVersionKind{Group: gvk.Group, Version: gvk.Version, Kind: gvk.Kind}))
		scaler.SetOwnerReferences(ownerRefs)

		initializeJobStatus(scaler.GetJobStatus())
		updateJobConditions(scaler.GetJobStatus(), v1.JobCreated, "", msg)
		err = r.Status().Update(context.Background(), scaler)
		err = r.Update(context.Background(), scaler)
	}
	return NoRequeue()
}

6.4 executeScaleIn

JobRunning state processing is similar to ScaleOut, so skip and look directly at processing executeScaleIn.

When zooming, you can use spec.toDelete in ScaleIn CR. Count or spec.toDelete. The podNames field specifies a scaled worker.

Configuring the number of shrinks through count calculates the high-to-low shrink Worker through index.

The specific combination code is:

setsSaleInToDelete specifies which to delete;

executeScaleScript executes the script;

DeleteWorkers deletes the worker;

func (r *TrainingJobReconciler) executeScaleIn(job *kaiv1alpha1.TrainingJob, scaleIn *kaiv1alpha1.ScaleIn) error {
	if scaleIn.DeletionTimestamp != nil || isScaleFinished(*scaleIn.GetJobStatus()) {
		logger.Info("reconcile cancelled, scalein does not need to do reconcile or has been deleted")
		return nil
	}

	initializeJobStatus(scaleIn.GetJobStatus())

	//TODO: Validate the scalein count for minSize
	err := r.setsSaleInToDelete(job, scaleIn)

	currentWorkers := r.workersAfterScaler(job.Status.CurrentWorkers, scaleIn)

	// execute scalein script
	if err := r.executeScaleScript(job, scaleIn, currentWorkers); err != nil {
		msg := fmt.Sprintf("%s execute script failed, error: %v", scaleIn.GetFullName(), err)
		r.updateScalerFailed(scaleIn, job, msg)
		return nil
	}

	toDeleteWorkers := scaleIn.GetPodNames()
	remainWorkers := false
	if scaleIn.Spec.Script == "" {
		if shutdownWorkers, err := r.checkWorkerShutdown(job, toDeleteWorkers); err != nil {
			return err
		} else {
			if len(toDeleteWorkers) != len(shutdownWorkers) {
				remainWorkers = true
				toDeleteWorkers = shutdownWorkers
			}
		}
	}
	if err := r.DeleteWorkers(job, toDeleteWorkers); err != nil {
		msg := fmt.Sprintf("%s delete resource failed, error: %v", scaleIn.GetFullName(), err)
		r.updateScalerFailed(scaleIn, job, msg)
		return nil
	}

	// wait pods deleted
	deleted, _ := r.isWorkersDeleted(job.Namespace, scaleIn.GetPodNames())
	if deleted {
		job.Status.TargetWorkers = r.workersAfterScaler(job.Status.TargetWorkers, scaleIn)
		job.Status.CurrentWorkers = currentWorkers
		r.updateScalerSuccessd(scaleIn, job)
		return nil
	}

	if remainWorkers {
		msg := "wait for workers process shutdown"
		logger.Info(msg)
		return NewRequeueError(fmt.Errorf(msg))
	}

	return nil
}

6.5 setsSaleInToDelete

Through spec.toDelete in ScaleIn CR. Count or spec.toDelete. The podNames field specifies a scaled worker.

func (r *TrainingJobReconciler) setsSaleInToDelete(job *kaiv1alpha1.TrainingJob, scaleIn *kaiv1alpha1.ScaleIn) error {
	podNames := scaleIn.Status.ToDeletePods
	if len(podNames) != 0 {
		return /*filterPodNames(workers, podNames, false), */ nil
	}
	workers, err := r.GetWorkerPods(job)

	toDelete := scaleIn.Spec.ToDelete

	if toDelete.PodNames != nil {
		workers = filterPodNames(workers, toDelete.PodNames, false)
	} else if toDelete.Count > 0 {
		if toDelete.Count < len(workers) {
			allPodNames := getSortPodNames(job.Name, workers)
			deletePodNames := allPodNames[len(workers)-toDelete.Count:]
			workers = filterPodNames(workers, deletePodNames, false)
		} 
	} 
  
	for _, worker := range workers {
		scaleIn.Status.ToDeletePods = append(scaleIn.Status.ToDeletePods, worker.Name)
	}

	return nil
}

6.6 DeleteWorkers

Delete the worker service and pods specifically.

func (r *TrainingJobReconciler) DeleteWorkers(trainingJob *kaiv1alpha1.TrainingJob, workers []string) error {
	if err := r.DeleteWorkerServices(trainingJob, workers); err != nil {
		return fmt.Errorf("delete services failed: %++v", err)
	}

	if err := r.DeleteWorkerPods(trainingJob, workers); err != nil {
		return fmt.Errorf("delete pods failed: %++v", err)
	}
	return nil
}

6.7 DeleteWorkerPods

Delete pods.

func (r *TrainingJobReconciler) DeleteWorkerPods(job *kaiv1alpha1.TrainingJob, pods []string) error {
	workerPods, err := r.GetWorkerPods(job)

	if pods != nil {
		workerPods = filterPodNames(workerPods, pods, false)
	}
	for _, pod := range workerPods {
		deleteOptions := &client.DeleteOptions{GracePeriodSeconds: utilpointer.Int64Ptr(0)}
		if err := r.Delete(context.Background(), &pod, deleteOptions); err != nil && !errors.IsNotFound(err) {
			r.recorder.Eventf(job, corev1.EventTypeWarning, trainingJobFailedReason, "Error deleting worker %s: %v", pod.Name, err)
			//return err
		}
		r.recorder.Eventf(job, corev1.EventTypeNormal, trainingJobSucceededReason, "Deleted pod %s", pod.Name)
	}
	return nil
}

The logic is as follows:

      1 Request("")
 K8S-----------------> Reconcile  <------------------+
      2 ScaleOut CR        +                         |
 K8S----------------->     |                         |
                           |                         |
                           v                         |
    +----------------------+---------------------+   |
    |                 ReconcileJobs              |   |
    |                      +                     |   |
    |                      |                     |   |
    |        +------------------------------+    |   |
    |     1  |             | 2            3 |    |   |
    |        v             v                v    |   | 3
    |  "", JobCreated   JobRunning      Scaling +---------> executeScaling -----+
    +--------+-------------+---------------------+   |          +               |
             |             |                         |          |               |
          1  |             | 2                       |          | 3             | 4
             v             v                         |          v               v
     reconcileResource   reconcileJobRunning         |    executeScaleOut  executeScaleIn
             +             +                         |          +               +
          1  |             | 2                       |          |               |
             |             |                         |          | 3             | 4
             v             v                         |          v               v
+------------+--------+  setTrainingJobScaler        | executeScaleScript executeScaleScript
| doSteps             |    +                         |          +               +
|                     |    | 2                       |          |               |
|                     |    |                         |          | 3             | 4
|    WorkersCreated   |    v                         |          v               v
|                     |  updateScalerState           | hostfileUpdateScript  DeleteWorkers
|                     |    +                         |          +               +
|    WorkersReady     |    |                         |          | 3             | 4
|                     |    | 2                       |          |               |
|                     |    v                         |          v               v
|    LauncherCreated  |  common.Scaling              |   executeOnLauncher  DeleteWorkerPods
|                     |    +                         |          +               +
|                     |    |                         |          |               |
|    JobRunning       |    | 2                       |          | 3             | 4
|                     |    |                         |          v               v
+---------------------+    +-------------------------+     kubectlOnPod      Delete

Now that the Horovod series has been analyzed, look forward to the next article on parameter servers.

0xEE Personal Information

Thoughts on Life and Technology

WeChat Public Account: Rosie's Thoughts

Stay tuned if you want to get timely news feeds from individuals who write articles or if you want to see the technical data that they recommend.

0xFF Reference

ElasticDL analysis

Elastic Training Operator

Topics: Kubernetes Machine Learning Deep Learning

Programmer Think

[Source Parsing] Deep Learning Distributed Training Framework horovod (20) --- Elastic Training Operator

[Source Parsing] Deep Learning Distributed Training Framework horovod(20) - Elastic Training Operator

0x00 Summary

0x01 Background Knowledge

1.1 Elastic

Disadvantages of 1.2 mpi-operator

0x02 Overall Architecture

2.1 Resource Creation

2.2 Roles

2.3 Main Procedures

0x03 Entry

3.1 Creation

3.2 Settings

0x04 TrainingJobReconciler

4.1 Reconcile

4.2 ReconcileJobs

4.3 reconcileResource

4.4 doSteps

4.5 createTrainingJobWorkers

4.5.1 CreateWorkers

4.5.1.1 createWorkers

4.5.1.2 createWorker

4.5.1.3 newService

4.5.2 newWorker

4.8 createLauncher

4.8.1 CreateHostConfigMap

4.8.2 Create pod

4.8.2.1 CreateLauncher

4.8.2.2 newLauncher

0x05 ScaleOut

5.1 Ideas

5.2 Reconcile

5.3 setScalingOwner

5.4 TrainingJobController

5.4.1 Reconcile

5.4.2 ReconcileJobs

5.5 JobRunning

5.5.1 reconcileJobRunning

5.5.2 setTrainingJobScaler

5.5.3 updateLatestScaler

5.5.4 updateCurrentScaler

5.5.5 updateScalerState

5.6 Scaling

5.6.1 executeScaling

5.6.2 executeScaleOut

5.6.3 executeScaleScript

5.6.3.1 hostfileUpdateScript

5.6.3.2 getHostfileContent

5.6.3.3 executeOnLauncher

5.6.3.4 kubectlOnPod

0x06 ScaleIn

6.1 Ideas

6.2 Reconcile

6.3 setScalingOwner

6.4 executeScaleIn

6.5 setsSaleInToDelete

6.6 DeleteWorkers

6.7 DeleteWorkerPods

0xEE Personal Information

0xFF Reference

Hot Topics