Fusion of Deep Learning Batch Task Processing Scheduler and kubernetes Default Scheduler

Posted by jrobles on Wed, 08 May 2019 14:45:03 +0200

Three-step installation of kubernetes cluster

What is a batch task?

In deep learning, there are many tasks of multi-machine and multi-card, that is to say, colleagues will have multiple pods, but these pods belong to the same task.

So there's a problem.

A task starts with 100 pods, each pod needs a card, a total of 100 GPU cards, while there are only 99 idle GPU cards in the cluster. What will the default k8s scheduler do?

Because the default scheduler is a pod scheduler, it only checks that a single pod resource is insufficient, so that the first 99 can succeed, and the last pod scheduler fails.

This is very likely to result in

  1. Mission can't run away
  2. The first 99 occupying GPU s are not released and new tasks cannot be scheduled
  3. In severe cases, the whole cluster is deadlocked and "occupies the pit without shit"

So it is necessary to check all the resources needed by the whole task when scheduling. When the overall resources of the cluster are insufficient, a pod can not be scheduled.

The community provides a support for this feature. Scheduler
But this scheduler can't work well with the original scheduler.

  1. The biggest problem is that both schedulers have caches, so the contents of the cache conflict, resulting in scheduling confusion.
  2. This scheduler does not work at the same time as the native scheduler, so when you use this batch scheduler, you can't use affinity or something.

So what we do is to integrate the two features, and the choice is to customize the development of kube-scheduler.

In fact, scheduler can be extended through extender, but extender is still too weak, it can only add its own filtering strategy in the pre-selection and optimization process, which is far from enough for batch tasks.

Difficulties in realization

Adding batch task checking when optimizing
Get a Pod - > If it's a batchpod - > Query whether the cluster resource satisfies the batch task - > Schedule failure

Other pod s in batch tasks need to be scheduled

There is a problem if cluster resources can meet this batch task and go directly to bind:
Assuming that the scheduling queue is like this, suppose that there are three GPUs in the cluster and that the batch task requires three GPUs:

A batch pod -> pod -> pod -> A batch pod -> A batch pod
Successful scheduling of cluster resources Scheduled other pod s Scheduled other pod s GPU occupied by other pod s is not enough to fail GPU is not enough to fail

So the final result is that batch A tasks occupy a GPU, but the whole task is scheduled to fail, which GPU can not be released.

So you need to change the order in the pod scheduling queue? Let A batch pod schedule continuously? It's not that simple.

pod scheduling is to create concurrent scheduling, so even adjusting the order of pods in the task queue does not necessarily guarantee that other pods of batch tasks can be scheduled first.

go wait.Until(sched.scheduleOne, 0, sched.config.StopEverything)

As long as batch pod goes to Bind logic, there is no turning back.

All pods in the batch task are assume d first, and if one of them fails, the other pods that are already bind but not actually scheduled are cleaned up. And throw all pods back to the queue, or directly return to the pod that failed to clean up the change task, so that the upper layer can trigger again?

Schduler process scheduler/sheduler.go scheduleOne logic:

Select Node - > cache assume pod on node - > Create Co-routine bind

So it is not feasible to check if the scheduled pod is not satisfied when assume, because the pod in the batch task may have been bind before, so only the last pod in the batch task can be confirmed to go to the pod in front of the bind.

Preoccupancy strategy
Pre-occupancy strategy: When the first batch pod task comes, check whether the cluster resources are enough. If it is enough to preoccupy, mark several other nodes so that the next pod can not occupy other nodes, so that the batch task actually has a pod to come with nodes available.

Back to the problem of no bind...

This problem has two points:

How to know what nodes other pods need in batch tasks, if all pods are the same, the problem can be simplified
If the subsequent pod fails, the first pod is still bind, or the same problem will arise.
Ultimately, you can't bind a single pod until all pod assume s

To sum up, we need to deal with it in several places.

It is better to use priority queue to increase the priority of the associated pod of the pod being scheduled.
When choosing a node, make a judgment to see if the cluster resources are enough
Check when choosing the assume pod node, if you are not enough or the pod group is not enough, do not go to bind
The problem is that the previous pod has gone through the bind process, so the most important thing is how to solve the problem of not bind the previous pod and delay the bind.

Final Solution - Delayed Binding

Solution: Special processing in batch task bind

  1. If the batch task is thrown into the task cache, no binding is performed
  2. If the last pod of the batch task is thrown into the task cache, the task ready is put into the bind queue
  3. task is taken to bind in the bind queue. Tak mutex is mutually exclusive with ordinary pod bind

With batch tasks, pod adds two annotations:

        scheduling.k8s.io/group-name: qj-1
        scheduling.k8s.io/group-pod-num: 3

The pod plus these two annotations indicates that it belongs to the same task, and num indicates how many pods are in the task.

Originally, a CRD was defined to describe the task. The coupling would be smaller, but the implementation would be more troublesome. We need to listen to one more CRD, so we didn't do it lazily.


Delayed binding process:

  • If it's a normal pod, assume bind s directly after finding the node
  • If it is a batch task, it is thrown directly into the batch cache and returned.
  • There's a coordinator that keeps checking for successful tasks in the batch cache.
  • Successful tasks are thrown into the binding queue, and worker takes successful tasks for batch binding, which is mutually exclusive with ordinary pod s.

batch scheduler interface and members

Run starts a collaborative process to check the successful task and crams it into the queue
RunBind starts a task binding protocol
PodQuePriority modifies the priority of the pod queue dynamically so that the pod scheduling with task takes precedence.

Execution process:

Delayed binding


    //fanux if it is a batch pod, return
    if sched.Config.BatchScheduler.IsBatchPod(assumedPod) {
        err = sched.Config.BatchScheduler.HandleBatchPod(assumedPod)
        if err != nil {
            glog.Errorf("schedule batch pod failed: %v", assumedPod.Namespace, assumedPod.Name)

Increase binding exclusion to prevent batch tasks from binding with regular pod colleagues:

    go func() {
        //fanux add bind mutex
        defer sched.Config.BatchScheduler.UnLock()

        err := sched.bind(assumedPod, &v1.Binding{

Check Resource IsEnough for adequacy of resources

should't use filterFunc, needs nodelist


package util

import "api/core/v1"

//CheckResourceIsEnough is
func CheckResourceIsEnough(pod *v1.Pod, nodes []*v1.Node) (bool, error) {
    return false, nil


    //fanux add checkBatchPodResource
    flag, err := util.CheckResourceIsEnough(pod, filteredNodes)
    if !flag || err != nil {
        return "", err


Dealing with resource constraints

    suggestedHost, err := sched.schedule(pod)

    //fanux add handle if resource not enough
    if strings.Contains(err.Error(), common.BatchResourceNotEnough) {
    } else if err != nil {

How to get the number of GPU s allocated by nodes

nodeInfo allocatableResource - requestedResource is avaliavle resource

    requestedResource *Resource
    nonzeroRequest    *Resource
    allocatableResource *Resource

The GPU is Scalar Resources, and the name of the resource is: NVIDIA GPUResourceName = nvidia.com/gpu

type Resource struct {
    MilliCPU         int64
    Memory           int64
    EphemeralStorage int64
    // We store allowedPodNumber (which is Node.Status.Allocatable.Pods().Value())
    // explicitly as int, to avoid conversions and improve performance.
    AllowedPodNumber int
    // ScalarResources
    ScalarResources map[v1.ResourceName]int64

Add podupdater to update podcondition status

    batchScheduler := batch.NewBatchScheduler(c.schedulerCache, c.podQueue, &binder{c.client}, &podConditionUpdater{c.client})

The cache of batch scheduler needs to be used when checking generic_scheduler resources

You need to know which pod s have been assume d and how many GPU s are needed for batch tasks to reduce this number


    //fanux add batch Cache
    //check batch pod resource is enough need batch scheduler cache
    BatchCache common.TaskCache
    //fanux add checkBatchPodResource
    flag, err := common.CheckResourceIsEnough(pod, filteredNodes, g.cachedNodeInfoMap, g.BatchCache)


    //fanux check batch resource is enough need batch scheduler cache
    batchCache := batchScheduler.GetTaskCache()

    algo := core.NewGenericScheduler(

then checkresource :

    //shoud not use metadata, need use metadata - assumed pod num in batch cache
    _, podNum := GetPodBathMeta(pod)
    podNum -= batchCache.GetTaskAssumedPodNum(pod)

Detailed algorithms for checking resource adequacy:

There are many details.

//How many GPU s does it take to get a pod? This requires adding up the container quotas in the pod.
func GetPodGPUCount(pod *v1.Pod) (count int) {
    for _, c := range pod.Spec.Containers {
        limit, ok := c.Resources.Limits[NVIDIAGPUResourceName]
        l, okay := limit.AsInt64()
        if !ok || !okay {
        count += int(l)

    glog.Infof("Pod [%s] need GPU [%d]", pod.GetName(), count)


//To get the idle GPU of the node, you need to subtract the allocatable from the applied one.
func GetNodeFreeGPU(nodeInfo *cache.NodeInfo) int {
    if nodeInfo == nil {
        return 0

    allocatable, ok := nodeInfo.AllocatableResource().ScalarResources[NVIDIAGPUResourceName]
    if !ok {
        glog.Errorf("can't fetch allocatable GPU : %v", nodeInfo)
        return 0
    glog.Infof("node [%s] allocatable GPU [%d]", nodeInfo.Node().Name, allocatable)

    requested, ok := nodeInfo.RequestedResource().ScalarResources[NVIDIAGPUResourceName]
    if !ok {
        //glog.Errorf("can't fetch requested GPU : %v", nodeInfo)
        //return 0
        requested = 0
    glog.Infof("node [%s] requested GPU [%d]", nodeInfo.Node().Name, requested)

    available := allocatable - requested

    glog.Infof("available node [%s] GPU : [%d]", nodeInfo.Node().Name, available)

    return int(available)

//The key point here is to subtract the total number of task pod s captured in annotations from the batch pod that has been assume d, which is what is really needed.
func CheckResourceIsEnough(pod *v1.Pod, nodes []*v1.Node, cachedNodeInfoMap map[string]*cache.NodeInfo, batchCache TaskCache) (bool, error) {
    //if is not batch pod, return true,nil
    if !IsBatch(pod) {
        glog.Infof("pod %s is not batch pod", pod.GetName())
        return true, nil

    //shoud not use metadata, need use metadata - ready pod num in batch cache
    _, podNum := GetPodBathMeta(pod)
    podNum -= batchCache.GetTaskAssumedPodNum(pod)

    everyPodNeedsGPU := GetPodGPUCount(pod)
    if everyPodNeedsGPU == 0 {
        glog.Infof("pod %s require 0 GPU", pod.GetName())
        return true, nil

    // TODO maybe check nodes[1:], node[0] already allocate a pod, CPU and other metric may reach limit
    for _, node := range nodes {
        nodeInfo, ok := cachedNodeInfoMap[node.Name]
        if !ok {
        nodeFree := GetNodeFreeGPU(nodeInfo)
        podNum -= nodeFree / everyPodNeedsGPU
        glog.Infof("pod: [%s] node: [%s] podNum [%d] nodeFree [%d] podNeed [%d]", pod.GetName(), node.Name, podNum, nodeFree, everyPodNeedsGPU)
        if podNum <= 0 {
            return true, nil

    return false, fmt.Errorf("BatchResourceNotEnough : pod name is %s", pod.GetName())

//Is it batch pod?
func IsBatch(pod *v1.Pod) bool {
    g, n := GetPodBathMeta(pod)
    if g == "" || n == 0 {
        glog.Infof("The pod's group name is empty string,pod name is %v.", pod.GetName())
        return false
    return true

On the Use and Discovery of GPU

Resource bundle

This includes docker nv-docker GPU-device plugin


[root@compute-gpu006 ~]# cat /etc/docker/daemon.json
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []

kubectl describe node xxx:

 cpu:                72
 ephemeral-storage:  222779Mi
 hugepages-1Gi:      0
 hugepages-2Mi:      2Gi
 memory:             791014684Ki
 nvidia.com/gpu:     2                # Here you can see the GPU.
 pods:               110
 cpu:                72
 ephemeral-storage:  210240641086
 hugepages-1Gi:      0
 hugepages-2Mi:      2Gi
 memory:             788815132Ki
 nvidia.com/gpu:     2
 pods:               110


The design of the original scheduler is pod one by one, so the development of this function is very difficult and difficult, but it needs to find an elegant solution.
Reasonable architecture is more troublesome. I have thought for a long time about the implementation of this less intrusive scheme. Welcome to discuss it together.

Public address:

Topics: Linux Docker JSON Kubernetes less