Chaos engineering tool - principle and installation of Ali ChaoBlade to simulate CPU&IO exceptions

Posted by bosco500 on Sat, 22 Jan 2022 11:41:18 +0100

Chaos engineering tool - principle and installation of Ali ChaoBlade to simulate CPU & IO exceptions

1. General

Starting from this article, an interesting test project is introduced, which is chaos engineering.

2. Introduction to chaos Engineering

2.1. Chaos engineering definition

  • English original definition

According to the principles of chaos engineering, it is defined as follows:
Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system's capability
to withstand turbulent conditions in production.

  • Definition of Chinese Translation

The Chinese translation is as follows:
Chaos engineering is a subject of experiments on distributed systems. The purpose is to establish the ability and confidence of the system to resist runaway conditions in the production environment.

  • Principle definition

Principle description:
Establish a hypothesis around steady-state behavior
Diversified real world events
Run the experiment in a production environment
Continuous automatic operation experiment
Minimize explosion radius
It's interesting to see some new words. Some people also distinguish it from abnormal test and fault test. We still have to integrate the concept. The concept should precede the development of technology and give technical guidance. However, it always takes some time to land.

3. Installation of tools

3.1.chaosblade installation

This tool is very simple. You can download and unzip it.

[gaolou@7dgroup2 ~]$ wget -c https://github.com/chaosblade-io/chaosblade/releases/download/v0.2.0/chaosblade-0.2.0.linux-amd64.tar.gz
[gaolou@7dgroup2 ~]$ tar zxvf chaosblade-0.2.0.linux-amd64.tar.gz

4. Use of chaosblade

4.1. Simulate CPU load

1.CPU load simulation command

[gaolou@7dgroup2 chaosblade-0.2.0]$ ./blade  create cpu fullload
{"code":200,"success":true,"result":"cb6300fd4899c537"}
[gaolou@7dgroup2 chaosblade-0.2.0]$

2. View the simulation effect

As can be seen from the figure, the effect of us CPU utilization consumption is indeed realized.

3. Implementation principle of CPU simulation


burnCpu is in this method. The key source codes are as follows:

func runBurnCpu(ctx context.Context, cpuCount int, cpuPercent int, pidNeeded bool, processor string) int {
  args := fmt.Sprintf(`%s --nohup --cpu-count %d --cpu-percent %d`,
    path.Join(util.GetProgramPath(), burnCpuBin), cpuCount, cpuPercent)
  if pidNeeded {
    args = fmt.Sprintf("%s --cpu-processor %s", args, processor)
  }
  args = fmt.Sprintf(`%s > /dev/null 2>&1 &`, args)
  response := channel.Run(ctx, "nohup", args)
  if !response.Success {
    stopBurnCpuFunc()
    bin.PrintErrAndExit(response.Err)
  }
  if pidNeeded {
    // parse pid
    newCtx := context.WithValue(context.Background(), exec.ProcessKey, fmt.Sprintf("cpu-processor %s", processor))
    pids, err := exec.GetPidsByProcessName(burnCpuBin, newCtx)
    if err != nil {
      stopBurnCpuFunc()
      bin.PrintErrAndExit(fmt.Sprintf("bind cpu core failed, cannot get the burning program pid, %v", err))
    }
    if len(pids) > 0 {
      // return the first one
      pid, err := strconv.Atoi(pids[0])
      if err != nil {
        stopBurnCpuFunc()
        bin.PrintErrAndExit(fmt.Sprintf("bind cpu core failed, get pid failed, pids: %v, err: %v", pids, err))
      }
      return pid
    }
  }
  return -1
}

Other associated codes are not posted. In general, I wrote a small program to consume the CPU. This function can be done while.

4.2. Analog IO high

1. Simulation command

[root@7dgroup2 chaosblade-0.2.0]# ./blade create disk burn --write --read  --size 10 --count 1024  --timeout 300
{"code":200,"success":true,"result":"f026b3510722685d"}

2. View the simulation effect

[root@7dgroup2 chaosblade-0.2.0]#

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    91.00  250.00  815.00 84892.00 92588.00   333.30    43.92   39.27   41.60   38.56   0.93  99.50
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               1.00   105.00  496.00  865.00 98012.00 92692.00   280.24    43.72   34.02   33.40   34.37   0.73  99.40
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.99   106.93  259.41  675.25 99853.47 91750.50   410.00    36.22   38.53   47.09   35.24   1.06  98.81
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    80.00  241.00 1103.00 116340.00 82296.00   295.59    44.06   33.03   47.92   29.78   0.74  99.90
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

From the above results, IO is indeed consumed. Let's see how it achieves consumption.

3. Implementation principle of high IO utilization

TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
24036 be/4 root      104.55 M/s    0.00 B/s  0.00 % 99.99 % dd if=/dev/vda1 of=/dev/null~ iflag=dsync,direct,fullblock
24034 be/4 root        0.00 B/s  104.55 M/s  0.00 % 68.17 % dd if=/dev/zero of=/tmp/chao~bs=10M count=1024 oflag=dsync

By viewing the processes with high IO, you can see the two processes, one reading and one writing. In other words, chaosblade calls dd to realize IO high simulation. The key implementation codes are as follows:

// write burn
func burnWrite(size, count string) {
  for {
    args := fmt.Sprintf(`if=/dev/zero of=%s bs=%sM count=%s oflag=dsync`, tmpDataFile, size, count)
    response := channel.Run(context.Background(), "dd", args)
    channel.Run(context.Background(), "rm", fmt.Sprintf(`-rf %s`, tmpDataFile))
    if !response.Success {
      bin.PrintAndExitWithErrPrefix(response.Err)
      return
    }
  }
}
// read burn
func burnRead(fileSystem, size, count string) {
  for {
    // "if" arg in dd command is file system value, but "of" arg value is related to mount point
    args := fmt.Sprintf(`if=%s of=/dev/null bs=%sM count=%s iflag=dsync,direct,fullblock`, fileSystem, size, count)
    response := channel.Run(context.Background(), "dd", args)
    if !response.Success {
      bin.PrintAndExitWithErrPrefix(fmt.Sprintf("The file system named %s is not supported or %s", fileSystem, response.Err))
    }
  }
}

5.ChaoBlade summary

This chaosblade can actually be regarded as a tool set, integrating various gadgets.
The hat of chaos is still a little big in this tool. If we want to use it to realize the simulation of thousands of nodes, we also need the cooperation of various integration configuration, remote execution and other tools.
Let's look back at the principles of the definition of chaos engineering written above. Are these simulations consistent with these principles? If you have experience in processing the production environment, you will know that such simulation is actually different from the logic of high CPU and high IO in the real environment.

  • Usually we say whether an application can remain robust when the CPU is high. There are two meanings:
    - 1. When other programs consume high CPU, whether the tested program can remain robust.
    - 2. It refers to whether the tested program can remain robust when the code of the application itself consumes a lot of CPU.

Friends who have dealt with similar production problems will know that the first situation is almost invisible except that unreasonable deployment will occur. Chaosblade actually simulates this situation. In the second case, chaosblade can't do it now.
But the second case is the focus of the test process.
In fact, the meaning of chaos in English is confusion. This is a very different concept from Chinese chaos. Now this concept is translated into chaos, which really lowers the meaning of the word chaos itself.