Chaos engineering tool - principle and installation of Ali ChaoBlade to simulate CPU & IO exceptions
1. General
Starting from this article, an interesting test project is introduced, which is chaos engineering.
2. Introduction to chaos Engineering
2.1. Chaos engineering definition
- English original definition
According to the principles of chaos engineering, it is defined as follows:
Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system's capability
to withstand turbulent conditions in production.
- Definition of Chinese Translation
The Chinese translation is as follows:
Chaos engineering is a subject of experiments on distributed systems. The purpose is to establish the ability and confidence of the system to resist runaway conditions in the production environment.
- Principle definition
Principle description:
Establish a hypothesis around steady-state behavior
Diversified real world events
Run the experiment in a production environment
Continuous automatic operation experiment
Minimize explosion radius
It's interesting to see some new words. Some people also distinguish it from abnormal test and fault test. We still have to integrate the concept. The concept should precede the development of technology and give technical guidance. However, it always takes some time to land.
3. Installation of tools
3.1.chaosblade installation
This tool is very simple. You can download and unzip it.
[gaolou@7dgroup2 ~]$ wget -c https://github.com/chaosblade-io/chaosblade/releases/download/v0.2.0/chaosblade-0.2.0.linux-amd64.tar.gz [gaolou@7dgroup2 ~]$ tar zxvf chaosblade-0.2.0.linux-amd64.tar.gz
4. Use of chaosblade
4.1. Simulate CPU load
1.CPU load simulation command
[gaolou@7dgroup2 chaosblade-0.2.0]$ ./blade create cpu fullload {"code":200,"success":true,"result":"cb6300fd4899c537"} [gaolou@7dgroup2 chaosblade-0.2.0]$
2. View the simulation effect
As can be seen from the figure, the effect of us CPU utilization consumption is indeed realized.
3. Implementation principle of CPU simulation
burnCpu is in this method. The key source codes are as follows:
func runBurnCpu(ctx context.Context, cpuCount int, cpuPercent int, pidNeeded bool, processor string) int { args := fmt.Sprintf(`%s --nohup --cpu-count %d --cpu-percent %d`, path.Join(util.GetProgramPath(), burnCpuBin), cpuCount, cpuPercent) if pidNeeded { args = fmt.Sprintf("%s --cpu-processor %s", args, processor) } args = fmt.Sprintf(`%s > /dev/null 2>&1 &`, args) response := channel.Run(ctx, "nohup", args) if !response.Success { stopBurnCpuFunc() bin.PrintErrAndExit(response.Err) } if pidNeeded { // parse pid newCtx := context.WithValue(context.Background(), exec.ProcessKey, fmt.Sprintf("cpu-processor %s", processor)) pids, err := exec.GetPidsByProcessName(burnCpuBin, newCtx) if err != nil { stopBurnCpuFunc() bin.PrintErrAndExit(fmt.Sprintf("bind cpu core failed, cannot get the burning program pid, %v", err)) } if len(pids) > 0 { // return the first one pid, err := strconv.Atoi(pids[0]) if err != nil { stopBurnCpuFunc() bin.PrintErrAndExit(fmt.Sprintf("bind cpu core failed, get pid failed, pids: %v, err: %v", pids, err)) } return pid } } return -1 }
Other associated codes are not posted. In general, I wrote a small program to consume the CPU. This function can be done while.
4.2. Analog IO high
1. Simulation command
[root@7dgroup2 chaosblade-0.2.0]# ./blade create disk burn --write --read --size 10 --count 1024 --timeout 300 {"code":200,"success":true,"result":"f026b3510722685d"}
2. View the simulation effect
[root@7dgroup2 chaosblade-0.2.0]# Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vda 0.00 91.00 250.00 815.00 84892.00 92588.00 333.30 43.92 39.27 41.60 38.56 0.93 99.50 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vda 1.00 105.00 496.00 865.00 98012.00 92692.00 280.24 43.72 34.02 33.40 34.37 0.73 99.40 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vda 0.99 106.93 259.41 675.25 99853.47 91750.50 410.00 36.22 38.53 47.09 35.24 1.06 98.81 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vda 0.00 80.00 241.00 1103.00 116340.00 82296.00 295.59 44.06 33.03 47.92 29.78 0.74 99.90 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
From the above results, IO is indeed consumed. Let's see how it achieves consumption.
3. Implementation principle of high IO utilization
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 24036 be/4 root 104.55 M/s 0.00 B/s 0.00 % 99.99 % dd if=/dev/vda1 of=/dev/null~ iflag=dsync,direct,fullblock 24034 be/4 root 0.00 B/s 104.55 M/s 0.00 % 68.17 % dd if=/dev/zero of=/tmp/chao~bs=10M count=1024 oflag=dsync
By viewing the processes with high IO, you can see the two processes, one reading and one writing. In other words, chaosblade calls dd to realize IO high simulation. The key implementation codes are as follows:
// write burn func burnWrite(size, count string) { for { args := fmt.Sprintf(`if=/dev/zero of=%s bs=%sM count=%s oflag=dsync`, tmpDataFile, size, count) response := channel.Run(context.Background(), "dd", args) channel.Run(context.Background(), "rm", fmt.Sprintf(`-rf %s`, tmpDataFile)) if !response.Success { bin.PrintAndExitWithErrPrefix(response.Err) return } } } // read burn func burnRead(fileSystem, size, count string) { for { // "if" arg in dd command is file system value, but "of" arg value is related to mount point args := fmt.Sprintf(`if=%s of=/dev/null bs=%sM count=%s iflag=dsync,direct,fullblock`, fileSystem, size, count) response := channel.Run(context.Background(), "dd", args) if !response.Success { bin.PrintAndExitWithErrPrefix(fmt.Sprintf("The file system named %s is not supported or %s", fileSystem, response.Err)) } } }
5.ChaoBlade summary
This chaosblade can actually be regarded as a tool set, integrating various gadgets.
The hat of chaos is still a little big in this tool. If we want to use it to realize the simulation of thousands of nodes, we also need the cooperation of various integration configuration, remote execution and other tools.
Let's look back at the principles of the definition of chaos engineering written above. Are these simulations consistent with these principles? If you have experience in processing the production environment, you will know that such simulation is actually different from the logic of high CPU and high IO in the real environment.
- Usually we say whether an application can remain robust when the CPU is high. There are two meanings:
- 1. When other programs consume high CPU, whether the tested program can remain robust.
- 2. It refers to whether the tested program can remain robust when the code of the application itself consumes a lot of CPU.
Friends who have dealt with similar production problems will know that the first situation is almost invisible except that unreasonable deployment will occur. Chaosblade actually simulates this situation. In the second case, chaosblade can't do it now.
But the second case is the focus of the test process.
In fact, the meaning of chaos in English is confusion. This is a very different concept from Chinese chaos. Now this concept is translated into chaos, which really lowers the meaning of the word chaos itself.