Write the correct micro benchmark

Posted by phpnewby1918 on Mon, 07 Mar 2022 16:56:22 +0100

Although JMH can help us better understand the code we write, if the JMH benchmark method we write has problems, it will be difficult to play a guiding role and may even be misleading. How to avoid writing the wrong micro benchmark method?

Modern Java virtual machine has become more and more intelligent. It can optimize our code in the early compilation stage, loading stage and later runtime of classes, such as erasing Dead Code, folding constants, opening loops, and even optimizing process Profiler. Therefore, we should master how to write good micro benchmark methods, First, we need to know what kind of benchmark code is problematic.

1. Avoid DCE (Dead Code Elimination)

The so-called Dead Code Elimination means that the JVM erases some code that is irrelevant to the context and will not be used at all after calculation:

public void test(){
    int x=10;
    int y=10;
    int z=x+y;
}

If x and z are not defined in the JVM, the y will be treated as two times. In other words, the method of test and z will not be used to add X and z, which are not defined in the JVM. In other words, the method of test and z will not be used twice, And the relevant codes for calculating z

@BenchmarkMode(Mode.AverageTime)
@Fork(1)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Threads(5)
@State(Scope.Thread)
@Warmup(iterations = 5)
@Measurement(iterations = 10)
public class JMHExample13 {


    /**
     * As an empty method, it is mainly used for benchmark data.
     */
    @Benchmark
    public void test1(){
       // do nothing
    }

    /**
     * Although the log operation is performed, the result is neither reused nor returned.
     */
    @Benchmark
    public void test2(){
        Math.log(PI);
    }

    /**
     * The log operation is also performed, although the first operation result is used as the second input parameter,
     * However, there is no further use of it after the second execution.
     */
    @Benchmark
    public void test3(){
        double res = Math.log(PI);
        Math.log(res);
    }

    /**
     * This method returns the operation result.
     * @return
     */
    @Benchmark
    public double test4(){
        double res = Math.log(PI);
        return Math.log(res);
    }
    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(JMHExample13.class.getSimpleName())
                .build();
        new Runner(options).run();
    }
}

Benchmark test results:

Benchmark           Mode  Cnt  Score    Error  Units
JMHExample13.test1  avgt   10  0.001 ±  0.001  us/op
JMHExample13.test2  avgt   10  0.001 ±  0.001  us/op
JMHExample13.test3  avgt   10  0.001 ±  0.001  us/op
JMHExample13.test4  avgt   10  0.006 ±  0.001  us/op

It can be found that there is basically no difference between test1, test2 and test3, because the codes of test2 and test3 have been erased. Such codes are called Dead Code (Dead Code, code fragments not used elsewhere), while test4 is different from the above two methods. Because it returns the results, math Log (PI) will not be considered as Dead Code, so it will occupy a certain amount of CPU time.

If you want to write a micro benchmark method with good performance, do not let the method have Dead Code. It is best that each benchmark method has a return value.

2 use Blackhole

Suppose that in the benchmark method, we need to take two calculation results as the return value, so how should we do it? The first thing we think of may be to store the results in an array or container as the return value, but this operation on the array or container will interfere with the performance statistics, because the write operation on the array or container also takes a certain amount of CPU time.

JMH provides a class called Blackhole, which can avoid the occurrence of Dead Code without any return. Blackhole is literally translated as "black hole"

@BenchmarkMode(Mode.AverageTime)
@Fork(1)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Threads(5)
@State(Scope.Thread)
@Warmup(iterations = 5)
@Measurement(iterations = 10)
public class JMHExample14 {
    /**
     * dead code will be generated
     * Although the results of the two calculations were used, the final results were not used
     */
    @Benchmark
    public void test1(){
        double res1 = Math.log(PI);
        double res2 = Math.log(PI);
        double res = res1 + res2;
    }

    /**
     * The result of adding the two results is returned, so dead code will not be generated
     */
    @Benchmark
    public double test2(){
        double res1 = Math.log(PI);
        double res2 = Math.log(PI);
        return res1 + res2;
    }

    /**
     * If there is no return value, you can use Blackhole
     * @return
     */
    @Benchmark
    public void test3(Blackhole blackhole){
        blackhole.consume(Math.log(PI));
        blackhole.consume(Math.log(PI));
    }
    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(JMHExample14.class.getSimpleName())
                .build();
        new Runner(options).run();
    }
}
Benchmark           Mode  Cnt  Score    Error  Units
JMHExample14.test1  avgt   10  0.001 ±  0.001  us/op
JMHExample14.test2  avgt   10  0.006 ±  0.001  us/op
JMHExample14.test3  avgt   10  0.011 ±  0.001  us/op

Blackhole can help you avoid the occurrence of DC (Dead Code) in benchmark methods without return value.

3. Avoid Constant Folding

Constant folding is an early optimization of Java compiler -- compilation optimization. In the process of javac compiling the source file, through lexical analysis, it can be found that some constants can be folded, that is, the calculation results can be directly stored in the declaration, and there is no need to calculate again in the execution stage.

private final int x = 10;
private final int y = x*20;

In the compilation phase, the value of y will be directly assigned to 200, which is called constant folding.

@BenchmarkMode(Mode.AverageTime)
@Fork(1)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Threads(5)
@State(Scope.Thread)
@Warmup(iterations = 5)
@Measurement(iterations = 10)
public class JMHExample15 {

    private final int x1 = 200;

    private final int x2 = 200;

    private int y1 = 200;

    private int y2 = 200;

    /**
     * Directly return the square result
     * @return
     */
    @Benchmark
    public int test1(){

        return 200;
    }

    /**
     * Calculate x1 * x2;
     * @return
     */
    @Benchmark
    public double test2(){
        return Math.sqrt(x1 * x2);
    }

    /**
     * Calculate x1 * x2;
     * @return
     */
    @Benchmark
    public double test3(){
        return Math.sqrt(y1 * y2);
    }
    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(JMHExample15.class.getSimpleName())
                .build();
        new Runner(options).run();
    }
}
Benchmark           Mode  Cnt  Score    Error  Units
JMHExample15.test1  avgt   10  0.006 ±  0.001  us/op
JMHExample15.test2  avgt   10  0.006 ±  0.001  us/op
JMHExample15.test3  avgt   10  0.012 ±  0.001  us/op

It can be seen that the results of test1 and test2 are the same. Test2 is that variable folding occurs in the compilation stage, and there is no need to calculate at all in the running stage
test3 because y is a variable, variable folding will not occur, so the operation is required in the running stage.

Here is a demonstration. The test case is too simple, but the persuasion is not enough

4 avoid loop unwrapping

Avoid or reduce loops in the benchmark method as much as possible, because the loop code is likely to be optimized by the "pain killer" in the run-time (JVM post optimization), which is called loop expansion

int sum=0;
for(int i = 0;i<100;i++){
    sum+=i;
}

In the above example, the code such as sum=sum+i will be executed 100 times, that is, the JVM will send such calculation instructions to the CPU 100 times, which seems nothing, but the designers of the JVM will think that this method can be optimized into the following form (possible):

int sum=0;
for(int i = 0;i<20; i+=5){
    sum+=i;
    sum+=i+1;
    sum+=i+2;
    sum+=i+3;
    sum+=i+4;
}

After optimization, the calculation instructions in the loop body are sent to the CPU in batches. This batch method can improve the efficiency of calculation. Assuming that 1 + 2 needs 1 nanosecond CPU time for one operation, we think it may be 10 nanoseconds CPU time in a 10 cycle calculation, but the real calculation situation may be less than 10 nanoseconds or even lower

1.5 Fork is used to avoid profile guided optimizations

Although Java supports multithreading, it does not support multiprocessing, which leads to all code running in one process. The execution of the same code at different times may introduce the optimization of the process Profiler in the previous stage, or even mix the parameters of other code profiler optimization, which may lead to inaccurate problems in the micro benchmark we write

@BenchmarkMode(Mode.AverageTime)
@Fork(0) // Fork set to 0
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Threads(5)
@State(Scope.Thread)
@Warmup(iterations = 5)
@Measurement(iterations = 10)
public class JMHExample16 {

    // The implementation of Inc1 and Inc2 is exactly the same
    interface Inc {
        int inc();
    }

    public static class Inc1 implements Inc {
        private int i = 0;

        @Override
        public int inc() {
            return ++i;
        }
    }

    public static class Inc2 implements Inc {
        private int i = 0;

        @Override
        public int inc() {
            return ++i;
        }
    }

    private Inc inc1 = new Inc1();
    private Inc inc2 = new Inc2();

    private int measure(Inc inc) {
        int result = 0;
        for (int i = 0; i < 10; i++) {
            result += inc.inc();
        }
        return result;
    }

    @Benchmark
    public int measure_inc_1() {
        return this.measure(inc1);
    }

    @Benchmark
    public int measure_inc_2() {
        return this.measure(inc2);
    }

    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(JMHExample16.class.getSimpleName())
                .build();
        new Runner(options).run();
    }
}

Test results:

Benchmark                   Mode  Cnt  Score   Error  Units
JMHExample16.measure_inc_1  avgt   10  0.007 ± 0.001  us/op
JMHExample16.measure_inc_2  avgt   10  0.034 ± 0.004  us/op

measure_inc_1 and measure_inc_2 is almost the same, but there is a big gap in their performance,
This is actually caused by JVM Profiler guided optimizations. Since all our benchmark methods are shared with the JVM process of this test program, it is inevitable to mix the Profiler of the test process into them. However, when Fork is set to 1, that is, a new JVM process will be opened to test it every time the benchmark is run, Then there will be no interference between multiple benchmarks.

Test results set to 1:

Benchmark                   Mode  Cnt  Score   Error  Units
JMHExample16.measure_inc_1  avgt   10  0.011 ± 0.001  us/op
JMHExample16.measure_inc_2  avgt   10  0.010 ± 0.001  us/op

Of course, you can set Fork to a value greater than 1, so it will run in different processes many times, but generally, we only need to set Fork to 1.