Although JMH can help us better understand the code we write, if the JMH benchmark method we write has problems, it will be difficult to play a guiding role and may even be misleading. How to avoid writing the wrong micro benchmark method?
Modern Java virtual machine has become more and more intelligent. It can optimize our code in the early compilation stage, loading stage and later runtime of classes, such as erasing Dead Code, folding constants, opening loops, and even optimizing process Profiler. Therefore, we should master how to write good micro benchmark methods, First, we need to know what kind of benchmark code is problematic.
1. Avoid DCE (Dead Code Elimination)
The so-called Dead Code Elimination means that the JVM erases some code that is irrelevant to the context and will not be used at all after calculation:
public void test(){ int x=10; int y=10; int z=x+y; }
If x and z are not defined in the JVM, the y will be treated as two times. In other words, the method of test and z will not be used to add X and z, which are not defined in the JVM. In other words, the method of test and z will not be used twice, And the relevant codes for calculating z
@BenchmarkMode(Mode.AverageTime) @Fork(1) @OutputTimeUnit(TimeUnit.MICROSECONDS) @Threads(5) @State(Scope.Thread) @Warmup(iterations = 5) @Measurement(iterations = 10) public class JMHExample13 { /** * As an empty method, it is mainly used for benchmark data. */ @Benchmark public void test1(){ // do nothing } /** * Although the log operation is performed, the result is neither reused nor returned. */ @Benchmark public void test2(){ Math.log(PI); } /** * The log operation is also performed, although the first operation result is used as the second input parameter, * However, there is no further use of it after the second execution. */ @Benchmark public void test3(){ double res = Math.log(PI); Math.log(res); } /** * This method returns the operation result. * @return */ @Benchmark public double test4(){ double res = Math.log(PI); return Math.log(res); } public static void main(String[] args) throws RunnerException { Options options = new OptionsBuilder() .include(JMHExample13.class.getSimpleName()) .build(); new Runner(options).run(); } }
Benchmark test results:
Benchmark Mode Cnt Score Error Units JMHExample13.test1 avgt 10 0.001 ± 0.001 us/op JMHExample13.test2 avgt 10 0.001 ± 0.001 us/op JMHExample13.test3 avgt 10 0.001 ± 0.001 us/op JMHExample13.test4 avgt 10 0.006 ± 0.001 us/op
It can be found that there is basically no difference between test1, test2 and test3, because the codes of test2 and test3 have been erased. Such codes are called Dead Code (Dead Code, code fragments not used elsewhere), while test4 is different from the above two methods. Because it returns the results, math Log (PI) will not be considered as Dead Code, so it will occupy a certain amount of CPU time.
If you want to write a micro benchmark method with good performance, do not let the method have Dead Code. It is best that each benchmark method has a return value.
2 use Blackhole
Suppose that in the benchmark method, we need to take two calculation results as the return value, so how should we do it? The first thing we think of may be to store the results in an array or container as the return value, but this operation on the array or container will interfere with the performance statistics, because the write operation on the array or container also takes a certain amount of CPU time.
JMH provides a class called Blackhole, which can avoid the occurrence of Dead Code without any return. Blackhole is literally translated as "black hole"
@BenchmarkMode(Mode.AverageTime) @Fork(1) @OutputTimeUnit(TimeUnit.MICROSECONDS) @Threads(5) @State(Scope.Thread) @Warmup(iterations = 5) @Measurement(iterations = 10) public class JMHExample14 { /** * dead code will be generated * Although the results of the two calculations were used, the final results were not used */ @Benchmark public void test1(){ double res1 = Math.log(PI); double res2 = Math.log(PI); double res = res1 + res2; } /** * The result of adding the two results is returned, so dead code will not be generated */ @Benchmark public double test2(){ double res1 = Math.log(PI); double res2 = Math.log(PI); return res1 + res2; } /** * If there is no return value, you can use Blackhole * @return */ @Benchmark public void test3(Blackhole blackhole){ blackhole.consume(Math.log(PI)); blackhole.consume(Math.log(PI)); } public static void main(String[] args) throws RunnerException { Options options = new OptionsBuilder() .include(JMHExample14.class.getSimpleName()) .build(); new Runner(options).run(); } }
Benchmark Mode Cnt Score Error Units JMHExample14.test1 avgt 10 0.001 ± 0.001 us/op JMHExample14.test2 avgt 10 0.006 ± 0.001 us/op JMHExample14.test3 avgt 10 0.011 ± 0.001 us/op
Blackhole can help you avoid the occurrence of DC (Dead Code) in benchmark methods without return value.
3. Avoid Constant Folding
Constant folding is an early optimization of Java compiler -- compilation optimization. In the process of javac compiling the source file, through lexical analysis, it can be found that some constants can be folded, that is, the calculation results can be directly stored in the declaration, and there is no need to calculate again in the execution stage.
private final int x = 10; private final int y = x*20;
In the compilation phase, the value of y will be directly assigned to 200, which is called constant folding.
@BenchmarkMode(Mode.AverageTime) @Fork(1) @OutputTimeUnit(TimeUnit.MICROSECONDS) @Threads(5) @State(Scope.Thread) @Warmup(iterations = 5) @Measurement(iterations = 10) public class JMHExample15 { private final int x1 = 200; private final int x2 = 200; private int y1 = 200; private int y2 = 200; /** * Directly return the square result * @return */ @Benchmark public int test1(){ return 200; } /** * Calculate x1 * x2; * @return */ @Benchmark public double test2(){ return Math.sqrt(x1 * x2); } /** * Calculate x1 * x2; * @return */ @Benchmark public double test3(){ return Math.sqrt(y1 * y2); } public static void main(String[] args) throws RunnerException { Options options = new OptionsBuilder() .include(JMHExample15.class.getSimpleName()) .build(); new Runner(options).run(); } }
Benchmark Mode Cnt Score Error Units JMHExample15.test1 avgt 10 0.006 ± 0.001 us/op JMHExample15.test2 avgt 10 0.006 ± 0.001 us/op JMHExample15.test3 avgt 10 0.012 ± 0.001 us/op
It can be seen that the results of test1 and test2 are the same. Test2 is that variable folding occurs in the compilation stage, and there is no need to calculate at all in the running stage
test3 because y is a variable, variable folding will not occur, so the operation is required in the running stage.
Here is a demonstration. The test case is too simple, but the persuasion is not enough
4 avoid loop unwrapping
Avoid or reduce loops in the benchmark method as much as possible, because the loop code is likely to be optimized by the "pain killer" in the run-time (JVM post optimization), which is called loop expansion
int sum=0; for(int i = 0;i<100;i++){ sum+=i; }
In the above example, the code such as sum=sum+i will be executed 100 times, that is, the JVM will send such calculation instructions to the CPU 100 times, which seems nothing, but the designers of the JVM will think that this method can be optimized into the following form (possible):
int sum=0; for(int i = 0;i<20; i+=5){ sum+=i; sum+=i+1; sum+=i+2; sum+=i+3; sum+=i+4; }
After optimization, the calculation instructions in the loop body are sent to the CPU in batches. This batch method can improve the efficiency of calculation. Assuming that 1 + 2 needs 1 nanosecond CPU time for one operation, we think it may be 10 nanoseconds CPU time in a 10 cycle calculation, but the real calculation situation may be less than 10 nanoseconds or even lower
1.5 Fork is used to avoid profile guided optimizations
Although Java supports multithreading, it does not support multiprocessing, which leads to all code running in one process. The execution of the same code at different times may introduce the optimization of the process Profiler in the previous stage, or even mix the parameters of other code profiler optimization, which may lead to inaccurate problems in the micro benchmark we write
@BenchmarkMode(Mode.AverageTime) @Fork(0) // Fork set to 0 @OutputTimeUnit(TimeUnit.MICROSECONDS) @Threads(5) @State(Scope.Thread) @Warmup(iterations = 5) @Measurement(iterations = 10) public class JMHExample16 { // The implementation of Inc1 and Inc2 is exactly the same interface Inc { int inc(); } public static class Inc1 implements Inc { private int i = 0; @Override public int inc() { return ++i; } } public static class Inc2 implements Inc { private int i = 0; @Override public int inc() { return ++i; } } private Inc inc1 = new Inc1(); private Inc inc2 = new Inc2(); private int measure(Inc inc) { int result = 0; for (int i = 0; i < 10; i++) { result += inc.inc(); } return result; } @Benchmark public int measure_inc_1() { return this.measure(inc1); } @Benchmark public int measure_inc_2() { return this.measure(inc2); } public static void main(String[] args) throws RunnerException { Options options = new OptionsBuilder() .include(JMHExample16.class.getSimpleName()) .build(); new Runner(options).run(); } }
Test results:
Benchmark Mode Cnt Score Error Units JMHExample16.measure_inc_1 avgt 10 0.007 ± 0.001 us/op JMHExample16.measure_inc_2 avgt 10 0.034 ± 0.004 us/op
measure_inc_1 and measure_inc_2 is almost the same, but there is a big gap in their performance,
This is actually caused by JVM Profiler guided optimizations. Since all our benchmark methods are shared with the JVM process of this test program, it is inevitable to mix the Profiler of the test process into them. However, when Fork is set to 1, that is, a new JVM process will be opened to test it every time the benchmark is run, Then there will be no interference between multiple benchmarks.
Test results set to 1:
Benchmark Mode Cnt Score Error Units JMHExample16.measure_inc_1 avgt 10 0.011 ± 0.001 us/op JMHExample16.measure_inc_2 avgt 10 0.010 ± 0.001 us/op
Of course, you can set Fork to a value greater than 1, so it will run in different processes many times, but generally, we only need to set Fork to 1.