Performance optimization

Posted by garrywinkler on Sat, 09 Oct 2021 08:14:00 +0200

performance optimization

Performance optimization related concepts

How to understand JDK, JRE, and JVM?

  • JDK (java development kit): Java development tool. Compiled into the corresponding specific machine code.

  • JRE (Java Resource Environment): Java runtime environment. Only. class files can be run, and. class files cannot be compiled.

    JRE + various Java Tools( javac/java/jdb Etc.) + Java Basic class library (i.e Java API include rt.jar) Equivalent to JDK. 
    Among them, rt.jar Is the basic class library.
    
  • JVM (Java Virtual Machine): Java Virtual Machine. Compiled into. class file.

    JVM + Class library lib Equivalent to JRE. 
    
  • Simply put, JDK contains JRE, and JRE contains the relationship of JVM.

How to understand a compilation of the Java language and run it everywhere?

  • Machine code (machine code/native code): a language that can be read by the computer CPU. The bottom layer is fast and obscure.

  • The instructions of each operating system are different, that is, the machine codes of different operating systems are different. (different language systems!)

  • JDK (java development kit) distinguishes operating systems, that is, different operating systems have different jdks for their specific machine codes.

  • JVM (Java Virtual Machine): it runs on the operating system. Combined with bytecode files, it shields the differences between underlying hardware and instructions of different operating systems at the software level.

  • Java bytecode file (. class file): the binary code (file) of the intermediate state (intermediate code). The source code is converted into bytecode through the java compiler, and the bytecode is converted into machine code through the interpreter embedded in the virtual machine. It runs on the JVM, has unified standards, and complies with the JVM specifications.

  • Therefore, the operating system < - > machine code < - > JDK is a one-to-one correspondence, that is, bytecode can be understood as unifying the upper abstract standard (JVM specification) of machine instructions of different operating systems. JVM is the translator of bytecode, and different jdks are the compiler of their corresponding operating system instructions.

  • The execution of machine instructions in the operating system has changed from speaking to speaking Mandarin. The cost is to transfer a layer of bytecode (performance score reduction). The advantage is that it can be compiled at one time and run everywhere (convenience plus score). It is also a trade-off and a kind of wisdom.

Is Java compiled or interpreted?

  • Compilation and execution: compile the source code of the high-level language into the target program of the machine language, which is used as the granularity of compilation and execution for execution. Compilation and execution feels like batch preprocessing. First, the preparations before execution are completed at one time, and later execution does not need to take into account others. With the foreshadowing of the early stage, the execution stage is fast. When the program runs, with the passage of time , the compiler gradually plays a role. According to the hot spot detection function, the valuable bytecode is compiled into local machine instructions in exchange for higher program execution efficiency.
  • Interpretation and execution: translate the source code of high-level language sentence by sentence, and execute the computer sentence by sentence, resulting in no target program. The advantage of interpretation and execution is one-time compilation (the whole is compiled into bytecode file, and then the bytecode is compiled into machine code sentence by sentence), and run everywhere (convenience plus points) , and saves the compilation time in the startup phase and executes immediately. The disadvantage is that it transfers a layer of bytecode (performance score reduction) At the security level, bytecode files are easier to decompile and crack, because bytecode is easier to understand than stored machine instructions. When the program needs to start quickly, the interpreter can play a role first, saving compilation time and executing immediately. Interpretation and execution occupy less memory space. At the same time, when the radical optimization of the compiler fails, it can also Inverse optimization can be performed to restore to the state of interpretation execution.
  • Java has both compilation and interpretation execution. Why? Look down.

What is JIT?

  • If the generation of. class files is regarded as compilation (but not directly compiled into machine instructions to form object files), it is not impossible to understand this process as compilation and execution. However, the execution process of. class files is interpreted and executed, that is, sentence by sentence translation.
  • If one line of the. class file is interpreted and executed once, it seems to have little impact.
  • However, there will also be code that needs to be executed many times, such as for loops, hot code blocks, etc. it is a waste to repeat the process of interpreting bytecode into machine instructions every time. How to solve? Cache.
  • Since you are always called, compile you directly into machine code, cache it, and get it as you go.
  • Just In Time, referred to as JIT.
  • So, the question is, how to judge whether caching is needed? How to master the degree of JIT? (caching is also a resource!)
  • We can't know which codes are hot codes in advance, so we can only make statistical analysis and judgment during execution.
  • For example, in a for loop, when the loop accumulation reaches a certain threshold (counter statistics) , it can be recognized as hot code, and then it can be compiled into machine code and cached. After the threshold is exceeded, the cycle execution can directly use the cached machine code without interpretation and execution every time. The scenario described in this example is the process of replacement on the stack. Through counter statistics, it is called Back Edge Counter for for or while loops (Back Edge Counter). If it counts the number of times a method is called, it is called the method counter. The default threshold is 1500 times in Client mode and 10000 times in Server mode, which can be modified manually. The method of detecting hotspots based on counters is called Counter Based Hot Spot Detection This HotSpot detection method is used in the HotSpot virtual machine.
  • Coincidentally, another JVM's Hot Spot Detection Method to periodically detect the stack top of each thread. If a method is often found at the top of the stack, in other words, a method is frequently called, resulting in frequent stack entry and stack exit, then it can be considered as hot code. Its disadvantage is that it is unable to accurately confirm the heat of a method, and it is easy to be disturbed by thread blocking or other reasons. We call it Sample Based Hot Spot Detection.
  • In fact, two real-time compilers are built into the HotSpot virtual machine, called Client Compiler and Server Compiler respectively, or C1 compiler and C2 compiler for short. By default, the HotSpot compiler works with the interpreter and one of the real-time compilers. The specific compiler selected depends on the running mode of the virtual machine. You can also use the - Client and - Server parameters to force the virtual machine to run in Client mode or Server mode. This method is called Mixed Mode. At the same time, the user can use the parameter - Xint to force the virtual machine to run in Interpreted Mode. At this time, the compiler does not intervene at all. In addition, use - Xcomp to force the virtual machine to run in "Compiled Mode". At this time, the compilation mode will be preferred, but the interpreter still needs to access the execution process when the compilation cannot be carried out. You can view the current default running mode through the virtual machine - version command.

Why does Java need compilation and execution?

Precompile( Ahead Of Time,AOT)
Just in time compilation( Just In Time,JIT)
Interpreter( Interpreter)
Compiler( Compiler)
Client compiler( Client Compiler,C1)
Server compiler( Server Compiler,C2,Also called Opto Compiler)
Graal Compiler( JDK 10 Appears for substitution C2)
Mixed mode( Mixed Mode)default
 Interpretation mode( Interpreted Mode)-Xint
 Compile mode( Compiled Mode)-Xcomp

Explanation: fast startup speed.
C1 Compilation and execution: faster warm-up, faster execution when running, and higher compilation speed. (general compilation optimization)
C2 Compilation and execution: it needs to be slow, run-time, fast, and relatively better compilation quality. (thorough compilation and Optimization)

Java Programs are initially interpreted and executed through the interpreter. When the virtual machine finds that a method or code block is executed very frequently, it will recognize these codes as "hot code"( Hot Spot Code),In order to improve the execution efficiency of hot code, at run time, the virtual machine will compile these codes into local machine code and optimize the code as much as possible by various means. The back-end compiler that completes this task at run time is called immediate compiler.

Static compilation: one-time compilation. Compile all your modules when compiling.
Dynamic compilation: on demand compilation. When a program runs, it compiles which module it uses.
  • Layered compilation: before the working mode of layered compilation appeared, HotSpot virtual machine usually worked by directly matching the interpreter with one of the compilers. In order to achieve the best balance between the corresponding speed of program startup and running efficiency, HotSpot virtual machine added layered compilation function to the compilation subsystem. (master the degree of pre compilation optimization)

    level 0: interpreter Explain execution.
    level 1: C1 Compile, none profiling(Performance monitoring).
    level 2: C1 Compilation, methods and loops only back-edge Number of executions profiling. 
    level 3: C1 Compile, except level 2 Medium profiling It also includes branch(For branch jump (bytecode) and receiver type(For member method calls or class detection, such as checkcast,instnaceof,aastore Bytecode) profiling. 
    level 4: C2 compile.
    

What is escape analysis?

  • What is escape analysis? A simple understanding is the analysis of the scope and life cycle of the pointer.

    When a variable (or object) is allocated in a subroutine, a pointer to the variable may escape to other execution threads or return to the caller subroutine. If a subroutine allocates an object and returns a pointer to the object, the object may not be determined where it is accessed in the program - so the pointer successfully "escapes" If the pointer is stored in a global variable or other data structure, because the global variable can be accessed outside the current subroutine, the pointer also escapes.

    Escape analysis determines where a pointer can be stored and whether the life cycle of the pointer can be guaranteed only in the current process or in other threads.

  • When is escape analysis performed?

    In short, it is possible to conduct escape analysis during compilation, but Java's separate compilation and dynamic loading make the escape analysis of static compilation in the early stage more difficult or less profitable. Therefore, at present, Java's escape analysis is only sent to JIT real-time compilation, because sufficient running data is collected, and the JVM can better judge whether the object has escaped.

  • What is the purpose of escape analysis? When it is determined that the object does not escape, the compiler can use the results of escape analysis for some code optimization.

    Optimization 1: convert heap allocation into stack allocation.
    Parsing: after the method is executed, the stack frame will pop up and the object will be recycled automatically. In this way, there is no need to trigger memory recycling when the memory is full. This has the advantages of high efficiency of program memory recycling and GC The frequency will also be reduced, and the performance of the program will be improved. The premise is that the pointer to the object will never escape.
    
    Optimization 2: synchronous lock elimination, that is, synchronous omission (lock elimination).
    Parsing: if it is found that an object can only be accessed from one thread, the operations on this object may not need to be synchronized.
    
    Optimization 3: separate objects or scalar replacement.
    Resolution: This is simply to decompose objects into basic types, and the memory allocation is no longer allocated on the heap, but on the stack. This has the following advantages:
    		1. Reduce memory usage because object headers are not generated. 
    		2. Program memory recovery efficiency is high, and GC The frequency will also be reduced. Generally speaking, the effect is similar to that of the above advantages.
    		3. Some objects may not need to exist as a continuous memory structure and can also be accessed. Then some (or all) of the objects can be stored in memory instead of memory CPU Register.
    
    Note: explain Java The escape analysis is method level because JIT The immediate compilation of is at the method level.
    
  • Stack differences:

    Stack and heap are Java Used in Ram A place in which data is stored C++Different, Java Automatically manage stacks and heaps. Programmers can't set stacks or heaps directly.
    
    The advantage of stack is that the access speed is faster than heap, second only to register, and stack data can be shared.
    However, the disadvantage is that the data size and lifetime in the stack must be determined and lack of flexibility.
    
    The stack mainly stores some basic types of variables( int, short, long, byte, float, double, boolean, char)And reference objects.
    What is stored in the stack is the first address name of the heap, just like a reference variable.
    
    Java Allocating heap memory in is automatically initialized. Java The storage space of all objects in is allocated in the heap, but the reference of this object is allocated in the stack, that is, when an object is created, memory is allocated from two places. The memory allocated in the heap actually creates this object, and the memory allocated in the stack is only a pointer to this heap object(Reference variable)Memory allocated in the heap by Java The virtual machine is managed by the automatic garbage collector.
    
    Reference variables are ordinary variables, which are allocated in the stack when defined. Reference variables are released after the program runs outside its scope. Arrays and objects themselves are allocated in the heap, even when the program runs to use new In addition to the code block where the statement generating the array or object is located, the memory occupied by the array and object itself will not be released. The array and object will become garbage only when there is no reference variable pointing to it. They can not be used, but they still occupy the memory space, and will be collected (released) by the garbage collector at an uncertain time.
    

What is method inlining?

  • It means that the JVM replaces the method call whose number of calls reaches a certain threshold with the method body itself at runtime, so as to eliminate the call cost and provide a basis for further code performance optimization.

  • Method inlining is done by the JIT compiler at run time. Since compilation is involved, method inlining also has a certain overhead, including cpu time and memory.

  • Inline function is that when the program is compiled, the compiler directly replaces the call expression of the inline function in the program with the function body of the inline function. Because the code in the function body is replaced into the program at compile time, the amount of target program code will be increased, and then the space overhead will be increased. Save time at the cost of increasing object code.

What are CMS and G1?

  • CMS: a collector aiming at obtaining the shortest recovery pause time, which is based on concurrent "tag cleaning".

    Steps:
    1. Initial tag: exclusive CPU,Mark only GCroots Objects that can be directly associated. (root node first) step One step, initial diffusion)
    2. Concurrent marking: it can be executed in parallel with the user thread to mark all reachable objects. (overall diffusion)
    3. Retag: exclusive CPU(STW),Mark and correct the garbage objects generated by the user thread in the concurrent marking stage. (replenishment of changes during diffusion)
    4. Concurrent Cleanup: it can be executed in parallel with user threads to clean up garbage. (cleaning)
    
    advantage:
    1. Concurrency.
    2. Low pause.
    
    Disadvantages:
    1. yes CPU Very sensitive: in the concurrency phase, although it will not cause the user thread to pause, it will slow down the application because it occupies some threads. (abuse) CPU)
    2. Unable to handle floating garbage: in the last step of concurrent cleanup, user thread execution will also generate garbage, but this part of garbage is after the tag, so we have to wait until the next time gc This part of garbage is called floating garbage.
    3. CMS Use tag-The "clean-up" method will produce a large number of space debris. When there are too many debris, it will bring great trouble to the allocation of large object space. There is often a large space in the old age, but it is unable to find a large enough continuous space to allocate the current object, so it has to be triggered in advance FullGC,To solve this problem CMS A switch parameter is provided for CMS I can't hold it. I have to do it FullGC Start the merge and defragmentation process of memory fragments, but the process of memory defragmentation cannot be concurrent. There is no space fragment, but the pause time becomes longer. (mark)-Cleaning up leads to fragmentation, which affects the elderly generation. There is no continuous space allocation storage, and single thread cleaning leads to time-consuming pause (obvious)
    
    CMS appear FullGC Reasons for:
    1. There is not enough continuous space for the younger generation to be promoted to the elderly, which is likely caused by memory fragmentation.
    2. During concurrency JVM I feel that the heap will be full before the end of the concurrent process, which needs to be triggered in advance FullGC. 
    
  • G1: it is a garbage collector for server-side applications.

    characteristic:
    1,Parallel to concurrent: G1 Can make full use of CPU,Hardware advantages in multi-core environment, using multiple CPU(CPU perhaps CPU Core) to shorten stop-The-World Pause time. Some other collectors would have needed to pause Java Thread executed GC Action, G1 The collector can still use concurrency to java The program continues.
    2,Generational collection: generational concept in G1 It is still preserved. although G1 You can manage the whole system independently without the cooperation of other collectors GC Heap, but it can handle newly created objects in different ways and has survived for some time and many times GC For better collection results. in other words G1 You can manage the new generation and the old generation by yourself.
    3,Spatial integration: due to G1 A separate area is used( Region)Concept, G1 On the whole, it is based on "tag"-The "collation" algorithm implements collection from local (two) Region)It is based on the "copy" algorithm, but in any case, both algorithms mean G1 Memory space fragmentation will not occur during operation.
    4,Predictable pause: This is G1 be relative to CMS Another advantage of reducing pause time is G1 and CMS Common concerns, but G1 In addition to pursuing low pause, it can also establish a predictable pause time model, which allows users to clearly specify a length of M The time spent on garbage collection must not exceed in milliseconds N millisecond.
    
    Compared with other collectors, G1 The big change is that it will be the whole Java The heap is divided into multiple independent areas of equal size( Region),Although the concepts of the new generation and the old generation are still retained, the new generation and the old generation are no longer physically isolated. They are all part of the new generation Region(It does not require a continuous) set. At the same time, in order to avoid full heap scanning, G1 Used Remembered Set To manage related object reference information. When memory reclamation is performed, the GC The root node is added to the enumeration range Remembered Set It can ensure that there will be no omission without scanning the whole heap.
    
    If maintenance is not calculated Remembered Set Operation of, G1 The operation of the collector can be roughly divided into the following steps:
    1,Initial marking( Initial Making)
    2,Concurrent marking( Concurrent Marking)
    3,Final marking( Final Marking)
    4,Filter recycling( Live Data Counting and Evacuation)
    
    Looks like CMS The operation process of the collector is somewhat similar, and it is true.
    1. The initial stage is just marking GC Roots Objects that can be directly associated with and modified TAMS(Next Top Mark Start)So that when the user program runs concurrently in the next stage, it can be used correctly Region To create a new object in, this stage needs to pause the thread, but the time is very short.
    2. The concurrent marking phase is from GC Roots Start to analyze the reachability of objects in the heap and find out the surviving objects. This stage takes a long time, but it can run concurrently with user threads.
    3. The final marking stage needs to Remembered Set Logs Merge your data into Remembered Set In, this stage needs to pause the thread, but it can be executed in parallel.
    4. Finally, in the screening and recovery stage, each Region The recovery value and cost are sorted according to the user's expectations GC Pause time to make a recycling plan. This process also needs to pause threads, but Sun The company revealed that concurrency can also be achieved at this stage, but considering that pausing threads will greatly improve the collection efficiency, it chose to pause.
    

What is ASM (Assembly)?

  • JDK dynamic proxy: it is easy to use and can only proxy interfaces.

  • CGLIB: dynamic agent with ASM.

  • JIT: instant compilation, which directly caches the machine code of hot code to prevent performance waste caused by repeated parsing of bytecode files.

  • If JIT is compared to a trick of skipping bytecode compilation, ASM has full control over bytecode.

  • ASM can modify existing class files or dynamically generate class files. It is a general Java bytecode operation and analysis framework.

    ASM It's a JAVA Open source application framework for bytecode analysis, creation and modification. It can dynamically generate binary format stub Class or other proxy class,Or when the class is JAVA Before the virtual machine is loaded into memory,Dynamically modify classes.
    
    Source code analysis: Visitor pattern
    

remarks:

  1. Install the bytecode plug-in ASM Bytecode Outline in the IDEA, view the class file, right-click and select Show bytecode Outline to view the generated bytecode in the toolbar on the right.
  2. javac -g Test.java is compiled into a class file, and the javap -verbose Test.class command is used to view the class file format.

JVM is a stack based instruction set. What are the similarities and differences between JVM and register based instruction set?

1. The main advantage of stack based instruction set is portability, but the disadvantage is that the execution speed is slow and there are many more instructions for the same operation.

2. Registers are directly provided by hardware, and programs that directly depend on these hardware registers are inevitably constrained by hardware.

3. Although the code of the instruction set of the stack architecture is very compact, the number of instructions required to complete the same function is generally more than that of the register architecture, because the out of stack and in stack operations themselves produce a considerable number of instructions. More importantly, the stack is implemented in memory. Frequent stack access means frequent memory access. Compared with the processor, memory is always the bottleneck of execution speed.
  • Features based on stack architecture:

    1. The design and implementation are simpler and suitable for resource constrained systems.
    2. It avoids the problem of register allocation: zero address instruction allocation.
    3. Most of the instructions in the instruction stream are zero address instructions, and their execution depends on the operation stack. The instruction set is smaller and the compiler is easy to implement.
    4. No hardware support, better portability and better cross platform.
    
  • Features of register based architecture:

    1. Typical applications are x86 Binary instruction set: such as traditional PC as well as Android of Davlik Virtual machine.
    2. Instruction set architecture is completely dependent on hardware and has poor portability.
    3. Excellent performance and more efficient execution.
    4. It takes less instructions to complete an operation
    5. In most cases, the instruction set based on register architecture is often dominated by one address instruction, two address instruction and three address instruction, while the instruction set based on stack architecture is dominated by zero address instruction.
    6. The stack architecture is 8 bits, while the register architecture is 16 bits. Therefore, the stack architecture instruction set is smaller, but the register architecture instruction set is smaller.
    
    Note: the zero address instruction has only opcodes and no operands.
    Although the zero address instructions used by the register based virtual machine are more compact, it will inevitably cost more in stack and out stack instructions to complete an operation, which also means that more instruction dispatch will be required( instruction dispatch)Number of reads and memory/Write times. Since memory access is an important bottleneck in execution speed, although each instruction occupies more space, generally speaking, less instructions can be used to complete an operation, instruction allocation and memory reading/Writing times are relatively less.
    

other

JVM memory area diagram

JVM performance tuning method

JVM garbage collection mechanism

Specific cases of performance optimization

Java memory usage details

  1. CMS Old Gen
  2. Par Eden Space 8
  3. MetaSpace
  4. Compressed Class Space
  5. Code Cache
  6. Par Survivor Space 1

Performance optimization related tools

  • java dump tool, ZProfile

Reference

  • https://zhuanlan.zhihu.com/p/81941373 (in depth understanding of Java just in time compiler (JIT) - Part I)
  • https://blog.csdn.net/Jbinbin/article/details/87783455 (how does the jvm interpreter and compiler cooperate to execute code)
  • https://zhuanlan.zhihu.com/p/36822336Java (compile once, run everywhere, cross platform underlying principle)
  • https://zhuanlan.zhihu.com/p/48285067 (differences and relationships among JDK, JRE and JVM)
  • https://www.zhihu.com/question/366524107 (how does java bytecode execute?)
  • https://zhuanlan.zhihu.com/p/94498015?utm_source=wechat_timeline (the most accessible ASM tutorial in History)
  • https://www.cnblogs.com/zt007/p/6377789.html (detailed explanation of ASM framework in Java)
  • https://segmentfault.com/a/1190000040440196?utm_source=sf -Similar article (hard core 10000 word long text, in-depth understanding of Java bytecode instructions)
  • https://asm.ow2.io/asm4-guide.pdf(ASM 4.0 A Java bytecode engineering library)
  • https://www.bilibili.com/read/cv9803401/ (Java ASM details: ASM library usage)
  • https://tech.meituan.com/2020/10/22/java-jit-practice-in-meituan.html (basic skills | principle analysis and practice of Java real-time compiler)
  • http://www.ruanyifeng.com/blog/2017/09/flame-graph.html (how to read the flame diagram?)
  • https://blog.csdn.net/hyman_c/article/details/103008165 (details of JVM memory area (Eden Space, Survivor Space, Old Gen, Code Cache and Perm Gen))
  • https://xie.infoq.cn/article/a05e8c191dcf06ee6d4b67117 (JVM analysis and tuning skills analysis (principle)
  • https://blog.csdn.net/qq_22796957/article/details/108049133 (reason analysis for low cpu utilization and high load)
  • https://www.cnblogs.com/rainy0426/articles/12620127.html (how to correctly understand the relationship between CPU utilization and average load? You'll know after reading it)
  • https://blog.csdn.net/srs1995/article/details/109203174 (a large number of GC s in the company's online virtual machines lead to a surge in STW and CPU – the process of pulling out the cocoon and positioning)
  • https://blog.csdn.net/u013490280/article/details/108522427 (Tiered Compilation of HotSpot virtual machine)
  • https://www.cnblogs.com/rgever/p/9534857.html (difference between CMS and G1)
  • https://asm.ow2.io/asm4-guide.pdf(ASM 4.0 A Java bytecode engineering library)
  • https://arthas.aliyun.com/doc/profiler.html(Arthas)
  • https://blog.csdn.net/HappySundlut/article/details/116705829 (architecture model of JVM & the difference between stack based instruction set architecture and register based instruction set architecture)
  • https://www.iteye.com/blog/rednaxelafx-492667 (talk about virtual machine (I): interpreter, tree traversal interpreter, stack based and register based, hodgepodge)

Topics: Java Big Data Optimize