Preface introduction
In Java programming language, BufferedReader, BufferedInputStream and other buffered IO classes are usually used to process large files when operating file IO. However, java nio introduces a method of operating large files based on MappedByteBuffer, which has high reading and writing performance. Compared with bio's model processing method, it greatly increases the number and space of parsing and reading files.
Memory management of OS
Technical terms and concepts at the memory level
- MMU: memory management unit of CPU.
- Physical memory: the memory space of the memory module.
- Virtual memory: a technology of computer system memory management. It makes the application think that it has continuous available memory (a continuous and complete address space), but in fact, it is usually separated into multiple physical memory fragments, and some are temporarily stored on external disk memory for data exchange when needed.
- Page file: a file created by the operating system to reflect the amount of hard disk space built and used by the virtual memory. Under windows, it is pagefile Sys file, which means that after the physical memory is full, the temporarily unused data will be moved to the hard disk.
- Page missing interrupt: an interrupt issued by MMC when a program attempts to access a page mapped in the virtual address space but not loaded into physical memory. If the operating system determines that the access is valid, it attempts to load the relevant pages from the virtual memory file into physical memory.
Virtual memory and physical memory
The memory required by a running process may be greater than the sum of the memory module capacity. For example, if the memory module is 256M, but the program needs to create a 2G data area, then all data cannot be loaded into memory (physical memory). There must be data to be placed in other media (such as hard disk). When the process needs to access that part of data, In this scenario, the storage occupied by the resource space scheduled to the hard disk is understood as virtual memory.
MappedByteBuffer
In general, what is MappedByteBuffer. In terms of inheritance structure, MappedByteBuffer inherits from ByteBuffer, so it has all the capabilities of ByteBuffer; For example, changing the position and limit pointers and wrapping a view of other types of Buffer, a logical address is maintained internally.
"MappedByteBuffer" will increase the speed and speed up
-
Why fast? Because it uses the direct buffer method to read and write file contents, the scientific name of this method is called memory mapping. In this way, the underlying cache of the system is called directly, and there is no replication operation between the JVM and the system, so the efficiency is greatly improved. And because it is so fast, it can also be used to pass messages between processes (or threads), which can basically achieve the same function as "shared memory page", but it runs based on entity files.
-
Also, it allows you to read and write files that are too large to fit into memory. The implementation assumes that the whole file is placed in memory (in fact, the large file is placed in memory and virtual memory). Basically, it can be accessed as a particularly large array, which greatly simplifies the modification of large files and other operations.
Case usage of MappedByteBuffer
FileChannel provides a map method to map files to MappedByteBuffer: MappedByteBuffer map(int mode,long position,long size); You can map the size area of the file from position to MappedByteBuffer. Mode indicates three ways to access the memory image file, namely:
- MapMode.READ_ONLY: attempting to modify the resulting buffer will result in a ReadOnlyBufferException being thrown.
- MapMode.READ_WRITE: changes to the resulting buffer will eventually be written to the file; However, the change is not necessarily visible to other programs mapped to the same file (the ubiquitous "consistency problem" has arisen again).
- MapMode.PRIVATE: it is readable and writable, but the modified content will not be written to the file. It is only the change of the buffer itself. This ability is called "copy on write"
MappedByteBuffer has three new methods compared with ByteBuffer
- The be fore() buffer is read_ In write mode, this method forcibly writes the modification of the buffer content to the file
- load() loads the contents of the buffer into memory and returns a reference to the buffer
- isLoaded() returns true if the contents of the buffer are in physical memory, otherwise false
Use FileChannel to build related MappedByteBuffer
//One byte accounts for 1B, so 128M of data is stored in the file int length = 0x8FFFFFF; try (FileChannel channel = FileChannel.open(Paths.get("src/c.txt"), StandardOpenOption.READ, StandardOpenOption.WRITE);) { MappedByteBuffer mapBuffer = channel.map(FileChannel.MapMode.READ_WRITE, 0, length); for(int i=0;i<length;i++) { mapBuffer.put((byte)0); } for(int i = length/2;i<length/2+4;i++) { //Access like an array System.out.println(mapBuffer.get(i)); } }
Realize the comparative processing of related read-write files
import java.io.DataInputStream; import java.io.DataOutputStream; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.nio.MappedByteBuffer; import java.nio.channels.FileChannel; import java.nio.file.Paths; import java.nio.file.StandardOpenOption; public class TestMappedByteBuffer { private static int length = 0x2FFFFFFF;//1G private abstract static class Tester { private String name; public Tester(String name) { this.name = name; } public void runTest() { System.out.print(name + ": "); long start = System.currentTimeMillis(); test(); System.out.println(System.currentTimeMillis()-start+" ms"); } public abstract void test(); } private static Tester[] testers = { new Tester("Stream RW") { public void test() { try (FileInputStream fis = new FileInputStream( "src/a.txt"); DataInputStream dis = new DataInputStream(fis); FileOutputStream fos = new FileOutputStream( "src/a.txt"); DataOutputStream dos = new DataOutputStream(fos);) { byte b = (byte)0; for(int i=0;i<length;i++) { dos.writeByte(b); dos.flush(); } while (dis.read()!= -1) { } } catch (IOException e) { e.printStackTrace(); } } }, new Tester("Mapped RW") { public void test() { try (FileChannel channel = FileChannel.open(Paths.get("src/b.txt"), StandardOpenOption.READ, StandardOpenOption.WRITE);) { MappedByteBuffer mapBuffer = channel.map(FileChannel.MapMode.READ_WRITE, 0, length); for(int i=0;i<length;i++) { mapBuffer.put((byte)0); } mapBuffer.flip(); while(mapBuffer.hasRemaining()) { mapBuffer.get(); } } catch (IOException e) { e.printStackTrace(); } } }, new Tester("Mapped PRIVATE") { public void test() { try (FileChannel channel = FileChannel.open(Paths.get("src/c.txt"), StandardOpenOption.READ, StandardOpenOption.WRITE);) { MappedByteBuffer mapBuffer = channel.map(FileChannel.MapMode.PRIVATE, 0, length); for(int i=0;i<length;i++) { mapBuffer.put((byte)0); } mapBuffer.flip(); while(mapBuffer.hasRemaining()) { mapBuffer.get(); } } catch (IOException e) { e.printStackTrace(); } } } }; public static void main(String[] args) { for(Tester tester:testers) { tester.runTest(); } } }
test result
-
Stream RW - > the slowest way to use traditional streaming is because the amount of data used is 1G, which can not be read into memory, so it can not complete the test at all.
-
MapMode.READ_WRITE, its speed varies greatly each time, fluctuates between 0.6s and 8s, and is very unstable.
-
MapMode.PRIVATE is surprisingly stable, always between 1.1s and 1.2s.
No matter which speed is amazing, MappedByteBuffer also has some shortcomings, that is, when the amount of data is very small, the performance is relatively poor. That is because the initialization time of direct buffer is long, so it is recommended that you use MappedByteBuffer only when the amount of data is large.
map process
FileChannel provides a map method to map files to virtual memory. Generally, the whole file can be mapped. If the file is large, it can be mapped in segments.
Several variables in FileChannel:
- Mapmode: the way to access the memory image file, that is, the three ways mentioned above.
- Position: the starting position of file mapping.
- allocationGranularity: Memory allocation size for mapping buffers, initialized by the native function initIDs.
Next, by analyzing the source code, we can understand the internal implementation of the map process. Obtain FileChannel through RandomAccessFile.
public final FileChannel getChannel() { synchronized (this) { if (channel == null) { channel = FileChannelImpl.open(fd, path, true, rw, this); } return channel; } }
As can be seen from the above implementation, only one thread can initialize FileChannel due to synchronized. Via FileChannel Map method, map the file to the virtual memory and return the logical address address. The implementation is as follows:
public MappedByteBuffer map(MapMode mode, long position, long size) throws IOException { int pagePosition = (int)(position % allocationGranularity); long mapPosition = position - pagePosition; long mapSize = size + pagePosition; try { addr = map0(imode, mapPosition, mapSize); } catch (OutOfMemoryError x) { System.gc(); try { Thread.sleep(100); } catch (InterruptedException y) { Thread.currentThread().interrupt(); } try { addr = map0(imode, mapPosition, mapSize); } catch (OutOfMemoryError y) { // After a second OOME, fail throw new IOException("Map failed", y); } } int isize = (int)size; Unmapper um = new Unmapper(addr, mapSize, isize, mfd); if ((!writable) || (imode == MAP_RO)) { return Util.newMappedByteBufferR(isize, addr + pagePosition, mfd, um); } else { return Util.newMappedByteBuffer(isize, addr + pagePosition, mfd, um); } }
As can be seen from the above code, the final map completes the file mapping through the native function map0.
- If the first file mapping leads to OOM, garbage collection will be triggered manually. After sleeping for 100ms, try mapping again. If it fails, an exception will be thrown.
- Initialize the MappedByteBuffer instance through the newMappedByteBuffer method, but the final return is the DirectByteBuffer instance. The implementation is as follows:
static MappedByteBuffer newMappedByteBuffer(int size, long addr, FileDescriptor fd, Runnable unmapper) { MappedByteBuffer dbb; if (directByteBufferConstructor == null) initDBBConstructor(); dbb = (MappedByteBuffer)directByteBufferConstructor.newInstance( new Object[] { new Integer(size), new Long(addr), fd, unmapper } return dbb; } // Access rights private static void initDBBConstructor() { AccessController.doPrivileged(new PrivilegedAction<Void>() { public Void run() { Class<?> cl = Class.forName("java.nio.DirectByteBuffer"); Constructor<?> ctor = cl.getDeclaredConstructor( new Class<?>[] { int.class, long.class, FileDescriptor.class, Runnable.class }); ctor.setAccessible(true); directByteBufferConstructor = ctor; }}); }
Because FileChannelImpl and DirectByteBuffer are not in the same package, there is a permission access problem. Obtain the constructor of DirectByteBuffer through AccessController class for instantiation.
DirectByteBuffer is a subclass of MappedByteBuffer, which implements direct operation on memory.
get procedure
The get method of MappedByteBuffer finally passes directbytebuffer The get method is implemented.
public byte get() { return ((unsafe.getByte(ix(nextGetIndex())))); } public byte get(int i) { return ((unsafe.getByte(ix(checkIndex(i))))); } private long ix(int i) { return address + (i << 0); }
-
The map0() function returns an address, so that the file can be operated through address without calling the read or write methods to read and write the file. The bottom layer adopts unsafe Getbyte method to obtain the data of the specified memory through (address + offset).
-
Accessing the memory area pointed to by address for the first time leads to page missing interrupt. The interrupt response function will find the corresponding page in the exchange area. If it is not found (that is, the file has never been read into memory), the specified page of the file will be read from the hard disk to the physical memory (non jvm heap memory).
-
If the physical memory is not enough when copying data, the temporarily unused physical pages will be exchanged to the virtual memory of the hard disk through the virtual memory mechanism (swap).
performance analysis
From the code level, when reading files into memory from the hard disk, the data must be copied through the file system, and the data copy operation is realized by the file system and hardware driver. In theory, the efficiency of copying data is the same.
The efficiency of accessing files on the hard disk through memory mapping is higher than that of read and write system calls
- read() is a system call. First, the file is copied from the hard disk to a buffer in the kernel space, and then the data is copied to the user space. In fact, the data is copied twice;
- map() is also a system call, but there is no data copy. When page missing interruption occurs, the file is directly copied from the hard disk to the user space, and only one data copy is made.
The reading and writing efficiency of memory mapping is higher than that of traditional read/write.
Use RandomAccessFile to build related MappedByteBuffer
Read files through MappedByteBuffer
public class MappedByteBufferTest { public static void main(String[] args) { File file = new File("D://data.txt"); long len = file.length(); byte[] ds = new byte[(int) len]; try { MappedByteBuffer mappedByteBuffer = new RandomAccessFile(file, "r") .getChannel().map(FileChannel.MapMode.READ_ONLY, 0, len); for (int offset = 0; offset < len; offset++) { byte b = mappedByteBuffer.get(); ds[offset] = b; } Scanner scan = new Scanner(new ByteArrayInputStream(ds)).useDelimiter(" "); while (scan.hasNext()) { System.out.print(scan.next() + " "); } } catch (IOException e) {} } }
summary
MappedByteBuffer uses virtual memory, so the memory size allocated (map) is not limited by the - Xmx parameter of the JVM, but there is also a size limit.
If the file exceeds the 1.5G limit, you can re map the contents behind the file through the position parameter.
MappedByteBuffer does have high performance in processing large files, but there are also some problems, such as memory occupation and uncertain file closing. The files opened by MappedByteBuffer will be closed only when they are garbage collected, and this time point is uncertain.
javadoc also mentioned: a mapped byte buffer and the file mapping that it representatives remain valid until the buffer itself is garbage collected*