SXSSF (stream user model API)

Posted by ClosetGeek on Tue, 02 Nov 2021 12:05:58 +0100

The data exported by SXFFS is not easy to cause memory overflow

SXSSF is an API compatible stream extension of XSSF. It is used when a very large spreadsheet must be generated and the heap space is limited. SXSSF achieves its low memory footprint by restricting access to rows in the sliding window, while XSSF allows access to all rows in the document. Old lines that no longer appear in the window cannot be accessed because they are written to disk. You can specify the window size through new SXSSFWorkbook(int windowSize) or set it for each page through sxssfssheet #setrandomaccesswindowsize (int windowSize). When a new row is created through createRow() and the total number of records that have not been refreshed will exceed the specified window size, the row with the lowest index value will be refreshed and can no longer be refreshed through getRow() visit. The default window size is 100 and is determined by SXSSFWorkbook.DEFAULT_WINDOW_SIZE definition. A windowSize of - 1 indicates unrestricted access. In this case, all records that have not been refreshed by the call to flushRows() can be used for random access.

Note: SXSSF allocates temporary files that you must always explicitly clear by calling the dispose method.

SXSSFWorkbook uses inline strings instead of shared string tables by default. This is very effective because there is no need to save the document content in memory, but it is also known that documents incompatible with some clients will be generated. When shared strings are enabled, all unique strings in the document must be saved in memory. Depending on your document content, this may use more resources than disabling shared strings.

Note: according to the functions you use, some contents may consume a lot of memory, such as merge areas, hyperlinks, comments... Are still only stored in memory, so a lot of memory may be required if widely used.

Before deciding whether to enable shared strings, carefully check your memory budget and compatibility requirements. The example of face writes a worksheet with a 100 row window. When the number of rows reaches 101, the row with rownum=0 is refreshed to disk and deleted from memory. When rownum reaches 102, the row with rownum=1 is refreshed, and so on.

import junit.framework.Assert;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import  org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
   public static void main(String[] args) throws Throwable {
       SXSSFWorkbook wb = new SXSSFWorkbook(100); // keep 100 rows in memory, exceeding rows will be flushed to disk
       Sheet sh = wb.createSheet();
       for(int rownum = 0; rownum < 1000; rownum++){
           Row row = sh.createRow(rownum);
           for(int cellnum = 0; cellnum < 10; cellnum++){
               Cell cell = row.createCell(cellnum);
               String address = new CellReference(cell).formatAsString();
               cell.setCellValue(address);
           }
       }
       // Rows with rownum < 900 are flushed and not accessible
       for(int rownum = 0; rownum < 900; rownum++){
         Assert.assertNull(sh.getRow(rownum));
       }
       // ther last 100 rows are still in memory
       for(int rownum = 900; rownum < 1000; rownum++){
           Assert.assertNotNull(sh.getRow(rownum));
       }
       FileOutputStream out = new FileOutputStream("/temp/sxssf.xlsx");
       wb.write(out);
       out.close();
       // dispose of temporary files backing this workbook on disk
       wb.dispose();
   }

The next example turns off automatic refresh (windowSize=-1), and the code manually controls how to write some data to disk

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
   public static void main(String[] args) throws Throwable {
       SXSSFWorkbook wb = new SXSSFWorkbook(-1); // turn off auto-flushing and accumulate all rows in memory
       Sheet sh = wb.createSheet();
       for(int rownum = 0; rownum < 1000; rownum++){
           Row row = sh.createRow(rownum);
           for(int cellnum = 0; cellnum < 10; cellnum++){
               Cell cell = row.createCell(cellnum);
               String address = new CellReference(cell).formatAsString();
               cell.setCellValue(address);
           }
          // manually control how rows are flushed to disk
          if(rownum % 100 == 0) {
               ((SXSSFSheet)sh).flushRows(100); // retain 100 last rows and flush all others
               // ((SXSSFSheet)sh).flushRows() is a shortcut for ((SXSSFSheet)sh).flushRows(0),
               // this method flushes all rows
          }
       }
       FileOutputStream out = new FileOutputStream("/temp/sxssf.xlsx");
       wb.write(out);
       out.close();
       // dispose of temporary files backing this workbook on disk
       wb.dispose();
  }

SXSSF refreshes the worksheet data in the temporary file (one temporary file per worksheet). The size of these temporary files may grow to very large values. For example, for 20 MB of csv data, the size of temporary xml becomes more than 1 GB. If there is a problem with the size of the temporary file, you can tell SXSSF to use gzip compression:

SXSSFWorkbook wb = new SXSSFWorkbook();
wb.setCompressTempFiles(true); // Temporary files will be compressed

reference resources

Class SXSSFWorkbook
How to use HSSF API
Introduction to generating SXSSFWorkbook from POI massive data / big data files
Comparison between POI and EasyExcel in processing large quantities of data

Topics: Java