java parsing docx format documents, XWPFTable parsing line breaks, and wps and office reading and writing docx documents have different paragraph structures

Posted by usvpn on Sat, 01 Jan 2022 17:02:43 +0100

demand

One came from the front docx document, the back-end parses the tables in the document and processes the data in it.

(I don't know why I have to parse the tables in word documents instead of directly transferring them to excel.)

realization

1. Front end elementui-upload , take the whole The docx file is passed to the backend.

2. The back-end controller receives this document. Note that the received format is MultipartFile, not file, otherwise it will report 500.

3. The backend service parses this document:

public Map < String, Object > dealWord(MultipartFile file) throws IOException {
		// 3.1 parsing the entire document
		XWPFDocument xwpf = new XWPFDocument(file.getInputStream());
		// 3.2 iterators for getting tables from documents
		Iterator < XWPFTable > it = xwpf.getTablesIterator();
		// 3.3 traverse all tables and parse them one by one
		while (it.hasNext()) {
			// 3.3.1 get the current table
			XWPFTable table = it.next();
			// 3.3.2 get row data
			List < XWPFTableRow > rows = table.getRows();
			// 3.3.3 traverse each row and get each cell
			for (int i = 1; i < rows.size(); i++) {
				XWPFTableRow row = rows.get(i);
				List<XWPFTableCell> cells = row.getTableCells();
				// 3.3.4 get the text in each cell
				for(int j = 0;j<cells.size();j++){
					String text = cells.get(j).getText()
				}
			}
		}

If you only need to get the text content in each cell without considering line wrapping, then this is the end.

If you need to parse line breaks in cells, please continue to look down.

Next, let's introduce the big pit.

pit

At first, I thought XWPFTable could not parse line breaks. Later, I thought it was impossible. I looked at the data structure in XWPFTable.

But I have read and written in wps and office docx document, the parsed paragraph structure is different!

But it's good to be compatible.

Here by the way, let's introduce the parsed structure!

Take the following table as an example (WPS read-write version):

The data parsed from this table is similar to this:

Field interpretation:

I don't know how to explain the runs here I don't quite understand the official documents. I hope friends who know something can add.

The difference between wps and word is in runs.

As you can see in the figure, there is a run object in paragrah, which is cell 1
Each of the three paragraphs has a run object.

However, in the table read and written by office, there may be multiple run objects, that is, cell 1 may be split into multiple run objects.

You can compare:

Here, cell 1 is disassembled into cell 1 and two run objects.

Explore office

Idle is also idle. Explore the rules for the number of run objects in office.

Speculation: split the text of the whole paragraph by data type. The table is string text and 1 is number.

Now we want to parse the offcie read / write table:

Parsed results:

Compare:

I don't understand With my limited knowledge, I don't know why I split 3 and l, but merged 4L% into a run object

terms of settlement

This problem is easy to solve:

To get the first row of content in a cell:
Get tableCell, get all paragrahs, get the first paragrah, and traverse to get all runs in this paragrah

In this way, it can adapt to wps and office at the same time.

Previous code:

public Map < String, Object > dealWord(MultipartFile file) throws IOException {
		// 3.1 parsing the entire document
		XWPFDocument xwpf = new XWPFDocument(file.getInputStream());
		// 3.2 iterators for getting tables from documents
		Iterator < XWPFTable > it = xwpf.getTablesIterator();
		// 3.3 traverse all tables and parse them one by one
		while (it.hasNext()) {
			// 3.3.1 get the current table
			XWPFTable table = it.next();
			// 3.3.2 get row data
			List < XWPFTableRow > rows = table.getRows();
			// 3.3.3 traverse each row and get each cell
			for (int i = 1; i < rows.size(); i++) {
				XWPFTableRow row = rows.get(i);
				List<XWPFTableCell> cells = row.getTableCells();
				// 3.3.4 get the text in each cell
				for(int j = 0;j<cells.size();j++){
					String text = cells.get(j).getText()
				}
				// 3.3.5 get the first paragraph in the cell
				// If you want to get the contents of all segments, use ArrayList to receive them. Ha, there is no demonstration here
				for(int k = 0;k<cells.getParagraphs().size();k++){
					String text = "";
					// Traverse all the run s in the first paragraph. Note that only the paragraph with 0 is got here
					for(int p = 0;p<cells.getParagraphs().get(0).getRuns().size();p++){
						text = text + cells.getParagraphs().get(0).getRuns().get(p);
					}
					System.out.prrintln("The contents of the first row of cells are:"+text);
				}
			}
		}

It's a very simple traversal. It's easy to adapt. It's written here.