Preface: When we read some PDF books, if PDF is not a picture, it is good to take notes of reading; if PDF is a picture, it is impossible to edit and take notes, it is still very painful. I met him. Of course, we have to learn how to solve the current pain points by ourselves.
I. Status Quo
In order not to repeat the wheel building, of course, we have to see whether there is already achieved in the market, if there is, it is natural to use it.
First of all, talk about some online PDF image transcription, the file size limit is 2M (it seems that a lot of file processing is limited to this number), more than that will be charged.
Secondly, that's how the PDF image of WPS is translated into text. Not to mention the size limit, it's the charge directly.
II. Scheme Realization
2.1 Baidu AI Platform Gets AppID, API Key, Secret Key
The platform limits the number of calls, which is basically enough for personal developers.
Java SDK Document Usage Instructions: https://ai.baidu.com/docs#/OCR-Java-SDK/top
If it's not clear, you can go to the document.
2.2 Code Implementation
Logic idea: Read PDF files, then read the pictures contained in PDF, pass the pictures to Baidu AI platform for identification, and return the results to parse.
Step 1: Build a new Demo Maven project
Eliminate. (I believe everyone will be happy)
Step 2: Introducing POM
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.example</groupId> <artifactId>demo</artifactId> <version>0.0.1-SNAPSHOT</version> <name>demo</name> <description> Demo project for pdf Picture to Text Conversion Favorite Wechat Focuses on Public Number: Java Technical Dry Goods </description> <properties> <java.version>1.8</java.version> </properties> <dependencies> <dependency><!--Baidu AI SDK--> <groupId>com.baidu.aip</groupId> <artifactId>java-sdk</artifactId> <version>4.8.0</version> </dependency> <dependency><!--PDF Operational Toolkit--> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox-app</artifactId> <version>2.0.16</version> </dependency> </dependencies> </project>
Step 3: Create a new class with main method
package com.example.demo; import com.baidu.aip.ocr.AipOcr; import org.apache.pdfbox.cos.COSName; import org.apache.pdfbox.pdmodel.*; import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; import org.apache.pdfbox.text.PDFTextStripper; import org.json.JSONObject; import javax.imageio.ImageIO; import java.awt.image.BufferedImage; import java.io.*; import java.nio.ByteBuffer; import java.text.SimpleDateFormat; import java.util.*; import java.util.concurrent.atomic.AtomicInteger; public class DemoApplication { //Setting up APPID/AK/SK public static final String APP_ID = "Your APP_ID"; public static final String API_KEY = "Your API_KEY"; public static final String SECRET_KEY = "Your SECRET_KEY "; public static final String DATE_FORMAT = "yyyy-MM-dd HH:mm:ss"; /** * Parsing pdf document information * * @param pdfPath pdf Document Path * @throws Exception */ public static void pdfParse(String pdfPath) throws Exception { InputStream input = null; File pdfFile = new File(pdfPath); PDDocument document = null; try { input = new FileInputStream(pdfFile); //Loading pdf documents document = PDDocument.load(input); /** Document attribute information**/ PDDocumentInformation info = document.getDocumentInformation(); System.out.println("Title:" + info.getTitle()); System.out.println("theme:" + info.getSubject()); System.out.println("author:" + info.getAuthor()); System.out.println("Keyword:" + info.getKeywords()); System.out.println("application program:" + info.getCreator()); System.out.println("pdf Production Procedure:" + info.getProducer()); System.out.println("author:" + info.getTrapped()); System.out.println("Creation time:" + dateFormat(info.getCreationDate())); System.out.println("Modification time:" + dateFormat(info.getModificationDate())); //Getting Content Information PDFTextStripper pts = new PDFTextStripper(); String content = pts.getText(document); System.out.println("content:" + content); /** Document Page Information**/ PDDocumentCatalog cata = document.getDocumentCatalog(); PDPageTree pages = cata.getPages(); System.out.println(pages.getCount()); int count = 1; // Initialize an AipOcr AipOcr client = new AipOcr(APP_ID, API_KEY, SECRET_KEY); // Optional: Set network connection parameters client.setConnectionTimeoutInMillis(2000); client.setSocketTimeoutInMillis(60000); for (int i = 0; i < pages.getCount(); i++) { PDPage page = (PDPage) pages.get(i); if (null != page) { PDResources res = page.getResources(); Iterable xobjects = res.getXObjectNames(); if(xobjects != null){ Iterator imageIter = xobjects.iterator(); while(imageIter.hasNext()){ COSName key = (COSName) imageIter.next(); if (res.isImageXObject(key)) { try { PDImageXObject image = (PDImageXObject) res.getXObject(key); BufferedImage bimage = image.getImage(); // Converting BufferImage to a byte array ByteArrayOutputStream out =new ByteArrayOutputStream(); ImageIO.write(bimage,"png",out);//png: To save the image format byte[] barray = out.toByteArray(); out.close(); // Send Picture Recognition Request JSONObject json = client.basicGeneral(barray, new HashMap<String, String>()); System.out.println(json.toString(2)); count++; System.out.println(count); } catch (Exception e) { } } } } } } } catch (Exception e) { throw e; } finally { if (null != input) input.close(); if (null != document) document.close(); } } /** * Get formatted time information * * @param dar Time information * @return * @throws Exception */ public static String dateFormat(Calendar calendar) throws Exception { if (null == calendar) return null; String date = null; try { String pattern = DATE_FORMAT; SimpleDateFormat format = new SimpleDateFormat(pattern); date = format.format(calendar.getTime()); } catch (Exception e) { throw e; } return date == null ? "" : date; } public static void main(String[] args) throws Exception { // Read pdf files String path = "C:\\Users\\fl\\Desktop\\a.pdf"; pdfParse(path); } }
Step 4: Comparison of Recognition Results
Example 1: Cover recognition
Before identification:
After recognition:
Example 2: Text recognition
Before identification:
After recognition:
Summary
It took an hour or two to familiarize myself with the functions of this area, and the results were satisfactory, although some formats were missing. But the ability to recognize the text avoids the need to knock again manually. It improves the efficiency of reading and taking notes.
Friends who like it can pay attention to it or like it.