Using Baidu AI OCR image recognition, Java realizes the conversion of pictures into text in PDF

Posted by sigmon on Mon, 22 Jul 2019 11:21:18 +0200

Preface: When we read some PDF books, if PDF is not a picture, it is good to take notes of reading; if PDF is a picture, it is impossible to edit and take notes, it is still very painful. I met him. Of course, we have to learn how to solve the current pain points by ourselves.

I. Status Quo

In order not to repeat the wheel building, of course, we have to see whether there is already achieved in the market, if there is, it is natural to use it.

First of all, talk about some online PDF image transcription, the file size limit is 2M (it seems that a lot of file processing is limited to this number), more than that will be charged.

Secondly, that's how the PDF image of WPS is translated into text. Not to mention the size limit, it's the charge directly.

II. Scheme Realization

2.1 Baidu AI Platform Gets AppID, API Key, Secret Key

The platform limits the number of calls, which is basically enough for personal developers.

Java SDK Document Usage Instructions: https://ai.baidu.com/docs#/OCR-Java-SDK/top

If it's not clear, you can go to the document.

2.2 Code Implementation

Logic idea: Read PDF files, then read the pictures contained in PDF, pass the pictures to Baidu AI platform for identification, and return the results to parse.

Step 1: Build a new Demo Maven project

Eliminate. (I believe everyone will be happy)

Step 2: Introducing POM

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.example</groupId>
    <artifactId>demo</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>demo</name>
    <description>
        Demo project for pdf Picture to Text Conversion
        Favorite Wechat Focuses on Public Number: Java Technical Dry Goods
    </description>

    <properties>
        <java.version>1.8</java.version>
    </properties>

    <dependencies>
        <dependency><!--Baidu AI SDK-->
            <groupId>com.baidu.aip</groupId>
            <artifactId>java-sdk</artifactId>
            <version>4.8.0</version>
        </dependency>
        <dependency><!--PDF Operational Toolkit-->
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox-app</artifactId>
            <version>2.0.16</version>
        </dependency>
    </dependencies>
</project>

Step 3: Create a new class with main method

package com.example.demo;

import com.baidu.aip.ocr.AipOcr;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.text.PDFTextStripper;
import org.json.JSONObject;


import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.*;
import java.nio.ByteBuffer;
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.concurrent.atomic.AtomicInteger;

public class DemoApplication {
    //Setting up APPID/AK/SK
    public static final String APP_ID = "Your APP_ID";
    public static final String API_KEY = "Your API_KEY";
    public static final String SECRET_KEY = "Your SECRET_KEY ";
    public static final String DATE_FORMAT = "yyyy-MM-dd HH:mm:ss";
    
    /**
     * Parsing pdf document information
     *
     * @param pdfPath pdf Document Path
     * @throws Exception
     */
    public static void pdfParse(String pdfPath) throws Exception {
        InputStream input = null;
        File pdfFile = new File(pdfPath);
        PDDocument document = null;
        try {
            input = new FileInputStream(pdfFile);
            //Loading pdf documents
            document = PDDocument.load(input);

            /** Document attribute information**/
            PDDocumentInformation info = document.getDocumentInformation();
            System.out.println("Title:" + info.getTitle());
            System.out.println("theme:" + info.getSubject());
            System.out.println("author:" + info.getAuthor());
            System.out.println("Keyword:" + info.getKeywords());

            System.out.println("application program:" + info.getCreator());
            System.out.println("pdf Production Procedure:" + info.getProducer());

            System.out.println("author:" + info.getTrapped());

            System.out.println("Creation time:" + dateFormat(info.getCreationDate()));
            System.out.println("Modification time:" + dateFormat(info.getModificationDate()));


            //Getting Content Information
            PDFTextStripper pts = new PDFTextStripper();
            String content = pts.getText(document);
            System.out.println("content:" + content);


            /** Document Page Information**/
            PDDocumentCatalog cata = document.getDocumentCatalog();
            PDPageTree pages = cata.getPages();
            System.out.println(pages.getCount());
            int count = 1;

            // Initialize an AipOcr
            AipOcr client = new AipOcr(APP_ID, API_KEY, SECRET_KEY);

            // Optional: Set network connection parameters
            client.setConnectionTimeoutInMillis(2000);
            client.setSocketTimeoutInMillis(60000);

            for (int i = 0; i < pages.getCount(); i++) {
                PDPage page = (PDPage) pages.get(i);
                if (null != page) {
                    PDResources res = page.getResources();
                    Iterable xobjects = res.getXObjectNames();
                    if(xobjects != null){
                        Iterator imageIter = xobjects.iterator();
                        while(imageIter.hasNext()){
                            COSName key = (COSName) imageIter.next();
                            if (res.isImageXObject(key)) {
                                try {
                                    PDImageXObject image = (PDImageXObject) res.getXObject(key);
                                    BufferedImage bimage = image.getImage();
                                     // Converting BufferImage to a byte array
                                    ByteArrayOutputStream out =new ByteArrayOutputStream();
                                    ImageIO.write(bimage,"png",out);//png: To save the image format
                                    byte[] barray = out.toByteArray();
                                    out.close();
                                     // Send Picture Recognition Request 
                                    JSONObject json = client.basicGeneral(barray, new HashMap<String, String>());
                                    System.out.println(json.toString(2));
                                    count++;
                                    System.out.println(count);
                                } catch (Exception e) {
                                }
                            }
                        }
                    }
                }
            }
        } catch (Exception e) {
            throw e;
        } finally {
            if (null != input)
                input.close();
            if (null != document)
                document.close();
        }
    }

    /**
     * Get formatted time information
     *
     * @param dar Time information
     * @return
     * @throws Exception
     */
    public static String dateFormat(Calendar calendar) throws Exception {
        if (null == calendar)
            return null;
        String date = null;
        try {
            String pattern = DATE_FORMAT;
            SimpleDateFormat format = new SimpleDateFormat(pattern);
            date = format.format(calendar.getTime());
        } catch (Exception e) {
            throw e;
        }
        return date == null ? "" : date;
    }

    public static void main(String[] args) throws Exception {

        // Read pdf files
        String path = "C:\\Users\\fl\\Desktop\\a.pdf";
        pdfParse(path);

    }

}

Step 4: Comparison of Recognition Results

Example 1: Cover recognition

Before identification:

After recognition:

Example 2: Text recognition

Before identification:

After recognition:

Summary

It took an hour or two to familiarize myself with the functions of this area, and the results were satisfactory, although some formats were missing. But the ability to recognize the text avoids the need to knock again manually. It improves the efficiency of reading and taking notes.

Friends who like it can pay attention to it or like it.

Topics: Java Apache Maven SDK