Today, Tsinghua senior told you that the library in python had better be converted from pdf to docx. This library is very easy to use

Posted by parijat_php on Mon, 20 Dec 2021 12:33:55 +0100

environment

preface

Converting PDF files into word files is a very common operation. I believe that most people's free solution is to use some online conversion services, but there will be a problem of data leakage. This article introduces an open source free local conversion tool, pdf2docx.

Installing pdf2docx

The installation method is very simple. Use the pip command to execute

pip install pdf2docx
 Copy code

After successful installation, in addition to the basic library, pdf2docx also provides us with the executable file pdf2docx.

For daily use, you can convert pdf to docx by directly using executable files; If you need to use it in python code, you can also use the api it provides.

Use of command line

You can view specific help information on the command line through pdf2docx --help

INFO: Showing help with the command 'pdf2docx -- --help'.

NAME
    pdf2docx - Command line interface for ``pdf2docx``.

SYNOPSIS
    pdf2docx COMMAND | -

DESCRIPTION
    Command line interface for ``pdf2docx``.

COMMANDS
    COMMAND is one of the following:

     convert
       Convert pdf file to docx file.

     debug
       Convert one PDF page and plot layout information for debugging.

     gui
       Simple user interface.

     table
       Extract table content from pdf pages.
Copy code

The above help lists pdf2docx Supported instructions. Here we mainly understand convert and gui

  • convert

    This is its core function. Convert itself also provides many parameters, which can be viewed through pdf2docx convert --help. This writing method is also applicable to other instructions. We won't list them in detail later

    (base) PS C:\Users\Administrator> pdf2docx.exe convert --help
    INFO: Showing help with the command 'pdf2docx convert -- --help'.
    
    NAME
        pdf2docx convert - Convert pdf file to docx file.
    
    SYNOPSIS
        pdf2docx convert PDF_FILE <flags>
    
    DESCRIPTION
        Convert pdf file to docx file.
    
    POSITIONAL ARGUMENTS
        PDF_FILE
            Type: str
            PDF filename to read from.
    
    FLAGS
        --docx_file=DOCX_FILE
            Type: Optional[str]
            Default: None
            docx filename to write to. Defaults to None.
        --password=PASSWORD
            Type: Optional[str]
            Default: None
            Password for encrypted pdf. Default to None if not encrypted.
        --start=START
            Type: int
            Default: 0
            First page to process. Defaults to 0.
        --end=END
            Type: Optional[int]
            Default: None
            Last page to process. Defaults to None.
        --pages=PAGES
            Type: Optional[list]
            Default: None
            Range of pages. Defaults to None.
        Additional flags are accepted.
            Configuration parameters.
    
            .. note
    
    NOTES
        You can also use flags syntax for POSITIONAL ARGUMENTS
     Copy code

    It can be seen from the above that to convert all pages in pdf, you only need to execute

    pdf2docx.exe convert test.pdf test.docx
     Copy code

    Start on page 3 and finish

    pdf2docx.exe convert test.pdf test.docx --start=2
     Copy code

    From start to page 10

    pdf2docx.exe convert test.pdf test.docx --end=10
     Copy code

    From page 2 to page 5

    pdf2docx.exe convert test.pdf test.docx --start=1 --end=5
     Copy code

    In particular, start and end here start from 0

    Of course, discontinuous pages can also be converted at one time, such as

    pdf2docx.exe convert test.pdf test.docx --pages=0,2,4
     Copy code

    If pdf is encrypted, it can be converted in this way

    pdf2docx.exe convert test.pdf test.docx --password=PASSWORD
     Copy code
  • gui

    If you are not used to using the command line, pdf2docx also provides a simple graphical interface, which can be called by typing in the pdf2docx gui in cmd. It's really rough. The text of the button is not fully displayed, but the function is ok.

Use of API

If you want to implement it in python pdf reach docx Conversion of, pdf2docx It provides us with a complete api. Let's take a look at the simplest example

from pdf2docx import Converter

    
if __name__ == "__main__":
    
    pdf_file = "test.pdf"
    docx_file = "test.docx"

    conv = Converter(pdf_file)
    conv.convert(docx_file, start=0, end=None)
    conv.close()
Copy code

For more detailed API documentation, please refer to the link dothinking.github.io/pdf2docx/mo...

limitations

The current pdf2docx version is only applicable to text-based PDF, and the reading habit is from left to right. We need to pay attention when using.

Topic on Python practical modules

For more useful python modules, please move to

xugaoxiang.com/category/py...

reference

① more than 2000 Python e-books (both mainstream and classic books should be available)

② Python standard library materials (the most complete Chinese version)

③ Project source code (forty or fifty interesting and classic hand training projects and source code)

④ Videos on basic introduction to Python, crawler, web development and big data analysis (suitable for Xiaobai)

⑤ Python learning roadmap (bid farewell to non stream learning)

If you need relevant information, you can scan it

Topics: Python Django list Tornado virtualenv