environment
preface
Converting PDF files into word files is a very common operation. I believe that most people's free solution is to use some online conversion services, but there will be a problem of data leakage. This article introduces an open source free local conversion tool, pdf2docx.
Installing pdf2docx
The installation method is very simple. Use the pip command to execute
pip install pdf2docx Copy code
After successful installation, in addition to the basic library, pdf2docx also provides us with the executable file pdf2docx.
For daily use, you can convert pdf to docx by directly using executable files; If you need to use it in python code, you can also use the api it provides.
Use of command line
You can view specific help information on the command line through pdf2docx --help
INFO: Showing help with the command 'pdf2docx -- --help'. NAME pdf2docx - Command line interface for ``pdf2docx``. SYNOPSIS pdf2docx COMMAND | - DESCRIPTION Command line interface for ``pdf2docx``. COMMANDS COMMAND is one of the following: convert Convert pdf file to docx file. debug Convert one PDF page and plot layout information for debugging. gui Simple user interface. table Extract table content from pdf pages. Copy code
The above help lists pdf2docx Supported instructions. Here we mainly understand convert and gui
-
convert
This is its core function. Convert itself also provides many parameters, which can be viewed through pdf2docx convert --help. This writing method is also applicable to other instructions. We won't list them in detail later
(base) PS C:\Users\Administrator> pdf2docx.exe convert --help INFO: Showing help with the command 'pdf2docx convert -- --help'. NAME pdf2docx convert - Convert pdf file to docx file. SYNOPSIS pdf2docx convert PDF_FILE <flags> DESCRIPTION Convert pdf file to docx file. POSITIONAL ARGUMENTS PDF_FILE Type: str PDF filename to read from. FLAGS --docx_file=DOCX_FILE Type: Optional[str] Default: None docx filename to write to. Defaults to None. --password=PASSWORD Type: Optional[str] Default: None Password for encrypted pdf. Default to None if not encrypted. --start=START Type: int Default: 0 First page to process. Defaults to 0. --end=END Type: Optional[int] Default: None Last page to process. Defaults to None. --pages=PAGES Type: Optional[list] Default: None Range of pages. Defaults to None. Additional flags are accepted. Configuration parameters. .. note NOTES You can also use flags syntax for POSITIONAL ARGUMENTS Copy code
It can be seen from the above that to convert all pages in pdf, you only need to execute
pdf2docx.exe convert test.pdf test.docx Copy code
Start on page 3 and finish
pdf2docx.exe convert test.pdf test.docx --start=2 Copy code
From start to page 10
pdf2docx.exe convert test.pdf test.docx --end=10 Copy code
From page 2 to page 5
pdf2docx.exe convert test.pdf test.docx --start=1 --end=5 Copy code
In particular, start and end here start from 0
Of course, discontinuous pages can also be converted at one time, such as
pdf2docx.exe convert test.pdf test.docx --pages=0,2,4 Copy code
If pdf is encrypted, it can be converted in this way
pdf2docx.exe convert test.pdf test.docx --password=PASSWORD Copy code
-
gui
If you are not used to using the command line, pdf2docx also provides a simple graphical interface, which can be called by typing in the pdf2docx gui in cmd. It's really rough. The text of the button is not fully displayed, but the function is ok.
Use of API
If you want to implement it in python pdf reach docx Conversion of, pdf2docx It provides us with a complete api. Let's take a look at the simplest example
from pdf2docx import Converter if __name__ == "__main__": pdf_file = "test.pdf" docx_file = "test.docx" conv = Converter(pdf_file) conv.convert(docx_file, start=0, end=None) conv.close() Copy code
For more detailed API documentation, please refer to the link dothinking.github.io/pdf2docx/mo...
limitations
The current pdf2docx version is only applicable to text-based PDF, and the reading habit is from left to right. We need to pay attention when using.
Topic on Python practical modules
For more useful python modules, please move to
reference
① more than 2000 Python e-books (both mainstream and classic books should be available)
② Python standard library materials (the most complete Chinese version)
③ Project source code (forty or fifty interesting and classic hand training projects and source code)
④ Videos on basic introduction to Python, crawler, web development and big data analysis (suitable for Xiaobai)
⑤ Python learning roadmap (bid farewell to non stream learning)
If you need relevant information, you can scan it