Python | paging pdf files

Posted by pragma on Sun, 31 Oct 2021 12:38:27 +0100

I don't know if you have ever encountered such a situation, such as an e-book in pdf format. We often browse part of it, and the e-book has a large number of pages. Whenever you need to browse, you need to turn to the corresponding page number, which is a little cumbersome.

In other cases, for example, if we want to share part of the contents of the pdf file to others, we also need to use pdf paging, which can not only analyze the specified contents, but also reduce the size of the sent file O(∩∩) O ha~

In short, pdf pagination is something we will inevitably encounter in real life. When you encounter, how do you solve it?

In Python, pdf pagination is extremely simple. You only need to run a few lines of code to achieve it. No matter how large your pdf file is, let's take a look~
Note: I annotate the function of each part of the code, and then come back to the back of the code. Pay attention to check it~

from PyPDF2 import PdfFileReader, PdfFileWriter
import os


def split(path, page_num):
    try:  # The capture value is an abnormal error, that is, when only one page number is entered
        page_start, page_end = page_num.split()
    except ValueError:
        page_start = page_num
        page_end = page_num
    output_name = os.path.splitext(path)[0] + f'({page_start}-{page_end}).pdf'  # Output split pdf file
    page_start, page_end = int(page_start), int(page_end)  # Convert page number to integer type
    if os.path.splitext(path)[1] == '':  # Determine whether the file format ends in. pdf
        path = path + ".pdf"
    try:
        pdf = PdfFileReader(path)  # Read pdf
    except FileNotFoundError:
        print("Error:Please confirm your input pdf Whether the file exists!")
        return
    pdf_writer = PdfFileWriter()  # pdf write object

    if page_start < 1 or page_end > pdf.getNumPages():  # Judge whether the page number is within a reasonable range, that is, whether it is less than or greater than the number of pages of pdf file
        print("Error:The page number exceeds the reasonable range. Please make sure that the range of pages you entered is within the reasonable range!")
        return
    for page in range(page_start-1, page_end):  # The range of pdf pages to be divided. Since the number of pages read starts from 0, it is reduced by 1
        pdf_writer.addPage(pdf.getPage(page))  # pdf pages are read, stored in memory and not written
    with open(output_name, 'wb') as output_pdf:
        pdf_writer.write(output_pdf)  # Start writing pdf in the specified page range
    print("After paging, please check:" + output_name)


if __name__ == '__main__':
    source_path = input("Please enter the to be split pdf File:")  # pdf file to be split
    pages = input("Please enter the start page and end page of the split,Separate with spaces:")
    split(source_path, pages)

In fact, the main code is as follows:

pdf = PdfFileReader(path)  # Read pdf
pdf_writer = PdfFileWriter()  # pdf write object
for page in range(page_start, page_end):  # pdf page number interval to be split
    pdf_writer.addPage(pdf.getPage(page))  # pdf pages are read, stored in memory and not written
with open(output_name, 'wb') as output_pdf:
    pdf_writer.write(output_pdf)  # Start writing pdf in the specified page range

I just added some code for exception error capture and resolution on the basis of it, and realized the recycling of the code, not just for a pdf file.

Code usage display:
1. Source file

The number of pages in the book is still quite large

2. Target page range

Note that although the page number here is 271, it does not mean that it is 271, because the page number of the read pdf file is the page number of the whole file, such as the cover of the book and the number of pages in the table of contents.

Therefore, we can simply calculate the difference between the first page of the book page and the cover page. The difference here is 13 pages, so 271 + 13 = 284 is used as the starting page of segmentation. The number of pages to be intercepted corresponds to the page number on the book is 276276 + 13 = 289, so the ending page is 289

3. Code operation

4. Code display of exception capture:
The start page entered is less than 1:

The pdf page number is only 600 in total, and then the end page is 700:

5. Others
The interval is required, but you can enter 1,1:

The code can continue to be improved. Interested partners can start their smart head, melon seeds haha

Topics: Python Pycharm