Crawler series: read CSV, PDF, Word documents

Posted by Timma on Fri, 24 Dec 2021 07:06:51 +0100

In the last issue, we explained Reading document encoding using Python In this issue, we will explain how to use Python to process CSV, PDF and Word documents.

CSV

When we collect web pages, you may encounter CSV files, or the project may need to save data to CSV files. Python has a great Standard library You can read and write CSV files. Although this library can handle various CSV files, we focus on the standard CSV format here.

Read CSV file

Python CSV is mainly for local users, that is, your CSV files have to be saved to your computer. When collecting network data, many files are online. However, there are some ways to solve this problem:

  • Manually download the CSV file to the local computer, and then use Python to locate the file location;
  • Write Python program, download the file, and delete the source file after reading;
  • Read the file directly from the Internet into a string, and then convert it into a StringIO object to make it have the properties of the file.

Although the first two methods can also be used, since you can easily save the CSV file to memory, you don't need to download it locally to occupy hard disk space. Directly read the file into a string, and then package it into a StringIO object. Let Python treat it as a file, so there is no need to save it as a file. The following procedure is to obtain a CSV file from the Internet, and then print each line to the command line:

import requests
from io import StringIO
import csv


class ProcessCSVPDFDOCX(object):
    def __init__(self):
        self._csv_path = 'https://image.pdflibr.com/crawler/blog/country.CSV'
        self._session = requests.Session()

    def read_csv(self):
        response = self._session.get(self._csv_path)
        # Set text to utf-8 encoding
        response.encoding = 'utf-8'
        response_text = response.text
        data_file = StringIO(response_text)
        dict_reader = csv.DictReader(data_file)

        print(dict_reader.fieldnames)

        for row in dict_reader:
            print(row)


if __name__ == '__main__':
    ProcessCSVPDFDOCX().read_csv()

csv.DictReader will return the dictionary object that converts each line of the CSV file into Python instead of the list object, and save the field list to the variable dict_ reader. In fieldnames, fields are also used as keys for dictionary objects.

PDF

In a sense, Adobe's invention of PDF format (portable document format) in 1993 is a technological revolution. PDF allows users to view pictures and text documents in the same way on different systems, regardless of which system they are made on.

Although it is outdated to display PDF on Web pages (you can already display content as HTML, why do you want this static and super slow loading format?), PDF is still everywhere, especially when dealing with business reports and forms.

At present, many PDF parsing libraries are Python 2 X version, which has not been migrated to Python 3 X version. However, because PDF is relatively simple and open source document format, some awesome Python can read PDF files and support Python 3.. X version.

PDFMiner3K is a very useful library (a Python 3.x porting version of PDFMiner). It is very flexible and can be used through the command line or integrated into the code. It can also handle different language codes, and it is very convenient to process network files.

You can download the source file of this module( https://pypi.org/project/pdfminer3k/ ), unzip and install with the following command:

python setup.py install

We can also use pip to install:

pip install pdfminer3k

The following example can read any PDF into a string, and then convert it into a file object using StringIO:

import requests
from io import StringIO
import csv
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from urllib.request import urlopen


class ProcessCSVPDFDOCX(object):
    def __init__(self):
        self._session = requests.Session()
        self._pdf_path = 'https://image.pdflibr.com/crawler/blog/markdown-cheatsheet-online.pdf'


    def read_pdf(self, pdf_file):
        rscmgr = PDFResourceManager()
        retstr = StringIO()
        laparames = LAParams()
        device = TextConverter(rscmgr, retstr, laparams=laparames)
        process_pdf(rscmgr, device, pdf_file)
        device.close()

        content = retstr.getvalue()
        retstr.close()
        return content

    def read_pdf_main(self):
        pdf_file = urlopen(self._pdf_path)
        output_string = self.read_pdf(pdf_file)
        print(output_string)
        pdf_file.close()


if __name__ == '__main__':
    ProcessCSVPDFDOCX().read_pdf_main()

The biggest advantage of readPDF is that if your PDF file is on the computer, you can directly send the object returned by urlopen to pdf_file is replaced by an ordinary open() file object.

The input results may not be perfect, especially when the file contains pictures, various text formats, or tables and data graphs. However, for most PDF s that contain only plain text, the output results are no different from plain text.

Microsoft Word and docx

There are many Internet users who make complaints about Word Tucao. The special function of Word is to turn those files that are written into simple TXT or PDF format into large, slow and hard to open monsters, which often appear incompatible in system switching and version switching, and should be editable for some reasons after the content of the document has been finalized. Word documents were never intended to be delivered frequently. However, they are very popular on some websites, including important documents, information, even charts and multimedia; In short, those contents should be replaced by HTML.

Before 2008, word was used in Microsoft Office products doc file format. This binary format is difficult to read, and few software can read word format. In order to keep up with the times and make its software meet the standards of mainstream software, Microsoft decided to use the XML like format standard of Open Office. After that, the new version of word can be compatible with other word processing software. This format is docx

However, Python is using Google Docs, Open Office and Microsoft Office Docx format support is not good enough. Although there is a python docx library, it only supports creating and reading some basic data, including file size and file title, and does not support text reading. If we want to read the body content of Microsoft Office files, we need to find a way ourselves.

The first step is to read the XML from the file:

import requests
from io import StringIO
import csv
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from urllib.request import urlopen
from io import open, BytesIO
from zipfile import ZipFile


class ProcessCSVPDFDOCX(object):
    def __init__(self):
        self._csv_path = 'https://image.pdflibr.com/crawler/blog/country.CSV'
        self._session = requests.Session()
        self._pdf_path = 'https://image.pdflibr.com/crawler/blog/markdown-cheatsheet-online.pdf'
        self._docx_path = 'https://image.pdflibr.com/crawler/blog/test_document.docx'

  
    def convert_docx_to_xml(self):
        word_file = urlopen(self._docx_path).read()
        word_file = BytesIO(word_file)
        document = ZipFile(word_file)
        xml_content = document.read('word/document.xml')
        print(xml_content.decode('utf-8'))


if __name__ == '__main__':
    ProcessCSVPDFDOCX().convert_docx_to_xml()

This code reads the remote Word into a binary file object (BytesIO is similar to the StringIO used above), decompresses it using the Python standard library zipfile (all. docx files are compressed to save space), and then reads the decompressed file, which becomes XML.

The decompressed XML file contains a lot of information. Fortunately, all the contents are contained in the < W: T > tag, and so is the title content, which is much easier to handle.

import requests
from io import StringIO
import csv
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from urllib.request import urlopen
from io import open, BytesIO
from zipfile import ZipFile
from bs4 import BeautifulSoup


class ProcessCSVPDFDOCX(object):
    def __init__(self):
        self._csv_path = 'https://image.pdflibr.com/crawler/blog/country.CSV'
        self._session = requests.Session()
        self._pdf_path = 'https://image.pdflibr.com/crawler/blog/markdown-cheatsheet-online.pdf'
        self._docx_path = 'https://image.pdflibr.com/crawler/blog/test_document.docx'


    def convert_docx_to_xml(self):
        word_file = urlopen(self._docx_path).read()
        word_file = BytesIO(word_file)
        document = ZipFile(word_file)
        xml_content = document.read('word/document.xml')
        print(xml_content.decode('utf-8'))
        word_obj = BeautifulSoup(xml_content.decode('utf-8'), features="html.parser")
        text_string = word_obj.findAll("w:t")
        for text_ele in text_string:
            print(text_ele.text)


if __name__ == '__main__':
    ProcessCSVPDFDOCX().convert_docx_to_xml()

The result of this code may not be perfect, but it's almost done. Print a < W: T > label line by line.

summary

This article mainly explains how to use Python to process online CSV, PDF and Word documents. Because docx documents do not have a good library, how to parse docx files by curve. This article can process most document contents on the Internet.

All the source code for this article is hosted in Github: https://github.com/sycct/Scra...

If you have any questions, please issue.

Topics: Python csv pdf word