This website uses cookies to collect usage information in order to offer a better browsing experience. By browsing this site or by clicking on the "ACCEPT COOKIES" button you accept our Cookie Policy.

How to Read PDF Files with Python

TheAutomatic.net

Contributor:
TheAutomatic.net
Visit: TheAutomatic.net

By:

Blogger, TheAutomatic.net, and Senior Data Scientist

Background

In a previous article, we talked about how to scrape tables from PDF files with Python. In this post, we’ll cover how to extract text from several types of PDFs. To read PDF files with Python, we can focus most of our attention on two packages – pdfminer and pytesseract.

pdfminer (specifically pdfminer.six, which is a more up-to-date fork of pdfminer) is an effective package to use if you’re handling PDFs that are typed and you’re able to highlight the text. On the other hand, to read scanned-in PDF files with Python, the pytesseract package comes in handy, which we’ll see later in the post.

Scraping hightlightable text

For the first example, let’s scrape a 10-k form from Apple (see here). First, we’ll just download this file to a local directory and save it as “apple_10k.pdf”. The first package we’ll be using to extract text is pdfminer. To download the version of the package we need, you can use pip (note we’re downloading pdfminer.six):

pip install pdfminer.six

Next, let’s import the extract_text method from pdfminer.high_level. This module within pdfminer provides higher-level functions for scraping text from PDF files. The extract_text function, as can be seen below, shows that we can extract text from a PDF with one line code (minus the package import)! This is an advantage of pdfminer versus some other packages like PyPDF2.

from pdfminer.high_level import extract_text
 
text = extract_text("apple_10k.pdf")
 
print(text)

The code above will extract the text from each page in the PDF. If we want to limit our extraction to specific pages, we just need to pass that specification to extract_text using the page_numbers parameter.

# extract text from the first 10 pages
text10 = extract_text("apple_10k.pdf", page_numbers = range(10))
 
# get text from pages 0, 2, and 4
text_pages = extract_text("apple_10k.pdf", page_numbers = [0, 2, 4])

Scraping a password-protected PDF

If the PDF we want to scrape is password-protected, we just need to pass the password as a parameter to the same method as above.

text = extract_text("apple_10k.pdf", password = "top secret password")

Scraping text from scanned-in images

If a PDF contains scanned-in images of text, then it’s still possible to be scraped, but requires a few additional steps. In this case, we’re going to be using two other Python packages – pytesseract and Wand. The second of these is used to convert PDFs into image files, while pytesseract is used to extract text from images.

Since pytesseract doesn’t work directly on PDFs, we have to first convert our sample PDF into an image (or collection of image files).

Initial setup

Let’s get started by setting up the Wand package. Wand can be installed using pip:

pip install Wand

This package also requires a tool called ImageMagick to be installed (see here for more details).

There are other options for packages that convert PDFs into images files. For example, pdf2image is another choice, but we’ll use Wand in this tutorial.

Additionally, let’s go ahead and install pytesseract. This package can also be installed using pip:

pip install pytesseract

pytesseract depends upon tesseract being installed (see here for instructions). tesseract is an underlying utility that performs OCR (Optical Character Recognition) on images to extract text.

Converting PDFs into image files

Now, once our setup is complete, we can convert a PDF into a collection of image files. The way we do this is by converting each individual page into an image file. In addition to using Wand, we’re also going to import the os package to help create the name of each image output file.

For this example, we’re going to take a scanned-in version of the first three pages of the 10k form from earlier in this post.

from wand.image import Image
import os
 
pdf_file = "scanned_apple_10k_snippet.pdf"
 
files = []
with(Image(filename=pdf_file, resolution = 500)) as conn: 
    for index, image in enumerate(conn.sequence):
        image_name = os.path.splitext(pdf_file)[0] + str(index + 1) + '.png'
        Image(image).save(filename = image_name)
        files.append(image_name)

In the with statement above, we open a connection to the PDF file. The resolution parameter specifies the DPI we want for the image outputs – in this case 500. Within the for loop, we specify the output filename, save the image using Image.save, and lastly append the filename to the list of image files. This way, we can loop over the list of image files, and scrape the text from each.

This should create three separate image files:

["scanned_apple_10k_snippet1.png", 
 "scanned_apple_10k_snippet2.png", 
 "scanned_apple_10k_snippet3.png"]

Using pytesseract on each image file

Next, we can use pytesseract to extract the text from each image file. In the code below, we store the extracted text from each page as a separate element in a list.

all_text = []
for file in files:
    text = pytesseract.image_to_string(Image.open(file))
    all_text.append(text)

Alternatively, we can use a list comprehension like below:

all_text = [pytesseract.image_to_string(Image.open(file)) for file in files]

Visit TheAutomatic.net learn more about this topic.

Disclosure: Interactive Brokers

Information posted on IBKR Traders’ Insight that is provided by third-parties and not by Interactive Brokers does NOT constitute a recommendation by Interactive Brokers that you should contract for the services of that third party. Third-party participants who contribute to IBKR Traders’ Insight are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from TheAutomatic.net and is being posted with permission from TheAutomatic.net. The views expressed in this material are solely those of the author and/or TheAutomatic.net and IBKR is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

In accordance with EU regulation: The statements in this document shall not be considered as an objective or independent explanation of the matters. Please note that this document (a) has not been prepared in accordance with legal requirements designed to promote the independence of investment research, and (b) is not subject to any prohibition on dealing ahead of the dissemination or publication of investment research.

Any trading symbols displayed are for illustrative purposes only and are not intended to portray recommendations.

trading top