Path: blob/master/15-PDFs-and-Spreadsheets/01-Working-with-PDFs.ipynb
666 views
Working with PDF Files
Welcome back Agent. Often you will have to deal with PDF files. There are many libraries in Python for working with PDFs, each with their pros and cons, the most common one being pypdf. You can install it with:
Keep in mind that not every PDF file can be read with this library. PDFs that are too blurry, have a special encoding, encrypted, or maybe just created with a particular program that doesn't work well with pypdf won't be able to be read. If you find yourself in this situation, try using the libraries linked above, but keep in mind, these may also not work. The reason for this is because of the many different parameters for a PDF and how non-standard the settings can be, text could be shown as an image instead of a utf-8 encoding. There are many parameters to consider in this aspect.
As far as pypdf is concerned, it can only read the text from a PDF document, it won't be able to grab images or other media files from a PDF.
Working with pypdf
Let's being showing the basics of the pypdf library.
Reading PDFs
Similar to the csv library, we open a pdf, then create a reader object for it. Notice how we use the binary method of reading , 'rb', instead of just 'r'.
We can then extract the text:
Adding to PDFs
We can not write to PDFs using Python because of the differences between the single string type of Python, and the variety of fonts, placements, and other parameters that a PDF could have.
What we can do is copy pages and append pages to the end.
Now we have copied a page and added it to another new document!
Simple Example
Let's try to grab all the text from this PDF file:
Excellent work! That is all for pypdf for now, remember that this won't work with every PDF file and is limited in its scope to only text of PDFs.