Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pierian-data

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: pierian-data/complete-python-3-bootcamp
Path: blob/master/15-PDFs-and-Spreadsheets/03-PDFs-Spreadsheets-Puzzle-Solution.ipynb
Views: 648
Kernel: Python 3 (ipykernel)


Content Copyright by Pierian Data

PDFs and Spreadsheets Puzzle Exercise

You will need to work with two files for this exercise and solve the following tasks:

  • Task One: Grab the Google Drive link from the .csv file. (Hint: Its along the diagonal).

  • Task Two: Download the PDF from the Google Drive link (we already downloaded it for you just in case you can't download from Google Drive) and find the phone number that is in the document. Note: There are different ways of formatting a phone number!

import csv

Grab all the lines of data.

data = open('Exercise_Files/find_the_link.csv',encoding="utf-8") csv_data = csv.reader(data) data_lines = list(csv_data)

We can see its along the diagonal, which means the values are at the index position that matches the row's number order. So the 1st letter is the 1st item in the 1st row, the 2nd letter is the 2nd item in the 2nd row, the 3rd item is the 3rd letter in the 3rd row and so on. We can use enumerate to track the row number and simply index off the data_lines.

Method One

link_list = [] for row_num,data in enumerate(data_lines): link_list.append(data[row_num])
''.join(link_list)
'https://drive.google.com/open?id=1G6SEgg018UB4_4xsAJJ5TdzrhmXipr4Q'

Method Two

link_str = '' for row_num,data in enumerate(data_lines): link_str+=data[row_num]
link_str
'https://drive.google.com/open?id=1G6SEgg018UB4_4xsAJJ5TdzrhmXipr4Q'
import PyPDF2
f = open('Exercise_Files/Find_the_Phone_Number.pdf','rb')
pdf = PyPDF2.PdfReader(f)
len(pdf.pages)
17

Phone Number Matching

Lot's of ways to do this, but you had to figure out the phone number was in format ###.###.####

Hint: https://stackoverflow.com/questions/4697882/how-can-i-find-all-matches-to-a-regular-expression-in-python

import re
pattern = r'\d{3}'
all_text = '' for n in range(len(pdf.pages)): page = pdf.pages[n] page_text = page.extract_text() all_text = all_text+' '+page_text
for match in re.finditer(pattern,all_text): print(match)
<re.Match object; span=(650, 653), match='000'> <re.Match object; span=(18270, 18273), match='000'> <re.Match object; span=(35890, 35893), match='000'> <re.Match object; span=(42919, 42922), match='505'> <re.Match object; span=(42923, 42926), match='503'> <re.Match object; span=(42927, 42930), match='445'>

Once you know the correct pattern:

import re
pattern = r'\d{3}.\d{3}.\d{4}'
for n in range(len(pdf.pages)): page = pdf.pages[n] page_text = page.extract_text() match = re.search(pattern,page_text) if match: print(match.group())
505.503.4455