Feb-01-2023, 03:38 PM
(This post was last modified: Feb-01-2023, 03:38 PM by standenman.)
I am trying to split a pdf doc that is a set of medical records based upon the date of treatment. So in this pdf of records we have "Visit Date: ##/##/####" that marks the beginning of one or a series of pages of notes for that give date. I want to split the pdf into seperate pdfs for each treatment date. The below code runs and gives me terminal out put of a series of lines either saying "You Failed" or saying something in this form:
[0, IndirectObject(612, 0, 2464980264080)]
unknown widths :
There are no pdf files that I can find are created. What am I doing wrong?
[0, IndirectObject(612, 0, 2464980264080)]
unknown widths :
There are no pdf files that I can find are created. What am I doing wrong?
import re
import pypdf
# Open the PDF file
pdf_file = pypdf.PdfReader(open("Documents/VisitDate.pdf", "rb"))
# Define the regex pattern
pattern = re.compile("Visit Date: ^[0-9]{1,2}\\/[0-9]{1,2}\\/[0-9]{4}$")
# Loop through each page of the PDF
for i in range(len(pdf_file.pages)):
page = pdf_file.pages[i]
text = page.extract_text()
# Check if the regex value is in the page text
if pattern.search(text):
# If the regex value is found, create a new PDF file
output_pdf = pypdf.PdfFileWriter()
output_pdf.addPage(page)
with open("output_{}.pdf".format(i), "wb") as output_file:
output_pdf.write(output_file)
else: print ("You Failed")
