Jun-15-2021, 06:22 PM
Dear Python users,
I am currently learning python and using python 3 version.
I am trying to convert several pdf files into 1 csv file.
pdfminer seems to be the best package for converting pdfs. Here is the code that I have written so far:
My objective is to obtain a csv file that looks like:
file; text;
file-xxx; Here is some information;
file-yyy; Here is more information;
...
To obtain the name of the files into csv I need to code:
I am currently learning python and using python 3 version.
I am trying to convert several pdf files into 1 csv file.
pdfminer seems to be the best package for converting pdfs. Here is the code that I have written so far:
import io
import os
from IPython.core.display import display
from pdfminer3.converter import TextConverter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfpage import PDFPage
import pandas as pd # possibly necessary to convert into csv
import csv
# Setting up the pdf file for processing and extracting the text in it into a string
resource_manager = PDFResourceManager()
out_text = io.StringIO()
converter = TextConverter(resource_manager, filehandler, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
def searchpdf():
pathextension = r'/where I have the pdfs saved' # -----> Folder where all the files are stored
for path in os.listdir(pathextension):
full_path = os.path.join(pathextension, path)
# Checks the folder and then the extension of the file
if os.path.isfile(full_path) and os.path.splitext(path)[1] == ".pdf":
# Opens each path and associated pdf file
with open(full_path, 'rb') as searchfullpdf:
# Running scans over each file
for page in PDFPage.get_pages(searchfullpdf, caching=True, check_extractable=True):
page_interpreter.process_page(page)
textfound = out_text.getvalue() # Returns the values found in each fileMy doubt is how I should continue to save my results into csv. Adding: [input_file = csv.DictReader(open("pdfdata.csv"))]does not work and seems too trivial.My objective is to obtain a csv file that looks like:
file; text;
file-xxx; Here is some information;
file-yyy; Here is more information;
...
To obtain the name of the files into csv I need to code:
[f=open("C:/Users/mydirectory/output.csv",'r+')
w=csv.writer(f)
for path, dirs, files in os.walk("C:/Users/mydirectory"):
for filename in files:
w.writerow([filename])]
