Jan-30-2021, 09:39 AM
Hi,
I have been testing pdfplumber and pdfminer and at this stage I am not sure which one I prefer. Pdfminer does a better a job at extracting text from an unstructured pdf but it doesn't seem to be easy to use. It looks like it takes a lot more code to open a pdf on a per page basis with pdfminer than with pdfplumber.
Does anyone know of a more concise way to do that in pdfminer than shown below:
I have been testing pdfplumber and pdfminer and at this stage I am not sure which one I prefer. Pdfminer does a better a job at extracting text from an unstructured pdf but it doesn't seem to be easy to use. It looks like it takes a lot more code to open a pdf on a per page basis with pdfminer than with pdfplumber.
Does anyone know of a more concise way to do that in pdfminer than shown below:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
fp = open('file', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = PDFDevice(rsrcmgr)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)Thanks!
