Python Forum
How to properly extract mathematical equations and images from PDF for a Python RAG c
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to properly extract mathematical equations and images from PDF for a Python RAG c
#1
Hi everyone,
I'm building a local AI RAG chatbot application in Python that should answer strictly from user‑provided documents. I'm running into an issue when extracting content from PDFs. When I use something like pypdf and then split the text into chunks, mathematical equations and images are extracted poorly or not at all.

Does anyone know a reliable way to extract mathematical equations (preferably in a usable format) and images from PDF files, so that I can chunk them and index everything with FAISS for use in a RAG pipeline?
Any recommended libraries, tools, or workflows that handle this better?
Reply
#2
In Linux I extract images from pdf files usin the pdfimages command. It works very nicely.
« We can solve any problem by introducing an extra level of indirection »
Reply
#3
Is the mathematical stuff saved as text or image?

Using fitz, aka pymupdf goes like this for text:

import pymupdf

path2pdf = '/home/peterr/pdfs/pdfs/'
file_name = 'spanish-demonstratives-worksheet.pdf'
names = file_name.split('.')
name = names[0]
savepath = '/home/peterr/temp/images/'

# general treatment of a page
doc = pymupdf.open(path2pdf + file_name)
num_pages = doc.page_count # here 1 page only
# the first page of a pdf is page zero
page = doc.load_page(0)
width, height = page.rect.width, page.rect.height # page is A4 size
d = page.get_text("dict")
for key in d.keys():
    print(key) # 3 keys: width, height, blocks

blocks = d["blocks"]
# text blocks are type 0
textblocks = [b for b in blocks if b["type"] == 0]
len(textblocks) # returns 20 here
# have a look if you want something special
for t in textblocks:
    print(t)

# get all text on the page and save it
for page_number in range(len(doc)): # only 1 page here
    page = doc[page_number]
    # get all text
    text = page.get_text('text')
    num = str(page_number + 1)
    savename = savepath + name + num + '.text'
    with open(savename, 'w') as outfile:
        outfile.write(text)
Images like this:

# image blocks are type 1
imgblocks = [b for b in blocks if b["type"] == 1]
len(imgblocks) # returns 3 here

for page_number in range(len(doc)): # only 1 page here
   page = doc[page_number]
   images = page.get_images(full=True) # get all images on the page
   
   for img_index, img in enumerate(images, start=1):
       ref = img[0]
       base_image = doc.extract_image(ref)
       image_bytes = base_image["image"]
       image_ext = base_image["ext"]
       # Save the image locally
       image_name = f"{name}_page_{page_number + 1}_image{img_index}.{image_ext}"
       with open(savepath + image_name, "wb") as image_file:
           image_file.write(image_bytes)
       print(f"image saved as {savepath + image_name}")
I think the formatting of mathematical formulae can be quite complex and may cause problems as text, but if the formulae are images, shouldn't be hard to get!

Post a small example PDF for experimenting on!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Looping through each images in a give folder Python druva 1 2,577 Jan-01-2025, 08:46 AM
Last Post: Pedroski55
  Excel isnt working properly after python function is started IchNar 2 1,968 May-01-2024, 06:43 PM
Last Post: IchNar
  plotting based on the results of mathematical formulas Timur 1 1,342 Feb-08-2024, 07:22 PM
Last Post: Gribouillis
  How do I properly implement restarting a multithreaded python application? MrFentazis 1 2,439 Jul-17-2023, 09:10 PM
Last Post: JamesSmith
  Taking Mathematical Expressions from Strings quest 2 2,358 Jul-02-2023, 01:38 PM
Last Post: Pedroski55
  [Solved]Help Displaying Emails properly via Python Extra 5 3,334 Sep-28-2022, 09:28 PM
Last Post: deanhystad
  how to extract tiff images from the subfolder into. hocr format in another similar su JOE 0 2,030 Feb-16-2022, 06:28 PM
Last Post: JOE
  SOlving LInear Equations in Python(Symoy, NUmpy) - Coefficient Problem quest 3 3,297 Jan-30-2022, 10:53 PM
Last Post: quest
Heart how to solve complex equations in python HoangF 3 4,530 Dec-26-2021, 07:04 PM
Last Post: HoangF
  how to read multispectral images on python noorasim 0 2,769 Feb-28-2021, 03:54 PM
Last Post: noorasim

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020