How to properly extract mathematical equations and images from PDF for a Python RAG c

IchNar · Jan-26-2026, 09:46 AM

Hi everyone,
I'm building a local AI RAG chatbot application in Python that should answer strictly from user‑provided documents. I'm running into an issue when extracting content from PDFs. When I use something like pypdf and then split the text into chunks, mathematical equations and images are extracted poorly or not at all.

Does anyone know a reliable way to extract mathematical equations (preferably in a usable format) and images from PDF files, so that I can chunk them and index everything with FAISS for use in a RAG pipeline?
Any recommended libraries, tools, or workflows that handle this better?

**Gribouillis** · Jan-26-2026, 12:39 PM

In Linux I extract images from pdf files usin the pdfimages command. It works very nicely.

Pedroski55 · (This post was last modified: Jan-27-2026, 11:53 PM by Pedroski55.)

Is the mathematical stuff saved as text or image?

Using fitz, aka pymupdf goes like this for text:

import pymupdf

path2pdf = '/home/peterr/pdfs/pdfs/'
file_name = 'spanish-demonstratives-worksheet.pdf'
names = file_name.split('.')
name = names[0]
savepath = '/home/peterr/temp/images/'

# general treatment of a page
doc = pymupdf.open(path2pdf + file_name)
num_pages = doc.page_count # here 1 page only
# the first page of a pdf is page zero
page = doc.load_page(0)
width, height = page.rect.width, page.rect.height # page is A4 size
d = page.get_text("dict")
for key in d.keys():
    print(key) # 3 keys: width, height, blocks

blocks = d["blocks"]
# text blocks are type 0
textblocks = [b for b in blocks if b["type"] == 0]
len(textblocks) # returns 20 here
# have a look if you want something special
for t in textblocks:
    print(t)

# get all text on the page and save it
for page_number in range(len(doc)): # only 1 page here
    page = doc[page_number]
    # get all text
    text = page.get_text('text')
    num = str(page_number + 1)
    savename = savepath + name + num + '.text'
    with open(savename, 'w') as outfile:
        outfile.write(text)

Images like this:

# image blocks are type 1
imgblocks = [b for b in blocks if b["type"] == 1]
len(imgblocks) # returns 3 here

for page_number in range(len(doc)): # only 1 page here
   page = doc[page_number]
   images = page.get_images(full=True) # get all images on the page
   
   for img_index, img in enumerate(images, start=1):
       ref = img[0]
       base_image = doc.extract_image(ref)
       image_bytes = base_image["image"]
       image_ext = base_image["ext"]
       # Save the image locally
       image_name = f"{name}_page_{page_number + 1}_image{img_index}.{image_ext}"
       with open(savepath + image_name, "wb") as image_file:
           image_file.write(image_bytes)
       print(f"image saved as {savepath + image_name}")

I think the formatting of mathematical formulae can be quite complex and may cause problems as text, but if the formulae are images, shouldn't be hard to get!

Post a small example PDF for experimenting on!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Looping through each images in a give folder Python	druva	1	2,577	Jan-01-2025, 08:46 AM Last Post: Pedroski55
	Excel isnt working properly after python function is started	IchNar	2	1,968	May-01-2024, 06:43 PM Last Post: IchNar
	plotting based on the results of mathematical formulas	Timur	1	1,342	Feb-08-2024, 07:22 PM Last Post: Gribouillis
	How do I properly implement restarting a multithreaded python application?	MrFentazis	1	2,439	Jul-17-2023, 09:10 PM Last Post: JamesSmith
	Taking Mathematical Expressions from Strings	quest	2	2,358	Jul-02-2023, 01:38 PM Last Post: Pedroski55
	[Solved]Help Displaying Emails properly via Python	Extra	5	3,334	Sep-28-2022, 09:28 PM Last Post: deanhystad
	how to extract tiff images from the subfolder into. hocr format in another similar su	JOE	0	2,030	Feb-16-2022, 06:28 PM Last Post: JOE
	SOlving LInear Equations in Python(Symoy, NUmpy) - Coefficient Problem	quest	3	3,297	Jan-30-2022, 10:53 PM Last Post: quest
	how to solve complex equations in python	HoangF	3	4,530	Dec-26-2021, 07:04 PM Last Post: HoangF
	how to read multispectral images on python	noorasim	0	2,769	Feb-28-2021, 03:54 PM Last Post: noorasim

How to properly extract mathematical equations and images from PDF for a Python RAG c

User Panel Messages

Announcements