Jan-26-2026, 09:46 AM
Hi everyone,
I'm building a local AI RAG chatbot application in Python that should answer strictly from user‑provided documents. I'm running into an issue when extracting content from PDFs. When I use something like pypdf and then split the text into chunks, mathematical equations and images are extracted poorly or not at all.
Does anyone know a reliable way to extract mathematical equations (preferably in a usable format) and images from PDF files, so that I can chunk them and index everything with FAISS for use in a RAG pipeline?
Any recommended libraries, tools, or workflows that handle this better?
I'm building a local AI RAG chatbot application in Python that should answer strictly from user‑provided documents. I'm running into an issue when extracting content from PDFs. When I use something like pypdf and then split the text into chunks, mathematical equations and images are extracted poorly or not at all.
Does anyone know a reliable way to extract mathematical equations (preferably in a usable format) and images from PDF files, so that I can chunk them and index everything with FAISS for use in a RAG pipeline?
Any recommended libraries, tools, or workflows that handle this better?
