Python Forum
Help with local RAG pipeline – poor retrieval quality, wrong page numbers
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Help with local RAG pipeline – poor retrieval quality, wrong page numbers
#1
Hi everyone,

I'm building a fully local RAG application in Python (no cloud APIs) and running into several persistent issues. I'll pin the full source below. Would really appreciate any advice from people who've dealt with similar setups.

---

### Stack overview

- **LLM:** Qwen2.5:7b via Ollama

- **Embeddings:** intfloat/multilingual-e5-base (HuggingFace, offline)

- **Vector store:** FAISS (child chunks) + BM25 (via LangChain)

- **Reranker:** cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

- **Chunking:** Parent-child strategy – MarkdownHeaderTextSplitter for parents, RecursiveCharacterTextSplitter for children

- **PDF extraction:** pymupdf4llm (fast) or MinerU (slow, for LaTeX-heavy docs)

- **Pipeline:** LangGraph with nodes: pre-retrieval → hybrid retrieve → rerank → build context → evaluate evidence → generate

- **UI:** Streamlit

Documents are primarily English-language academic PDFs (e.g. Montgomery's Design and Analysis of Experiments, 720 pages). User queries are always in Slovak.

---

### Problem 1 – Cross-lingual retrieval failure (SK query → EN document)

This is the most painful issue. When a user asks *"čo to je replikácia?"* ("what is replication?"), the FAISS similarity search returns completely irrelevant chunks (confidence ~0.045) even though the word "replication" appears many times in the document.

My current workaround:

Detect document language via langdetect

If EN document detected, translate the SK query to EN using the LLM before retrieval

Use the translated query in both FAISS and BM25

This partially works but is inconsistent – sometimes the LLM translates to "What is replication?", sometimes it doesn't, so results are non-deterministic even at temperature=0.

I also added a rescue BM25 search in evaluate_evidence as a last resort, which helps but retrieves chunks from wrong pages (e.g. page 424 instead of page 13 where the definition actually is).

**Questions:**

- Is multilingual-e5-base simply too weak for SK↔EN cross-lingual retrieval? Should I switch to a different model (e.g. intfloat/multilingual-e5-large, BAAI/bge-m3, or a dedicated cross-lingual model)?

- Is there a better approach than LLM-based query translation? I considered expanding the index with translated chunks but haven't implemented it yet.

- Any experience with mmarco-mMiniLMv2 reranker for non-English content? I suspect it's poorly calibrated for Slovak and the confidence scores are systematically too low (~0.04 instead of expected ~0.3+).

---

### Problem 2 – Wrong page numbers in cited sources

My chunker injects <!--PAGE:N--> markers into the markdown before chunking, then detects which page each chunk belongs to by matching text probes against page texts. The logic works reasonably for single-page chunks but breaks in two cases:

**Large parents spanning multiple pages** – when _split_large splits them, all resulting chunks inherit the original parent's page metadata instead of getting re-detected page numbers.

**Dense mathematical/formula-heavy pages** – probes (min 15 chars) often don't match because MinerU reformats LaTeX and the text doesn't align with the original page content.

The cited pages are sometimes off by 5–15 pages which makes source verification impossible.

**Questions:**

- Is there a more reliable strategy for page attribution in RAG chunking?

- Would embedding page number tokens directly into chunk text help BM25/FAISS associate chunks with correct pages?

---

### Problem 3 – Poor Slovak output quality

The LLM (Qwen2.5:7b) receives English context and is instructed via system prompt to answer in Slovak. The output Slovak is grammatically broken – literal word-by-word translations, wrong declensions, invented compound words (e.g. "olejová hniloba" for "oil quench", "oholenie vzorku" for "quenching a specimen").

Current system prompt instructs:

- Always answer in Slovak

- Don't translate literally, explain in your own words

- Keep English technical terms in parentheses if unsure

This helps somewhat but the quality is still poor for technical content.

**Questions:**

- Is Qwen2.5:7b simply not good enough for EN→SK technical translation in context? Would a larger model (Qwen2.5:14b, gemma3:12b) make a significant difference?

- Has anyone tried a two-step approach: generate answer in English first, then translate to Slovak as a second LLM call?

- Any prompt engineering tricks that worked for you for multilingual RAG output?

---

### Problem 4 – Reranker confidence threshold causes false abstentions

The cross-encoder produces confidence scores around 0.04–0.07 for relevant Slovak/English pairs. My threshold is set to 0.15 (already lowered from original 0.32). At confidence below threshold, the system returns "not found in documents" even when the correct answer is there.

I added a keyword override (check if query words appear in context docs) but it's unreliable for cross-lingual queries because Slovak words don't match English document text.

### Code

*(pinning below)*

- document_processor.py – PDF extraction + parent-child chunking: https://pastebin.com/m8egQ7HY

- vector_store.py – FAISS + BM25 + E5Embeddings wrapper: https://pastebin.com/4kkhsg8M

- rag_graph.py – full LangGraph pipeline: https://pastebin.com/P31pGiie

- parent_store.pyhttps://pastebin.com/xwNeAMnE
Reply
#2
Lotta stuff there: Try 1 problem at a time!

Can you first give an actual example of Problem 1?

Quote:This is the most painful issue. When a user asks *"čo to je replikácia?"* ("what is replication?"), the FAISS similarity search returns completely irrelevant chunks (confidence ~0.045) even though the word "replication" appears many times in the document.

Provide a document and examples of what you want returned from said document.

I think you want to search the text of 1 or more PDFs to find key words. Is that correct?

Quote:4. Retrieval – Find the most relevant chunks for a query.

5. Generation – Pass retrieved chunks to an LLM to produce grounded answers.
Reply
#3
(Apr-13-2026, 01:25 AM)Pedroski55 Wrote: Lotta stuff there: Try 1 problem at a time!

Can you first give an actual example of Problem 1?

Quote:This is the most painful issue. When a user asks *"čo to je replikácia?"* ("what is replication?"), the FAISS similarity search returns completely irrelevant chunks (confidence ~0.045) even though the word "replication" appears many times in the document.

Provide a document and examples of what you want returned from said document.

I think you want to search the text of 1 or more PDFs to find key words. Is that correct?

Quote:4. Retrieval – Find the most relevant chunks for a query.

5. Generation – Pass retrieved chunks to an LLM to produce grounded answers.
Thanks for the reply — I’ve already made some progress since my previous message.

Retrieval and generation are now mostly working well. The answers are generally relevant and grounded in the document, so the core RAG pipeline is in a decent state.

Right now I’m dealing with two main issues:
1. Critical issue – incorrect page attribution (chunking problem)

During chunking, I sometimes end up with chunks that contain text from multiple pages, even though those pages belong to different contexts/sections. As a result: The model often produces a correct answer. But the referenced page number is wrong

This happens mainly when processing PDFs through MinerU.

My current pipeline works roughly like this:

MinerU outputs:
a .md file (clean text, but no page info)
a content_list.json (contains page_idx per block)
Since the markdown itself doesn’t contain page metadata, I try to:
reconstruct page boundaries manually
by matching text from the markdown to entries in content_list.json
then injecting page markers (<!--PAGE:X-->) into the markdown
and finally chunking based on that

So essentially:

MinerU markdown (no pages)
+ content_list.json (page_idx per block)
→ text matching
→ inject page markers
→ chunking

So even when retrieval finds the correct chunk semantically, the page metadata is corrupted.

I’m not sure if the root cause is:

MinerU’s markdown structure
LaTeX / tables breaking text alignment
or just the fact that I’m reconstructing pages from text instead of using a strict structural mapping

If you have a better approach for reliably preserving page boundaries from MinerU output, I’d really appreciate it.

2. Secondary issue – Slovak language quality

This one is less critical but still noticeable.

When the source document is Slovak → responses are very good
When the source document is English → responses are generated in Slovak but:
grammar is sometimes off
declension is incorrect
some phrases feel like literal translations that don’t fit the context

So the issue is not understanding, but naturalness of Slovak output when translating from English context.
You can check my document_processor.py which i posted there is the chunking strategy and the page marking.
Reply
#4
For getting text from PDFs I find pymupdf, aka fitz, very good. Can't help with translating to Slovak though! If you wanted Chinese, maybe ...

Maybe this link will help you?

Little example of getting text from a PDF using pymupdf

import pymupdf

pdf = '/home/peterr/pdfs_extracted_pages/chinese_to_english_2025_page_5_to_page_10.pdf'
savepath = '/home/peterr/pdfs_extracted_pages/chinese_to_english_2025_page_5_to_page_10.txt'

doc = pymupdf.open(pdf) 
pages = len(doc) # here 6

# pdfs start at page 0
with open(savepath, 'wb') as outfile:
    # iterate over pdf pages
    for page_index in range(len(doc)): # iterate over pdf pages
        page = doc[page_index]
        text = page.get_text().encode("utf8")
        # now send text to a function to get the things you want
        # text = search_for_wanted(text)
        num = f'This is page {page_index}\n'
        page_num = bytes(num, "utf8")
        # so you know what page you are viewing in the text file
        outfile.write(page_num)
        # write text from page
        outfile.write(text)
        # so you know what page your on, again
        outfile.write(page_num)
        # write page delimiter (form feed 0x0C)
        outfile.write(bytes((12,)))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  RAG pipeline returns wrong page citations and occasional hallucinations IchNar 0 45 May-01-2026, 03:38 PM
Last Post: IchNar
  Regression with pipeline and GridSearch patite 0 2,207 Jul-31-2020, 01:40 PM
Last Post: patite
  Random Forest high R2 Score but poor prediction donnertrud 5 9,647 Jan-13-2020, 11:23 PM
Last Post: jefsummers
  How many unique values of quality are in this dataset? Jack_Sparrow 1 4,246 May-20-2018, 01:59 PM
Last Post: volcano63
  Python+Dash+ Can't get menu page to feed to page georgelza 0 3,148 Apr-15-2018, 02:09 PM
Last Post: georgelza

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020