Stop wasting PDFs - create a RAG that actually understands them

Author(s): Ravi Kumar Tomar

Originally published on Towards AI.

Convert disorganized PDFs into reliable, audible answers – a production-ready RAG pipeline with OCR, heading-aware chunking, FAISS, cross-encoder reranking, and strict LLM signals

Image Source: Google Gemini

TL;DR – For Skimmers

crisis: PDFs are disorganized – scans, tables, and long paragraphs impede retrieval.
solve: Ingest → Smart Chunk → Bi-encoder shortlist → Cross-encoder re-rank → Grounded LLM prompt.
Result: Low hallucinations, audible responses, production-grade recovery.
Send in a week: Use the included end-to-end script (OCR Fallback + FAISS + CrossEncoder + LLM) and run the 20-question assessment recipe.

You’ve spent months crafting product descriptions, compliance documents, and slide decks – yet there’s still noise from search. The frustration is universal: employees copy-paste from PDFs, support teams waste hours, and trust in the knowledge base quietly erodes. This article provides a concise, battle-tested pipeline that turns jumbled PDFs into precise, audible answers.

Quick vignette: A support team at a medium-sized SaaS company spent three days debating whether the old contract clause applied to renewals. After shipping the pipeline below, they came up with the exact paragraph with a quoted quote – in 30 seconds. Legal leadership stopped asking, “Do we trust the answer?” And started asking, “How can we make more documents searchable?” Small change. Great trust.

Why does regular RAG fail on real PDFs?

scanned pages – No extractable text without OCR.
bad rebuttal – Arbitrary windows split meaning, tables and code blocks.
ambiguous recovery – come back to bi-encoder Type Of relevance, not exact degree.
Uselessness – Nearly duplicate pieces inflate the shortlist and confuse re-rankers.

Bold Truth: These problems are rarely fixed by upgrading your LLM. Get the ingestion and retrieval right first.

Disciplined, production-ready pipeline (one line)

Reliably extract → segment by meaning → embed and shortlist (bi-encoder) → re-rank with a cross-encoder → signal the LLM with quotes.

respectively

Extract + OCR Fallback – Try text extraction first; blank pages with ocr pdf2image + pytesseract.
Title- and table-aware disambiguation – division on semantic boundaries; Avoid cutting tables mid-cell.
Duplicate + append metadata – pieces of hash to cheat; Maintain source, page, title, timestamp.
embed and store – Use a fast bi-encoder (for example, all-MiniLM-L6-v2) and store in FAISS/Chroma/Weaviate.
Shortlist (Top 30-100) – Fast, scalable candidate retrieval.
cross-encoder re-rank – Score the query-dock pairs and select the top 3-10.
Grounded LLM Prompt – require section ID, include exact quote, return Not found In case of absence.
remedy – Track Precision@K, Hallucination Rate, and P95 Latency.

Micro-Edit for Skimmers – What to Tune First

Add ocr fallback For scanned PDF.
Tone Fragment size and overlap Per document type (specifications vs slides).
duplicate fragments Using SHA-1 plus an embedding-similarity threshold.
Retrieving 30-100 candidatesthen rank on top again ~10.
Force LLM to return Segment ID and exact quote.
monitor precision@k And hallucination rates weekly.

Complete working code (end-to-end)

It is a single-file, runnable script that implements the full pipeline: OCR fallback, heading-aware chunking, FAISS vector store, bi-encoder retrieval, crossencoder re-ranking, and a strict LLM response stage. save as rag_pdf_pipeline.py.

"""
rag_pdf_pipeline.pyEnd-to-end RAG for unstructured PDFs:
- PDF text extraction with OCR fallback
- Heading-aware chunking with overlap
- Deduplication and metadata preservation
- Embeddings (HuggingFace) + FAISS vector store
- Bi-encoder retrieval -> Cross-encoder re-rank
- LLM answer with strict grounding and citations
Dependencies:
pip install langchain sentence-transformers faiss-cpu openai PyPDF2 pdf2image pytesseract python-magic
# Optional: pip install unstructured(local) pymupdf
"""import os
import logging
import hashlib
import re
from typing import List, Tuple, Dict, Any
from langchain.docstore.document import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage
from sentence_transformers import CrossEncoder
from PyPDF2 import PdfReader
from pdf2image import convert_from_path
import pytesseract
# ---------------- CONFIG ----------------
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
RERANKER_MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
RETRIEVE_K = 50
RERANK_TOP_N = 10
# --------------------------------------
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
_heading_re = re.compile(
r'^s*(?:#{1,6}s*|(A-Z)(w -){2,}n(-=){2,})',
re.MULTILINE,
)
def sha1_text(text: str) -> str:
return hashlib.sha1(text.encode("utf-8")).hexdigest()
def load_pdf_text(path: str, ocr_if_needed: bool = True) -> List(Tuple(int, str)):
"""
Returns ((page_number, text)).
Uses PyPDF2 first; falls back to OCR per-page if text is missing.
"""
pages: List(Tuple(int, str)) = ()
try:
reader = PdfReader(path)
for i, page in enumerate(reader.pages):
text = page.extract_text() or ""
pages.append((i + 1, text.strip() or None))
except Exception as e:
logger.warning("PDF parse failed (%s). Falling back to OCR.", e)
pages = ()
if ocr_if_needed and (not pages or any(t is None for _, t in pages)):
images = convert_from_path(path, dpi=200)
for i, img in enumerate(images):
if i >= len(pages) or pages(i)(1) is None:
ocr_text = pytesseract.image_to_string(img)
if i < len(pages):
pages(i) = (i + 1, ocr_text)
else:
pages.append((i + 1, ocr_text))
return pages
def simple_chunk(text: str, page: int) -> List(Document):
chunks = ()
start = 0
while start < len(text):
end = min(len(text), start + CHUNK_SIZE)
chunk = text(start:end).strip()
if chunk:
chunks.append(Document(page_content=chunk, metadata={"page": page}))
start += CHUNK_SIZE - CHUNK_OVERLAP
return chunks
def smart_chunk_pages(pages: List(Tuple(int, str))) -> List(Document):
"""
Heading-aware chunking with deduplication.
"""
all_docs: Dict(str, Document) = {}
for page_num, text in pages:
if not text:
continue
headings = list(_heading_re.finditer(text))
spans = (
(m.start() for m in headings) + (len(text))
if len(headings) >= 2
else (0, len(text))
)
for i in range(len(spans) - 1):
segment = text(spans(i):spans(i + 1)).strip()
for doc in simple_chunk(segment, page_num):
h = sha1_text(doc.page_content)
if h not in all_docs:
all_docs(h) = doc
return list(all_docs.values())
def build_vectorstore(docs: List(Document), save_path: str = None) -> FAISS:
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
vectorstore = FAISS.from_documents(docs, embeddings)
if save_path:
vectorstore.save_local(save_path)
return vectorstore
def retrieve_and_rerank(
vectorstore: FAISS,
query: str,
reranker: CrossEncoder,
) -> List(Tuple(Document, float)):
candidates = vectorstore.similarity_search(query, k=RETRIEVE_K)
if not candidates:
return ()
pairs = ((query, d.page_content) for d in candidates)
scores = reranker.predict(pairs, batch_size=16)
ranked = sorted(zip(candidates, scores), key=lambda x: x(1), reverse=True)
return ranked(:RERANK_TOP_N)
def format_context(ranked: List(Tuple(Document, float))) -> Tuple(str, List(str)):
blocks, citations = (), ()
for i, (doc, score) in enumerate(ranked):
cid = f"doc-{i+1}-p{doc.metadata.get('page')}"
excerpt = doc.page_content(:400).replace("n", " ").strip()
blocks.append(f"--- {cid} ---n{excerpt}")
citations.append(f"{cid} (score={score:.4f})")
return "nn".join(blocks), citations
def answer_with_llm(query: str, ranked: List(Tuple(Document, float))) -> Dict(str, Any):
system = (
"You are a strict summarizer. Use ONLY the provided context blocks. "
"If the answer is not present, reply NOT FOUND. "
"Include citation IDs and quote exact text using triple backticks."
)
context, citations = format_context(ranked)
user = f"Question:n{query}nnContext:n{context}"
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)
resp = llm((SystemMessage(content=system), HumanMessage(content=user)))
return {"answer": resp.content, "citations": citations}
def main(pdf_path: str):
pages = load_pdf_text(pdf_path)
docs = smart_chunk_pages(pages)
logger.info("Indexed %d chunks", len(docs))
vs = build_vectorstore(docs, save_path="faiss_index")
reranker = CrossEncoder(RERANKER_MODEL_NAME)
query = "How does the authentication workflow handle token refresh?"
ranked = retrieve_and_rerank(vs, query, reranker)
if not ranked:
print("No relevant context found.")
return
result = answer_with_llm(query, ranked)
print("n=== ANSWER ===n", result("answer"))
print("n=== SOURCES ===")
for c in result("citations"):
print(c)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--pdf", required=True, help="Path to PDF file")
args = parser.parse_args()
main(args.pdf)

prompt pattern that reduces frequent hallucinations

System:
“You are a strict summarizer. Experiment.” Only Reference blocks provided. Answer if answer does not exist Not found. Include source ID and exact citation when used.”

User:
Provide the top 3-5 re-ranked excerpts (IDs + excerpts) along with the question.

Assistant:
Short answers with inline citations; Include exact citations in triple backticks when referencing sources.

Why: Binding quote IDs and verbatim quotes makes answers listenable – stakeholders can verify claims in seconds.

What most guides forget (these matter more than the choice of model)

ocr fallback – Required for real-world PDFs.
Title- and table-aware disambiguation – Preserves semantic units.
deduplication – Reduces unnecessary signals and costs.
metadata filter – Apply date/source/type filters First Re-ranking.
evaluation as a habit – Track Precision@K, Hallucination Rate, and P95 Latency.

Bold Insights: When chunking quality is low, rerankers can only choose the best among the bad. Fix the ingestion first – this investment often beats larger rerankers.

Evaluation Method (Run It Now)

Choose three document types: Policy, Specification, Slide.
Create 20 gold questions per type with accurate ground truth snippets.
Baseline: Bi-Encoder Top-5 → LLM Answer. Precision@5 and record hallucination flags.
Experiment: Bi-Encoder Top-50 → Rerank → Top-5 → LLM. Compare metrics.
p50/p95 Track latency and cost.

Success Hint: Precision@k improves >10% And hallucinations are significantly reduced.

deployment checklist

batch reranker call; Cache score by (query, doc_id).
Quantize rerankers for CPU or run through ONNX.
Autoscale GPU for spikes; Keep CPU fallback warm.
Re-embed upon embedding upgrade; Versioning and backfill indexing.
Hide or modify PII before third-party calls.
monitor precision@kHallucination rate, p95 latency, and cost per 1K query.

what really matters

Many teams consider rerankers as a silver bullet. They are not. If sentences or tables are cut into pieces, the reranker will at best bring out garbage. Get ingestion (OCR, chunk limits, metadata) right before investing in a large reranker model.

Quick Models and Tools Cheat-Sheet

Extraction: PyPDF2, PyMuPDF, Unstructured.
OCR: pdf2image + pytesseract.
Embedding: sentence-transformers/all-MiniLM-L6-v2 (Fast Baseline).
Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2.
Vector DB: FAISS (local), Chroma, Milvus, Viviate, Pinecone.
LLM: Use strict notation and require quotes.

original idea

Hold your pipeline accountable before you make it bigger. Reliable search on disorganized PDFs is not a modeling trick – it’s an engineering discipline: extract cleanly, segment meaningfully, shortlist extensively, then re-rank accurately. Implement the two-step recovery + rerank flow, run the evaluation recipe, and you’ll turn frustrating searches into trustworthy answers.

❤️ If this changes your thinking about RAG architecture, clap 👏 and follow along for more practical, battle-tested AI engineering insights.

Published via Towards AI

Stop wasting PDFs – create a RAG that actually understands them