We begin this tutorial by configuring a high-performance evaluation environment, specifically focused on integrating depthval Framework for bringing unit-testing rigor to our LLM applications. Bridging the gap between raw recovery and final generation, we implement a system that treats model outputs as testable code and uses LLM-A-Judge metrics to measure performance. We move beyond manual inspection by building a structured pipeline in which each question, retrieved context, and generated response is validated against rigorous academic-standard metrics. check it out full code here.
import sys, os, textwrap, json, math, re
from getpass import getpass
print("🔧 Hardening environment (prevents common Colab/py3.12 numpy corruption)...")
!pip -q uninstall -y numpy || true
!pip -q install --no-cache-dir --force-reinstall "numpy==1.26.4"
!pip -q install -U deepeval openai scikit-learn pandas tqdm
print("✅ Packages installed.")
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from deepeval import evaluate
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
ContextualRelevancyMetric,
ContextualPrecisionMetric,
ContextualRecallMetric,
GEval,
)
print("✅ Imports loaded successfully.")
OPENAI_API_KEY = getpass("🔑 Enter OPENAI_API_KEY (leave empty to run without OpenAI): ").strip()
openai_enabled = bool(OPENAI_API_KEY)
if openai_enabled:
os.environ("OPENAI_API_KEY") = OPENAI_API_KEY
print(f"🔌 OpenAI enabled: {openai_enabled}")
We initialize our environment by freezing key dependencies and installing the DeepWell framework to ensure a robust testing pipeline. Next, we import specialized metrics like Faithfulness and Contextual Recall when configuring our API credentials to enable automated, high-fidelity evaluation of our LLM responses. check it out full code here.
DOCS = (
{
"id": "doc_01",
"title": "DeepEval Overview",
"text": (
"DeepEval is an open-source LLM evaluation framework for unit testing LLM apps. "
"It supports LLM-as-a-judge metrics, custom metrics like G-Eval, and RAG metrics "
"such as contextual precision and faithfulness."
),
},
{
"id": "doc_02",
"title": "RAG Evaluation: Why Faithfulness Matters",
"text": (
"Faithfulness checks whether the answer is supported by retrieved context. "
"In RAG, hallucinations occur when the model states claims not grounded in context."
),
},
{
"id": "doc_03",
"title": "Contextual Precision",
"text": (
"Contextual precision evaluates how well retrieved chunks are ranked by relevance "
"to a query. High precision means relevant chunks appear earlier in the ranked list."
),
},
{
"id": "doc_04",
"title": "Contextual Recall",
"text": (
"Contextual recall measures whether the retriever returns enough relevant context "
"to answer the query. Low recall means key information was missed in retrieval."
),
},
{
"id": "doc_05",
"title": "Answer Relevancy",
"text": (
"Answer relevancy measures whether the generated answer addresses the user's query. "
"Even grounded answers can be irrelevant if they don't respond to the question."
),
},
{
"id": "doc_06",
"title": "G-Eval (GEval) Custom Rubrics",
"text": (
"G-Eval lets you define evaluation criteria in natural language. "
"It uses an LLM judge to score outputs against your rubric (e.g., correctness, tone, policy)."
),
},
{
"id": "doc_07",
"title": "What a DeepEval Test Case Contains",
"text": (
"A test case typically includes input (query), actual_output (model answer), "
"expected_output (gold answer), and retrieval_context (ranked retrieved passages) for RAG."
),
},
{
"id": "doc_08",
"title": "Common Pitfall: Missing expected_output",
"text": (
"Some RAG metrics require expected_output in addition to input and retrieval_context. "
"If expected_output is None, evaluation fails for metrics like contextual precision/recall."
),
},
)
EVAL_QUERIES = (
{
"query": "What is DeepEval used for?",
"expected": "DeepEval is used to evaluate and unit test LLM applications using metrics like LLM-as-a-judge, G-Eval, and RAG metrics.",
},
{
"query": "What does faithfulness measure in a RAG system?",
"expected": "Faithfulness measures whether the generated answer is supported by the retrieved context and avoids hallucinations not grounded in that context.",
},
{
"query": "What does contextual precision mean?",
"expected": "Contextual precision evaluates whether relevant retrieved chunks are ranked higher than irrelevant ones for a given query.",
},
{
"query": "What does contextual recall mean in retrieval?",
"expected": "Contextual recall measures whether the retriever returns enough relevant context to answer the query, capturing key missing information issues.",
},
{
"query": "Why might an answer be relevant but still low quality in RAG?",
"expected": "An answer can address the question (relevant) but still be low quality if it is not grounded in retrieved context or misses important details.",
},
)
We define a structured knowledge base that contains documentation snippets that serve as our ground truth reference for the RAG system. We also establish a set of evaluation questions and associated expected outputs to create a “gold dataset,” which enables us to assess how accurately our model captures information and generates on-the-ground responses. check it out full code here.
class TfidfRetriever:
def __init__(self, docs):
self.docs = docs
self.texts = (f"{d('title')}n{d('text')}" for d in docs)
self.vectorizer = TfidfVectorizer(stop_words="english", ngram_range=(1, 2))
self.matrix = self.vectorizer.fit_transform(self.texts)
def retrieve(self, query, k=4):
qv = self.vectorizer.transform((query))
sims = cosine_similarity(qv, self.matrix).flatten()
top_idx = np.argsort(-sims)(:k)
results = ()
for i in top_idx:
results.append(
{
"id": self.docs(i)("id"),
"score": float(sims(i)),
"text": self.texts(i),
}
)
return results
retriever = TfidfRetriever(DOCS)
We implement a custom TF-IDF retriever class that transforms our document into a searchable vector space using bigram-aware TF-IDF vectorization. This allows us to perform cosine similarity searches against the knowledge base, ensuring that we can programmatically obtain the top-k most relevant text fragments for any given query. check it out full code here.
def extractive_baseline_answer(query, retrieved_contexts):
"""
Offline fallback: we create a short answer by extracting the most relevant sentences.
This keeps the notebook runnable even without OpenAI.
"""
joined = "n".join(retrieved_contexts)
sents = re.split(r"(?<=(.!?))s+", joined)
keywords = (w.lower() for w in re.findall(r"(a-zA-Z){4,}", query))
scored = ()
for s in sents:
s_l = s.lower()
score = sum(1 for k in keywords if k in s_l)
if len(s.strip()) > 20:
scored.append((score, s.strip()))
scored.sort(key=lambda x: (-x(0), -len(x(1))))
best = (s for sc, s in scored(:3) if sc > 0)
if not best:
best = (s.strip() for s in sents(:2) if len(s.strip()) > 20)
ans = " ".join(best).strip()
if not ans:
ans = "I could not find enough context to answer confidently."
return ans
def openai_answer(query, retrieved_contexts, model="gpt-4.1-mini"):
"""
Simple RAG prompt for demonstration. DeepEval metrics can still evaluate even if
your generation prompt differs; the key is we store retrieval_context separately.
"""
from openai import OpenAI
client = OpenAI()
context_block = "nn".join((f"(CTX {i+1})n{c}" for i, c in enumerate(retrieved_contexts)))
prompt = f"""You are a concise technical assistant.
Use ONLY the provided context to answer the query. If the answer is not in context, say you don't know.
Query:
{query}
Context:
{context_block}
Answer:"""
resp = client.chat.completions.create(
model=model,
messages=({"role": "user", "content": prompt}),
temperature=0.2,
)
return resp.choices(0).message.content.strip()
def rag_answer(query, retrieved_contexts):
if openai_enabled:
try:
return openai_answer(query, retrieved_contexts)
except Exception as e:
print(f"⚠️ OpenAI generation failed, falling back to extractive baseline. Error: {e}")
return extractive_baseline_answer(query, retrieved_contexts)
else:
return extractive_baseline_answer(query, retrieved_contexts)
We implement a hybrid answer mechanism that prioritizes high-fidelity generation via OpenAI while maintaining a keyword-based extractive baseline as a reliable fallback. By separating the retrieval context from the final generation, we ensure that our DeepEval test cases remain consistent whether the answer is synthesized by the LLM or extracted programmatically. check it out full code here.
print("n🚀 Running RAG to create test cases...")
test_cases = ()
K = 4
for item in tqdm(EVAL_QUERIES):
q = item("query")
expected = item("expected")
retrieved = retriever.retrieve(q, k=K)
retrieval_context = (r("text") for r in retrieved)
actual = rag_answer(q, retrieval_context)
tc = LLMTestCase(
input=q,
actual_output=actual,
expected_output=expected,
retrieval_context=retrieval_context,
)
test_cases.append(tc)
print(f"✅ Built {len(test_cases)} LLMTestCase objects.")
print("n✅ Metrics configured.")
metrics = (
AnswerRelevancyMetric(threshold=0.5, model="gpt-4.1", include_reason=True, async_mode=True),
FaithfulnessMetric(threshold=0.5, model="gpt-4.1", include_reason=True, async_mode=True),
ContextualRelevancyMetric(threshold=0.5, model="gpt-4.1", include_reason=True, async_mode=True),
ContextualPrecisionMetric(threshold=0.5, model="gpt-4.1", include_reason=True, async_mode=True),
ContextualRecallMetric(threshold=0.5, model="gpt-4.1", include_reason=True, async_mode=True),
GEval(
name="RAG Correctness Rubric (GEval)",
criteria=(
"Score the answer for correctness and usefulness. "
"The answer must directly address the query, must not invent facts not supported by context, "
"and should be concise but complete."
),
evaluation_params=(
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT,
LLMTestCaseParams.RETRIEVAL_CONTEXT,
),
model="gpt-4.1",
threshold=0.5,
async_mode=True,
),
)
if not openai_enabled:
print("n⚠️ You did NOT provide an OpenAI API key.")
print("DeepEval's LLM-as-a-judge metrics (AnswerRelevancy/Faithfulness/Contextual* and GEval) require an LLM judge.")
print("Re-run this cell and provide OPENAI_API_KEY to run DeepEval metrics.")
print("n✅ However, your RAG pipeline + test case construction succeeded end-to-end.")
rows = ()
for i, tc in enumerate(test_cases):
rows.append({
"id": i,
"query": tc.input,
"actual_output": tc.actual_output(:220) + ("..." if len(tc.actual_output) > 220 else ""),
"expected_output": tc.expected_output(:220) + ("..." if len(tc.expected_output) > 220 else ""),
"contexts": len(tc.retrieval_context or ()),
})
display(pd.DataFrame(rows))
raise SystemExit("Stopped before evaluation (no OpenAI key).")
We execute the RAG pipeline to generate an LLMTestCase object by combining our retrieved context with model-generated answers and ground truth expectations. We then configure a comprehensive suite of DeepEval metrics, including G-Eval and specialized RAG indicators, to evaluate system performance using the LLM-as-Judge approach. check it out full code here.
print("n🧪 Running DeepEval evaluate(...) ...")
results = evaluate(test_cases=test_cases, metrics=metrics)
summary_rows = ()
for idx, tc in enumerate(test_cases):
row = {
"case_id": idx,
"query": tc.input,
"actual_output": tc.actual_output(:200) + ("..." if len(tc.actual_output) > 200 else ""),
}
for m in metrics:
row(m.__class__.__name__ if hasattr(m, "__class__") else str(m)) = None
summary_rows.append(row)
def try_extract_case_metrics(results_obj):
extracted = ()
candidates = ()
for attr in ("test_results", "results", "evaluations"):
if hasattr(results_obj, attr):
candidates = getattr(results_obj, attr)
break
if not candidates and isinstance(results_obj, list):
candidates = results_obj
for case_i, case_result in enumerate(candidates or ()):
item = {"case_id": case_i}
metrics_list = None
for attr in ("metrics_data", "metrics", "metric_results"):
if hasattr(case_result, attr):
metrics_list = getattr(case_result, attr)
break
if isinstance(metrics_list, dict):
for k, v in metrics_list.items():
item(f"{k}_score") = getattr(v, "score", None) if v is not None else None
item(f"{k}_reason") = getattr(v, "reason", None) if v is not None else None
else:
for mr in metrics_list or ():
name = getattr(mr, "name", None) or getattr(getattr(mr, "metric", None), "name", None)
if not name:
name = mr.__class__.__name__
item(f"{name}_score") = getattr(mr, "score", None)
item(f"{name}_reason") = getattr(mr, "reason", None)
extracted.append(item)
return extracted
case_metrics = try_extract_case_metrics(results)
df_base = pd.DataFrame(({
"case_id": i,
"query": tc.input,
"actual_output": tc.actual_output,
"expected_output": tc.expected_output,
} for i, tc in enumerate(test_cases)))
df_metrics = pd.DataFrame(case_metrics) if case_metrics else pd.DataFrame(())
df = df_base.merge(df_metrics, on="case_id", how="left")
score_cols = (c for c in df.columns if c.endswith("_score"))
compact = df(("case_id", "query") + score_cols).copy()
print("n📊 Compact score table:")
display(compact)
print("n🧾 Full details (includes reasons):")
display(df)
print("n✅ Done. Tip: if contextual precision/recall are low, improve retriever ranking/coverage; if faithfulness is low, tighten generation to only use context.")
We finalize the workflow by executing the evaluation function, which triggers the LLM-A-Judge process to score each test case against our defined metrics. We then aggregate these scores and their corresponding qualitative logic into a centralized dataframe to reveal where the RAG pipeline excels or needs further optimization in retrieval and generation.
Finally, we conclude by running our comprehensive evaluation suite, in which DeepEval transforms complex linguistic outputs into actionable data using metrics such as Faithfulness, Contextual Precision, and the G-Eval Rubric. This systematic approach allows us to diagnose “silent failures” in recovery and hallucination generation with surgical precision, providing the rationale necessary to justify architectural changes. With these results, we move from an experimental prototype to a production-ready RAG system supported by a verifiable, metric-driven safety net.
check it out full code here. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
