Author(s): michaelczarnecki
Originally published on Towards AI.
hello! The next valuation technique that we will discuss in this section is trajectory evaluator,
This tool does not just look at the final answer produced by the model. Instead, it focuses on the entire reasoning process – respectively,
The trajectory evaluator helps us decide whether the model reached the result correctly, or whether it made mistakes along the way and gave the correct final answer only by accident.
What does “trajectory” mean?
Imagine a student is solving a math problem.
You can only see the final result – if it matches, the student gets points.
But often what matters more is How The student reached there.
Because if the result is correct but the reasoning is completely wrong, the student will fail in the next problem.
It’s the same with language models. Sometimes the final answer is correct, but the logic along the way is weak. And sometimes when the final answer is wrong, we want to find out exactly which logical step Introduced the mistake.
Trajectory Evaluator lets us evaluate quality of argumentNot just the final output.
Why does it matter?
Some practical reasons:
- debugging – When the model gives the wrong answer, you can see where the logic collapsed.
- Security – In critical applications (medical, legal, compliance), it matters not only What The model replied, but also Why It replied thus.
- Training and fine-tuning —If you’re training or tuning a model, you want the reasoning process to conform to your expectations.
- transparency – It is often necessary to explain the process to users to trust the system.
How does the trajectory evaluator work?
The trajectory evaluator compares the model-generated trajectory – its sequence of logic steps – to a human-generated reference trajectory.
Scoring takes place at two levels:
- respectively —Do the individual argument elements match the context?
- overall – How far does the full logic path deviate from the expected path?
Under the hood, it can use metrics like ROUGE (text overlap), similarity measures, and analysis of how errors propagate to subsequent stages.
OK – let’s look at a practical example in Jupyter Notebook.
Install and import libraries, load environment variables
!pip install langchain-core langchain-openai python-dotenv
from langchain_classic.evaluation import load_evaluator
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
import jsonload_dotenv()
trajectory assessment
Trajectory evaluation compares the logic steps (predictions) of a model with a reference sequence, assessing the consistency, order, and completeness of each step. It not only detects the wrong final answer, but also detects where a deviation occurred in the logic chain (for example, a missed or distorted step), making it particularly useful for agents and multi-step tasks.
# LLM used by evaluator
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)evaluator = load_evaluator("trajectory", llm=llm)
reference_steps = (
"Fibonacci starts at 0, 1",
"The next number is the sum of the previous two",
"The 5th number is 5"
)
prediction_steps = (
"Fibonacci starts at 0, 1",
"Subsequent numbers are the sum of the previous ones",
"The 5th number is 4"
)
result = evaluator.invoke({
"question": "What is the 5th number of the Fibonacci sequence?",
"answer": "4",
"agent_trajectory": prediction_steps,
"reference": {
"answer": "5",
"agent_trajectory": reference_steps
}
})
print(json.dumps(result, indent=4))
Output:
{
"question": "What is the 5th number of the Fibonacci sequence?",
"answer": "4",
"agent_trajectory": (
"Fibonacci starts at 0, 1",
"Subsequent numbers are the sum of the previous ones",
"The 5th number is 4"
),
"reference": "nnThe following is the expected answer. Use this to measure correctness:n(GROUND_TRUTH)n{'answer': '5', 'agent_trajectory': ('Fibonacci starts at 0, 1', 'The next number is the sum of the previous two', 'The 5th number is 5')}n(END_GROUND_TRUTH)n",
"score": 0.0,
"reasoning": "Let's evaluate the AI language model's answer step by step based on the provided criteria:nni. **Is the final answer helpful?**n - The final answer given by the AI model is "4," which is incorrect. The 5th number in the Fibonacci sequence is actually "5." Therefore, the answer is not helpful.nnii. **Does the AI language model use a logical sequence of tools to answer the question?**n - The AI model provides a logical sequence in its reasoning by stating how the Fibonacci sequence starts and how subsequent numbers are derived. However, it ultimately arrives at the wrong conclusion.nniii. **Does the AI language model use the tools in a helpful way?**n - The model does not appear to use any specific tools in this case. It relies on its internal reasoning rather than utilizing any external tools to verify or calculate the Fibonacci sequence. This lack of tool usage is a missed opportunity to ensure accuracy.nniv. **Does the AI language model use too many steps to answer the question?**n - The model does not use an excessive number of steps; it provides a concise explanation of the Fibonacci sequence. However, the explanation does not lead to the correct answer.nnv. **Are the appropriate tools used to answer the question?**n - The model does not use any tools at all. Given the straightforward nature of the question, it could have used a calculator tool to verify the Fibonacci sequence or simply relied on its internal knowledge. The absence of tool usage is a significant oversight.nn**Judgment:**nThe AI language model's final answer is incorrect, and it did not utilize any tools to verify its reasoning. While the explanation of the Fibonacci sequence is logical, it ultimately leads to an incorrect answer. Therefore, the performance is poor.nn**"
}
correct_prediction_steps = (
"Fibonacci starts at 0, 1",
"The next number is the sum of the previous two",
"The 5th number is 5"
)result = evaluator.invoke({
"question": "What is the 5th number of the Fibonacci sequence?",
"answer": "5",
"agent_trajectory": correct_prediction_steps,
"reference": {
"answer": "5",
"agent_trajectory": reference_steps
}
})
print(json.dumps(result, indent=4))
Output:
{
"question": "What is the 5th number of the Fibonacci sequence?",
"answer": "5",
"agent_trajectory": (
"Fibonacci starts at 0, 1",
"The next number is the sum of the previous two",
"The 5th number is 5"
),
"reference": "nnThe following is the expected answer. Use this to measure correctness:n(GROUND_TRUTH)n{'answer': '5', 'agent_trajectory': ('Fibonacci starts at 0, 1', 'The next number is the sum of the previous two', 'The 5th number is 5')}n(END_GROUND_TRUTH)n",
"score": 1.0,
"reasoning": "Let's evaluate the AI language model's answer step by step based on the provided criteria:nni. **Is the final answer helpful?**n - Yes, the final answer "5" is correct and directly answers the question about the 5th number in the Fibonacci sequence.nnii. **Does the AI language model use a logical sequence of tools to answer the question?**n - The model's reasoning is logical. It starts with the definition of the Fibonacci sequence and explains how to derive the numbers, leading to the correct answer.nniii. **Does the AI language model use the tools in a helpful way?**n - The model effectively uses the reasoning process to arrive at the answer. However, it does not explicitly mention using any tools, but the reasoning provided is sufficient to understand how it arrived at the answer.nniv. **Does the AI language model use too many steps to answer the question?**n - No, the model does not use too many steps. The explanation is concise and directly leads to the answer without unnecessary elaboration.nnv. **Are the appropriate tools used to answer the question?**n - While the model does not explicitly mention using any tools, the reasoning provided is appropriate for answering the question. The Fibonacci sequence is a well-known mathematical concept, and the model's explanation is accurate.nn**Judgment:**nThe AI language model provided a correct and helpful answer with a logical sequence of reasoning. It did not use any tools explicitly, but the reasoning was sufficient. Given the correctness and clarity of the response, I would rate the model's performance as a 5.nn**"
}
That’s all in this section dedicated to LLM trajectory assessment. In the next article we will implement guardrails that will prevent LLM-based applications from returning unexpected output.
Look next chapter
Look previous chapter
See the full code from this article in GitHub treasury
Published via Towards AI