Modern conversational AI agents can typically handle complex, multi-turn tasks such as asking clarifying questions and proactively assisting users. However, they often struggle with long conversations, often forgetting interruptions or producing irrelevant responses. These systems require continuous training and feedback to improve, but relying on the “gold standard” of live human testing is extremely expensive, time-consuming, and extremely difficult to scale.
As a scalable option, the AI research community is increasingly turning to user simulator – LLM-powered agents are explicitly instructed to role-play as human users. However, modern LLM-based simulators may still suffer from a significant Rrealism gapDisplaying unusual levels of patience or unrealistic, sometimes encyclopedic knowledge of a domain. Think of it like a pilot using a flight simulator: The best simulators are as realistic as possible, including unpredictable weather, sudden gusts of wind, and even the occasional bird flying into the engine. To close the realism gap for LLM-based user simulators, we need to quantify it.
in our recent paperwe introduce transformation apparelA new dataset of human-AI conversations is designed to do exactly that. ConvApparel exposes the hidden flaws in today’s user simulations and offers a path toward building AI-based testers we can trust. To capture the full spectrum of human behavior – from satisfaction to intense irritation – we employed a unique dual-agent data collection protocol, where participants were randomly assigned to either a helpful “good” agent or an intentionally unhelpful “bad” agent. This setup paired with a three-pillar validation strategy involving population-level statistics, human-likeness scoring, and counterfactual validation allows us to move beyond simple surface-level replication.