Achieving better intent extraction through decomposition

by
0 comments
Helping AI have long-term memory

As AI technologies advance, truly helpful agents will be able to better anticipate user needs. For experiences on mobile devices to be truly helpful, the underlying models need to understand what the user is doing (or trying to do) while interacting with them. Once current and past actions are understood, the model has more context to predict possible next actions. For example, if a user previously searched for music festivals across Europe and is now looking for a flight to London, the agent might offer to find festivals in London on those specific dates.

Large multimodal LLMs are already quite good at understanding user intent from user interface (UI) trajectories. But using LLM for this task would typically require sending information to a server, which can be slow, expensive, and carry the potential risk of exposing sensitive information.

Our recent paper “Small models, big results: achieving better intent extraction through decomposition“, presented at emnlp 2025Addresses the question of how to use Small Multimodal LLM (MLLM) to understand the sequence of user interactions on the web and mobile devices. By separating the understanding of user intent into two steps, first summarizing each screen separately and then extracting an intent from the sequence of generated summaries, we make the task more streamlined for smaller models. We also formalize metrics for evaluating model performance and show that our approach yields results comparable to much larger models, demonstrating its potential for on-device applications. This work progresses Of earlier Work From our team on understanding user intent.

Related Articles

Leave a Comment