Google DeepMind added agentic vision capabilities to its Gemini 3 Flash model this week, making image analysis an active rather than passive task.
While typical multimodal models process images at a single “glance” by introducing agentic capabilities, Google allows its models to actively study a photo and study specific details, such as a street sign or a serial number on a microchip.
The new feature works by generating and running Python code that zooms, manipulates, and systematically inspects images.
“By combining visual reasoning with code execution, one of the first tools supported by Agent Vision, the model builds a plan by zooming, inspecting, and manipulating images step-by-step, yielding answers in visual evidence,” Rohan Doshi, product manager at Google DeepMind, wrote in an article. blog post About the announcement.
This feature uses a Think-Act-Observe loop, whereby Gemini 3 Flash will study a user query and image and formulate a plan, actively use Python code to perform an image analysis, and then observe the results before generating its final response.
According to Google, the update saw quality improvements of between 5% and 10% in the Vision benchmark.
A range of new agentive behaviors have already been demonstrated through Google AI Studio from the update, Google said, such as iterative zooming, direct image annotation, and visual plotting. The latter is said to reduce hallucinations – a common problem with visual math tasks.
Looking ahead, the company said it plans to add more built-in code-driven behavior to the model, meaning that some capabilities that currently require a specific prompt will become an autonomous feature.
More features, such as web and reverse image searches, as well as a larger range of model sizes, are expected to be introduced in the future.
