Frontier multimodal models typically process an image in a single pass. If they forget the serial number on a chip or a small symbol on a building plan, they often have to guess. Google’s new Agent Vision capacity in gemini 3 flash Using the loop ground in visual evidence changes this by turning image understanding into an active, tool.
Google team reports enabling code execution Gemini 3 is delivered with a flash 5-10% quality increase in most vision benchmarksWhich is a significant advantage for production vision workloads.
What does agentic vision do??
Agent Vision is a new capability built into Gemini 3 Flash Python combines visual logic with code execution. Instead of treating the vision as a fixed embedding stage, The model can:
- Have a plan for how to inspect an image.
- Run Python that manipulates or analyzes that image.
- Re-test the changed image before replying.
The main behavior is to treat image understanding as a active investigation Instead of a frozen snapshot. This design is important for tasks that require accurate reading of small text, dense tables, or complex engineering diagrams.
Think, Act, Observe loop
Agent Vision provides a structured introduction think, act, observe Loop in image understanding tasks.
- Thinking:Gemini 3 Flash analyzes the user query and the initial image. this is so Develops a multi-step plan. For example, it may decide to zoom into multiple areas, parse a table, and then calculate a statistic.
- Work: model Generate and execute Python code Manipulating or analyzing images. Official examples include:
- Crop and zoom.
- Rotate or annotate images.
- Ongoing calculation.
- Counting bounding boxes or other detected elements.
- inspection: The transformed images has been linked to the model context window. The model then inspects this new data with more detailed visual context and ultimately responds to the original user query.
What this actually means is that the model is not limited to the first view of an image. It can iteratively refine its evidence using external calculations and then reason on the updated context.
Zooming and inspecting high resolution plans
The main use case is automatic zooming on high resolution inputs. gemini 3 flash It is trained to zoom clearly when it detects fine details It matters for the job.
Google team highlights planchecksolver.com, An AI powered building plan verification platform:
- Enables PlanCheckSolver Code execution with Gemini 3 flash.
- Model generates python code Cut and analyze the patch Large architectural plans, such as roof eaves or building blocks.
- These cropped patches are treated as new images Added back to context window.
- Based on these patches, the model checks compliance complex building codes.
- PlanCheckSolver Report A 5% accuracy improvement After enabling code execution.
This workflow is directly relevant to engineering teams working with CAD exports, structural layouts, or regulatory drawings that cannot be safely downsampled without losing detail.
Image annotation as a visual scratchpad
Agentic Vision also exposes an annotation capability where the Gemini 3 flash can treat an image as a visual scratchpad.
In the Gemini app example:
- user asks model count points on one hand.
- To reduce counting errors, the model executes Python:
- connects bounding boxes on each identified finger.
- pulls numerical label At the top of each digit.
- The annotated image is fed back into the context window.
- The final count is derived from this pixel aligned annotation.
Visual mathematics and plotting with deterministic code
Large language models often cause hallucinations when performing multi-step visual arithmetic or reading dense tables from screenshots. Agent Vision addresses this Offloading computation in a deterministic Python environment.
google demo Google AI Studio The following shows the workflow:
- gemini 3 flash parse a high density table From an image.
- It identifies the raw numerical values required for analysis.
- This Python code writes that:
- to normal ex SOTA to values 1.0.
- Use matplotlib Preparing a bar chart of relative performance.
- The generated plot and normalized values are returned as part of the context, and the final answer is based on these calculated results.
For data science teams, this creates a clear separation:
- Sample Handles perception and planning.
- Python Handles numerical calculations and plotting.
How developers can use agentic vision today?
agent vision now available Gemini 3 with multiple flash throughs Google Surfaces:
- Gemini API in Google AI Studio:Developers can try demo applications or use ai studio playground. In the playground, Agent Vision is enabled by turning on ‘code execution‘ Below tool Section.
- Vertex AI:Similar capability is available through the Gemini API Vertex AIConfiguration is controlled through general model and tool settings.
- gemini app:agent is vision Starting to roll out to the Gemini app. Users can access it by selecting ‘Thinking‘ from the model drop down.
key takeaways
- Agent Vision Gemini 3 turns Flash into an active Vision Agent: Image understanding is no longer just a forward pass. The model can generate plans, call Python tools on the images, and then re-inspect the transformed images before providing an answer.
- Think, Act, Observe loop is the main execution pattern:Gemini 3 Flash plans multi-step visual analysis, executes Python to crop, annotate, or calculate images, then observes the new visual context added in its context window.
- Code execution gains of 5-10% on Vision benchmarks: Enabling Python code execution with Agentic Vision results in a 5-10% quality increase in most vision benchmarks, while PlanCheckSolver.com has seen a 5% accuracy improvement on building plan validation.
- Deterministic Python is used for visual mathematics, tables, and plotting: The model parses tables from images, extracts numerical values, then uses Python and Matplotlib to normalize metrics and generate plots, reducing hallucinations in multi-step visual arithmetic and analysis.
check it out technical details And demo. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.

