Google has officially released android benchA new leaderboard and evaluation framework designed to measure how large language models (LLMs) perform specifically on Android development tasks. The dataset, methodology and test harness have been made open-source and are publicly available GitHub.
Benchmark Methodology and Task Design
Generic coding benchmarks often fail to capture the platform-specific dependencies and nuances of mobile development. Android Bench solves this by curating working sets obtained directly from real-world, public GitHub Android repositories.
The scenarios evaluated cover varying difficulty levels, including:
- Resolving breaking changes in Android releases.
- Domain-specific functions, such as networking on Wear OS devices.
- Migrating the code to the latest version jetpack compose (Android’s modern toolkit for building native user interfaces).
To ensure model-agnostic evaluation, the framework prompts an LLM to fix the reported problem and then verify Fix using standard developer testing practices:
- unit test: Tests that verify small, isolated blocks of code (such as a single function or class) without requiring the Android framework.
- Instrumentation Testing: Tests that run on a physical Android device or emulator to verify how the code interacts with the real Android system and APIs.
reducing data contamination
A significant challenge for developers evaluating public benchmarks data contamination. This occurs when an LLM is exposed to assessment tasks during its training process, resulting in the model memorizing answers rather than demonstrating actual reasoning and problem-solving abilities.
To ensure the integrity of Android Bench results, the Google team implemented several preventative measures:
- Manual review of agent trajectories: Developers review the step-by-step logic and action path taken by the model to arrive at a solution, ensuring that it is actively solving the problem.
- Canary String Integration: A unique, identifiable string of text is embedded in the benchmark dataset. This serves as a signal to web crawlers and data scrapers used by AI companies to explicitly exclude this data from future model training runs.
Initial Android Bench Leaderboard Results
For initial releases, the benchmark strictly measures base model performance, intentionally omitting complex agentive workflows or tool usage.
score Represents the average percentage of 100 test cases successfully solved in 10 independent runs for each model. Because LLM output may vary between runs, results include Confidence Interval (CI) With p-value <0.05. The CI provides the expected performance range, which reflects the statistical reliability of the model's scores.
In this first release, models completed 16% to 72% of tasks successfully.
| Sample | score (%) | CI Range (%) | date |
| Gemini 3.1 Pro Preview | 72.4 | 65.3-79.8 | 2026-03-04 |
| cloud opus 4.6 | 66.6 | 58.9 — 73.9 | 2026-03-04 |
| gpt-5.2-codecs | 62.5 | 54.7-70.3 | 2026-03-04 |
| cloud opus 4.5 | 61.9 | 53.9 — 69.6 | 2026-03-04 |
| gemini 3 pro preview | 60.4 | 52.6-67.8 | 2026-03-04 |
| cloud sonnet 4.6 | 58.4 | 51.1-66.6 | 2026-03-04 |
| cloud sonnet 4.5 | 54.2 | 45.5-62.4 | 2026-03-04 |
| gemini 3 flash preview | 42.0 | 36.3-47.9 | 2026-03-04 |
| gemini 2.5 flash | 16.1 | 10.9 — 21.9 | 2026-03-04 |
Note: You can try all evaluated models for your Android projects using API keys in the latest stable version of Android Studio.
key takeaways
- Special focus on common benchmarks: Android Bench specifically addresses the shortcomings of general coding benchmarks by measuring how well LLMs handle the unique complexities, APIs, and dependencies of the Android ecosystem.
- Based on real world scenarios: Instead of isolated algorithm tests, the benchmark evaluates models against real challenges pulled from public GitHub repositories. Tasks include resolving breaking API changes, migrating legacy UI code to Jetpack Compose, and handling device-specific networking (for example, on Wear OS).
- Verifiable, model-agnostic testing: Code generation is evaluated based on performance, not functionality. The framework automatically verifies proposed improvements to LLM using standard Android engineering practices: isolated unit testing and emulator-based instrumentation testing.
- Strict Anti-Contamination Measures: To ensure that models are actually reasoning rather than regurgitating memorized training data, the benchmark agent manually reviews reasoning paths and uses ‘canary strings’ to prevent AI web crawlers from ingesting the test dataset.
- Baseline performance established: The first version of Leaderboard focuses solely on base model performance without external agentive tools. Gemini 3.1 Pro Preview currently leads with a 72.4% success rate, highlighting the wide variation in current LLM capabilities (which ranged from 16.1% to 72.4% across the models tested).
check it out repo And technical details. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
