Google DeepMind releases Quantization-Aware Training (QAT) checkpoints for the Gemma 4 family. The release is aimed at local deployment on edge devices and consumer GPUs. It follows the Gemma 4 launch in April and the 12B model two days earlier.
We compared the available Gemma 4 edge-model formats using only published numbers. The goal was simple. Show what each precision level costs in memory. Then show what the QAT actually changes.
What does QAT actually do?
Quantization shrinks a model by reducing the accuracy of the weights. Standard post-training quantization (PTQ) compresses a finished model. This often results in poor quality. Instead QAT simulates quantization during training. The model learns to compensate for the precise loss.
Google’s AI team says its QAT results provide higher overall quality than the standard PTQ baseline. Google did not publish Gemma 4 QAT benchmark scores in the announcement. For reference, Gemma 3 QAT reduced the obfuscation of Q4_0 by 54% using llama.cpp evaluation. We cite this only as an example of the earlier generation.
comparison task
Gemma 4 Compare E2B and E4B in three formats. The formats are BF16, Q4_0 QAT and the new mobile QAT schema. Rank them based on memory footprint, quality preservation, and on-device accessibility. Use only published data.
memory results
| Format | e2b | E4B | Base |
|---|---|---|---|
| BF16 (16-bit) | 9.6 GB | 15 GB | Official Gemma 4 Documentation |
| Q4_0 (4-bit, QAT) | 3.2 GB | 5 GB | Official Gemma 4 Documentation |
| Mobile (QAT, E2B) | ~1 GB | — | QAT declaration |
The figures for Q4_0 match the footprint of PTQ Q4_0. QAT does not change size in a given format. It improves the quality at that size. The new mobile scheme provides additional reduction.
Using that mobile schema, Google reduced the Gemma 4 E2B to about 1GB. Developers can still go down. Text-only models without per-layer embeddings require less than 1GB excluding audio and vision encoders.
counter-format breakdown
BF16 quality is the baseline. E2B requires 9.6 GB and E4B requires 15 GB. This is a reference point, not a phone deployment target.
Q4_0 QAT is a general purpose local format. E2B is reduced to 3.2 GB and E4B is reduced to 5 GB. QAT here retains more quality than PTQ at the same size. This format is suitable for consumer GPUs. Earlier E2B testing was also run on a Raspberry Pi 5 on INT4.
The mobile format is an edge-specialized schema. This brings E2B to about 1 GB. It uses static activation, channel-wise quantization and targeted 2-bit compression.
How does the mobile scheme work?
The Google AI team developed four technologies for mobile hardware. Static activations pre-calculate scaling during training, reducing on-device work. Channel-wise quantization fits into the design of mobile accelerators. Targeted 2-bit quantization compresses only the token-generation layers. Embedding and KV cache optimization shrinks the active memory footprint.
The main logic layers remain at high precision. This protects capacity while cutting down on storage. Developers can also deploy text only and remove the audio and vision encoders. This further trims memory for use cases that do not require any multimodality.
dimensional breakdown
The score is a qualitative ranking of formats for use on the device. Memory is the only hard-to-measure axis. Quality reflects Google’s manifest design, not measured by Gemma 4 numbers. Each score has a one-line basis.
| Dimensions | BF16 | Q4_0 QAT | mobile qat |
|---|---|---|---|
| memory footprint | 1 – Heaviest, 9.6 GB E2B | 4 – 3.2GB E2B | 5 – ~1 GB E2B text-only |
| quality protection | 5 – Full-precision baseline | 4 – QAT-preserved, near baseline | 3 – 2-bit token layers, core placed higher |
| decode speed | 2 – No quantization speed | 4 – accelerates 4-bit decode | 5 – Mobile-Optimized Static Activation |
| deployment width | 4 – Loadable but heavy | 5 – llama.cpp, llama, LM studio, vLLM, MLX | 3 – LightRT-LM, Transformers.js, Edge-Focused |
| on-device access | 1 – Requires larger GPU | 4 – Consumer GPU, Raspberry Pi 5 | 5 – runs on the phone |
| Total (/25) | 13 | 21 | 21 |
winner
The result is a tie by design. Q4_0 QAT and mobile QAT both have a score of 21, but for different hardware. For phones, the mobile format leads. It reaches approximately 1GB over E2B and targets mobile accelerators directly. For laptop and consumer GPUs, Q4_0 QAT is the practical default. BF16 remains a quality reference, not a local option.
Methodology and limitations
Memory figures come from Google’s Gemma 4 documentation. The ~1GB E2B figure comes from the QAT announcement. Quality is Google’s stated claim. No independent Gemma 4 QAT quality numbers were published at the time of release. We did not run the models locally for this comparison. Developers should test their own quantization and workloads before building.
key takeaways
- Q4_0 QAT reduces Gemma 4 E2B to 3.2 GB and E4B to 5 GB, BF16 to 9.6 GB and 15 GB.
- A new mobile QAT scheme brings E2B to about 1 GB; Without PLE, text-only drops to less than 1GB.
- QAT changes quality at a given size, not size; The mobile format induces additional memory reduction.
- Google claims higher quality than PTQ but did not publish any Gemma 4 QAT benchmark numbers at the time of release.
- Vets shipped today on Hugging Face with llama.cpp, Ollama, LM Studio, vLLM, MLX, and LiteRT-LM support.
MarketTechPost’s visual explainer
check it out model weight (Q4_0 QAT collection, Mobile QAT Collection) And Google Blog (QAT Release). Also, feel free to follow us Twitter And don’t forget to join us 150k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us