Google DeepMind releases Gemma 4 QAT checkpoint: Q4_0 and a new mobile format cuts on-device memory

by ai-intensify
0 comments
Google DeepMind releases Gemma 4 QAT checkpoint: Q4_0 and a new mobile format cuts on-device memory

Google DeepMind releases Quantization-Aware Training (QAT) checkpoints for the Gemma 4 family. The release is aimed at local deployment on edge devices and consumer GPUs. It follows the Gemma 4 launch in April and the 12B model two days earlier.

We compared the available Gemma 4 edge-model formats using only published numbers. The goal was simple. Show what each precision level costs in memory. Then show what the QAT actually changes.

What does QAT actually do?

Quantization shrinks a model by reducing the accuracy of the weights. Standard post-training quantization (PTQ) compresses a finished model. This often results in poor quality. Instead QAT simulates quantization during training. The model learns to compensate for the precise loss.

Google’s AI team says its QAT results provide higher overall quality than the standard PTQ baseline. Google did not publish Gemma 4 QAT benchmark scores in the announcement. For reference, Gemma 3 QAT reduced the obfuscation of Q4_0 by 54% using llama.cpp evaluation. We cite this only as an example of the earlier generation.

comparison task

Gemma 4 Compare E2B and E4B in three formats. The formats are BF16, Q4_0 QAT and the new mobile QAT schema. Rank them based on memory footprint, quality preservation, and on-device accessibility. Use only published data.

memory results

Format e2b E4B Base
BF16 (16-bit) 9.6 GB 15 GB Official Gemma 4 Documentation
Q4_0 (4-bit, QAT) 3.2 GB 5 GB Official Gemma 4 Documentation
Mobile (QAT, E2B) ~1 GB — QAT declaration

The figures for Q4_0 match the footprint of PTQ Q4_0. QAT does not change size in a given format. It improves the quality at that size. The new mobile scheme provides additional reduction.

Using that mobile schema, Google reduced the Gemma 4 E2B to about 1GB. Developers can still go down. Text-only models without per-layer embeddings require less than 1GB excluding audio and vision encoders.

counter-format breakdown

BF16 quality is the baseline. E2B requires 9.6 GB and E4B requires 15 GB. This is a reference point, not a phone deployment target.

Q4_0 QAT is a general purpose local format. E2B is reduced to 3.2 GB and E4B is reduced to 5 GB. QAT here retains more quality than PTQ at the same size. This format is suitable for consumer GPUs. Earlier E2B testing was also run on a Raspberry Pi 5 on INT4.

The mobile format is an edge-specialized schema. This brings E2B to about 1 GB. It uses static activation, channel-wise quantization and targeted 2-bit compression.

How does the mobile scheme work?

The Google AI team developed four technologies for mobile hardware. Static activations pre-calculate scaling during training, reducing on-device work. Channel-wise quantization fits into the design of mobile accelerators. Targeted 2-bit quantization compresses only the token-generation layers. Embedding and KV cache optimization shrinks the active memory footprint.

The main logic layers remain at high precision. This protects capacity while cutting down on storage. Developers can also deploy text only and remove the audio and vision encoders. This further trims memory for use cases that do not require any multimodality.

dimensional breakdown

The score is a qualitative ranking of formats for use on the device. Memory is the only hard-to-measure axis. Quality reflects Google’s manifest design, not measured by Gemma 4 numbers. Each score has a one-line basis.

Dimensions BF16 Q4_0 QAT mobile qat
memory footprint 1 – Heaviest, 9.6 GB E2B 4 – 3.2GB E2B 5 – ~1 GB E2B text-only
quality protection 5 – Full-precision baseline 4 – QAT-preserved, near baseline 3 – 2-bit token layers, core placed higher
decode speed 2 – No quantization speed 4 – accelerates 4-bit decode 5 – Mobile-Optimized Static Activation
deployment width 4 – Loadable but heavy 5 – llama.cpp, llama, LM studio, vLLM, MLX 3 – LightRT-LM, Transformers.js, Edge-Focused
on-device access 1 – Requires larger GPU 4 – Consumer GPU, Raspberry Pi 5 5 – runs on the phone
Total (/25) 13 21 21

winner

The result is a tie by design. Q4_0 QAT and mobile QAT both have a score of 21, but for different hardware. For phones, the mobile format leads. It reaches approximately 1GB over E2B and targets mobile accelerators directly. For laptop and consumer GPUs, Q4_0 QAT is the practical default. BF16 remains a quality reference, not a local option.

Methodology and limitations

Memory figures come from Google’s Gemma 4 documentation. The ~1GB E2B figure comes from the QAT announcement. Quality is Google’s stated claim. No independent Gemma 4 QAT quality numbers were published at the time of release. We did not run the models locally for this comparison. Developers should test their own quantization and workloads before building.

key takeaways

  • Q4_0 QAT reduces Gemma 4 E2B to 3.2 GB and E4B to 5 GB, BF16 to 9.6 GB and 15 GB.
  • A new mobile QAT scheme brings E2B to about 1 GB; Without PLE, text-only drops to less than 1GB.
  • QAT changes quality at a given size, not size; The mobile format induces additional memory reduction.
  • Google claims higher quality than PTQ but did not publish any Gemma 4 QAT benchmark numbers at the time of release.
  • Vets shipped today on Hugging Face with llama.cpp, Ollama, LM Studio, vLLM, MLX, and LiteRT-LM support.

MarketTechPost’s visual explainer

MarketTechPost Benchmark

Gemma 4 QAT: comparison of Q4_0 and the new mobile format

Google DeepMind releases quantization-aware training checkpoint for Gemma 4. We compared three edge-model formats on published numbers.

Formats compared

BF16 (16-bit) Q4_0 QAT (4-bit) Mobile QAT

5 June 2026

comparison task

what we ranked

$ compare gemma-4 --models E2B,E4B 
    --formats BF16,Q4_0-QAT,MOBILE-QAT 
    --rank memory,quality,accessibility 
    --source published-only --no-self-run

Memory from the official Gemma 4 docs. Quality as per Google’s stated claim. Neither model runs locally.

Format 1 of 3 References

BF16 (16-bit)

13 / 25

Full-precision quality baseline. E2B requires 9.6 GB and E4B requires 15 GB.

Top Overview: A reference point, not a phone or laptop deployment target.

2 of 3 formats · Laptop/GPU

Q4_0 QAT (4-bit)

21 / 25

General purpose local format. E2B is reduced to 3.2 GB and E4B is reduced to 5 GB.

Top Overview: QAT retains more quality than PTQ at the same 4-bit size.

3 of 3 formats · Mobile

mobile qat

21 / 25

Edge-specific schema. Brings E2B down to about 1GB.

Top Overview: 2-bit, logic layers placed at high precision on token layers.

leaderboard

full ranking

Dimensions BF16 Q4_0 QAT mobile qat
memory footprint 1 4 5
quality protection 5 4 3
decode speed 2 4 5
deployment width 4 5 3
on-device access 1 4 5
Total 13 21 21

Tie by design: Q4_0 won laptop and GPU; The mobile phone wins.

key takeaways

What developers need to know

  • Q4_0 QAT has reduced E2B to 3.2 GB and E4B to 5 GB from 9.6 GB and 15 GB on BF16.
  • A new mobile QAT scheme brings E2B to about 1 GB; Without PLE, text-only drops to less than 1GB.
  • QAT varies quality at a given size; The mobile format induces additional memory reduction.
  • Google claims better quality than PTQ but has not published any Gemma 4 QAT numbers.
  • Waits shipped today on Hugging Face with llama.cpp, Ollama, vLLM and MLX support.


check it out model weight (Q4_0 QAT collection, Mobile QAT Collection) And Google Blog (QAT Release). Also, feel free to follow us Twitter And don’t forget to join us 150k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us


Related Articles

Leave a Comment