The generative AI race has long been a game of ‘bigger is better’. But as the industry approaches the limits of power consumption and memory constraints, the conversation is shifting from raw parameter calculations to architectural efficiency. The Liquid AI team is leading the charge with the release of LFM2-24B-A2BA 24-billion parameter model that defines what we should expect from edge-enabled AI.

‘A2B’ architecture: 1:3 ratio for efficiency
The meaning of ‘A2B’ in the model name is focus-to-base. In a traditional Transformer, each layer uses a softmax convolution, which scales quadratically (O(N).2)) with sequence length. This creates massive KV (key-value) caches that eat up VRAM.
The Liquid AI team bypasses this by using a hybrid structure. ‘Base‘The layers are efficient Gated Small Convolution BlockWhen ‘Attention‘Layers use Grouped Query Attention (GQA).
In the LFM2-24B-A2B configuration, the model uses a 1:3 ratio:
- Total Layers: 40
- Convolution Block: 30
- Meditation Block: 10
By combining a small number of GQA blocks with mostly gated convolution layers, the model preserves the high-resolution retrieval and logic of a transformer while maintaining the fast prefill and low memory footprint of a linear-complexity model.
Sparse MoE: 24B intelligence on 2B budget
The most special feature of LFM2-24B-A2B is Mix of Experts (MOE) design. While the model has 24 billion parameters, it only activates 2.3 billion parameters per token.
This is a game-changer for deployment. Because the active parameter path is so lean, the model can fit into 32GB RAM. This means it can run natively on high-end consumer laptops, desktops with integrated GPUs (iGPUs), and dedicated NPUs without the need for a data-center-grade A100. It effectively provides the knowledge density of a 24B model with the inference speed and energy efficiency of a 2B model.


Benchmark: Punching Up
The Liquid AI team reports that the LFM2 family follows a predictable, log-linear scaling behavior. Despite its smaller active parameter count, the 24B-A2B model consistently outperforms larger competitors.
- Logic and Reasoning: like in tests GSM8K And Math-500It rivals dense models twice its size.
- Throughput: When benchmarked using a single NVIDIA H100 VLLMit reached Total 26.8K tokens per second At 1,024 concurrent requests, which far surpasses Snowflake GPT-OSS-20B And Qwen3-30B-A3B.
- Long story: The model has a feature 32k Optimized for token context windows, privacy-sensitive RAG (retrieval-augmented generation) pipelines, and local document analysis.
technical cheat sheet
| Property | specifications |
| Total Parameters | 24 billion |
| Active Parameters | 2.3 billion |
| architecture | Hybrid (Gate Conv + GQA) |
| layers | 40 (30 base/10 focus) |
| context length | 32,768 tokens |
| training data | 17 trillion tokens |
| license | LFM Open License v1.0 |
| basic support | llama.cpp, vLLM, SGLang, MLX |
key takeaways
- Hybrid ‘A2B’ Architecture: The model uses a ratio of 1:3 Grouped Query Attention (GQA) To gated small resolution. By using linear-complexity ‘base’ layers for 30 out of 40 layers, the model achieves very fast prefill and decode speeds with a significantly smaller memory footprint than traditional all-attention transformers.
- Sparse MOE Efficiency: despite being 24 billion total parametersModel is active only 2.3 billion parameters per token. This ‘sparse mix of experts’ design allows it to provide the reasoning depth of a larger model while maintaining the inference latency and energy efficiency of a 2B-parameter model.
- True Edge Capability: The hardware-in-the-loop architecture is designed to fit the model, optimized through search 32GB RAM. This makes it fully deployable on consumer-grade hardware, including laptops with integrated GPUs and NPUs, without the need for expensive data-center infrastructure.
- State-of-the-Art Performance: Outperforms larger competitors like the LFM2-24B-A2B Qwen3-30B-A3B And Snowflake GPT-OSS-20B in throughput. Benchmarks show it hits approx. 26.8K tokens per second On a single H100, showing high efficiency in its near-linear scaling and long-reference functions 32k token window.
check it out technical details And model weight. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

