Liquid AI releases LFM2.5-1.2b-thinking: a 1.2b parameter reasoning model that fits under 1GB on-device

Liquid AI has released LFM2.5-1.2b-Thinking, a 1.2 billion parameter reasoning model that runs entirely on device and fits in about 900 MB on a modern phone. What was required in the data center 2 years ago can now run offline on consumer hardware, focusing on structured logic, tool use, and mathematics rather than general chat.

Position and key specifications in the LFM2.5 family

LFM2.5-1.2b-Thinking is part of the LFM2.5 family of Liquid Foundation models, which extends the earlier LFM2 architecture with enhanced pre-training and multi-stage reinforcement learning for edge deployments.

The model is text only and general purpose with the following configuration:

1.17b parameters, reported as 1.2b class model
16 layers, with 10 double gated LIV convolution blocks and 6 GQA blocks
Training budget of 28T tokens
Reference length of 32,768 tokens
Vocabulary size 65,536
8 languages, English, Arabic, Chinese, French, German, Japanese, Korean, Spanish

Logic first marks behavior and thinking

The ‘thinking’ version is specifically trained for reasoning. At the time of guessing it creates traces of internal thinking before the final answer. These traces are series of intermediate steps that the model uses to plan tool calls, verify partial results, and work through multi-step instructions.

The Liquid AI team recommends this model for agentic tasks, data extraction pipelines, and retrieval augmented generation flows where you want clear logic and verifiable intermediate steps. A practical way to think about it, you use LFM2.5-1.2b-thinking as the planning brain inside agents and devices, and use other models when you need broader world knowledge or code heavy workflows.

Benchmark vs other 1B class models

The Liquid AI team evaluates LFM2.5-1.2b-thinking against models based around 1b parameters on a set of logic and instruction benchmarks.

https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking

Compared to LFM2.5-1.2b-instruct, three metrics are strongly improved, math reasoning on MATH500 has increased from approximately 63 to 88, instruction following on Multi IF has increased from approximately 61 to 69, and tool usage on BFCLV3 has increased from approximately 49 to 57.

LFM2.5-1.2b-thinking competes with Qwen3-1.7b in thinking mode on most logic benchmarks while using on average about 40 percent fewer parameters and fewer output tokens. It also outperforms other 1B class baselines such as Granite-4.0-H-1B, Granite-4.0-1B, Gemma-3-1B-IT, and Llama-3.2-1B instructions on many of these tasks.

Training Recipe and Doom Looping Shaman

Reasoning models often suffer from doom looping, where the model repeats pieces of its thought chain instead of finishing the answer. LFM2.5-1.2b-thinking uses a multi stage training pipeline to mitigate this.

The process starts with middle training which involves reasoning traces so that the model can learn the ‘reason first then answer’ pattern. Supervised fine tuning on synthetic chains then improves the chain of idea generation. After that, preference alignment and RLVR are applied. In preference alignment, the research team generates 5 temperature sampling candidates and 1 greedy candidate per signal and uses the LLM judge to choose preferred and rejected outputs, while also clearly labeling the looping outputs. During RLVR they add N gram repetition penalty at the beginning of training. This reduces the doom loop rate from 15.74 percent to 0.36 percent in the middle training after RLVR on a representative set of signals.

The result is a small logic model that can generate thinking traces without getting stuck in long repetitive outputs, which is important for interactive agents and device UX.

Estimated performance and hardware footprint

A key design goal is to achieve fast inference with a small memory footprint on CPUs and NPUs. LFM2.5-1.2b-thinking can decode at about 239 tokens per second on an AMD CPU and about 82 tokens per second on a mobile NPU, while running under 1 GB of memory, with broad day one support for llama.cpp, MLX, and vLLM.

The detailed hardware table uses 1K prefill and 100 decode tokens and gives the following example for LFM2.5-1.2B-thinking

These figures show that the model fits comfortably under 1 GB on phones and embedded devices while maintaining useful throughput even in long contexts.

key takeaways

LFM2.5-1.2b-thinking is a 1.17b parameter reasoning model with 32,768 context lengths and runs under 1GB on phones and laptops.
The model is optimized for explicit thinking traces, agentic workflow, data extraction, and RAG.
It reaches strong scores for 1B category models, for example 87.96 on MATH 500, 85.60 on GSM8K, and competitive performance with Qwen3 1.7B in thinking mode with lower parameters.
The training pipeline uses median training with logic traces, supervised fine tuning, preference alignment with 5 samples with 1 greedy candidate, and RLVR with n gram penalties, reducing doom loops from 15.74 percent to 0.36 percent.
The model runs efficiently on AMD and Qualcomm NPUs and CPUs with runtimes such as llama.cpp, FastFlowLM, and NexaML, is available in GGUF, ONNX, and MLX formats, and can be easily loaded from Hugging Face for device deployment.

Hosting Provider/Deployment

You can access or host the models through the following providers and platforms:

Cloud and API Providers

Model Repository (self-hosting)

If you want to run the model locally or on your own infrastructure, the weights are available in different formats: