7 tiny AI models for Raspberry Pi

, Introduction

We often talk about small AI models. But what about smaller models that can actually run on a Raspberry Pi with limited CPU power and very little RAM?

Thanks to modern architectures and aggressive quantization, models with 1 to 2 billion parameters can now run on extremely small devices. When set to volume, these models can run almost anywhere, even on your smart fridge. All you need to get started is llama.cpp, a parameterized model of Hugging Face Hub, and a simple command.

What makes these smaller models exciting is that they are not flimsy or old. Many of them outperform much older large models in real-world text generation. Some tools also support calling, vision understanding, and structured output. These are not small and stupid models. They’re small, fast and surprisingly intelligent, able to run on devices that fit in the palm of your hand.

In this article, we will explore 7 small AI models that run well on Raspberry Pi and other low-powered machines using llama.cpp. If you want to experiment with local AI without GPUs, cloud costs, or heavy infrastructure, this list is a great place to start.

, 1.Queen3 4B 2507

qwen3-4b-instructions-2507 A compact but highly capable non-thinking language model that makes a huge leap in performance for its size. With only 4 billion parameters, it shows strong benefits in instruction following, logical reasoning, mathematics, science, coding, and tool use, while also expanding long-tail knowledge coverage across multiple languages.

The model demonstrates significantly better alignment with user preferences in subjective and open-ended tasks, resulting in clearer, more useful, and higher quality text generation. Its support for an impressive 256K native reference length allows it to efficiently handle extremely long documents and conversations, making it a practical choice for real-world applications that demand both depth and speed without the overhead of larger models.

, 2.Queen3 VL4B

Qwen3‑VL‑4B‑Instructions Quen is the most advanced vision-language model yet in the family, packing cutting-edge multimodal intelligence into a highly efficient 4B-parameter form factor. This coupled with deep visual perception, reasoning and spatial awareness delivers superior text understanding and generation, enabling strong performance across images, videos and long documents.

The model supports a native 256K references (expandable to 1M), allowing it to process entire books or hours-long videos with accurate recall and granular temporal indexing. Architectural upgrades such as interleaved-MROPE, DeepStack visual fusion, and precise text-timestamp alignment significantly improve long-horizon video logic, fine-detail detection, and image-text grounding.

Beyond perception, Qwen3‑VL‑4B‑Instruct acts as a visual agent, capable of operating PC and mobile GUIs, invoking tools, generating visual code (HTML/CSS/JS, Draw.io), and handling complex multimodal workflows with logic based on both text and vision.

, 3. Xeon 4.0 1.2b

Xeon 4.0 1.2b is a compact, on-device-friendly language model designed to bring agentic AI and hybrid reasoning into extremely resource-efficient deployments. It integrates both non-logic mode for fast, practical responses and alternative logic mode for complex problem solving, allowing developers to dynamically trade off speed and depth within the same model.

Despite its smaller size, the 1.2B variant supports the use of agentic tools, enables function calling and autonomous task execution, and offers multilingual capabilities in English, Korean, and Spanish, extending its usefulness beyond monolingual edge applications.

Architecturally, it inherits EXAONE 4.0’s advancements such as hybrid attention and improved normalization schemes, while supporting up to 64K token context lengths, making it unusually robust for understanding long-contexts at this scale.

Optimized for efficiency, it is clearly positioned for on-device and low-cost inference scenarios, where memory footprint and latency matter as much as the quality of the model.

, 4. Minstrel 3B

ministry-3-3b-instruction-2512 Minstrel 3 is the smallest member of the family and is a highly efficient small multimodal language model, built for edge and low-resource deployments. It is an FP8 instruction-fine-tuned model, specifically optimized for chat and instruction-following workloads, while maintaining strong adherence to system prompts and structured output.

Architecturally, it combines a 3.4B-parameter language model with a 0.4B vision encoder, enabling text reasoning as well as basic image understanding.

Despite its compact size, the model supports a large 256K context window, strong multilingual coverage in dozens of languages, and native agentive capabilities such as function calling and JSON output, making it suitable for real-time, embedded, and distributed AI systems.

Designed to fit within the 8GB of VRAM in the FP8 (and even less when parameterized), the Ministral 3 3B instruction delivers strong performance per watt and per dollar for production use cases that demand efficiency without sacrificing capacity.

, 5. Jamba Reasoning 3B

jamba-logic-3b is a compact but exceptionally capable 3‑billion‑parameter reasoning model designed to deliver strong intelligence, long‑context processing, and high efficiency in a small footprint.

Its defining innovation is a hybrid Transformer-Mamba architecture, where a small number of attention layers capture complex dependencies while the majority of layers use the Mamba state-space model for highly efficient sequence processing.

This design dramatically reduces memory overhead and improves throughput, enabling the model to run smoothly on laptops, GPUs, and even mobile-class devices without compromising quality.

Despite its size, Jamba Reasoning 3B supports up to 256K token references, scaling to very long documents without relying on massive attention caches, making long reference inference practical and cost-effective.

On intelligence benchmarks, it outperforms comparable smaller models such as the Gemma 3 4B and Llama 3.2 3B on the combined scores of multiple evaluations, demonstrating unusually strong reasoning abilities for its class.

, 6. Granite 4.0 Micro

granite-4.0-micro is a 3B-parameter long-reference instruction model developed by IBM’s Granite team and designed specifically for enterprise-grade assistants and agentic workflows.

Fine-tuned from Granite‑4.0‑Micro‑Base using a mix of permissioned licensed open datasets and high-quality synthetic data, it emphasizes reliable instruction adherence, professional tone, and safe responses, reinforced by default system prompts added in its October 2025 update.

The model supports a very large 128K context window, robust tool-calling and function-execution capabilities, and extensive multilingual support in major European, Middle Eastern, and East Asian languages.

Built on a dense decoder-only Transformer architecture with modern components such as GQA, RoPE, SwiGLU MLPs, and RMSNorm, Granite-4.0-Micro balances robustness and efficiency, making it suitable as a foundation model for commercial applications, RAG pipelines, coding tasks, and LLM agents that must integrate cleanly with external systems under the Apache 2.0 open-source license.

, 7. Fi-4 Mini

fei-4-mini-instructions is a lightweight, open 3.8B-parameter language model from Microsoft designed to provide strong reasoning and instruction-following performance under tight memory and compute constraints.

Built on a dense decoder-only Transformer architecture, it is trained primarily on high-quality synthetic “textbook-like” data and carefully filtered public sources, with a deliberate emphasis on logic-dense content over raw factual memorization.

The model supports a 128K token context window, enabling long document comprehension and extended interactions unusual at this scale.

Post-training follows a combination of supervised fine-tuning and direct priority optimization, resulting in accurate instruction following, robust safety behavior, and efficient function calling.

With a large 200K-token vocabulary and broad multilingual coverage, the Phi-4-mini-instruction is positioned as a practical building block for research and production systems that must balance latency, cost, and logic quality, especially in memory- or compute-constrained environments.

, final thoughts

Smaller models have reached the point where size is no longer a limitation of capability. The Queue 3 series leads this list, delivering performance that rivals much larger language models and even challenges some proprietary systems. If you’re building applications for the Raspberry Pi or other low-power devices, the Queue 3 is an excellent starting point and worth integrating into your setup.

Beyond Quen, the EXAONE 4.0 1.2B models are particularly strong in logic and non-trivial problem solving, while being significantly smaller than most alternatives. The Minstrel 3B is also worth noting as the latest release in its series, offering an updated knowledge cutoff and solid general-purpose performance.

Overall, many of these models are impressive, but if your priorities are speed, accuracy, and tool calling, the Quen3 LLM and VLM variants are hard to beat. They clearly show how far small, on-device AI has come and that local inference on small hardware is no longer a compromise.

abid ali awan ,@1Abidaliyawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a master’s degree in technology management and a bachelor’s degree in telecommunication engineering. Their vision is to create AI products using graph neural networks for students struggling with mental illness.

Models Raspberry tiny