New LightRT NeuroPilot Accelerator and from Google mediatek It’s a solid step toward running true generator models on phones, laptops, and IoT hardware without sending every request to a data center. It takes the existing LightRT runtime and wires it directly into MediaTek’s NeuroPilot NPU stack, so developers can deploy LLM and embedding models with a single API surface instead of custom code per chip.
What is LightRT NeuroPilot Accelerator?
LiteRT is the successor to TensorFlow Lite. It’s a high performance runtime that sits on the device, runs models .tflite FlatBuffer format, and can target CPU, GPU, and now NPU backends through an integrated hardware acceleration layer.
The LightRT NeuroPilot Accelerator is the new NPU path for MediaTek hardware. It replaces the older TFlite NeuroPilot proxy with direct integration of the NeuroPilot compiler and runtime. Instead of treating the NPU as a thin proxy, LightRT now uses a compiled model API that understands ahead-of-time (AOT) compilation and device compilation, and exposes both through the same C++ and Kotlin APIs.
On the hardware side, the integration currently targets the MediaTek Dimensity 7300, 8300, 9000, 9200, 9300 and 9400 SoCs, which together cover a large portion of the Android mid-range and flagship device space.
Why developers are concerned, unified workflow for fragmented NPU,
Historically, ML stacks on devices were CPU and GPU first. The NPU SDK is shipped as a vendor specific toolchain that requires separate compilation flows per SoC, custom delegates, and manual runtime packaging. The result was a combined explosion of binaries and a lot of device specific debugging.
LiteRT NeuroPilot Accelerator replaces it three-step workflow Regardless of which MediaTek NPU exists, it’s the same:
- convert or load
.tfliteModel as usual. - Alternatively use the LiteRT Python tool to run AOT compilation and generate an AI pack that is attached to one or more target SoCs.
- Ship AI packs via Play for on-device AI (PODAI), then select
Accelerator.NPUat runtime. LiteRT handles device targeting, runtime loading, and falls back to the GPU or CPU when the NPU is not available.
For you as an engineer, the main change is that the device targeting logic moves into a structured configuration file and play delivery, while the app code does most of the interacting. CompiledModel And Accelerator.NPU,
Both AOT and on device compilation are supported. AOT compiles for a known SoC ahead of time and is recommended for larger models as it removes the cost of compilation on the user device. Compilation on device is better for smaller models and normal models .tflite Delivery at the cost of higher first run latency. The blog shows that for models like the Gemma-3-270M, a pure on-device compilation can take more than 1 minute, making AOT a realistic option for production LLM use.
Gemma, Quen and embedding models on MediaTek NPU
The stack is built around an open weight model rather than a single proprietary NLU path. Google and MediaTek list clear, Production oriented support for:
- Qwen3 0.6B, for text production in markets such as mainland China.
- Gemma-3-270M, a compact base model that is easy to fine-tune for tasks such as sentiment analysis and entity extraction.
- Gemma-3-1b, a multilingual text model for summarization and general reasoning.
- Gemma-3N E2B, a multimodal model that handles text, audio, and vision for things like real-time translation and visual question answering.
- EmbeddingGemma300M, a text embedding model for retrieval augmented generation, semantic search, and classification.
On the latest Dimensity 9500 running on the Vivo
For text generation use cases, LiteRT-LM sits on top of LiteRT and exposes a stateful engine with text in the text out API. Creating a specific C++ flow ModelAssetsbuild one Engine with litert::lm::Backend::NPUthen make one Session and call GenerateContent per conversation. EmbeddingGemma uses low-level LiteRT to embed workloads. CompiledModel API in Tensor in Tensor out configuration, again with NPU selected through hardware accelerator options.
Developer Experience, C++ Pipelines and Zero Copy Buffers
LiteRT introduces a new C++ API that replaces the old C entry points and is explicitly designed Environment, Model, CompiledModel And TensorBuffer objects.
For MediaTek NPUs, this API integrates tightly with Android AHardwareBuffer And GPU buffers. You can create input TensorBuffer Examples directly from OpenGL or OpenCL buffers TensorBuffer::CreateFromGlBufferWhich lets image processing code feed NPU input without intermediate copying through CPU memory. This is important for real-time cameras and video processing where multiple copies per frame quickly saturate the memory bandwidth.
A typical high level C++ path to a device looks like this, with error handling omitted for clarity:
// Load model compiled for NPU
auto model = Model::CreateFromFile("model.tflite");
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);
// Create compiled model
auto compiled = CompiledModel::Create(*env, *model, *options);
// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers(0).Write(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers(0).Read(output_span);
Whether you’re targeting a CPU, GPU, or MediaTek NPU, the same compiled model API is used, reducing the amount of conditional logic in the application code.
key takeaways
- The LightRT NeuroPilot Accelerator is a new, first-class NPU integration between LightRT and MediaTek NeuroPilot, replacing the legacy TFlite proxy and exposing a unified compile model API with AOT and device compilation on supported Dimensity SoCs.
- The stack targets concrete open weight models, including Qwen3-0.6B, Gemma-3-270M, Gemma-3-1B, Gemma-3n-E2B, and EmbeddingGemma-300M, and runs them via LiteRT and LiteRT LM on MediaTek NPUs with a single accelerator abstraction.
- AOT compilation is strongly recommended for LLM, for example Gemma-3-270M can take over 1 minute to compile on a device, so production deployments should compile once in the pipeline and send the AI pack via Play to the device AI.
- On the Dimensity 9500 class NPU, the Gemma-3N-E2B can reach over 1600 tokens per second in prefill and 28 tokens per second in decode in the 4K context, with measured throughput up to 12x CPU and 10x GPU for LLM workloads.
- For developers, C++ and Kotlin provide a common path for LightRT API selection
Accelerator.NPUManage compiled models and use zero-copy tensor buffers, so that CPU, GPU, and MediaTek NPU targets can share a code path and a deployment workflow.
check it out docs And technical detailsFeel free to check us out GitHub page for tutorials, code, and notebooksAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.
Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.
