Microsoft unveils Maia 200, a FP4 and FP8 optimized AI inference accelerator for Azure Datacenter

Maiya 200 This is Microsoft’s new in-house AI accelerator designed to perform inference in Azure datacenters. It targets the cost of token generation for large language models and other logic workloads by combining narrow precision computation, density on chip memory hierarchy, and Ethernet based scale up fabric.

Why did Microsoft create a dedicated inference chip??

Training and inference put pressure on the hardware in different ways. All communication and long-term jobs require a great deal of training. The estimate cares about tokens per second, latency, and tokens per dollar. Microsoft positions the Maia 200 as its most efficient estimating system, with approximately 30 percent better performance per dollar than the latest hardware in its fleet.

Maia 200 is part of a heterogeneous Azure stack. It will work on multiple models, including OpenAI’s latest GPT 5.2 models, and power workloads in Microsoft Foundry and Microsoft 365 Copilot. The Microsoft superintelligence team will use the chip for synthetic data generation and reinforcement learning to improve home models.

Core silicon and numerical specifications

Everyone Maiya 200 The die is manufactured on TSMC’s 3 nanometer process. The chip integrates more than 140 billion transistors.

The compute pipeline is built around native FP8 and FP4 tensor cores. A single chip delivers over 10 petaFLOPS in FP4 and over 5 petaFLOPS in FP8 within a 750W SoC TDP envelope.

Memory is divided between stacked HBM and die SRAM. The Maia 200 offers 216 GB HBM3e with approximately 7TB per second bandwidth and 272MB of on-die SRAM. SRAM is organized into tile level SRAM and cluster level SRAM and is completely software managed. Compiler and runtime attention and GEMM kernels can explicitly keep the working set close to the computation.

Tile based microarchitecture and memory hierarchy

The Maia 200 microarchitecture is hierarchical. The base unit is the tile. A tile is the smallest autonomous computation and storage unit on a chip. Each tile includes a tile vector processor as a tile tensor unit and a programmable SIMD engine for high throughput matrix operations. The tile SRAM feeds both units and the tile DMA engine moves data in and out of the SRAM without stopping the calculations. A tile control processor organizes the sequence of tensor and DMA tasks.

Multiple tiles form a cluster. Each cluster exposes a large multi-banked cluster SRAM that is shared across all tiles in that cluster. Cluster level DMA engines move data between the cluster SRAM and the co-packaged HBM stack. A cluster core coordinates multi-tile execution and uses redundancy schemes for tiles and SRAM to improve yield while maintaining the same programming model.

This hierarchy allows the software stack to pin different parts of the model to different levels. For example, the attention kernel can hold Q, K, V tensors in tile SRAM, while the collective communication kernel can stage payloads in cluster SRAM and reduce HBM pressure. The design goal is to maintain high utilization as models grow in size and sequence length.

On Chip Data Movement and Ethernet Scale Up Fabric

Estimates are often limited by data circulation, not peak counts. Maia uses a custom network on chip with a hierarchy of 200 DMA engines. The network on the chip spans tiles, clusters, memory controllers, and I/O units. It has separate planes for large tensor traffic and small control messages. This separation prevents synchronization behind larger transfers and smaller outputs from being blocked.

beyond chip limit, Maiya 200 Integrates its own NIC and an Ethernet based scale up network that runs AI transport layer protocols. The on-die NIC exposes approximately 1.4 TB per second in each direction, or 2.8 TB per second bidirectional bandwidth, and scales to 6,144 accelerators in a two-tier domain.

Within each tray, four Maia accelerators form a fully connected quad. These four devices have direct non-switched links to each other. Most tensor parallel traffic remains inside this group, while only lightly aggregated traffic goes out to the switch. This improves latency and reduces switch port count for specific inference groups.

Azure systems integration and cooling

At the system level, the Maia 200 follows the same rack, power, and mechanical standards as Azure GPU Server. It supports air-cooled and liquid-cooled configurations and uses a second-generation closed-loop liquid cooling heat exchanger unit for higher density racks. This allows mixed deployment of GPUs and Maia accelerators in the same datacenter footprint.

The accelerator integrates with the Azure control plane. Firmware management, health monitoring, and telemetry use the same workflow as other Azure compute services. This enables fleet-wide rollout and maintenance without disrupting ongoing AI workloads.

key takeaways

Here are 5 brief, technical solutions:

guess first design:Maia 200 is Microsoft’s first silicon and systems platform built just for AI inference, optimized for large-scale token generation in modern logic models and large language models.
Numerical specifications and memory hierarchy: The chip is manufactured at TSMCs 3nm, integrates approximately 140 billion transistors and offers more than 10 PFLOPS FP4 and more than 5 PFLOPS FP8, at 7TB per second with 216 GB HBM3e with 272 MB on chip SRAM divided into tile SRAM and cluster SRAM and managed in software.
Performance vs. other cloud accelerators: Microsoft reports 30 percent better performance per dollar compared to the latest Azure Inference system and claims 3x FP4 performance of the third generation Amazon Tranium and higher FP8 performance than Google TPU v7 at the accelerator level.
Tile-based architecture and Ethernet fabric:Maya organizes computations in tiles and clusters with 200 local SRAM, DMA engine and a network on chip, and exposes an integrated NIC with approximately 1.4 TB per second per direction Ethernet bandwidth that accesses 6,144 accelerators using fully connected quad clusters as local tensor parallel domains.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.

Microsoft unveils Maia 200, a FP4 and FP8 optimized AI inference accelerator for Azure Datacenter

Why did Microsoft create a dedicated inference chip??

Core silicon and numerical specifications

Tile based microarchitecture and memory hierarchy

On Chip Data Movement and Ethernet Scale Up Fabric

Azure systems integration and cooling

key takeaways

Google DeepMind introduces agentic vision in Gemini 3 flash

DeepSeek AI Releases DeepSeek-OCR 2 with Causal Visual Flow Encoder for Layout Aware Document Understanding

Related Articles

Leave a Comment Cancel Reply