Moonshot AI and Tsinghua researchers propose PrfaaS: a cross-datacenter KVCache architecture that rethinks how LLMs are served at scale

by ai-intensify
0 comments
Moonshot AI and Tsinghua researchers propose PrfaaS: a cross-datacenter KVCache architecture that rethinks how LLMs are served at scale

For years, the way large language models handle inference has been stuck inside a box – literally. The high-bandwidth RDMA networks that provide modern LLM service limit both prefill and decode to a single datacenter, sometimes even to the same rack. Moonshot AI and a team of researchers from Tsinghua University are claiming that this barrier is about to be broken – and the right architecture can already take advantage of that change.

The research team introduces Prefill-as-a-Service (PRFaaS), a cross-datacenter service architecture that selectively offloads long-context prefills to a standalone, compute-intensive prefill cluster and transfers the resulting KVcash over commodity Ethernet to a local PD cluster for decode. Results in a case study using an internal 1T-parameter hybrid model yield 54% higher serving throughput than a homogeneous PD baseline and 32% higher than a naive heterogeneous setup – while consuming only a fraction of the available cross-datacenter bandwidth. The research team says that when compared at the same hardware cost, the throughput gain is about 15%, indicating that the full 54% gain comes partly from pairing the higher-compute H200 GPU for prefill with the H20 GPU for decode.

https://arxiv.org/pdf/2604.15039v1

Why has existing architecture hit a wall?

To understand what PrfaaS solves, it helps to understand why the LLM service is divided into two phases in the first place. Prefill is the step where the model processes all the input tokens and generates the KVCache – this is compute-intensive. Decode is where the model generates output tokens one at a time – this is memory-bandwidth-intensive. Prefill-decode (PD) separation separates these two stages on separate hardware, which improves utilization and allows each stage to be optimized independently.

The problem is that separating prefill from decode creates transport problems. Once the prefill runs on one set of machines and the decode runs on another, the KVCache produced by the prefill must be transferred to the decode side before output generation can begin. In traditional dense-attention models – those using grouped query attention (GQA) – this KVCache is much larger. The research team benchmarked MiniMax-M2.5, a representative dense model, with GQA, producing KVCache at approximately 60 Gbps for a 32K-token request on a single 8×H200 instance. Transferring that amount of data without stopping the compute requires RDMA-class interconnects, which is why traditional PD separation is tightly tied to a single datacenter-scale network fabric. Moving prefill and decode to separate clusters, let alone datacenters, is not feasible.

Hybrid meditation transforms mathematics

What makes PrfaaS timely is an architectural change taking place at the model level. A growing class of models – including KM Linear, MIMO-V2-Flash, QUEN3.5-397B, and RING-2.5-1T – adopt hybrid attention stacks that combine a large number of linear-complexity or bounded-state layers such as KM Delta Attention (KDA), Multi-Head Latent Attention (MLA), and Sliding Window Attention (SWA). Add a small number of full-attention layers with . In these architectures, only full-attention layers generate a KVCache that scales with sequence length. Linear-complexity layers maintain fixed-size periodic states whose footprint is negligible in the long term.

The KV throughput numbers – defined as KVCache size divided by prefill latency – tell the story clearly. At 32K tokens, MiMo-V2-Flash yields KVCache at 4.66 Gbps compared to 59.93 Gbps for MiniMax-M2.5, a 13× reduction. The Qwen3.5-397B reaches 8.25 Gbps compared to 33.35 Gbps for the Qwen3-235B, which is a 4× reduction. Specifically for Ring-2.5-1T, the paper decomposes the savings: MLA contributes about 4.5× compression over GQA, and the 7:1 hybrid ratio contributes about 8× reduction, giving an overall KV memory savings of about 36×. For the indoor 1T model used in the case study, KV throughput at 32K tokens is only 3.19 Gbps – a level that modern inter-datacenter Ethernet links can realistically maintain.

But the research team is careful to make a distinction that matters to AI developers building real systems: A small KVCache is necessary but not sufficient to make cross-datacenter PD separation practical. The actual workload is very high, the request length is heterogeneous, the prefix cache is unevenly distributed among the nodes, and the inter-cluster bandwidth fluctuates. A simple design that routes each prefill to a remote cluster still results in congested and unstable queues.

https://arxiv.org/pdf/2604.15039v1

What does PrfaaS actually do

PrfaaS-PD architecture sits on top three subsystems: : computation, network, And storage. The compute subsystem separates clusters into two types – local PD clusters that handle end-to-end inference for small requests, and PrfaaS clusters with dedicated high-compute-throughput accelerators for long-context prefills. The network subsystem uses intra-cluster RDMA for fast local transfers and Commodity Ethernet for cross-cluster KVCache transport. The storage subsystem creates a distributed hybrid prefix cache pool that handles linear attention recurring state (request-level, fixed-size, exact-match) and full-attention KVCache blocks (block-level, growing linearly with input length, supports partial prefix matching) in separate clusters supported by a unified block pool.

The main routing mechanism is length-based threshold routing. let l denote the incremental prefill length of the request after subtracting any cached prefixes, and t A routing limitation. If l > tThe request goes to the PrfaaS cluster and its KVCache is forwarded over Ethernet to a decode node. If l ≤ tIt lives on the local PD path. In the case study, the optimal range is t = 19.4K token, which routes about 50% of all requests – the long ones – to the PrfaaS cluster.

In practice, much more than low-kV throughput is required to make an Ethernet path reliable. The research team specifies three concrete transport mechanisms: layer-wise prefill pipelining to overlap KV cache generation with transmission, multi-connection TCP transport to fully utilize the available bandwidth, and congestion monitoring integrated with the scheduler to quickly detect loss and re-transmission signals and prevent congestion accumulation.

On top of this, the research team introduces a dual-period scheduler. On shorter time scales, it monitors PrfaaS egress utilization and queue depth, adjusting routing when the link approaches its bandwidth limit. It also handles cache-affine routing: when bandwidth is scarce, each cluster’s prefix cache is evaluated independently; When bandwidth is abundant, the scheduler considers the best cached prefix across all clusters and performs cross-cluster cache transfer if it reduces unnecessary computation. On longer time scales, the scheduler rebalances the prefill and decode node counts within the local PD cluster as traffic patterns change, keeping the system near the throughput-optimal operating point.

Number

In the case study, a PrfaaS cluster of 32 H200 GPUs is paired with a local PD cluster of 64 H20 GPUs over a VPC network providing approximately 100 Gbps cross-cluster bandwidth. The total PrfaaS egress load under the optimal configuration is approximately 13 Gbps – only 13% of the available Ethernet capacity – and the paper notes that the PrfaaS cluster remains compute-bound with additional bandwidth headroom. The research also projects it to be suitable for larger deployments: even at the scale of a 10,000-GPU datacenter, the total egress bandwidth required for a KVCache transfer is only 1.8 Tbps, which is well within the capacity of modern inter-datacenter links.

Mean time to first token (TTFT) declined by 50% and P90 TTFT declined by 64% compared to the homogeneous baseline. The naive heterogeneous configuration – all prefill on H200, all decode on H20, with no routing or scheduling logic – achieves only 1.16× throughput over the homogeneous baseline compared to 1.54× for the full PrfaaS-PD system. The difference between 1.16× and 1.54× isolates the contribution of the scheduling layer and shows that it accounts for most of the practical gain.

The research team positions PrfaaS not as a near-future concept but as a design that is viable today for hybrid-architecture models – and argues that as the context window grows, KVCache compression technology matures, and stage-specific hardware such as NVIDIA’s Rubin CPX for prefill and LPU-style chips for decode become more widely available, cross-datacenter PD The case for secession will only get stronger.


check it out paper here. Also, feel free to follow us Twitter And don’t forget to join us 130k+ ML subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us


Related Articles

Leave a Comment