In industrial recommendation systems, the shift towards Generative Retrieval (GR) Traditional embedding-based nearest neighbor search is being replaced by large language models (LLMs). These models represent objects Semantic ID (SID)—discrete token sequences—and treat retrieval as an autoregressive decoding task. However, industrial applications often require strict adherence to business logic, such as enforcing ingredient freshness or inventory availability. Standard autoregressive decoding cannot natively enforce these constraints, which often leads the model to “hallucine” invalid or out-of-stock item identifiers.
Accelerator bottleneck: effort vs tpu/gpu
To ensure valid output, developers typically use a prefix tree (TRIE) to hide invalid tokens during each decoding step. While conceptually simple, traditional Tri implementations are fundamentally inefficient on hardware accelerators such as TPUs and GPUs.
The efficiency gap arises from two primary issues:
- memory latency: Pointer-chasing structures result in non-contiguous, random memory access patterns. This prevents memory from being aggregated and fails to utilize the high-bandwidth memory (HBM) burst capabilities of modern accelerators.
- compilation inconsistency: Accelerators rely on static computation graphs for machine learning compilation (for example, Google’s XLA). Standard efforts use data-dependent control flow and recursive branching, which are incompatible with this paradigm and often force expensive host-device round-trips.

Static: Sparse transition matrix-accelerated tri index
Google DeepMind and Youtube researchers have introduced steady (Accelerated Try Index for Sparse Transition Matrix-Constrained Decoding) to solve these constraints. Instead of treating the trie as a graph to be traversed, STATIC flattens it into a static Compressed Sparse Row (CSR) Matrix. This transformation allows irregular tree traversal to be performed as a fully vectorized sparse matrix operation.
hybrid decoding architecture
Static uses a two-stage lookup strategy to balance memory usage and speed:
- dense masking (Tea-1 D): for the first D=2 layers, where the branching factor is highest, Static uses a bit-packed dense Boolean tensor. This allows hey(1) Lookups during the most computationally expensive initial stages.
- Vectorized Node Transition Kernel (VNTK): For deeper layers (l ≥ 3), Static uses a branch-free kernel. This kernel executes a ‘speculative slice’ of a fixed number of entries (B)Tea), corresponding to the maximum branching factor at that level. By using fixed-sized slices regardless of the actual number of children, the entire decoding process remains a single, stable computation graph.
This approach achieves a I/O complexity hey(1) relative to the constraint set size, whereas previous hardware-accelerated binary-search methods scaled logarithmically (hey(log|c|)).
Performance and Scalability
Evaluated on a Google TPU v6e accelerator using a 3-billion parameter model with a batch size of 2 and a beam size (m) Of the 70, STATIC demonstrated significant performance gains compared to existing methods.
| Method | Latency overhead per step (ms) | % of total estimation time |
| static (ours) | +0.033 | 0.25% |
| ppv estimated | +1.56 | 11.9% |
| hash bitmap | +12.3 | 94.0% |
| cpu try | +31.3 | 239% |
| ppv accurate | +34.1 | 260% |
Static achieved an achievement 948x speedup On CPU-offloaded efforts and outperformed the exact binary-search baseline (PPV) 1033x. Its latency remains almost constant regardless of the semantic ID vocabulary size (|)V|) increases.
For a vocabulary of 20 million items, the upper limit of STATIC for HBM use is approximately 1.5 GB. In practice, due to non-uniform distribution and clustering of semantic IDs, the actual usage is generally ≤75% Of this bound. The general rule for capacity planning is approximately 90 MB HBM per 1 million barriers.
deployment results
STATIC was deployed on YouTube to enforce a ‘past 7 days’ freshness constraint for video recommendations. The system provided a vocabulary of 20 million fresh items with 100% compliance.
Online A/B testing revealed:
- A +5.1% increase 7 days of fresh video views.
- A +2.9% increase In 3-day fresh video views.
- A +0.15% increase In click-through rate (CTR).
cold-start performance
This framework also addresses the ‘cold-start’ limitation of generative retrieval – recommending items not seen during training. By restricting the model to the cold-start item set on the Amazon reviews dataset, Static significantly improved performance over the unrestricted baseline, which recorded 0.00% recall@1. For these tests, the 1-billion parameter Gemma architecture was used. l = 4 tokens and size of vocabulary |V|=256.
key takeaways
- vectorized efficiency: Reconstructs decoding constrained by the static graph traversal problem into hardware-friendly, vectorized sparse matrix operations by flattening prefix trees into static compressed sparse row (CSR) matrices.
- massive speedup:The system achieves 0.033 ms per-step latency, representing a 948x speedup on CPU-offloaded attempts and a 47-1033x speedup over the hardware-accelerated binary-search baseline.+1
- scalable hey(1) complexity: by receiving hey(1) I/O complexity relative to the constraint set size, Static maintains high performance with a low memory footprint of approximately 90 MB per 1 million items.
- production proven results: The deployment on YouTube showed 100% compliance with business logic constraints, leading to a 5.1% increase in fresh video views and a 0.15% increase in click-through rates.
- cold-start solution: The framework enables generic retrieval models to successfully recommend cold-start items, increasing the performance of Recall@1 on the Amazon reviews benchmark from 0.00% to non-trivial levels.
check it out paper And codes. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

