Tencent Hunyuan releases HPC-Ops: a high-performance LLM inference operator library

Tencent Hunyuan has open sourced HPC-OPSA production grade operator library for large language model inference architecture tools. HPC-Ops focuses on low-level CUDA kernels for core operators such as inference, grouped GEMM, and fused MOE, and exposes them via Compact-C and Python APIs for integration into existing inference stacks.

HPC-Ops runs largely in internal services. In those deployments it delivered about 30 percent query improvement per minute for the Tencent-HY model and about 17 percent improvement for the DeepSeek model on mainstream inference cards. These benefits are reported at the service level, so they reflect the cumulative effect of the faster kernel inside the actual inference pipeline.

HPC-Ops scope and design

HPC-Ops is a production grade, high performance and easy-to-use operator library for LLM inference, developed by Tencent Hunyuan AI Infra team. The project does not attempt to replace the serving framework. Instead it provides a kernel and clean APIs that can be called from systems that already handle scheduling, KV cache management, batching, and transport.

The API is designed for seamless use inside popular inference frameworks such as VLLM and SGLang. This means that the framework team can swap in an HPC-Ops kernel behind their own abstraction without changing the external behavior of their servers.

HPC-Ops uses C++ and CUDA with CuTe and CUTLASS as building blocks. The kernels are written as relatively small examples that also serve as modern CUDA tutorials.

kernel performance characteristics

The project publishes maximum observed speedup numbers for each operator relative to an established baseline. These are microbenchmarks, and the research team emphasizes that performance varies by size and workload, but they show optimization limits.

For Attention in BF16, compared to FlashInfer, FlashAttention two, FlashAttention three and TensorRT LLM, HPC Ops reports up to 1.33x speedup in prefill and up to 2.22x speedup in decode. For Attention in FP8, compared to FlashInfer, FlashAttentionThree and TensorRT LLM, it reports up to 1.12x in prefill and 2.0x in decode.

For FusedMoE fp8, the maximum observed speedup is up to 1.49x in prefill and 1.14x in decode, compared to TensorRT LLM and vLLM. For GroupGEMM FP8, compared to DeepGEMM, the reported gains are up to 1.1x in prefill and 1.88x in decode.

These numbers matter because decode is typically the latency bottleneck in autoregressive generation, where batch sizes shrink and memory traffic dominates. The fact that Attention and GroupGEMM show the largest relative gains in Decode suggests that HPC-Ops focuses on the part of the pipeline that most users notice.

Supported kernels and precision

The current release groups its functionality into three operator families:

Attention kernels cover both prefill and decode and include support for paged attention. Paged attention is the memory layout that frameworks like VLLM use to place key and value cache blocks in a paged structure, which improves memory reuse for long sequences.
The agglomerated GEMM is implemented as quantized GroupGEMM with fp8 weight. HPC-Ops supports block-wise and per-tensor scaling, so teams can trade off quantization granularity against parameter storage and calibration costs.
Fused-MOE combines a mixture of expert routing and expert computation into a single quantized operator. It also uses fp8 expert weights and supports block wise and per tensor scaling strategies.

In these kernels, HPC-Ops provides native support for the bf16 and fp8 data types. This matches the current production trend to infer toward lower precision formats that preserve accuracy while reducing memory bandwidth and improving tensor core utilization.

key takeaways

Tencent Hunyuan open-sourced HPC-Ops as a production grade operator library for LLM inference on NVIDIA SM90 GPUs, including H20 with C++ and CUDA kernels built on CuTe and CUTLASS.
HPC-Ops in production deployments has reported about 30 percent QPM gain for Tencent-HY models and about 17 percent QPM gain for DeepSeek models on mainstream inference cards.
Operator microbenchmarks show maximum speedups of up to 2.22x for BF16 attention decode, up to 2.0x for FP8 attention decode, up to 1.49x for FP8 fusedMOE prefill, and up to 1.88x for FP8 groupGEMM decode compared to robust baselines such as FlashInfer, FlashAttention, TensorRT LLM, and DeepGEMM. Are.
The library focuses on three operator families, Attention with paged attention support, Quantized GroupGEMM with FP8 weights, and Quantized Fused MOE with FP8 Expert weights, with both blockwise and per tensor scaling, and native BF16 plus FP8 precision support.
HPC-Ops is designed as an operator layer that integrates into existing inference frameworks such as VLLM and SGLang, and the roadmap focuses on long context LLMs, extended quantization including 4 bit and 8 bit strategies, and reduced computation for kernels that perform better overlap calculations with multi GPU communication.

check it out repo here. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.

Tencent Hunyuan releases HPC-Ops: a high-performance LLM inference operator library

HPC-Ops scope and design

kernel performance characteristics

Supported kernels and precision

key takeaways

2025 in Review: Databricks SQL, faster for every workload

Enterprise AI Agent Trends | Databricks Blog

Related Articles

Leave a Comment Cancel Reply