2026 list (January to May)

As some of you know, I have a longtime habit of keeping a running list of research papers that I want to read, revisit, or cite in future articles and projects.

Last year, I shared two organized paper lists, One Cover from January to June and Second Covers July to December.

Many readers told me that these lists were very useful, so, in that spirit, I created a new list for the first half of 2026. This includes the papers I bookmarked from January to May 2026.

Please do not consider this a complete list of everything published this year. There are so many papers published every day that it would not be completely possible. Instead, it is a curated bibliography based on papers I found interesting or relevant to my work. I read titles, abstracts, and subject matter carefully when organizing the list, but I must admit that I read only a subset of the papers in detail.

Why make these lists in the first place? When I work on an article, book section, code example, or lecture, I often remember that I saw the relevant paper somewhere, but having to find it again can be surprisingly annoying. A categorized Markdown list solves that problem for me, and I hope it’s useful for you too. (Even in the age of LLM-based web searching, keeping a specific bibliography is still very useful.)

This year, the list is again heavy on logic models, reinforcement learning, and efficient inference, because I’m biased towards bookmarking papers that are related to things I’m currently working on. However, compared to the 2025 lists, I have bookmarked more papers around agent harnesses, tool usage, long references, diffusion language models, and practical service infrastructure, as I am largely involved in this currently and where this field is headed.

The categories of this research paper list are as follows. (Pro tip: In the web version of this article, you can use the table of contents on the left to jump to the sections that are most relevant to you.)

Architecture and Model Design
Efficient Training and Scaling
Estimate efficiency and KV cache
sparse attention and long context
Logic and test-time calculations
Reinforcement Learning and RLVR
Agent System and Tool Usage
Coding Agents and Software Engineering
diffusion language model
Model Evaluation and Benchmarks

This first section collects papers on model architectures, model-release technical reports, and documents that help explain why current LLMs look the way they do.

One thing I find interesting so far about 2026 is that the work of the architecture goes beyond making the transformers bigger. there is a lot of work around

All these papers are quite interesting, which is why I have bookmarked them in the first place. But if I had to pick one must-read, it would probably be Nemotron 3 Super, because the article is Very good Detailed (no sarcasm intended), and it describes the technologies used in a model that is already in production. And after all it is one of the best models in its size class.

One of the interesting aspects of Nemotron 3 is its hybrid-architecture design, meaning it alternates between regular attention layers and Mamba-2 (state space model) layers to be more efficient in long contexts. In 2026, long-context efficiency is king as more and more LLM agents get plugged into harnesses (OpenClaw, etc.) that require working with longer and longer contexts.

That being said, the 120B-A12B may be a bit large for local spec on regular consumer hardware, but there is also a Nemotron 3 Nano (4B) version.

Figure 1: Architecture of Nemotron-3 Super, which is a hybrid architecture using Mamba-2 layers.

Note that 2 days ago, Nvidia also released a scaled-up version of this, the Nemotron 3 Ultra (550B-A55B), which scales up the embedding and projection dimensions but otherwise uses the same building blocks. If you’re interested in a visual, I posted about it on Substack Notes Here.

This hybrid-architecture trend with alternating focus and alternating layers is a relatively popular development this year. Probably the most popular open-weight LLM series that uses a similar hybrid design is probably Qwen3.6, which uses gated DeltaNet layers instead of Mamba-2 layers for the non-attention parts. For more information, see my Hybrid Attention (https://sebastianraschka.com/llm-architecture-gallery/hybrid-attention/) write-up, which aggregates information from several of my previous Substack articles where I wrote about these.

Also, in the paper list below, you can see that there are now Mamba-3 and Gated DeltaNet-2 (ie, new versions of Mamba-2 and GatedDeltaNet), and it would be interesting to see them in upcoming open-weight LLMs (ie, Nemotron-4 and Quen4?).

In addition to describing the hybrid-architecture design, the Nemotron-3 paper covers several other interesting ablations, for example, multi-token prediction for speculative decoding, NVFP4 pretraining vs. BF16, synthetic MMLU-style data, and post-training quantization recipes, but covering these in detail would be beyond the scope of this overview.

Jan 1, Deep Delta Learning, https://arxiv.org/abs/2601.00417
January 6, MiMo-V2-Flash Technical Report, https://arxiv.org/abs/2601.02780
13 January, Ministry 3, https://arxiv.org/abs/2601.08584
Jan 29, Scaling embeddings outperform scaling experts in language models, https://arxiv.org/abs/2601.21204
January 30, LatencyLens: Revealing highly interpretable visual tokens in LLM, https://arxiv.org/abs/2602.00462
February 4, Ernie 5.0 technical report, https://arxiv.org/abs/2602.04705
February 8, VIT-5: Vision Transformers for the mid-2020s, https://arxiv.org/abs/2602.08071 (Most of this article is LLM-centric, but I couldn’t resist including a new major vision transformer design.)
February 11, Phase 3.5 Flash: 11B Open Frontier-Level Intelligence with Active Parameters, https://arxiv.org/abs/2602.10604
Feb 12, NanBez4.1-3b: A small general model that causes, aligns, and acts, https://arxiv.org/abs/2602.13367
February 16, Symmetry in language statistics shapes the geometry of model representations, https://arxiv.org/abs/2602.15029
February 17, GLM-5: From Vibe Coding to Agentic Engineering, https://arxiv.org/abs/2602.15763
February 18, Arcee Trinity Big Technical Report, https://www.arxiv.org/abs/2602.17004
March 4, The Spike, the Spurs, and the Sink: Anatomy of Massive Activations and Attention Sinks, https://arxiv.org/abs/2603.05498
March 12, Tiny Aya: Bridging Scale and Multilingual Depth, https://arxiv.org/abs/2603.11510
March 15th, note the remains, https://arxiv.org/abs/2603.15031
March 16, Mamba-3: Improved sequence modeling using state space principles, https://arxiv.org/abs/2603.15569
March 31, Meditations on Mamba: A Recipe for Cross-Architecture Distillation, https://arxiv.org/abs/2604.14191
April 13, Nemotron 3 Super: Open, Efficient Mixture of Experts Hybrid Mamba-Transformer Models for Agentic Reasoning, https://arxiv.org/abs/2604.12374
May 6, ZAYA1-8B technical report, https://arxiv.org/abs/2605.05365
May 13, Delta Meditation Remains, https://arxiv.org/abs/2605.18855
May 21, Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention, https://arxiv.org/abs/2605.22791
May 25, MiniMax-M2 Series: Mini Activations Unleash Max Real-World Intelligence, https://arxiv.org/abs/2605.26494

This section is about training systems, optimization methods, and scaling recipes. These papers are not (all) about pre-training from scratch. Some focus on fine-tuning, distillation, trial-time training, or improving training performance on limited hardware.

January list

2026 list (January to May)

Meta is so desperate for computation that it’s building “data centers” that are just tents filled with AI chips

Google’s new Colab CLI lets developers and AI agents run Python on remote Colab GPUs and TPUs from the terminal

Related Articles

Leave a Comment Cancel Reply