As some of you know, I have a longtime habit of keeping a running list of research papers that I want to read, revisit, or cite in future articles and projects.
Last year, I shared two organized paper lists, One Cover from January to June and Second Covers July to December.
Many readers told me that these lists were very useful, so, in that spirit, I created a new list for the first half of 2026. This includes the papers I bookmarked from January to May 2026.
Please do not consider this a complete list of everything published this year. There are so many papers published every day that it would not be completely possible. Instead, it is a curated bibliography based on papers I found interesting or relevant to my work. I read titles, abstracts, and subject matter carefully when organizing the list, but I must admit that I read only a subset of the papers in detail.
Why make these lists in the first place? When I work on an article, book section, code example, or lecture, I often remember that I saw the relevant paper somewhere, but having to find it again can be surprisingly annoying. A categorized Markdown list solves that problem for me, and I hope it’s useful for you too. (Even in the age of LLM-based web searching, keeping a specific bibliography is still very useful.)
This year, the list is again heavy on logic models, reinforcement learning, and efficient inference, because I’m biased towards bookmarking papers that are related to things I’m currently working on. However, compared to the 2025 lists, I have bookmarked more papers around agent harnesses, tool usage, long references, diffusion language models, and practical service infrastructure, as I am largely involved in this currently and where this field is headed.
The categories of this research paper list are as follows. (Pro tip: In the web version of this article, you can use the table of contents on the left to jump to the sections that are most relevant to you.)
-
Architecture and Model Design
-
Efficient Training and Scaling
-
Estimate efficiency and KV cache
-
sparse attention and long context
-
Logic and test-time calculations
-
Reinforcement Learning and RLVR
-
Agent System and Tool Usage
-
Coding Agents and Software Engineering
-
diffusion language model
-
Model Evaluation and Benchmarks
This first section collects papers on model architectures, model-release technical reports, and documents that help explain why current LLMs look the way they do.
One thing I find interesting so far about 2026 is that the work of the architecture goes beyond making the transformers bigger. there is a lot of work around
All these papers are quite interesting, which is why I have bookmarked them in the first place. But if I had to pick one must-read, it would probably be Nemotron 3 Super, because the article is Very good Detailed (no sarcasm intended), and it describes the technologies used in a model that is already in production. And after all it is one of the best models in its size class.
One of the interesting aspects of Nemotron 3 is its hybrid-architecture design, meaning it alternates between regular attention layers and Mamba-2 (state space model) layers to be more efficient in long contexts. In 2026, long-context efficiency is king as more and more LLM agents get plugged into harnesses (OpenClaw, etc.) that require working with longer and longer contexts.
That being said, the 120B-A12B may be a bit large for local spec on regular consumer hardware, but there is also a Nemotron 3 Nano (4B) version.
Note that 2 days ago, Nvidia also released a scaled-up version of this, the Nemotron 3 Ultra (550B-A55B), which scales up the embedding and projection dimensions but otherwise uses the same building blocks. If you’re interested in a visual, I posted about it on Substack Notes Here.
This hybrid-architecture trend with alternating focus and alternating layers is a relatively popular development this year. Probably the most popular open-weight LLM series that uses a similar hybrid design is probably Qwen3.6, which uses gated DeltaNet layers instead of Mamba-2 layers for the non-attention parts. For more information, see my Hybrid Attention (https://sebastianraschka.com/llm-architecture-gallery/hybrid-attention/) write-up, which aggregates information from several of my previous Substack articles where I wrote about these.
Also, in the paper list below, you can see that there are now Mamba-3 and Gated DeltaNet-2 (ie, new versions of Mamba-2 and GatedDeltaNet), and it would be interesting to see them in upcoming open-weight LLMs (ie, Nemotron-4 and Quen4?).
In addition to describing the hybrid-architecture design, the Nemotron-3 paper covers several other interesting ablations, for example, multi-token prediction for speculative decoding, NVFP4 pretraining vs. BF16, synthetic MMLU-style data, and post-training quantization recipes, but covering these in detail would be beyond the scope of this overview.
-
Jan 1, Deep Delta Learning, https://arxiv.org/abs/2601.00417
-
January 6, MiMo-V2-Flash Technical Report, https://arxiv.org/abs/2601.02780
-
13 January, Ministry 3, https://arxiv.org/abs/2601.08584
-
Jan 29, Scaling embeddings outperform scaling experts in language models, https://arxiv.org/abs/2601.21204
-
January 30, LatencyLens: Revealing highly interpretable visual tokens in LLM, https://arxiv.org/abs/2602.00462
-
February 4, Ernie 5.0 technical report, https://arxiv.org/abs/2602.04705
-
February 8, VIT-5: Vision Transformers for the mid-2020s, https://arxiv.org/abs/2602.08071 (Most of this article is LLM-centric, but I couldn’t resist including a new major vision transformer design.)
-
February 11, Phase 3.5 Flash: 11B Open Frontier-Level Intelligence with Active Parameters, https://arxiv.org/abs/2602.10604
-
Feb 12, NanBez4.1-3b: A small general model that causes, aligns, and acts, https://arxiv.org/abs/2602.13367
-
February 16, Symmetry in language statistics shapes the geometry of model representations, https://arxiv.org/abs/2602.15029
-
February 17, GLM-5: From Vibe Coding to Agentic Engineering, https://arxiv.org/abs/2602.15763
-
February 18, Arcee Trinity Big Technical Report, https://www.arxiv.org/abs/2602.17004
-
March 4, The Spike, the Spurs, and the Sink: Anatomy of Massive Activations and Attention Sinks, https://arxiv.org/abs/2603.05498
-
March 12, Tiny Aya: Bridging Scale and Multilingual Depth, https://arxiv.org/abs/2603.11510
-
March 15th, note the remains, https://arxiv.org/abs/2603.15031
-
March 16, Mamba-3: Improved sequence modeling using state space principles, https://arxiv.org/abs/2603.15569
-
March 31, Meditations on Mamba: A Recipe for Cross-Architecture Distillation, https://arxiv.org/abs/2604.14191
-
April 13, Nemotron 3 Super: Open, Efficient Mixture of Experts Hybrid Mamba-Transformer Models for Agentic Reasoning, https://arxiv.org/abs/2604.12374
-
May 6, ZAYA1-8B technical report, https://arxiv.org/abs/2605.05365
-
May 13, Delta Meditation Remains, https://arxiv.org/abs/2605.18855
-
May 21, Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention, https://arxiv.org/abs/2605.22791
-
May 25, MiniMax-M2 Series: Mini Activations Unleash Max Real-World Intelligence, https://arxiv.org/abs/2605.26494
This section is about training systems, optimization methods, and scaling recipes. These papers are not (all) about pre-training from scratch. Some focus on fine-tuning, distillation, trial-time training, or improving training performance on limited hardware.