Nous Research has introduced NousCoder-14B, a competitive Olympiad programming model that is trained on the Qwen3-14B using reinforcement learning (RL) with verifiable rewards. On the LiveCodeBench v6 benchmark, which covers problems from 08/01/2024 to 05/01/2025, the model reaches a Pass@1 accuracy of 67.87 percent. This is 7.08 percentage points higher than the Qwen3-14B baseline of 60.79 percent on the same benchmark. The research team trained the model on 24k verifiable coding problems using 48 B200 GPUs in 4 days, and released the weights under the Apache 2.0 license on Hugging Face.

What does Benchmark Focus and Pass@1 mean?
LiveCodeBench v6 is designed for competitive programming evaluation. The test partition used here has 454 problems. The training set uses the same recipe as Agentica and Together AI’s DeepCoder-14b project. This combines issues with TACO Verified, PrimeIntellect SYNTHETIC 1, and LiveCodeBench issues created before 07/31/2024.
The benchmark includes only competitive programming style tasks. For each problem, the solution must respect strict time and memory constraints and pass a large set of hidden input output tests. Pass@1 is the fraction of problems where the first generated program passes all tests, including time and memory constraints.


Dataset creation for performance based RL
All datasets used for training are composed of verifiable code generation problems. Each problem consists of a reference implementation and several test cases. The training set includes 24k problems:
- TACO Verified
- PrimeIntellect Synthetic 1
- LiveCodeBench issues that occur before 07/31/2024
The test set is LiveCodeBench v6, with 454 issues between 08/01/2024 and 05/01/2025.
Each problem is a complete competitive programming task with description, input format, output format, and test cases. This setup is important for RL because it gives a binary reward signal that is cheap to compute after the code is run.
RL environment with Atropos and Modal
The RL environment is built using the Atropos framework. NousCoder-14B is prompted using the standard LiveCodeBench prompt format, and it generates Python code for each problem. Each rollout receives a scalar reward that depends on the results of the test case:
- Give 1 reward when generated code passes all test cases of that problem
- Reward -1 when the code gives the wrong answer, exceeds the 15 second time limit, or exceeds the 4 GB memory limit in any test case
To execute untrusted code securely and at scale, the team uses Modal as an autoscaled sandbox. The system launches one model container per rollout in the main design in what the research team describes as the used setting. Each container runs all test cases for that rollout. This avoids mixing the training computation with the validation computation and keeps the RL loop stable.
The research team also performs estimation and verification. When an inference worker finishes a generation, it sends the completion to a model verifier and immediately starts a new generation. With multiple inference workers and a fixed pool of model containers, this design keeps the training loop inference computation bounded rather than validation.
Team 3 discusses verification parallelization strategies. They spawn one container per issue, one per rollout, and one per test case. Due to container launch overhead they eventually avoid setting per test case and use an approach where each container evaluates multiple test cases and focuses on a small set of the most difficult test cases first. If any of these fails, the system may abort the verification prematurely.
GRPO Objective, DAPO, GSPO, and GSPO+
NousCoder-14B uses Group Relative Policy Optimization (GRPO) which does not require separate pricing models. Research team runs tests on top of GRPO 3 Objectives: Dynamic Sampling Policy Optimization (DAPO), Group Sequence Policy Optimization (GSPO), and a modified GSPO variant called GSPO+.
All three objectives share the same definition of profit. The profit of each rollout is the reward for that rollout normalized by the mean and standard deviation of the rewards within the group. DAPO applies weighting and clipping at the token level, and Three main changes were introduced in relation to the GRPO:
- A clip higher rule that increases exploration for low probability tokens
- A token level policy gradual loss that gives equal weight to each token
- Dynamic sampling, where groups that are all correct or all incorrect are removed because they have no advantage
GSPO moves the importance weight to the sequence level. It defines a sequence importance ratio that aggregates token ratios across the entire program. GSPO+ keeps the sequence level correction, but it rescales the gradients so that tokens are weighted equally regardless of sequence length.
On LiveCodeBench v6, the difference between these objectives is minor. At a reference length of 81,920 tokens, DAPO reaches a pass@1 of 67.87 percent while GSPO and GSPO+ reach 66.26 percent and 66.52 percent. At 40,960 tokens, all 3 objectives converge around 63 percent pass@1.
Iterative context expansion and long term filtering
Qwen3-14B supports long references and training follows an iterative reference expansion program. The team first trains the model with a 32k reference window and then continues training on the maximum Qwen3-14B reference window of 40k. At each step they select the checkpoint with the best LiveCodeBench score in the 40k context and then use the YaRN context extension at evaluation time to reach 80k tokens, i.e. 81,920 tokens.
One key tip is to filter for a longer period of time. When a generated program exceeds the maximum reference window, they reset its gain to zero. This removes that rollout from the gradient signal rather than penalizing it. The research team reports that this approach avoids pushing the model towards smaller solutions for purely optimization reasons and helps maintain quality when they scale the reference length at test time.
key takeaways
- NousCoder 14B, a Qwen3-14B based competitive programming model trained with performance-based RL, reaches 67.87 percent Pass@1 on LiveCodeBench v6, which is 7.08 percentage points higher than the Qwen3-14B baseline 60.79 percent on the same benchmark.
- The model is trained on 24k verifiable coding problems from TACO Verified, PrimeIntellect SYNTHETIC-1, and prior 07 31 2024 LiveCodeBench tasks, and evaluated on a disjoint LiveCodeBench v6 test set of 454 problems from 08/01/2024 to 05/01/2025.
- The RL setup uses Atropos, which has Python solutions executed in sandboxed containers, a simple reward of 1 for solving all test cases and minus 1 for any failure or resource limit violation, and a pipeline design where inference and validation run asynchronously.
- Group relative policy optimization objective DAPO, GSPO and GSPO+ are used for the long context code RL, all groups operate on normalized rewards, and show similar performance, with DAPO reaching the best pass@1 in the longest context of 81,920 tokens.
- Training uses iterative context expansion, first on 32k then 40k tokens, with YaRN based expansion on 81,920 tokens at evaluation time, includes long-term rollout filtering for stability, and ships as a fully reproducible open stack with Apache 2.0 weights and RL pipeline code.
check it out model weight And technical details. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.