Designing algorithms for multi-agent reinforcement learning (MARL) in games with incomplete information – scenarios where players act sequentially and cannot see each other’s private information, such as poker – has historically relied on manual iteration. Researchers identify weighting schemes, discounting rules, and balance solvers through intuition and trial-and-error. Google DeepMind researchers have proposed AlphaEvolve, an LLM-powered evolutionary coding agent that replaces that manual process with automated search.
The research team applies this framework to two established paradigms: Counterfactual Regret Minimization (CFR) and Policy Space Response Oracle (PSRO). In both cases, the system discovers new algorithm variants that perform competitively or better than the existing hand-designed state-of-the-art baseline. All experiments were run using the OpenSpoil framework.
Background: CFR and PSRO
CFR is an iterative algorithm that decomposes regret minimization into information sets. At each iteration it accumulates ‘counterfactual regrets’ – how much a player would have achieved by playing differently – and derives a new policy proportional to the positive accumulated regrets. Over several iterations, the time-average strategy converges to a Nash Equilibrium (NE). Variants such as DCFR (Discounted CFR) and PCFR+ (Predictive CFR+) improve convergence by applying specific discounting or predictive updating rules, all developed through manual design.
PSRO operates at a higher level of abstraction. It maintains a population of policies for each player, creates a payoff tensor (meta-game) by computing the expected utilities for each combination of population policies, and then uses the meta-strategy solver to generate a probability distribution over the population. The best responses are trained against that distribution and iteratively added to the population. The meta-strategy solver – how the population distribution is calculated – is the central design choice that the paper targets for automated search. All experiments use the exact best response oracle (calculated via price iteration) and exact payoff values for all meta-game entries, which removes Monte Carlo sampling noise from the results.
alphavolve framework
AlphaEvolve is a distributed evolutionary system that uses LLM to transform source code instead of numerical parameters. Procedure: A population is initialized with a standard implementation (CFR+ as seed for CFR experiments; same as seed for both PSRO solver classes). In each generation, a basic algorithm is selected based on fitness; Its source code is shipped to LLM (Gemini 2.5 Pro) with indications of how to modify it; The resulting candidates are evaluated on a proxy game; Valid candidates are added to the population. Alphavolve supports multi-objective optimization – if multiple fitness metrics are defined, one is randomly chosen per generation to guide the original sample.
The fitness signal after k iterations is the negative exploitation potential, evaluated on a fixed set of training games: 3-player Kuhn Poker, 2-player Leduc Poker, 4-card Goofspiel, and 5-sided Liar’s Dice. The final evaluation is performed on a separate test set of larger, unseen games.
For CFR, the developed search space consists of three Python classes: RegretAccumulator, PolicyFromRegretAccumulator, and PolicyAccumulator. These govern accumulation, current policy derivation and average policy accumulation respectively. The interface is sufficiently expressive to represent all known CFR variants as special cases. For PSRO, the developable components are TrainMetaStrategySolver and EvaluateMetaStrategySolver – meta-strategy solvers that are used during oracle training and exploit evaluation.
discovered algorithms 1: VAD-CFR
The developed CFR version is the volatility-adaptive discounted CFR (VAD-CFR). Instead of linear averaging and static relaxation used in the CFR family, The discovery yielded three different mechanisms:
- Volatility-Adaptive Discounting. Instead of fixed discount factors α and β applied to cumulative regrets (as in DCFR), VAD-CFR tracks the instability of the learning process using exponential weighted moving averages (EWMA) of instantaneous regret magnitudes. When volatility is high, the discount increases so the algorithm forgets the volatile history faster; When volatility is reduced it retains more history. The EWMA decay factor is 0.1, with base α = 1.5 and base β = −0.1.
- Asymmetric instantaneous boosting. Positive immediate regrets are multiplied by a factor of 1.1 before being added to the cumulative regrets. This asymmetry applies to instantaneous updates, not cached history, making the algorithm more responsive to good actions in the present.
- Tough warm-up start with regret-magnitude loads. The policy averaging is completely postponed until iteration 500. The regret accumulation process continues normally during this phase. Once accumulation begins, policies are weighted by a combination of temporal weights and instantaneous regret magnitudes – with higher-information iterations being prioritized when constructing the average strategy. The 500-iteration threshold was generated by LLM without knowledge of the 1000-iteration evaluation horizon.
VAD-CFR is benchmarked against standard CFR, CFR+, linear CFR (LCFR), DCFR, PCFR+, DPCFR+, and HS-PCFR+(30) in 1000 iterations with K = 1000. Exploitability is calculated accurately. Over a full 11-game evaluation, VAD-CFR matches or exceeds state-of-the-art performance 10 out of 11 gamesWith 4-player Kuhn Poker as the only exception.
| Also searched: AOD-CFR A pre-test on a different training set (2-player Kuhn Poker, 2-player Leduc Poker, 4-card Goofspiel, 4-sided Liar’s Dice) produced a second version, Asymmetric Optimistic Discounted CFR (AOD-CFR). It uses a linear schedule to minimize cumulative regrets (α transition from 1.0 → 2.5, β to 0.5 → 0.0 over 500 iterations), signal-dependent scaling of instantaneous regrets, trend-based policy optimism via an exponential moving average of cumulative regrets, and polynomial policy averaging with an exponential γ scaling from 1.0. → 5.0. The research team reports that it achieves competitive performance compared to VAD-CFR using more traditional mechanisms. |
Searched Algorithm 2: Noise-PSRO
The PSRO variant developed is Smoothed Hybrid Optimistic Regret PSRO (SHOR-PSRO). The search produced a hybrid meta-solver that produces a meta-strategy Linearly mixing the two components at each internal solver iteration:
- σ_ORM (Optimistic Regret Matching): Regret-reduction provides stability. The benefits are calculated, optionally normalized and heterogeneity-adjusted, then used to update the cumulative regrets via regret matching. A momentum term is applied to paid benefits.
- σ_Softmax (Smoothed Best Pure Strategy): The Boltzmann distribution on pure strategies biased toward the high-payoff mode. A temperature parameter controls the concentration – lower temperature means the distribution is more concentrated at the best purifying strategy.
| σ_Hybrid = (1 − λ) · σ_ORM + λ · σ_Softmax |
The training-time solver uses a dynamic annealing schedule on external PSRO iterations. The blending factor λ decreases from 0.3 → 0.05 (shifting from greedy exploitation to equilibrium finding), the diversity bonus decreases from 0.05 → 0.001 (enabling early population exploration and then late-stage refinement), and the softmax temperature drops from 0.5 → 0.01. The number of internal solver iterations also scales with the population size. The training solver returns the time-averaged strategy in internal iterations for stability.
The evaluation-time solver uses fixed parameters: λ = 0.01, diversity bonus = 0.0, temperature = 0.001. It runs more internal iterations (base 8000, scaling with population size) and returns the last-iteration strategy instead of the average for responsive, low-noise exploitability estimation. This training/evaluation asymmetry was a product of discovery itself, not a human design choice.
SHOR-PSRO is benchmarked against Uniform, Nash (via linear program for 2-player games), AlphaRank, Projected Replicator Dynamics (PRD), and Regress Matching (RM) using K = 100 PSRO iterations. Over the full 11-game evaluation, SHOR-PSRO matches or exceeds state-of-the-art performance 8 out of 11 games.
experimental setup
The evaluation protocol separates training and test games to assess generalization. The training set for both CFR and PSRO experiments consists of 3-player Kuhn Poker, 2-player Leduc Poker, 4-card Goofspiel, and 5-sided Liar’s Dice. The test sets used in the main part of the paper include 4-player Kuhn Poker, 3-player Leduc Poker, 5-card Goofspiel, and 6-sided Liar’s Dice – larger and more complex variants not seen during development. The complete sweep of 11 games is included in the appendix. Algorithms are decided after training-phase search before test evaluation begins.
key takeaways
- AlphaEvolve automates algorithm design – Instead of tuning hyperparameters, it evolves the actual Python source code of the MARL algorithm using Gemini 2.5 Pro as the mutation operator, searching for completely new update rules instead of variations of existing rules.
- VAD-CFR replaces static relaxation with volatility-awareness. – It tracks instantaneous regret magnitudes via EWMA and dynamically adjusts its discount factors, as well as delaying the policy completely until iteration 500, without LLM finding a limit for the uninformed evaluation horizon 1000 iterations.
- Noise-PSRO automates exploration-to-exploitation transition – By declaring a blending factor between optimistic regress matching and the softmax best-pure-strategy component during training, it removes the need to manually tune the PSRO meta-solver when moving from population diversity to equilibrium refinement.
- Generalizations are tested, not assumed. – Both algorithms are developed on a set of four games and evaluated on a different set of larger, unseen games. VAD-CFR prevails in 10 out of 11 sports; Noisy-PSROs in 8 out of 11,no re-tuning between training and testing.
- The mechanisms discovered are non-intuitive by design – Things like a hard warm-start at iteration 500, asymmetric growth of positive regrets by exactly 1.1, and different training/evaluation solver configurations are not options that human researchers typically reach for, which is the main rationale of this research for automated exploration in this design space.
check it out paper. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.