A coding tutorial on OpenMythos on Recurrent-Depth Transformer with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing

In this tutorial, we explore the implementation of openmythosA theoretical reconstruction of the Cloud Mythos architecture that enables deeper reasoning through iterative computation rather than increased parameter size. We build and analyze models using both GQA and MLA attention mechanisms, check memory efficiency through KV-cache comparisons, and validate consistency through the spectral properties of periodic updates. We then train the model on a structured parity task and investigate how increasing the depth of the loop on inference improves performance without retraining. Additionally, we also observe adaptive computation via ACT halving and expert usage monitoring in MOE layers, providing a comprehensive, practical understanding of this emerging architecture.

import subprocess, sys
try:
   import open_mythos  # noqa: F401
except ImportError:
   subprocess.check_call((sys.executable, "-m", "pip", "install", "-q",
                          "open-mythos"))


import math, time, copy
from collections import Counter, defaultdict


import numpy as np
import torch, torch.nn as nn, torch.nn.functional as F
import matplotlib.pyplot as plt


from open_mythos.main import (
   OpenMythos, MythosConfig,
   ACTHalting, MoEFFN,
)


torch.manual_seed(0); np.random.seed(0)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"▸ device = {device}   |   torch = {torch.__version__}")


def make_config(attn_type: str, *, dim=128, n_heads=4, n_experts=4,
               max_loops=8, seq_len=128, vocab=256):
   base = dict(
       vocab_size=vocab, dim=dim, n_heads=n_heads,
       max_seq_len=seq_len, max_loop_iters=max_loops,
       prelude_layers=1, coda_layers=1,
       n_experts=n_experts, n_shared_experts=1,
       n_experts_per_tok=2, expert_dim=dim // 2,
       lora_rank=8, attn_type=attn_type,
   )
   if attn_type == "gqa":
       return MythosConfig(**base, n_kv_heads=2)
   return MythosConfig(
       **base, n_kv_heads=n_heads,
       kv_lora_rank=32, q_lora_rank=64,
       qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,
   )


cfg_gqa = make_config("gqa")
cfg_mla = make_config("mla")
m_gqa = OpenMythos(cfg_gqa).to(device)
m_mla = OpenMythos(cfg_mla).to(device)


print("n─── Part 1 ─ model sizes ──────────────────────────────")
print(f"GQA  params : {sum(p.numel() for p in m_gqa.parameters()):>10,}")
print(f"MLA  params : {sum(p.numel() for p in m_mla.parameters()):>10,}")

We install and import all the required dependencies and initialize our environment to run OpenMythos. We construct configurations for both GQA and MLA attention mechanisms and instantiate their respective models. We also compare their parameter sizes to understand how architectural differences affect model scale.

def cache_bytes(kv: dict) -> int:
   total = 0
   for entry in kv.values():
       for t in entry.values():
           total += t.element_size() * t.numel()
   return total


x = torch.randint(0, 256, (1, 64), device=device)
ck_gqa, ck_mla = {}, {}
with torch.no_grad():
   m_gqa(x, n_loops=4, kv_cache=ck_gqa)
   m_mla(x, n_loops=4, kv_cache=ck_mla)


gqa_kb = cache_bytes(ck_gqa) / 1024
mla_kb = cache_bytes(ck_mla) / 1024
print("n─── Part 2 ─ KV-cache footprint (1×64 tokens, 4 loops) ─")
print(f"GQA cache : {gqa_kb:6.2f} KB   ({len(ck_gqa)} layer-keys)")
print(f"MLA cache : {mla_kb:6.2f} KB   ({len(ck_mla)} layer-keys)")
print(f"ratio      : MLA is ≈{gqa_kb / max(mla_kb, 1e-9):.2f}× smaller")


def show_stability(model, tag):
   A = model.recurrent.injection.get_A()
   print(f"{tag:3s}  ρ(A): min={A.min():.4f}  max={A.max():.4f}  "
         f"mean={A.mean():.4f}  stable={bool((A < 1).all() and (A > 0).all())}")


print("n─── Part 3 ─ spectral radius at init ──────────────────")
show_stability(m_gqa, "GQA")
show_stability(m_mla, "MLA")


opt = torch.optim.Adam(m_mla.parameters(), lr=1.0)
for _ in range(30):
   loss = m_mla(torch.randint(0, 256, (2, 16), device=device),
                n_loops=2).square().mean()
   opt.zero_grad(); loss.backward(); opt.step()
show_stability(m_mla, "MLA after abusive training (lr=1.0, 30 steps)")

We calculate and compare the KV-cache memory footprint for both GQA and MLA attention types during the forward pass. We then observe the stability of the periodic component by analyzing the spectral radius of the matrix A. We further stress-test the model with extreme training conditions to confirm that stability is preserved.

VOCAB = 64
SEQ_LEN = 24


def make_batch(batch=64, seq_len=SEQ_LEN):
   x = torch.randint(1, 3, (batch, seq_len), device=device)
   bits = x - 1
   parity = bits.cumsum(dim=1) % 2
   y = parity + 1
   return x, y


cfg = MythosConfig(
   vocab_size=VOCAB, dim=64, n_heads=4, n_kv_heads=2,
   max_seq_len=SEQ_LEN + 4, max_loop_iters=16,
   prelude_layers=1, coda_layers=1,
   n_experts=4, n_shared_experts=1, n_experts_per_tok=2,
   expert_dim=32, lora_rank=4, attn_type="gqa",
   act_threshold=0.99,
)
model = OpenMythos(cfg).to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
T_TRAIN = 3


print("n─── Part 5 ─ training (T_train = 3) ───────────────────")
print(f"params: {sum(p.numel() for p in model.parameters()):,}")
losses = ()
t0 = time.time()
for step in range(600):
   x, y = make_batch(64)
   logits = model(x, n_loops=T_TRAIN)
   loss = F.cross_entropy(logits.reshape(-1, VOCAB), y.reshape(-1))
   opt.zero_grad(); loss.backward()
   opt.step()
   losses.append(loss.item())
   if step % 100 == 0 or step == 599:
       with torch.no_grad():
           acc = (logits.argmax(-1) == y).float().mean().item()
       print(f"step {step:3d}   loss={loss.item():.4f}   acc@T3={acc:.3f}")
print(f"training wallclock: {time.time() - t0:.1f}s")

We define a cumulative similarity function to train our model on the structured sequential problem. We initialize the OpenMythos model with a fixed loop depth and train it using cross-entropy loss. Throughout training, we monitor loss and accuracy to evaluate how well the model learns under limited depth.

model.eval()
T_sweep = (1, 2, 3, 4, 6, 8, 10, 12, 14, 16)
accs = ()
with torch.no_grad():
   x_eval, y_eval = make_batch(512)
   for T in T_sweep:
       logits = model(x_eval, n_loops=T)
       accs.append((logits.argmax(-1) == y_eval).float().mean().item())


print("n─── Part 6 ─ depth extrapolation (T_train=3) ──────────")
for T, a in zip(T_sweep, accs):
   bar = "█" * int(a * 40)
   marker = "  ← trained here" if T == T_TRAIN else ""
   print(f"T={T:2d}  acc={a:.3f}  {bar}{marker}")


halt_trace: list(torch.Tensor) = ()
orig_halt = model.recurrent.act.forward


def halt_hook(self, h):
   p = orig_halt(h)
   halt_trace.append(p.detach().cpu())
   return p
model.recurrent.act.forward = halt_hook.__get__(model.recurrent.act, ACTHalting)


with torch.no_grad():
   x_h, _ = make_batch(1)
   _ = model(x_h, n_loops=16)


model.recurrent.act.forward = orig_halt


halts = torch.stack(halt_trace, dim=0)(:, 0).numpy()
print(f"n─── Part 7 ─ ACT halting matrix (loops × positions) ───")
print(f"shape: {halts.shape}  |  "
     f"mean halt-prob per loop: "
     f"{', '.join(f'{v:.2f}' for v in halts.mean(1))}")

We evaluate the trained model by varying the number of inference loops to study depth extrapolation. We see how increasing the loop depth improves accuracy without retraining the model. We also use the ACT mechanism to capture the stopping probabilities at each sequence position and iteration.

expert_hits = Counter()
orig_moe = model.recurrent.block.ffn.forward


def moe_hook(self, x):
   flat = x.view(-1, x.shape(-1))
   logits = self.router(flat) + self.router_bias
   scores = F.softmax(logits, dim=-1)
   _, idx = scores.topk(self.topk, dim=-1)
   for e in idx.flatten().tolist():
       expert_hits(e) += 1
   return orig_moe(x)


model.recurrent.block.ffn.forward = moe_hook.__get__(
   model.recurrent.block.ffn, MoEFFN)


with torch.no_grad():
   x_m, _ = make_batch(32)
   _ = model(x_m, n_loops=T_TRAIN)


model.recurrent.block.ffn.forward = orig_moe


print("n─── Part 8 ─ MoE expert utilization ───────────────────")
total = sum(expert_hits.values())
for eid in range(cfg.n_experts):
   share = expert_hits.get(eid, 0) / max(total, 1)
   print(f"expert {eid}: {share*100:5.2f}% of topk slots")


prompt = torch.tensor(((1, 2, 1, 1, 2, 2, 1, 2)), device=device)
print("n─── Part 9 ─ generation ───────────────────────────────")
print(f"prompt (parity pattern): {prompt.tolist()(0)}")
for T_gen in (1, 4, 12):
   with torch.no_grad():
       out = model.generate(prompt, max_new_tokens=8,
                            n_loops=T_gen, temperature=0.1, top_k=2)
   print(f"T_gen={T_gen:2d}  → {out.tolist()(0)}")


fig, axes = plt.subplots(1, 3, figsize=(15, 4))


axes(0).plot(losses)
axes(0).set_title("Training loss (parity task)")
axes(0).set_xlabel("step"); axes(0).set_ylabel("cross-entropy")
axes(0).grid(alpha=0.3)


axes(1).plot(T_sweep, accs, "o-", linewidth=2, markersize=8)
axes(1).axvline(T_TRAIN, color="red", linestyle="--",
               label=f"T_train = {T_TRAIN}")
axes(1).set_title("Depth extrapolation: accuracy vs inference loops")
axes(1).set_xlabel("n_loops at inference"); axes(1).set_ylabel("accuracy")
axes(1).legend(); axes(1).grid(alpha=0.3); axes(1).set_ylim(0, 1.05)


im = axes(2).imshow(halts, aspect="auto", cmap="viridis",
                   vmin=0, vmax=halts.max())
axes(2).set_title("ACT halting probabilityn(loop t × position)")
axes(2).set_xlabel("position"); axes(2).set_ylabel("loop iteration t")
plt.colorbar(im, ax=axes(2), fraction=0.046, pad=0.04)


plt.tight_layout()
plt.savefig("openmythos_tutorial.png", dpi=120, bbox_inches="tight")
plt.show()

We analyze expert usage in the MoE layer by tracking how tokens are sent between experts. We then generate sequences at different loop depths to observe their impact on the output. Finally, we visualize training loss, depth extrapolation performance, and ACT stopping behavior through plots.

In conclusion, we demonstrated that OpenMythos effectively leverages looped computation to achieve depth extrapolation, enabling the model to improve accuracy simply by increasing the number of inference-time loops. We observe that the recurrent mechanism remains stable even under extreme training conditions, and the attention of MLA significantly reduces KV-cache memory usage compared to GQA. We also saw how ACT enables dynamic computation in sequence situations and how MoE routing distributes the workload among experts. Overall, we established that this architecture provides an attractive direction for computation-adaptive reasoning, where we trade additional inference computation for improved performance without modifying the model parameters.

check it out Full code with notebook here. Also, feel free to follow us Twitter And don’t forget to join us 130k+ ML subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us

A coding tutorial on OpenMythos on Recurrent-Depth Transformer with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing

Download: Introduction to the Nature Issue

Karen Hao: Are we betting on the wrong AI narrative? (MAICON 2026)

Related Articles

Leave a Comment Cancel Reply