In this tutorial, we implement an advanced, practical implementation of NVIDIA Transformer Engine In Python, the focus is on how mixed-precision acceleration can be explored in realistic deep learning workflows. We set up the environment, verify GPU and CUDA readiness, attempt to install the necessary Transformer Engine components, and handle compatibility issues beautifully so that the notebook remains runnable even when full extensions cannot be built. As we move through each step, we build teacher and student networks, compare the baseline PyTorch path with the Transformer engine-enabled path, train both models, benchmark their speed and memory usage, and visualize the results, giving us a clear understanding of how performance-oriented training workflows are structured in practice.
import os
import sys
import json
import time
import math
import random
import shutil
import platform
import subprocess
import statistics
def run(cmd, check=True):
print("n(RUN)", " ".join(cmd))
result = subprocess.run(cmd, text=True, capture_output=True)
if result.stdout.strip():
print(result.stdout(-4000:))
if result.returncode != 0 and result.stderr.strip():
print(result.stderr(-4000:))
if check and result.returncode != 0:
raise subprocess.CalledProcessError(result.returncode, cmd)
return result
def has_cmd(name):
return shutil.which(name) is not None
run((sys.executable, "-m", "pip", "install", "-q", "--upgrade", "pip"))
run((sys.executable, "-m", "pip", "install", "-q", "ninja", "packaging", "matplotlib"))
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
assert torch.cuda.is_available(), "This notebook needs a GPU runtime in Colab."
gpu_name = torch.cuda.get_device_name(0)
cc_major, cc_minor = torch.cuda.get_device_capability(0)
cuda_runtime = torch.version.cuda
python_version = sys.version.split()(0)
torch_version = torch.__version__
cuda_home = os.environ.get("CUDA_HOME", "/usr/local/cuda")
nvcc_path = shutil.which("nvcc") or os.path.join(cuda_home, "bin", "nvcc")
cudnn_header_candidates = (
os.path.join(cuda_home, "include", "cudnn.h"),
"/usr/include/cudnn.h",
"/usr/local/include/cudnn.h",
)
nvcc_exists = os.path.exists(nvcc_path)
cudnn_header_exists = any(os.path.exists(p) for p in cudnn_header_candidates)
print("=" * 120)
print("ENVIRONMENT CHECK")
print("=" * 120)
print(json.dumps({
"python": python_version,
"platform": platform.platform(),
"torch": torch_version,
"torch_cuda": cuda_runtime,
"gpu_name": gpu_name,
"compute_capability": f"{cc_major}.{cc_minor}",
"cuda_home": cuda_home,
"nvcc_exists": nvcc_exists,
"nvcc_path": nvcc_path if nvcc_exists else None,
"cudnn_header_exists": cudnn_header_exists,
}, indent=2))
print("=" * 120)
We prepare the Colab environment by importing the required Python libraries, defining a helper function to execute shell commands, and installing key dependencies for the tutorial. We then import PyTorch and Matplotlib, verify that a GPU is available, and collect key environment details, including GPU name, CUDA version, Python version, and toolkit path. This gives us a clear view of the system status before attempting any transformer engine setup or model execution.
te_available = False
te_mode = "fallback"
te_import_error = None
try:
run((sys.executable, "-m", "pip", "install", "-q", "transformer_engine(core_cu12)"))
except Exception as e:
print("Core wheel install failed:", repr(e))
can_try_te_torch = nvcc_exists and cudnn_header_exists
if can_try_te_torch:
env = os.environ.copy()
env("NVTE_FRAMEWORK") = "pytorch"
env("MAX_JOBS") = "1"
env("NVTE_BUILD_THREADS_PER_JOB") = "1"
env("CUDA_PATH") = cuda_home
env("CUDA_HOME") = cuda_home
try:
print("nAttempting to build the PyTorch extension for Transformer Engine...")
result = subprocess.run(
(sys.executable, "-m", "pip", "install", "-q", "--no-build-isolation", "transformer_engine(pytorch)"),
text=True,
capture_output=True,
env=env,
)
if result.stdout.strip():
print(result.stdout(-4000:))
if result.returncode != 0 and result.stderr.strip():
print(result.stderr(-4000:))
if result.returncode == 0:
import transformer_engine.pytorch as te
from transformer_engine.common import recipe
te_available = True
te_mode = "transformer_engine"
else:
te_import_error = result.stderr(-4000:) if result.stderr else "Unknown pip build error"
except Exception as e:
te_import_error = repr(e)
else:
te_import_error = "Missing nvcc or cuDNN headers in this Colab runtime, so TE PyTorch extension cannot be built here."
if te_available:
try:
fp8_available, fp8_reason = te.is_fp8_available(return_reason=True)
except Exception as e:
fp8_available, fp8_reason = False, f"Could not query FP8 availability: {e}"
try:
bf16_available = te.is_bf16_available()
except Exception:
bf16_available = torch.cuda.is_bf16_supported()
else:
fp8_available = False
fp8_reason = "Transformer Engine not installed; using fallback PyTorch path."
bf16_available = torch.cuda.is_bf16_supported()
amp_dtype = torch.bfloat16 if bf16_available else torch.float16
print("n" + "=" * 120)
print("INSTALL STATUS")
print("=" * 120)
print(json.dumps({
"te_available": te_available,
"te_mode": te_mode,
"fp8_available": fp8_available,
"fp8_reason": fp8_reason,
"te_import_error": te_import_error,
"amp_dtype": str(amp_dtype),
}, indent=2))
print("=" * 120)
device = "cuda"
random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
if te_available:
fp8_recipe = recipe.DelayedScaling(margin=0, fp8_format=recipe.Format.E4M3)
def baseline_autocast():
return torch.autocast(device_type="cuda", dtype=amp_dtype)
def te_forward_context(use_fp8):
if te_available and use_fp8:
return te.autocast(enabled=True, recipe=fp8_recipe)
return baseline_autocast()
We try to install the Transformer Engine core package and then check if the Colab runtime can build the PyTorch extension by verifying the presence of the nvcc and cuDNN headers. If the environment supports it, we try to install the Transformer Engine PyTorch backend and then observe if FP8 and BF16 are available on the current hardware. We also configure precision modes and define autocast references that later allow us to switch between standard mixed precision and Transformer engine execution.
class TeacherNet(nn.Module):
def __init__(self, hidden_size=512, intermediate_size=2048, num_layers=3, vocab_size=4096):
super().__init__()
self.embed = nn.Embedding(vocab_size, hidden_size)
self.layers = nn.ModuleList((
nn.Sequential(
nn.LayerNorm(hidden_size),
nn.Linear(hidden_size, intermediate_size),
nn.GELU(),
nn.Linear(intermediate_size, hidden_size),
) for _ in range(num_layers)
))
self.head = nn.Linear(hidden_size, hidden_size)
def forward(self, token_ids):
x = self.embed(token_ids)
for layer in self.layers:
x = x + layer(x)
return self.head(x)
class BaselineStudent(nn.Module):
def __init__(self, hidden_size=512, intermediate_size=2048, num_layers=3, vocab_size=4096):
super().__init__()
self.embed = nn.Embedding(vocab_size, hidden_size)
self.norms = nn.ModuleList((nn.LayerNorm(hidden_size) for _ in range(num_layers)))
self.fc1 = nn.ModuleList((nn.Linear(hidden_size, intermediate_size) for _ in range(num_layers)))
self.fc2 = nn.ModuleList((nn.Linear(intermediate_size, hidden_size) for _ in range(num_layers)))
self.head = nn.Linear(hidden_size, hidden_size)
def forward(self, token_ids):
x = self.embed(token_ids)
for ln, fc1, fc2 in zip(self.norms, self.fc1, self.fc2):
residual = x
x = ln(x)
x = fc1(x)
x = F.gelu(x, approximate="tanh")
x = fc2(x)
x = x + residual
return self.head(x)
if te_available:
class TEStudent(nn.Module):
def __init__(self, hidden_size=512, intermediate_size=2048, num_layers=3, vocab_size=4096):
super().__init__()
self.embed = nn.Embedding(vocab_size, hidden_size)
self.norms = nn.ModuleList((te.LayerNorm(hidden_size) for _ in range(num_layers)))
self.fc1 = nn.ModuleList((te.Linear(hidden_size, intermediate_size, bias=True) for _ in range(num_layers)))
self.fc2 = nn.ModuleList((te.Linear(intermediate_size, hidden_size, bias=True) for _ in range(num_layers)))
self.head = te.Linear(hidden_size, hidden_size, bias=True)
def forward(self, token_ids, use_fp8=False):
x = self.embed(token_ids)
with te_forward_context(use_fp8):
for ln, fc1, fc2 in zip(self.norms, self.fc1, self.fc2):
residual = x
x = ln(x)
x = fc1(x)
x = F.gelu(x, approximate="tanh")
x = fc2(x)
x = x + residual
x = self.head(x)
return x
else:
class TEStudent(nn.Module):
def __init__(self, hidden_size=512, intermediate_size=2048, num_layers=3, vocab_size=4096):
super().__init__()
self.embed = nn.Embedding(vocab_size, hidden_size)
self.norms = nn.ModuleList((nn.LayerNorm(hidden_size) for _ in range(num_layers)))
self.fc1 = nn.ModuleList((nn.Linear(hidden_size, intermediate_size) for _ in range(num_layers)))
self.fc2 = nn.ModuleList((nn.Linear(intermediate_size, hidden_size) for _ in range(num_layers)))
self.head = nn.Linear(hidden_size, hidden_size)
def forward(self, token_ids, use_fp8=False):
x = self.embed(token_ids)
with baseline_autocast():
for ln, fc1, fc2 in zip(self.norms, self.fc1, self.fc2):
residual = x
x = ln(x)
x = fc1(x)
x = F.gelu(x, approximate="tanh")
x = fc2(x)
x = x + residual
x = self.head(x)
return x
def count_params(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
def format_millions(n):
return f"{n / 1e6:.2f}M"
We define the neural network architecture used throughout the tutorial, including the teacher model, the baseline student model, and the Transformer Engine student paths. We keep model structures aligned so that comparisons remain meaningful while allowing TE paths to be swapped across Transformer Engine layers when extensions are available. We also define small utility functions to count parameters and format the model size, which help us inspect the scale of the model before training begins.
hidden_size = 512
intermediate_size = 2048
num_layers = 3
vocab_size = 4096
seq_len = 128
batch_size = 8
steps = 25
benchmark_iters = 20
lr = 2e-4
weight_decay = 1e-2
teacher = TeacherNet(hidden_size, intermediate_size, num_layers, vocab_size).to(device).eval()
baseline_model = BaselineStudent(hidden_size, intermediate_size, num_layers, vocab_size).to(device)
te_model = TEStudent(hidden_size, intermediate_size, num_layers, vocab_size).to(device)
optimizer_baseline = torch.optim.AdamW(baseline_model.parameters(), lr=lr, weight_decay=weight_decay)
optimizer_te = torch.optim.AdamW(te_model.parameters(), lr=lr, weight_decay=weight_decay)
print("Teacher params :", format_millions(count_params(teacher)))
print("Baseline params:", format_millions(count_params(baseline_model)))
print("TE-path params :", format_millions(count_params(te_model)))
def make_batch(batch_size, seq_len, vocab_size, device):
tokens = torch.randint(0, vocab_size, (batch_size, seq_len), device=device)
with torch.no_grad():
target = teacher(tokens)
return tokens, target
def peak_mem_mb():
return torch.cuda.max_memory_allocated() / (1024 ** 2)
def train_baseline_step():
baseline_model.train()
optimizer_baseline.zero_grad(set_to_none=True)
tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
with baseline_autocast():
pred = baseline_model(tokens)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer_baseline.step()
return float(loss.detach().item())
def train_te_step(use_fp8):
te_model.train()
optimizer_te.zero_grad(set_to_none=True)
tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
pred = te_model(tokens, use_fp8=use_fp8)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer_te.step()
return float(loss.detach().item())
We set the main experiment hyperparameters, instantiate all models on the GPU, and create optimizers that will be used during training. We also print parameter calculations to confirm that the baseline and TE path models are comparable in terms of size. In addition, we define batch-generation logic, memory tracking functions, and individual training-step functions that perform an optimization step for each model path.
baseline_losses = ()
te_losses = ()
mode_name = "TE-FP8" if (te_available and fp8_available) else ("TE-BF16/FP16" if te_available else "Fallback-PyTorch")
print("n" + "=" * 120)
print("TRAINING")
print("=" * 120)
for step in range(1, steps + 1):
b_loss = train_baseline_step()
t_loss = train_te_step(use_fp8=fp8_available)
baseline_losses.append(b_loss)
te_losses.append(t_loss)
if step == 1 or step % 5 == 0 or step == steps:
print(f"step={step:02d} | baseline_loss={b_loss:.6f} | te_path_loss={t_loss:.6f} | mode={mode_name}")
@torch.no_grad()
def evaluate_model(model, is_te=False, use_fp8=False, eval_batches=8):
model.eval()
vals = ()
for _ in range(eval_batches):
tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
if is_te:
pred = model(tokens, use_fp8=use_fp8)
else:
with baseline_autocast():
pred = model(tokens)
vals.append(float(F.mse_loss(pred, target).item()))
return sum(vals) / len(vals)
baseline_eval = evaluate_model(baseline_model, is_te=False)
te_eval = evaluate_model(te_model, is_te=True, use_fp8=fp8_available)
def benchmark_train_step(model, optimizer, is_te=False, use_fp8=False, warmup=5, iters=20):
times_ms = ()
mems_mb = ()
for _ in range(warmup):
optimizer.zero_grad(set_to_none=True)
tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
if is_te:
pred = model(tokens, use_fp8=use_fp8)
else:
with baseline_autocast():
pred = model(tokens)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer.step()
torch.cuda.synchronize()
for _ in range(iters):
torch.cuda.reset_peak_memory_stats()
optimizer.zero_grad(set_to_none=True)
tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
start = time.perf_counter()
if is_te:
pred = model(tokens, use_fp8=use_fp8)
else:
with baseline_autocast():
pred = model(tokens)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer.step()
torch.cuda.synchronize()
end = time.perf_counter()
times_ms.append((end - start) * 1000.0)
mems_mb.append(peak_mem_mb())
return {
"mean_ms": statistics.mean(times_ms),
"median_ms": statistics.median(times_ms),
"max_memory_mb": max(mems_mb),
}
baseline_bench = benchmark_train_step(baseline_model, optimizer_baseline, is_te=False, use_fp8=False, iters=benchmark_iters)
te_bench = benchmark_train_step(te_model, optimizer_te, is_te=True, use_fp8=fp8_available, iters=benchmark_iters)
We run the main training loops for both the baseline model and the TE path, tracking their losses over multiple steps. We then define and execute an evaluation function to measure how well each model matches the teacher’s output after training. Finally, we implement benchmarking routines to measure per-stage runtime and peak CUDA memory usage, enabling quantitative comparison of performance characteristics.
summary = {
"gpu_name": gpu_name,
"compute_capability": f"{cc_major}.{cc_minor}",
"te_available": te_available,
"fp8_available": fp8_available,
"fp8_reason": fp8_reason,
"mode": mode_name,
"baseline_eval_mse": baseline_eval,
"te_path_eval_mse": te_eval,
"baseline_mean_step_ms": baseline_bench("mean_ms"),
"te_path_mean_step_ms": te_bench("mean_ms"),
"baseline_peak_mem_mb": baseline_bench("max_memory_mb"),
"te_path_peak_mem_mb": te_bench("max_memory_mb"),
}
print("n" + "=" * 120)
print("SUMMARY")
print("=" * 120)
print(json.dumps(summary, indent=2))
plt.figure(figsize=(10, 5))
plt.plot(baseline_losses, label="Baseline loss")
plt.plot(te_losses, label=f"{mode_name} loss")
plt.xlabel("Training step")
plt.ylabel("MSE loss")
plt.title("Training Loss Comparison")
plt.legend()
plt.grid(True)
plt.show()
plt.figure(figsize=(8, 5))
plt.bar(("Baseline", mode_name), (baseline_bench("mean_ms"), te_bench("mean_ms")))
plt.ylabel("Mean train step time (ms)")
plt.title("Speed Comparison")
plt.grid(True, axis="y")
plt.show()
plt.figure(figsize=(8, 5))
plt.bar(("Baseline", mode_name), (baseline_bench("max_memory_mb"), te_bench("max_memory_mb")))
plt.ylabel("Peak memory (MB)")
plt.title("Peak CUDA Memory Comparison")
plt.grid(True, axis="y")
plt.show()
We collect all the final metrics into a summary dictionary and print the consolidated results of the experiment in a structured format. We then generate visualizations of training loss, average training-step time, and peak memory usage to more intuitively understand the differences between the baseline and TE paths. This final section helps us move from raw numbers to practical insights by showing how both implementations behave in accuracy, speed, and memory.
Finally, we created much more than a simple installation walkthrough; We’ve built a complete experimental pipeline that helps us understand how the NVIDIA Transformer Engine fits into modern GPU-accelerated model training. We tested the runtime environment, optimized for Colab limitations, preserved a working fallback path, and then trained, evaluated, and benchmarked the two implementations together to see practical differences in efficiency, precision behavior, and resource usage. In the end, we understood how to use the Transformer engine in a colab-friendly setting and gained a reusable foundation that we can extend to larger Transformer architectures, richer benchmarking scenarios, and more production-oriented optimization workflows.
check it out Full code/notebook here. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us