llm-pruning repository: a JAX based repo for structured and unstructured LLM compression

Zlab Princeton researchers have released llm-pruning collectionA JAX based repository that consolidates the major pruning algorithms for large language models into a single, reproducible framework. This targets a concrete target, making it easy to compare block level, layer level and weight level sorting methods under a consistent training and evaluation stack on both GPU and TPU.

What is included in the LLM-Pruning Collection,

It is described as a JAX based repo for LLM pruning. It is organized into three main directories:

pruning Contains implementations for several pruning methods: Minitron, ShortGPT, Vanda, SparseGPT, Magnitude, Sheared Llama, and LLM-Pruner.
training Provides integration with FMS-FSDP for GPU training and MaxText for TPU training.
eval Built around lm-eval-harness with instantiated based support for MaxText exposes JAX compatible evaluation scripts which gives approximately 2 to 4 times speedup.

Sorting methods covered

The LLM-Pruning collection spans several families of pruning algorithms with different granularity levels:

minitron

Minitron is a practical sorting and distillation recipe developed by NVIDIA that compresses Llama 3.1 8b and Mistral Nemo 12b to 4b and 8b while preserving performance. It detects hidden shapes after distillation, focusing and depth sorting of MLPs and combined width sorting.

In the LLM-Pruning Collection, pruning/minitron Provides scripts like folder prune_llama3.1-8b.sh Joe runs Minitron style pruning on Llama 3.1 8b.

shortgpt

ShortGPT is based on the observation that many Transformer layers are redundant. The method defines block impact, a metric that measures the contribution of each layer and then removes layers with less impact by directly removing the layer. Experiments show that ShortGPT outperforms previous sorting methods for multivariate and generative tasks.

In the archive, ShortGPT is implemented through the Minitron folder with a dedicated script. prune_llama2-7b.sh,

wanda, sparsegpt, magnitude

Wanda is a post-training pruning method that scores weights per output based on the product of the magnitude and the corresponding input activation. It sorts out the smallest scores, requires no retraining and produces sparsity that works well even at billion parameter scale.

SparseGPT is another post-training method that uses a second-order induced reconstruction step to prune large GPT-style models at high sparseness ratios. Magnitude pruning is a classical baseline that removes weights with small absolute value.

All three remain below in the LLM-Pruning Collection pruning/wanda With shared installation path. The README includes a dense table of Llama 2 7b results that compares Vanda, SparseGPT and Magnitude in BoolQ, RTE, Hellasvag, Winogrande, ARC E, ARC C and OBQA under unstructured and structured sparsity patterns such as 4:8 and 2:4.

shredded llama

Sheared LLAMA is a structured pruning method that learns masks for layers, attention heads, and hidden dimensions and then retrains the pruned architecture. The original release offers models in several scales, including 2.7B and 1.3B.

pruning/llmshearing The directory in the llm-pruning collection integrates this recipe. It uses the RedPajama subset for calibration, accessed through Hugging Face, and a helper script to convert between Hugging Face and MosaicML Composer formats.

llm-pruner

LLM-Pruner is a framework for structural pruning of large language models. It removes non-significant coupled structures, such as attention heads or MLP channels, using gradient based importance scores and then recovers performance with a small LoRA tuning step that uses about 50K samples. The collection includes LLM-Pruner pruning/LLM-Pruner With scripts for LLaMA, LLaMA 2 and Llama 3.1 8B.

key takeaways

The llm-pruning collection is a JAX-based, Apache-2.0 repo from zab-princeton that integrates modern LLM pruning methods with shared pruning, training, and evaluation pipelines for GPUs and TPUs.
The codebase implements block, layer, and weight level pruning approaches including Minitron, ShortGPT, WANDA, SparseGPT, Sheared LLAMA, Magnitude Pruning, and LLM-Pruner with method specific scripts for LLAMA family models.
The training integrates FMS-FSDP on GPU and MaxText on TPU with JAX compatible evaluation scripts built on the lm-eval-harness, which gives approximately 2 to 4 times faster eval for MaxText checkpoints through acceleration.
The repository reproduces key results from prior pruning work, side-by-side publishing “paper vs. reproduced” tables for methods such as Vanda, SparseGPT, Sheared LLAMA, and LLM-Pruner so that engineers can verify their runs against known baselines.

check it out GitHub repoAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.

Shobha is a data analyst with a proven track record of developing innovative machine-learning solutions that drive business value.

llm-pruning repository: a JAX based repo for structured and unstructured LLM compression

What is included in the LLM-Pruning Collection,

Sorting methods covered

minitron

shortgpt

wanda, sparsegpt, magnitude

shredded llama

llm-pruner

key takeaways

Tencent researchers release Tencent HY-MT1.5: a new translation model consisting of 1.8B and 7B models designed for seamless on-device and cloud deployment

Finally, a computing accessory that both PC enthusiasts and professionals can get excited about

Related Articles

Leave a Comment Cancel Reply