Zipu AI Releases GLM-4.7-Flash: A 30B-A3B MOE Model for Efficient Local Coding and Agents

glm-4.7-flash is a new member of the GLM 4.7 family and targets developers who want robust coding and reasoning performance in practical models to run locally. Zhipu AI (Z.AI) describes GLM-4.7-Flash as a 30B-A3B MOE model and presents it as the most robust model in the 30B class, designed for lightweight deployments where both performance and efficiency matter.

Model class and position within the GLM 4.7 family

glm-4.7-flash is a text generation model with 31b params, bf16 and f32 tensor types and architecture tags. glm4_moe_lite. It supports English and Chinese, and is configured for conversational use. GLM-4.7-Flash sits next to the larger GLM-4.7 and GLM-4.7-FP8 models in the GLM-4.7 collection.

Z.ai positions GLM-4.7-Flash as a free tier and lightweight deployment option relative to the full GLM-4.7 model, while still targeting coding, logic, and general text generation tasks. This makes it interesting for developers who cannot deploy a 358B class model but still want a modern MOE design and strong benchmark results.

Architecture and reference length

In this type of expert mixing architecture, the model stores more parameters than are active for each token. This allows specialization among experts while keeping the effective computation per token close to small dense models.

GLM 4.7 Flash supports a reference length of 128k tokens and achieves strong performance on coding benchmarks among similar scale models. This reference size is suitable for large codebases, multi-file repositories, and long technical documents, where many existing models would require aggressive chunking.

GLM-4.7-Flash uses a standard causal language modeling interface and a chat template, allowing integration into existing LLM stacks with minimal changes.

Benchmark performance in 30B class

The Z.ai team compares GLM-4.7-Flash with Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B. GLM-4.7-Flash is leading or competitive in its mix of mathematics, logic, long horizon, and coding agent benchmarks.

https://huggingface.co/zai-org/GLM-4.7-Flash

The above table shows why the GLM-4.7-Flash 30B is one of the most robust models in the class, at least among the models included in this comparison. Importantly, GLM-4.7-Flash is not only a compact deployment of GLM but also a high-performance model on established coding and agent benchmarks.

Evaluation parameters and thinking mode

For most tasks, the default settings are: temp 1.0, top p 0.95, and max new tokens 131072. This defines a relatively open sampling regime with a large generation budget.

For Terminal Bench and SWE-Bench verified, the configuration uses temp 0.7, top p 1.0 and maximum new tokens 16384. For τ²-bench, the configuration uses temperature 0 and a maximum number of new tokens of 16,384. These strict settings reduce randomness for tasks that require steady tool use and multi step interactions.

The Z.ai team also recommends turning on Protected Thinking mode for multi turn agentic tasks like τ²-Bench and Terminal Bench 2. This mode preserves traces of internal logic in all turns. This is useful when you create agents that require long chains of function calls and refinements.

How does GLM-4.7-Flash fit into the developer workflow

GLM-4.7-Flash combines several properties that are relevant for agentic, coding-centric applications:

30B-A3B MoE architecture with 31K parameters and 128k token reference length.
Strong benchmark results on AIME25, GPQA, SWE-Bench Verified, τ²-Bench and BrowseComp compared to other models in the same table.
Documented evaluation parameters and a protected thinking mode for multi-turn agent tasks.
First-class support for VLLM, SGLang and Transformer based inference with ready-to-use commands.
A growing set of finetunes and quantizations, including MLX conversions, in the Hugging Face ecosystem.

check it out model weight. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.

Zipu AI Releases GLM-4.7-Flash: A 30B-A3B MOE Model for Efficient Local Coding and Agents

Model class and position within the GLM 4.7 family

Architecture and reference length

Benchmark performance in 30B class

Evaluation parameters and thinking mode

How does GLM-4.7-Flash fit into the developer workflow

How did BYD beat Tesla?

Microsoft CEO is suddenly very nervous about AI

Related Articles

Leave a Comment Cancel Reply