OpenAI releases a research preview of GPT‑5.3-Codex-Spark: a 15x faster AI coding model that delivers over 1000 tokens per second on Cerebras hardware

by
0 comments
OpenAI releases a research preview of GPT‑5.3-Codex-Spark: a 15x faster AI coding model that delivers over 1000 tokens per second on Cerebras hardware

OpenAI recently launched a new research preview called GPT-5.3 Codex-Spark. This model is built for one thing: extreme speed. While the standard GPT-5.3 codec focuses on deep logic, Spark is designed for instant response times. This is the result of deep hardware-software integration between OpenAI and Cerebra.

The results are game-changing. there is a spark 15 times faster Compared to the flagship GPT-5.3 codec. It gives consistently better results 1000 tokens per second. This speed effectively eliminates the delay between the developer’s idea and the model’s code output.

Hardware: Wafer-Scale Engineering

The huge jump in performance is driven by Cerebras Wafer-Scale Engine 3 (WSE-3). Traditional AI models run on clusters of small GPUs. These GPUs must communicate with each other via cables, which creates a ‘bottleneck’. This bottleneck slows down the speed of the model.

WSE-3 Is different. It’s a single, giant chip the size of an entire silicon wafer. Because the entire model rests on 1 piece of silicone, there are no cables to slow it down. This architecture provides:

  • Huge on-chip memory.
  • Ultra-high bandwidth.
  • Low latency computing.

by using Cerebras CS-3 SystemOpenAI can run inference at speeds that traditional GPU clusters cannot reach.

Software optimization and low latency

Speed ​​is not just about the chip. OpenAI re-engineered the way models communicate with your computer. They moved away from traditional solicitation methods and started a persistent websocket connection.

This change results in several technical improvements:

  1. Round-Trip Time (RTT): Client-server overhead is reduced 80%.
  2. Time-to-First-Token (TTFT): it has improved 50%Which means that as soon as you press enter, the code becomes visible.
  3. Per-token overhead: Internal processing time per token is reduced 30%.

These optimizations allow ‘real-time steering’. You can interrupt the model while typing and redirect its logic without waiting for the complete block to finish.

Trade-Off: Speed ​​vs. Logic

The GPT-5.3 codec-spark is optimized for throughput, not deep complexity. It is a ‘smaller’ model than the flagship GPT-5.3 codec. For this reason there is less depth of logic in it.

https://openai.com/index/introduction-gpt-5-3-codex-spark/
https://openai.com/index/introduction-gpt-5-3-codex-spark/

Devs should be aware of these performance differences:

  • Benchmark: spark score is low SWE-Bench Pro And Terminal-Bench 2.0 Compared to the flagship model. It may struggle with very complex, multi-file architecture changes.
  • Security: Under OpenAI preparation outlineThe flagship GPT-5.3 codec is rated ‘high’ capacity For cyber security. Spark does not meet this high threshold. It should not be used for sensitive security logic or autonomous authentication functions.

Quick Details and Access

Spark is now available chatgpt pro Users and developers. You can access it through the following tools:

  • Codex App: Use the model picker to select ‘Spark’.
  • VS Code Extension: Integrated directly into Composer.
  • CLI: Access it via command codex --model gpt-5.3-codex-spark.
Speciality GPT-5.3 Codex-Spark GPT-5.3 codec (flagship)
tokens per second 1000+ ~70
context window 128k 128k
hardware Cerebras WSE-3 NVIDIA GPU Cluster
best for fast repetition deep logic/security

key takeaways

  • Fast Speed: there is a spark 15 times faster Compared to the flagship GPT-5.3 codecs, delivers unprecedented throughput of more 1,000 tokens per second To enable near-instant code generation.
  • Custom Silicon Infrastructure: This is the first running model of OpenAI Cerebras Wafer-Scale Engine 3 (WSE-3) Hardware using ‘wafer-scale’ memory instead of traditional NVIDIA GPUs to eliminate data bottlenecks.
  • Sharp latency reduction: integration of a persistent websocket connection Reduces client-server round-trip overhead 80% and improves time-to-first-token 50%.
  • Real Time Operation: Designed for ‘micro-iterations’, the speed of the model allows developers to Interrupt and redirect Logic in real-time, shifting the workflow from batch-processing to live pair-programming.
  • Target Capacity Trade-Off: While fast, Spark has less logic depth than flagship models and has No Meeting the ‘high capability’ threshold for cybersecurity in OpenAI’s preparedness framework makes it unsuitable for sensitive authentication or security tasks.

check it out Technical details here. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.


Related Articles

Leave a Comment