FireRedTeam released FireRed-OCR-2B using GRPO to solve structural hallucinations in Tables and LaTeX for software developers.

by
0 comments
FireRedTeam released FireRed-OCR-2B using GRPO to solve structural hallucinations in Tables and LaTeX for software developers.

Document digitization has long been a multi-step problem: first figure out the layout, then extract the text, and finally try to recreate the structure. For large vision-language models (LVLMs), this often leads to ‘structural hallucinations’ – disorganized lines, invented formulas, or open syntax.

FireRedTeam has released firered-ocr-2bA prominent model designed to treat document parsing as a structural engineering task rather than ‘impressionist’ text generation. built on Qwen3-VL-2B-Instructions Architecture, this model establishes a new state-of-the-art for end-to-end solutions (SOTA), achieving an overall score of 92.94% on OmniDocBench v1.5 benchmark.

Paradigm Shift: Structural Engineering vs. Text Generation

Devs often find that even the most powerful general VLMs struggle with the dense spatial logic of technical PDFs. When a model ‘sees’ a complex table or multi-row LaTeX equation, it often fails to maintain hierarchical relationships between elements.

FireRed-OCR-2B addresses this through a special progressive training pipeline that consists of three distinct phases:

  1. Multi-task pre-alignment: This step establishes spatial grounding by training the model on detection, region detection, and layout-to-markdown tasks.
  2. Typical SFT (Supervised Fine-Tuning): The model is fine-tuned on high-quality, standardized Markdown datasets to ensure logical consistency and hierarchical expression.
  3. Format-constrained GRPO: The final step uses reinforcement learning to enforce syntactic validity.

Core Innovation: Format-Constrained GRPO

The most important technical differentiator for FireRed-OCR is its use Format-Constrained Group Relative Policy Optimization (GRPO). While traditional fine-tuning focuses on character accuracy, GRPO introduces a reinforcement learning loop that rewards models for specific structural traits:

  • Formula Syntax: Ensuring that LaTeX equations are mathematically valid.
  • Table Integrity: Maintaining consistent row/column counts and proper HTML/Markdown tagging.
  • Hierarchical Completion: Verifying that all opened structural tags (such as lists or headers) are closed correctly.
  • Text Accuracy: Reducing character-level errors in dense text blocks.

By eliminating the need for a separate ‘critic’ model – a key advantage of the GRPO algorithm – the FireRed team has optimized the training process to focus exclusively on high-friction areas of document parsing.

Solution to long-tail layout problem

The ‘long tail’ of document layout (for example, non-standard legal forms, academic papers with overlapping figures, or handwritten annotations) is where most OCR pipelines break down. Uses FireRed-OCR ‘Geometry + Semantics’ Data Factory.

This novel approach uses geometric feature clustering and multi-dimensional tagging to synthesize a balanced dataset. By combining geometric awareness with semantic understanding, the model maintains ‘in-the-wild robustness’, outperforming traditional pipelined systems like PaddleOCR on complex, non-standard layouts (on benchmarks). fireredbench dataset).

performance benchmark

In a head-to-head comparison on OmniDocBench v1.5, FireRed-OCR-2B (92.94%) significantly outperforms other end-to-end models, including:

  • DeepSeek-OCR 2: 91.09%
  • Gemini-3.0 Pro: 90.33%
  • Qwen3-VL-235B: 89.15%

While some ‘pipeline’ solutions (which use separate models for detection and recognition) achieve slightly higher scores, FireRed-OCR-2B represents the leading performance for a single-model, end-to-end approach. This is particularly relevant for developers who want to reduce system complexity and inference latency in production RAG (Retrieval-Augmented Generation) environments.

key takeaways

I’ve summarized the technical importance and performance metrics of the FireRed-OCR-2b release into five key takeaways for AI engineers and data scientists.

5 Key Points: FireRed-OCR-2B

  • New end-to-end SOTA performance: FireRed-OCR-2B achieves state-of-the-art (SOTA) score 92.94% on OmniDocBench v1.5 benchmark. This makes it the leading single-model solution for document parsing, outperforming larger models such as Qwen2-VL-72B and Gemini-1.5-Pro ​​in structural accuracy.
  • Architectural Foundation: built on Qwen2-VL-2B-Instructions (or the updated 2026 iteration) base, model uses the Vision-Language-Model (VLM) approach. It replaces traditional multi-stage pipelines (separate detection, cropping, and OCR steps) with a unified, end-to-end Transformer architecture that directly outputs structured Markdown.
  • Structural integrity through GRPO: A key technological differentiator is the use of Format-Bound GRPO (Group Relative Policy Optimization). This reinforcement learning technique rewards models for maintaining syntactic validity – specifically ensuring that LaTeX formulas, table tags, and Markdown hierarchies are logically closed and mathematically consistent.
  • ‘Geometry + Semantics’ Data Factory: To solve the problem of complex ‘in-the-wild’ layouts, FireRedTeam developed a special data engine. This ‘factory’ synthesizes datasets by balancing geometric layout features with semantic content, enabling the model to handle overlapping figures, multi-column academic papers and non-standard forms more reliably than previous iterations.

check it out model weight And repo. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.


Related Articles

Leave a Comment