Gina AI has been released GINA-VLM, a 2.4B parameter vision language model Which targets multilingual visual question answering and document understanding on limited hardware. The model combines a SigLIP2 vision encoder with a Qwen3 language backbone and uses an attention pooling connector to reduce visual tokens while preserving spatial structure. Among open 2B scale VLMs, it reaches state-of-the-art results on multilingual benchmarks such as MMMB and multilingual MMBench.

Architecture, Overlapping Tiles with Attention Pooling Connector
Gina-VLM keeps the standard VLM layout, but optimizes the vision side for arbitrary resolution and low token count. The vision encoder is SigLIP2 So400M/14 384, which is a 27 layer vision transformer with about 400M parameters. This processing crops 378×378 pixels into a 27×27 grid of 14×14 patches, so each tile produces 729 patch tokens.
To handle high resolution images, the model does not resize the entire input into a square. Instead, it creates a grid of 12 overlapping tiles with global thumbnails. Each tile is a 378×378 crop, adjacent tiles overlap by 112 pixels, and the interval between tile origins is 266 pixels. The 4×3 grid covers an effective resolution of 1176×910 pixels before larger images are downscaled to fit inside the tile budget.
The main design is the Vision Language Connector. Instead of using the last VIT layer, GINA-VLM combines features from two intermediate layers, the third from last and the ninth from last, corresponding to layers 24 and 18. It combines high-level semantics and mid-level spatial details. Then Connector applies attention pooling on 2×2 patch neighborhoods. It computes a mean pooled query for each 2×2 region, takes into account the full conjoint feature map, and outputs a single pooled token per neighborhood. This reduces 729 visual tokens to 182 tokens per tile, which is a 4x compression. A Swigloo projection maps the pooled features to the Qwen3 embedding dimension.
With the default 12 tile configuration plus thumbnails, an inexperienced connector would feed 9,477 visual tokens into the language model. Attention pooling reduces this to 2,366 visual tokens. The VIT calculation does not change, but for the language backbone it produces approximately 3.9 times fewer prefill FLOPs and 4 times smaller KV caches. When including shared VIT costs, the total FLOPs are reduced by approximately 2.3 times for the default setting.
Language decoder is Qwen3-1.7B-Base. Model introduces special token for images And tile sequence around and To mark rows in the patch grid. Visual tokens from the connector and text embeddings are combined and passed to Qwen3 to generate answers.
Training Pipelines and Multilingual Data Blending
The training takes place in 2 phases. All components, encoders, connectors and decoders, are updated jointly, without freezing. The full corpus contains approximately 5M multimodal samples and 12B text tokens in over 30 languages. About half of the text is English, and the rest includes high- and medium-resource languages ​​such as Chinese, Arabic, German, Spanish, French, Italian, Japanese, and Korean.
Step 1 is alignment training. The goal is cross language visual grounding, not following instructions. The team uses caption heavy datasets PixmoCap and PangeaIns, which span natural images, documents, diagrams, and infographics. They add 15 percent text-only data from the PLEAS common corpus to control for degradation in pure language tasks. The connector uses a higher learning rate and less warmup than the encoder and decoder to speed up optimization without destabilizing the backbone.
Stage 2 instructions are fine tuning. Here Gina learns to follow the prompts for VLM visual question answers and reasoning. This mix combines LLaVA OneVision, Cauldron, Cambrian, PangeaIns and FineVision as well as Aya-style multilingual text instructions. The Gina research team first trains for 30,000 steps with single source batches, then trains for another 30,000 steps with mixed source batches. This schedule stabilizes learning in the presence of very heterogeneous supervision.
In addition to pretraining and fine tuning, the model sees approximately 10B tokens in the first phase and 37B tokens in the second phase, for a total of approximately 1,300 GPU hours reported for the main experiments.
Benchmark profile, 2.4b model with multilingual power
On standard English VQA tasks including diagrams, charts, documents, OCR, and mixed views, Gina-VLM reaches an average score of 72.3 across 8 benchmarks. These are AI2D, ChartQA, TextVQA, DocVQA, InfoVQA, OCRBench, SeedBench 2 Plus and CharXiv. This is the best average among 2B scale comparison models this research paper Live like this.
On the multimodal understanding and real-world understanding tasks, the model scores 67.4 on the multimodal group, which includes MME, MMB v1.1, and MMStar. Its score is 61.9 on the real-world ensemble, which includes RealWorldQA, MME RealWorld, and R Bench, and it reaches 68.2 accuracy on RealWorldQA, which is the best result among the considered baselines.


Multi image reasoning is a weak area. On BLINK, MuirBench and MMT, Gina-VLM reaches an average of 47.3. The research team attributes this to limited multi-image training data. In contrast, hallucinogenic control is stronger. On the POPE benchmark, which measures object hallucinations, the model scores 90.3, which is the best score in the comparison table.
For mathematical and structured logic, the model implicitly uses the same architecture. It reaches 59.5 on MMMU and the overall math score is 33.3 in MathVista, MathVision, Mathverse, VMath and LogicVista. Gina-VLM is comparable to InternVL3-2b on this set and clearly ahead of Qwen2-VL-2b, while InternVL3.5-2b remains stronger due to its larger scale and more specialized mathematics training.
On the pure text benchmark, the picture is mixed. The research team reports that GINA-VLM maintains most of the Quen3-1.7b performance on MMLU, GSM 8K, ARC C, and HellasWAG. However, after multimodel tuning the MMLU-Pro drops from 46.4 for the base model to 30.3. The research team attributes this to instruction tuning that pushes the model toward much shorter answers, which clashes with the longer multi-step logic required for MMLU Pro.
The main attraction is multilingual multimodal understanding. On MMMB in Arabic, Chinese, English, Portuguese, Russian and Turkish, GINA-VLM reaches an average of 78.8. On multilingual MMBench in the same languages, it reaches 74.3. The research team reports these as average state-of-the-art among open 2B scale VLMs.
comparison table
| Sample | parameters | VQA average | mmmb | Multiple. message board | DocVQA | ocrbench |
|---|---|---|---|---|---|---|
| live-vlm | 2.4b | 72.3 | 78.8 | 74.3 | 90.6 | 778 |
| QUEN2-VL-2B | 2.1b | 66.4 | 71.3 | 69.4 | 89.2 | 809 |
| Qwen3-VL-2B | 2.8b | 71.6 | 75.0 | 72.3 | 92.3 | 858 |
| internvl3-2b | 2.2b | 69.2 | 73.6 | 71.9 | 87.4 | 835 |
| internvl3.5-2b | 2.2b | 71.6 | 74.6 | 70.9 | 88.5 | 836 |
key takeaways
- Gina-VLM is a 2.4b parameter VLM that combines Siglip2 SO400M as a vision encoder with Qwen3-1.7b as the language backbone through an attention pooling connector that truncates visual tokens by 4x while maintaining spatial structure.
- The model uses overlapping 378×378 tiles, 12 tiles, and a global thumbnail to handle arbitrary resolution images up to about 4K, then feeds only pooled view tokens to the LLM which reduces prefill FLOPs and KV cache size by about 4 times compared to naive patch token usage.
- The training uses approximately 5M multimodal samples and 12B text tokens in approximately 30 languages ​​in a 2-stage pipeline, first aligning with caption style data, then instruction fine-tuning with the LLaVA OneVision, Cauldron, Cambrian, PangeaIns, FineVision, and multilingual instruction sets.
- On English VQA, GINA-VLM8 reaches a 72.3 average in the VQA benchmark, and on multilingual multimodal benchmarks it leads the Open 2B scale class with 78.8 on MMMB and 74.3 on multilingual MMBench, while only maintaining competitive text performance.
check it out paper, Model on HF And technical detailsFeel free to check us out GitHub page for tutorials, code, and notebooksAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.
Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.