Google AI introduces Gemini Embedding 2: a multimodal embedding model that lets you bring text, images, video, audio, and docs into the embedding space.

by ai-intensify
0 comments
Google AI introduces Gemini Embedding 2: a multimodal embedding model that lets you bring text, images, video, audio, and docs into the embedding space.

Google expands its Gemini model family with the release of gemini embedding 2. This second generation model enables text-only success gemini-embedding-001 And it is specifically designed to address the high-dimensional storage and cross-modal retrieval challenges faced by AI developers building production-grade AI. Recovery-Augmented Generation (RAG) System. gemini embedding 2 The release marks a significant technical shift in how embedding models are architected, moving away from modality-specific pipelines toward an integrated, natively multimodal latent space.

Native Multimodality and Interleaved Input

The primary architectural advancement in Gemini Embedding 2 is its mapping capability. five Specific media types-Text, Image, Video, Audio and PDF– in a single, high-dimensional vector space. This eliminates the need for complex pipelines that previously required different models for different data types, such as CLIP for images and BERT-based models for text.

Model supports interleaved inputAllows developers to combine different modalities in a single embedding request. This is especially relevant for use cases where text alone does not provide sufficient context. The technical limitations of these inputs are defined as follows:

  • Basic lesson: Up to 8,192 tokens per request.
  • Images: Up to 6 images (PNG, JPEG, WebP, HEIC/HEIF).
  • Video: Video up to 120 seconds (MP4, MOV, etc.).
  • Audio: Up to 80 seconds of original audio (MP3, WAV, etc.) without the need for a separate transcription step.
  • document: Up to 6 pages of PDF files.

By processing these inputs seamlessly, Gemini Embedding 2 captures the semantic relationships between a visual frame in a video and spoken dialogue in an audio track, presenting them as a single vector that can be compared with text queries using standard distance metrics cosine similarity.

Efficiency through Matryoshka Representation Learning (MRL)

Storage and computation costs are often the primary barriers to large-scale vector search. To mitigate this, Gemini Embedding 2 is implemented Matryoshka Representation Learning (MRL).

Standard embedding models distribute semantic information equally across all dimensions. If a developer reduces a 3,072-dimension vector to 768 dimensions, accuracy is usually destroyed because information is lost. In contrast, Gemini Embedding 2 is trained to pack the most important semantic information into the initial dimensions of the vector.

model is default 3,072 dimensionsBut The Google team has optimized three specific tiers for production use:

  1. 3,072: Maximum accuracy for complex legal, medical, or technical datasets.
  2. 1,536: Balance of performance and storage efficiency.
  3. 768: Optimized for low-latency retrieval and low memory footprint.

Learning Matryoshka Representation (MRL) enables a ‘short-listing’ architecture. A system can perform a coarse, high-speed search across millions of objects using 768-dimension sub-vectors, then accurately re-rank the top results using the full 3,072-dimension embeddings. This reduces the computational overhead of the initial retrieval stage without compromising the final accuracy of the RAG pipeline.

Benchmarking: MTEB and long-reference retrieval

Google AI’s internal evaluation and performance Massive Text Embedding Benchmark (MTEB) indicate that Gemini Embedding 2 performs better than its predecessor two specific areas: : retrieval accuracy And domain shift strength.

Many embedding models suffer from ‘domain drift’, where accuracy decreases when moving from general training data (such as Wikipedia) to specialized domains (such as proprietary codebases). Gemini Embedding 2 used a multi-stage training process incorporating diverse datasets to ensure high zero-shot performance on specific tasks.

of model 8,192-token window There is an important specificity to RAG. This allows large ‘chunks’ of text to be embedded, preserving the context needed for coreference and resolving long-range dependencies within the document. This reduces the possibility of ‘context fragmentation’, a common problem where the retrieved fragment lacks the information necessary for the LLM to generate a coherent answer.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/

key takeaways

  1. original diversity:Gemini Embedding 2 supports five different media types-Text, Image, Video, Audio and PDF-within a unified vector space. This allows interleaved input (for example, an image combined with a text caption) is to be processed as a single embedding without separate model pipelines.
  2. Matryoshka Representation Learning (MRL): The model is designed to store the most important semantic information in the initial dimensions of the vector. While it is the default 3,072 dimensionsIt supports efficient truncation 1,536 Or 768 dimensions with minimal loss in accuracy, reducing storage costs and increasing retrieval speed.
  3. Extended reference and display: is an attribute in the model 8,192-token input windowRAG allows large text ‘chunks’ in pipelines. It shows significant performance improvement over Massive Text Embedding Benchmark (MTEB)Particularly in retrieval accuracy and handling specialized domains such as code or technical documentation.
  4. task-specific optimization:Developers can use task_type parameters (eg RETRIEVAL_QUERY, RETRIEVAL_DOCUMENTOr CLASSIFICATION) to provide signals to the model. It optimizes the mathematical properties of the vector for a specific operation, improving the “hit rate” in semantic search.

check out technical detailsIn public preview via gemini api And Vertex AI. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.


Related Articles

Leave a Comment