OmniFusion

Maintained By
AIRI-Institute

OmniFusion

PropertyValue
LicenseApache 2.0
PaperArXiv Link
Base ModelMistral-7B
Visual EncodersCLIP-ViT-L, Dino V2

What is OmniFusion?

OmniFusion is an advanced multimodal AI model that extends traditional language processing capabilities by integrating multiple data modalities. Built on Mistral-7B, it particularly excels at processing images through its innovative dual-encoder architecture. The latest version (1.1) notably includes Russian language support and achieves state-of-the-art performance on various visual-language tasks.

Implementation Details

The model employs a sophisticated architecture combining CLIP-ViT-L and Dino V2 visual encoders with a custom adapter mechanism. This adapter efficiently maps visual information to the language model's textual space, enabling seamless multimodal understanding.

  • Two-stage training process: adapter pre-training and full model fine-tuning
  • Custom tokens for visual data marking in text sequences
  • Comprehensive training dataset including caption, VQA, and conversation tasks

Core Capabilities

  • Superior performance on TextVQA (48.93%) and ScienceQA (68.02%)
  • Bilingual support (English and Russian)
  • Advanced visual dialogue capabilities with high NDCG scores
  • Efficient processing of complex visual-textual queries

Frequently Asked Questions

Q: What makes this model unique?

OmniFusion's distinctive feature is its dual-encoder architecture and specialized adapter mechanism, allowing for superior multimodal understanding while maintaining computational efficiency. The ability to process both English and Russian makes it particularly versatile.

Q: What are the recommended use cases?

The model excels in image-text interaction tasks, including visual question answering, image captioning, and multimodal dialogue. It's particularly suited for applications requiring detailed visual understanding and natural language interaction in both English and Russian.

The first platform built for prompt engineering