XLM-Roberta-Large-Vit-B-32
Property | Value |
---|---|
Author | M-CLIP |
Downloads | 12.3M+ |
Languages Supported | 48 |
Framework | PyTorch, TensorFlow |
What is XLM-Roberta-Large-Vit-B-32?
XLM-Roberta-Large-Vit-B-32 is a multilingual extension of OpenAI's CLIP model, designed to bridge the gap between vision and language across 48 different languages. It combines a powerful XLM-RoBERTa text encoder with a ViT-B/32 vision transformer architecture, enabling cross-lingual vision-language understanding.
Implementation Details
The model architecture consists of two main components: a multilingual text encoder based on XLM-RoBERTa-Large and a vision encoder using ViT-B/32. The model achieves impressive performance across multiple languages, with R@10 scores of 91.8% for English and maintaining strong performance (80-90%) across other languages like German, Spanish, French, and Chinese.
- Multilingual text encoding supporting 48 languages including English, German, Chinese, Russian, and more
- Compatible with both PyTorch and TensorFlow frameworks
- Demonstrated strong cross-lingual retrieval capabilities
- Easy integration with the multilingual-clip package
Core Capabilities
- Cross-lingual image-text retrieval
- Multilingual zero-shot classification
- Text-to-image search across 48 languages
- Competitive performance with English-only CLIP models
Frequently Asked Questions
Q: What makes this model unique?
This model extends CLIP's capabilities to 48 languages while maintaining performance comparable to the original English model. It's particularly notable for achieving over 88% R@10 scores across most supported languages.
Q: What are the recommended use cases?
The model is ideal for multilingual image-text retrieval tasks, cross-lingual visual search systems, and zero-shot classification applications where content needs to be processed in multiple languages.