XLM-Roberta-Large-Vit-B-32

Maintained By
M-CLIP

XLM-Roberta-Large-Vit-B-32

PropertyValue
AuthorM-CLIP
Downloads12.3M+
Languages Supported48
FrameworkPyTorch, TensorFlow

What is XLM-Roberta-Large-Vit-B-32?

XLM-Roberta-Large-Vit-B-32 is a multilingual extension of OpenAI's CLIP model, designed to bridge the gap between vision and language across 48 different languages. It combines a powerful XLM-RoBERTa text encoder with a ViT-B/32 vision transformer architecture, enabling cross-lingual vision-language understanding.

Implementation Details

The model architecture consists of two main components: a multilingual text encoder based on XLM-RoBERTa-Large and a vision encoder using ViT-B/32. The model achieves impressive performance across multiple languages, with R@10 scores of 91.8% for English and maintaining strong performance (80-90%) across other languages like German, Spanish, French, and Chinese.

  • Multilingual text encoding supporting 48 languages including English, German, Chinese, Russian, and more
  • Compatible with both PyTorch and TensorFlow frameworks
  • Demonstrated strong cross-lingual retrieval capabilities
  • Easy integration with the multilingual-clip package

Core Capabilities

  • Cross-lingual image-text retrieval
  • Multilingual zero-shot classification
  • Text-to-image search across 48 languages
  • Competitive performance with English-only CLIP models

Frequently Asked Questions

Q: What makes this model unique?

This model extends CLIP's capabilities to 48 languages while maintaining performance comparable to the original English model. It's particularly notable for achieving over 88% R@10 scores across most supported languages.

Q: What are the recommended use cases?

The model is ideal for multilingual image-text retrieval tasks, cross-lingual visual search systems, and zero-shot classification applications where content needs to be processed in multiple languages.

The first platform built for prompt engineering