AltCLIP
Property | Value |
---|---|
Author | BAAI |
License | CreativeML OpenRAIL-M |
Research Paper | arXiv:2211.06679 |
Primary Task | Text-to-Image Representation |
What is AltCLIP?
AltCLIP is an advanced bilingual CLIP model that extends language capabilities to support both Chinese and English. Developed by BAAI, it's trained on the WuDao dataset and LIAON, offering superior performance in cross-lingual text-image understanding tasks. The model implements a two-phase training approach, combining parallel knowledge distillation with bilingual contrastive learning.
Implementation Details
The model employs a sophisticated training methodology consisting of two distinct phases: parallel knowledge distillation using extensive parallel corpus texts, followed by bilingual contrastive learning using approximately 2 million Chinese-English image-text pairs. This approach enables superior performance in both languages while maintaining high accuracy in text-to-image and image-to-text retrieval tasks.
- Achieves state-of-the-art performance in bilingual image-text retrieval
- Supports both Chinese and English text inputs
- Implements efficient knowledge distillation techniques
- Integrates with Stable Diffusion architecture
Core Capabilities
- Bilingual text-to-image retrieval with high accuracy (R@1: 66.3% for English, 63.7% for Chinese)
- Superior image-to-text retrieval performance (R@1: 85.9% for English, 84.7% for Chinese)
- Seamless integration with AltDiffusion for image generation
- Zero-shot image classification capabilities
Frequently Asked Questions
Q: What makes this model unique?
AltCLIP stands out for its ability to handle both Chinese and English with near-equal proficiency, achieved through its innovative two-phase training approach. It maintains CLIP's strong performance in English while adding robust Chinese language capabilities.
Q: What are the recommended use cases?
The model is ideal for bilingual applications requiring text-image understanding, including cross-lingual image retrieval, zero-shot image classification, and as a foundation for text-to-image generation systems like AltDiffusion.