AltCLIP

Maintained By
BAAI

AltCLIP

PropertyValue
AuthorBAAI
LicenseCreativeML OpenRAIL-M
Research PaperarXiv:2211.06679
Primary TaskText-to-Image Representation

What is AltCLIP?

AltCLIP is an advanced bilingual CLIP model that extends language capabilities to support both Chinese and English. Developed by BAAI, it's trained on the WuDao dataset and LIAON, offering superior performance in cross-lingual text-image understanding tasks. The model implements a two-phase training approach, combining parallel knowledge distillation with bilingual contrastive learning.

Implementation Details

The model employs a sophisticated training methodology consisting of two distinct phases: parallel knowledge distillation using extensive parallel corpus texts, followed by bilingual contrastive learning using approximately 2 million Chinese-English image-text pairs. This approach enables superior performance in both languages while maintaining high accuracy in text-to-image and image-to-text retrieval tasks.

  • Achieves state-of-the-art performance in bilingual image-text retrieval
  • Supports both Chinese and English text inputs
  • Implements efficient knowledge distillation techniques
  • Integrates with Stable Diffusion architecture

Core Capabilities

  • Bilingual text-to-image retrieval with high accuracy (R@1: 66.3% for English, 63.7% for Chinese)
  • Superior image-to-text retrieval performance (R@1: 85.9% for English, 84.7% for Chinese)
  • Seamless integration with AltDiffusion for image generation
  • Zero-shot image classification capabilities

Frequently Asked Questions

Q: What makes this model unique?

AltCLIP stands out for its ability to handle both Chinese and English with near-equal proficiency, achieved through its innovative two-phase training approach. It maintains CLIP's strong performance in English while adding robust Chinese language capabilities.

Q: What are the recommended use cases?

The model is ideal for bilingual applications requiring text-image understanding, including cross-lingual image retrieval, zero-shot image classification, and as a foundation for text-to-image generation systems like AltDiffusion.

The first platform built for prompt engineering