siglip-so400m-patch14-384

Maintained By
google

SigLIP-SO400M-Patch14-384

PropertyValue
Parameter Count878M
LicenseApache 2.0
Training DataWebLI Dataset
Resolution384x384
PaperSigmoid Loss for Language Image Pre-Training

What is siglip-so400m-patch14-384?

SigLIP-SO400M-Patch14-384 is an advanced vision-language model that improves upon the CLIP architecture by implementing a novel sigmoid loss function. Developed by Google, this shape-optimized model contains 878M parameters and is specifically designed for zero-shot image classification tasks. The model processes images at a 384x384 resolution and utilizes a patch size of 14 pixels.

Implementation Details

The model was trained on the WebLI dataset using 16 TPU-v4 chips over three days. It implements sophisticated preprocessing techniques, including image normalization with mean and standard deviation of 0.5 across RGB channels, and text tokenization with 64-token padding.

  • Image preprocessing: 384x384 resolution with RGB normalization
  • Text processing: 64-token maximum length
  • Training infrastructure: 16 TPU-v4 chips
  • Architecture: SoViT-400m shape-optimized design

Core Capabilities

  • Zero-shot image classification
  • Image-text retrieval
  • Efficient batch processing with sigmoid loss function
  • Improved performance over traditional CLIP models

Frequently Asked Questions

Q: What makes this model unique?

The model's key innovation lies in its sigmoid loss function, which eliminates the need for global similarity normalization, enabling better scaling and improved performance at various batch sizes. The shape-optimized architecture (SoViT-400m) provides optimal compute efficiency.

Q: What are the recommended use cases?

The model excels at zero-shot image classification and image-text retrieval tasks. It's particularly suitable for applications requiring efficient processing of image-text pairs without extensive task-specific training.

The first platform built for prompt engineering