distilbert-base-uncased

Maintained By
distilbert

DistilBERT Base Uncased

PropertyValue
Parameter Count67M parameters
LicenseApache 2.0
PaperarXiv:1910.01108
Training DataBookCorpus and Wikipedia

What is distilbert-base-uncased?

DistilBERT is a compact and efficient transformer model that serves as a lighter alternative to BERT. Created through knowledge distillation, it retains 97% of BERT's language understanding capabilities while being 40% smaller and 60% faster. This uncased version treats "English" and "english" identically, making it more flexible for general text processing.

Implementation Details

The model was trained using a triple loss combining language modeling, distillation, and cosine-embedding objectives. It processes text using WordPiece tokenization with a 30,000 token vocabulary and follows BERT's masked language modeling approach where 15% of tokens are masked during training.

  • Architecture: Transformer-based with knowledge distillation
  • Training Infrastructure: 8 16GB V100 GPUs (90 hours)
  • Input Format: [CLS] Sentence A [SEP] Sentence B [SEP]

Core Capabilities

  • Masked Language Modeling
  • Feature Extraction for Downstream Tasks
  • Sentence Classification
  • Token Classification
  • Question Answering

Frequently Asked Questions

Q: What makes this model unique?

DistilBERT's uniqueness lies in its efficient architecture that maintains BERT's performance while significantly reducing computational requirements. It achieves this through sophisticated knowledge distillation techniques and multi-objective training.

Q: What are the recommended use cases?

The model excels in tasks that require understanding of complete sentences, such as text classification, named entity recognition, and question answering. However, it's not recommended for text generation tasks, where models like GPT-2 would be more appropriate.

The first platform built for prompt engineering