BERT Base Multilingual Uncased
Property | Value |
---|---|
Parameter Count | 168M parameters |
License | Apache 2.0 |
Languages | 102 languages |
Paper | Original BERT Paper |
Training Data | Wikipedia (102 languages) |
What is bert-base-multilingual-uncased?
BERT-base-multilingual-uncased is a transformer-based language model trained on Wikipedia data from 102 languages. This model is particularly notable for its uncased tokenization, meaning it doesn't differentiate between upper and lower case text. With 168M parameters, it provides a robust foundation for multilingual NLP tasks through masked language modeling (MLM) and next sentence prediction (NSP) objectives.
Implementation Details
The model utilizes a shared vocabulary of 110,000 tokens, with special preprocessing for languages like Chinese, Japanese, and Korean. During training, the model employs a sophisticated masking procedure where 15% of tokens are masked, with 80% replaced by [MASK] tokens, 10% by random tokens, and 10% left unchanged.
- Bidirectional context understanding through transformer architecture
- Handles sequences up to 512 tokens
- Balanced training across languages through under/over-sampling
- Special CJK Unicode block handling for Asian languages
Core Capabilities
- Masked language modeling for predicting masked words
- Next sentence prediction for understanding text coherence
- Feature extraction for downstream tasks
- Cross-lingual transfer learning
- Sequence classification and token classification
Frequently Asked Questions
Q: What makes this model unique?
This model's key strength lies in its multilingual capabilities, supporting 102 languages while maintaining a relatively compact size of 168M parameters. The uncased nature makes it particularly robust for mixed-case text processing.
Q: What are the recommended use cases?
The model excels in tasks requiring whole sentence understanding, such as sequence classification, token classification, and question answering. It's not recommended for text generation tasks, where models like GPT-2 would be more appropriate.