BERT Base Japanese (IPA Dictionary)
Property | Value |
---|---|
License | CC-BY-SA-4.0 |
Training Data | Japanese Wikipedia (17M sentences) |
Downloads | 2,491,734 |
Architecture | BERT Base (12 layers, 768-dim hidden states, 12 attention heads) |
What is bert-base-japanese?
bert-base-japanese is a specialized BERT model pretrained on Japanese text, developed by Tohoku NLP. It implements a unique two-stage tokenization process, combining word-level tokenization using the IPA dictionary with WordPiece subword tokenization, making it particularly effective for Japanese language processing tasks.
Implementation Details
The model is trained on Japanese Wikipedia data from September 2019, processing approximately 17M sentences totaling 2.6GB of text. It utilizes MeCab morphological parser with the IPA dictionary for initial tokenization, followed by WordPiece subword tokenization with a 32,000 token vocabulary.
- Training configuration: 512 tokens per instance
- Batch size: 256 instances
- Training steps: 1M
- Tokenization: Two-stage process (MeCab + WordPiece)
Core Capabilities
- Fill-mask task prediction for Japanese text
- Contextual word embeddings for Japanese language
- Support for long-form text understanding (up to 512 tokens)
- Efficient processing of Japanese-specific linguistic patterns
Frequently Asked Questions
Q: What makes this model unique?
This model's distinctive feature is its specialized tokenization approach that combines traditional Japanese word segmentation with modern subword tokenization, making it particularly effective for Japanese language processing. It's trained on a comprehensive Japanese Wikipedia dataset, ensuring broad coverage of contemporary Japanese language usage.
Q: What are the recommended use cases?
The model is well-suited for various Japanese NLP tasks, including: text classification, named entity recognition, question answering, and particularly fill-mask predictions. It's especially effective for applications requiring deep understanding of Japanese language structure and context.