BERT Base Japanese (IPA Dictionary)

Property	Value
License	CC-BY-SA-4.0
Training Data	Japanese Wikipedia (17M sentences)
Downloads	2,491,734
Architecture	BERT Base (12 layers, 768-dim hidden states, 12 attention heads)

What is bert-base-japanese?

bert-base-japanese is a specialized BERT model pretrained on Japanese text, developed by Tohoku NLP. It implements a unique two-stage tokenization process, combining word-level tokenization using the IPA dictionary with WordPiece subword tokenization, making it particularly effective for Japanese language processing tasks.

Implementation Details

The model is trained on Japanese Wikipedia data from September 2019, processing approximately 17M sentences totaling 2.6GB of text. It utilizes MeCab morphological parser with the IPA dictionary for initial tokenization, followed by WordPiece subword tokenization with a 32,000 token vocabulary.

Training configuration: 512 tokens per instance
Batch size: 256 instances
Training steps: 1M
Tokenization: Two-stage process (MeCab + WordPiece)

Core Capabilities

Fill-mask task prediction for Japanese text
Contextual word embeddings for Japanese language
Support for long-form text understanding (up to 512 tokens)
Efficient processing of Japanese-specific linguistic patterns

Frequently Asked Questions

Q: What makes this model unique?

This model's distinctive feature is its specialized tokenization approach that combines traditional Japanese word segmentation with modern subword tokenization, making it particularly effective for Japanese language processing. It's trained on a comprehensive Japanese Wikipedia dataset, ensuring broad coverage of contemporary Japanese language usage.

Q: What are the recommended use cases?

The model is well-suited for various Japanese NLP tasks, including: text classification, named entity recognition, question answering, and particularly fill-mask predictions. It's especially effective for applications requiring deep understanding of Japanese language structure and context.