CamemBERT Base Model

Property	Value
Parameter Count	110M
Training Data	OSCAR (138GB)
License	MIT
Paper	arXiv:1911.03894

What is camembert-base?

CamemBERT is a state-of-the-art French language model based on the RoBERTa architecture, specifically designed to advance natural language processing capabilities for the French language. This base version, trained on 138GB of text from the OSCAR dataset, represents a significant milestone in French language modeling.

Implementation Details

The model utilizes the transformers architecture and is implemented using PyTorch. It features F32 tensor types and includes specialized components for masked language modeling tasks. The base architecture contains 110M parameters, offering a balanced trade-off between computational efficiency and model performance.

Built on RoBERTa architecture optimized for French language
Trained on large-scale OSCAR dataset
Implements efficient tokenization using SentencePiece
Supports multiple downstream tasks through fine-tuning

Core Capabilities

Masked language modeling for French text
Contextual word embeddings generation
Support for multiple model variants (base, large, CCNet versions)
Integration with HuggingFace's transformers library
Pipeline support for common NLP tasks

Frequently Asked Questions

Q: What makes this model unique?

CamemBERT is specifically optimized for French language processing, trained on a massive French text corpus, making it one of the most comprehensive French language models available. Its architecture is based on RoBERTa, incorporating modern advances in transformer-based models.

Q: What are the recommended use cases?

The model excels in various French NLP tasks including masked language modeling, text classification, named entity recognition, and generation of contextual embeddings. It's particularly useful for applications requiring deep understanding of French language semantics and context.

camembert-base