camembert-base

Maintained By
almanach

CamemBERT Base Model

PropertyValue
Parameter Count110M
Training DataOSCAR (138GB)
LicenseMIT
PaperarXiv:1911.03894

What is camembert-base?

CamemBERT is a state-of-the-art French language model based on the RoBERTa architecture, specifically designed to advance natural language processing capabilities for the French language. This base version, trained on 138GB of text from the OSCAR dataset, represents a significant milestone in French language modeling.

Implementation Details

The model utilizes the transformers architecture and is implemented using PyTorch. It features F32 tensor types and includes specialized components for masked language modeling tasks. The base architecture contains 110M parameters, offering a balanced trade-off between computational efficiency and model performance.

  • Built on RoBERTa architecture optimized for French language
  • Trained on large-scale OSCAR dataset
  • Implements efficient tokenization using SentencePiece
  • Supports multiple downstream tasks through fine-tuning

Core Capabilities

  • Masked language modeling for French text
  • Contextual word embeddings generation
  • Support for multiple model variants (base, large, CCNet versions)
  • Integration with HuggingFace's transformers library
  • Pipeline support for common NLP tasks

Frequently Asked Questions

Q: What makes this model unique?

CamemBERT is specifically optimized for French language processing, trained on a massive French text corpus, making it one of the most comprehensive French language models available. Its architecture is based on RoBERTa, incorporating modern advances in transformer-based models.

Q: What are the recommended use cases?

The model excels in various French NLP tasks including masked language modeling, text classification, named entity recognition, and generation of contextual embeddings. It's particularly useful for applications requiring deep understanding of French language semantics and context.

The first platform built for prompt engineering