LegalBERT Large 1.7M-2

Property	Value
Architecture	BERT Large (Uncased)
Training Data	Pile of Law (~256GB)
Vocabulary Size	32,000 tokens
Training Steps	1.7 million
Paper	Pile of Law Paper

What is legalbert-large-1.7M-2?

LegalBERT Large 1.7M-2 is a specialized language model trained on a massive corpus of legal and administrative text. It represents the second variant of a BERT-large architecture specifically optimized for legal domain tasks, trained using the RoBERTa pretraining objective. The model leverages a custom vocabulary that combines standard word-pieces with specialized legal terminology from Black's Law Dictionary.

Implementation Details

The model was trained on a SambaNova cluster using 8 RDUs for 1.7 million steps. It employs a carefully tuned learning rate of 5e-6 and a batch size of 128 to ensure stability across diverse legal sources. The training process utilized the masked language modeling (MLM) objective without Next Sentence Prediction (NSP) loss, following the RoBERTa approach.

Custom vocabulary of 32,000 tokens including 3,000 legal terms
Trained on 512-length sequences throughout
Uses LexNLP sentence segmentation for legal citation handling
Implements 80-10-10 masking strategy with 20x replication rate

Core Capabilities

Masked language modeling for legal text
Legal document understanding and analysis
Foundation for downstream legal NLP tasks
Specialized handling of legal terminology and citations

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its specialized training on the Pile of Law dataset and its custom vocabulary that includes specific legal terminology. It's particularly well-suited for legal domain tasks due to its exposure to diverse legal texts during training.

Q: What are the recommended use cases?

The model is best suited for legal text analysis, document understanding, and can be fine-tuned for specific legal NLP tasks. It's particularly effective for tasks involving legal terminology and concepts, such as contract analysis, case law understanding, and legal document processing.