legalbert-large-1.7M-2

Maintained By
pile-of-law

LegalBERT Large 1.7M-2

PropertyValue
ArchitectureBERT Large (Uncased)
Training DataPile of Law (~256GB)
Vocabulary Size32,000 tokens
Training Steps1.7 million
PaperPile of Law Paper

What is legalbert-large-1.7M-2?

LegalBERT Large 1.7M-2 is a specialized language model trained on a massive corpus of legal and administrative text. It represents the second variant of a BERT-large architecture specifically optimized for legal domain tasks, trained using the RoBERTa pretraining objective. The model leverages a custom vocabulary that combines standard word-pieces with specialized legal terminology from Black's Law Dictionary.

Implementation Details

The model was trained on a SambaNova cluster using 8 RDUs for 1.7 million steps. It employs a carefully tuned learning rate of 5e-6 and a batch size of 128 to ensure stability across diverse legal sources. The training process utilized the masked language modeling (MLM) objective without Next Sentence Prediction (NSP) loss, following the RoBERTa approach.

  • Custom vocabulary of 32,000 tokens including 3,000 legal terms
  • Trained on 512-length sequences throughout
  • Uses LexNLP sentence segmentation for legal citation handling
  • Implements 80-10-10 masking strategy with 20x replication rate

Core Capabilities

  • Masked language modeling for legal text
  • Legal document understanding and analysis
  • Foundation for downstream legal NLP tasks
  • Specialized handling of legal terminology and citations

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its specialized training on the Pile of Law dataset and its custom vocabulary that includes specific legal terminology. It's particularly well-suited for legal domain tasks due to its exposure to diverse legal texts during training.

Q: What are the recommended use cases?

The model is best suited for legal text analysis, document understanding, and can be fine-tuned for specific legal NLP tasks. It's particularly effective for tasks involving legal terminology and concepts, such as contract analysis, case law understanding, and legal document processing.

The first platform built for prompt engineering