GatorTron-Base

Property	Value
Parameter Count	345 Million
License	Apache 2.0
Architecture	BERT (Megatron implementation)
Training Data	91.1B words total (Clinical + Research)

What is gatortron-base?

GatorTron-base is a specialized clinical language model developed through collaboration between the University of Florida and NVIDIA. This 345M-parameter model is built on BERT architecture and trained on one of the largest clinical text datasets, making it particularly valuable for healthcare applications.

Implementation Details

The model leverages the Megatron package from NVIDIA and is trained on a diverse dataset including 82B words of de-identified clinical notes, 6.1B words from PubMed CC0, 2.5B from WikiText, and 0.5B from MIMIC-III. All clinical data has been carefully de-identified following HIPAA guidelines.

Pre-trained using Megatron-BERT architecture
Implements comprehensive PHI de-identification
Supports standard Hugging Face integration

Core Capabilities

Clinical concept extraction (NER)
Relation extraction from medical texts
Social determinants of health (SDoH) extraction
General clinical text understanding

Frequently Asked Questions

Q: What makes this model unique?

GatorTron-base stands out due to its extensive training on real clinical data (82B+ words) and its specialized focus on healthcare applications, making it particularly effective for clinical NLP tasks.

Q: What are the recommended use cases?

The model is ideal for clinical text analysis, including named entity recognition, relation extraction, and understanding social determinants of health from medical narratives. It's particularly suited for healthcare organizations and researchers working with clinical documents.

gatortron-base