gatortron-base

Maintained By
UFNLP

GatorTron-Base

PropertyValue
Parameter Count345 Million
LicenseApache 2.0
ArchitectureBERT (Megatron implementation)
Training Data91.1B words total (Clinical + Research)

What is gatortron-base?

GatorTron-base is a specialized clinical language model developed through collaboration between the University of Florida and NVIDIA. This 345M-parameter model is built on BERT architecture and trained on one of the largest clinical text datasets, making it particularly valuable for healthcare applications.

Implementation Details

The model leverages the Megatron package from NVIDIA and is trained on a diverse dataset including 82B words of de-identified clinical notes, 6.1B words from PubMed CC0, 2.5B from WikiText, and 0.5B from MIMIC-III. All clinical data has been carefully de-identified following HIPAA guidelines.

  • Pre-trained using Megatron-BERT architecture
  • Implements comprehensive PHI de-identification
  • Supports standard Hugging Face integration

Core Capabilities

  • Clinical concept extraction (NER)
  • Relation extraction from medical texts
  • Social determinants of health (SDoH) extraction
  • General clinical text understanding

Frequently Asked Questions

Q: What makes this model unique?

GatorTron-base stands out due to its extensive training on real clinical data (82B+ words) and its specialized focus on healthcare applications, making it particularly effective for clinical NLP tasks.

Q: What are the recommended use cases?

The model is ideal for clinical text analysis, including named entity recognition, relation extraction, and understanding social determinants of health from medical narratives. It's particularly suited for healthcare organizations and researchers working with clinical documents.

The first platform built for prompt engineering