distil-large-v3.5

Maintained By
distil-whisper

Distil-Large-v3.5

PropertyValue
Parameter Count756M parameters
LicenseMIT
PaperarXiv:2311.00430
Training Data98,000 hours of audio
Speed Improvement1.46x faster than base model

What is distil-large-v3.5?

Distil-Large-v3.5 is a knowledge-distilled version of OpenAI's Whisper-Large-v3, specifically designed for efficient speech recognition. This model represents a significant advancement in the Distil-Whisper family, trained on an extensive dataset of 98,000 hours of diverse audio content. It achieves remarkable efficiency while maintaining competitive accuracy, offering a 1.46x speed improvement over the original model.

Implementation Details

The model employs an encoder-decoder architecture optimized for speech recognition tasks. It features a streamlined decoder that accounts for over 90% of inference time, with specific optimizations including patient teacher training with aggressive data augmentation and an extended 80-epoch training schedule.

  • Encoder-decoder architecture with focus on decoder optimization
  • Trained using 64 H100 GPUs on the Jean Zay cluster
  • Implements Flash Attention 2 and Torch SDPA for improved performance
  • Supports both sequential and chunked long-form transcription

Core Capabilities

  • Short-form transcription with 7.08% WER on out-of-distribution data
  • Long-form transcription with 11.39% WER on out-of-distribution data
  • Supports speculative decoding for 2x faster inference with Whisper-Large-v3
  • Compatible with multiple frameworks including Whisper.cpp, Faster-Whisper, and Candle

Frequently Asked Questions

Q: What makes this model unique?

The model combines extensive training data (98k hours), patient teacher training, and aggressive data augmentation to achieve superior performance while maintaining high efficiency. It's specifically optimized for both short and long-form transcription tasks.

Q: What are the recommended use cases?

The model is ideal for production environments requiring efficient speech recognition, particularly where a balance between speed and accuracy is crucial. It excels in both short-form transcription tasks and can handle long-form content through either sequential or chunked processing approaches.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.