Distil-Large-v3.5
Property | Value |
---|---|
Parameter Count | 756M parameters |
License | MIT |
Paper | arXiv:2311.00430 |
Training Data | 98,000 hours of audio |
Speed Improvement | 1.46x faster than base model |
What is distil-large-v3.5?
Distil-Large-v3.5 is a knowledge-distilled version of OpenAI's Whisper-Large-v3, specifically designed for efficient speech recognition. This model represents a significant advancement in the Distil-Whisper family, trained on an extensive dataset of 98,000 hours of diverse audio content. It achieves remarkable efficiency while maintaining competitive accuracy, offering a 1.46x speed improvement over the original model.
Implementation Details
The model employs an encoder-decoder architecture optimized for speech recognition tasks. It features a streamlined decoder that accounts for over 90% of inference time, with specific optimizations including patient teacher training with aggressive data augmentation and an extended 80-epoch training schedule.
- Encoder-decoder architecture with focus on decoder optimization
- Trained using 64 H100 GPUs on the Jean Zay cluster
- Implements Flash Attention 2 and Torch SDPA for improved performance
- Supports both sequential and chunked long-form transcription
Core Capabilities
- Short-form transcription with 7.08% WER on out-of-distribution data
- Long-form transcription with 11.39% WER on out-of-distribution data
- Supports speculative decoding for 2x faster inference with Whisper-Large-v3
- Compatible with multiple frameworks including Whisper.cpp, Faster-Whisper, and Candle
Frequently Asked Questions
Q: What makes this model unique?
The model combines extensive training data (98k hours), patient teacher training, and aggressive data augmentation to achieve superior performance while maintaining high efficiency. It's specifically optimized for both short and long-form transcription tasks.
Q: What are the recommended use cases?
The model is ideal for production environments requiring efficient speech recognition, particularly where a balance between speed and accuracy is crucial. It excels in both short-form transcription tasks and can handle long-form content through either sequential or chunked processing approaches.