Distil-Large-v3.5

Property	Value
Parameter Count	756M parameters
License	MIT
Paper	arXiv:2311.00430
Training Data	98,000 hours of audio
Speed Improvement	1.46x faster than base model

What is distil-large-v3.5?

Distil-Large-v3.5 is a knowledge-distilled version of OpenAI's Whisper-Large-v3, specifically designed for efficient speech recognition. This model represents a significant advancement in the Distil-Whisper family, trained on an extensive dataset of 98,000 hours of diverse audio content. It achieves remarkable efficiency while maintaining competitive accuracy, offering a 1.46x speed improvement over the original model.

Implementation Details

The model employs an encoder-decoder architecture optimized for speech recognition tasks. It features a streamlined decoder that accounts for over 90% of inference time, with specific optimizations including patient teacher training with aggressive data augmentation and an extended 80-epoch training schedule.

Encoder-decoder architecture with focus on decoder optimization
Trained using 64 H100 GPUs on the Jean Zay cluster
Implements Flash Attention 2 and Torch SDPA for improved performance
Supports both sequential and chunked long-form transcription

Core Capabilities

Short-form transcription with 7.08% WER on out-of-distribution data
Long-form transcription with 11.39% WER on out-of-distribution data
Supports speculative decoding for 2x faster inference with Whisper-Large-v3
Compatible with multiple frameworks including Whisper.cpp, Faster-Whisper, and Candle

Frequently Asked Questions

Q: What makes this model unique?

The model combines extensive training data (98k hours), patient teacher training, and aggressive data augmentation to achieve superior performance while maintaining high efficiency. It's specifically optimized for both short and long-form transcription tasks.

Q: What are the recommended use cases?

The model is ideal for production environments requiring efficient speech recognition, particularly where a balance between speed and accuracy is crucial. It excels in both short-form transcription tasks and can handle long-form content through either sequential or chunked processing approaches.

distil-large-v3.5