Whisper Large V3
Property | Value |
---|---|
Parameter Count | 1.54B |
License | Apache 2.0 |
Paper | View Paper |
Supported Languages | 99 |
Model Type | Speech Recognition |
What is whisper-large-v3?
Whisper Large V3 is OpenAI's latest state-of-the-art model for automatic speech recognition (ASR) and translation. Built on the same architecture as its predecessors but with significant improvements, it features 128 Mel frequency bins (up from 80) and includes new language support for Cantonese. The model was trained on an impressive dataset of 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio.
Implementation Details
The model uses a Transformer-based encoder-decoder architecture and shows 10-20% error reduction compared to its predecessor, Whisper Large V2. It supports both transcription in the source language and translation to English, with advanced features like temperature fallback and timestamp generation.
- FP16 tensor support for optimal performance
- Compatible with Flash Attention 2 and Torch compile for up to 4.5x speed improvements
- Supports chunked processing for long-form audio
- Includes advanced batching capabilities for efficient processing
Core Capabilities
- Multilingual speech recognition across 99 languages
- Zero-shot translation to English
- Word and sentence-level timestamp generation
- Robust performance across different accents and background noise
- Support for both short and long-form audio processing
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its significant accuracy improvements over previous versions, larger training dataset, and enhanced architecture with 128 Mel frequency bins. It's particularly notable for its robust performance across multiple languages and challenging audio conditions.
Q: What are the recommended use cases?
The model is ideal for large-scale speech transcription, multilingual content processing, accessibility tools, and research applications. It's particularly well-suited for scenarios requiring high accuracy in multiple languages or when dealing with challenging audio conditions.