F5-TTS-Vietnamese-100h
Property | Value |
---|---|
License | CC-BY-NC-SA-4.0 |
Author | hynt |
Training Data | 150 hours Vietnamese speech |
Base Model | F5-TTS_Base |
Model URL | https://huggingface.co/hynt/F5-TTS-Vietnamese-100h |
What is F5-TTS-Vietnamese-100h?
F5-TTS-Vietnamese-100h is a specialized Text-to-Speech model fine-tuned specifically for Vietnamese language synthesis. Built upon the F5-TTS base architecture, this model has been trained on a diverse 150-hour dataset comprising VLSP collections (2021-2023), vietTTS, TeacherDinh-UEH, and curated YouTube content.
Implementation Details
The model was trained on an RTX 3090 GPU with a batch size of 3200 frames, reaching 390,000 training steps. The training data underwent rigorous preprocessing, including music background removal using Facebook's demucs model, length filtering (1-30 seconds), and text normalization.
- Comprehensive data cleaning and preprocessing pipeline
- Advanced audio background removal techniques
- Optimized for production-quality speech synthesis
- Institutional access only for research purposes
Core Capabilities
- High-quality Vietnamese speech synthesis
- Support for various text inputs with punctuation
- Adjustable speech speed control
- Integration with multiple vocoder options
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its extensive training on carefully curated Vietnamese speech data and its specific optimization for the Vietnamese language. The inclusion of diverse speech sources and rigorous preprocessing ensures high-quality output.
Q: What are the recommended use cases?
The model is specifically designed for research purposes in academic or institutional settings. It's ideal for Vietnamese TTS research, speech synthesis experiments, and academic studies in computational linguistics.