F5-TTS-Vietnamese-100h

Property	Value
License	CC-BY-NC-SA-4.0
Author	hynt
Training Data	150 hours Vietnamese speech
Base Model	F5-TTS_Base
Model URL	https://huggingface.co/hynt/F5-TTS-Vietnamese-100h

What is F5-TTS-Vietnamese-100h?

F5-TTS-Vietnamese-100h is a specialized Text-to-Speech model fine-tuned specifically for Vietnamese language synthesis. Built upon the F5-TTS base architecture, this model has been trained on a diverse 150-hour dataset comprising VLSP collections (2021-2023), vietTTS, TeacherDinh-UEH, and curated YouTube content.

Implementation Details

The model was trained on an RTX 3090 GPU with a batch size of 3200 frames, reaching 390,000 training steps. The training data underwent rigorous preprocessing, including music background removal using Facebook's demucs model, length filtering (1-30 seconds), and text normalization.

Comprehensive data cleaning and preprocessing pipeline
Advanced audio background removal techniques
Optimized for production-quality speech synthesis
Institutional access only for research purposes

Core Capabilities

High-quality Vietnamese speech synthesis
Support for various text inputs with punctuation
Adjustable speech speed control
Integration with multiple vocoder options

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its extensive training on carefully curated Vietnamese speech data and its specific optimization for the Vietnamese language. The inclusion of diverse speech sources and rigorous preprocessing ensures high-quality output.

Q: What are the recommended use cases?

The model is specifically designed for research purposes in academic or institutional settings. It's ideal for Vietnamese TTS research, speech synthesis experiments, and academic studies in computational linguistics.