Gemma-3-4b-it-speech
Property | Value |
---|---|
Base Model | google/gemma-3-4b-it |
Parameters | 4B + 596B Speech LoRA adapter |
License | Gemma |
Author | junnei |
Context Length | 128K tokens |
What is gemma-3-4b-it-speech?
Gemma-3-4b-it-speech is an innovative multimodal extension of the Gemma-3 model family, specifically designed to handle text, vision, and speech processing tasks. This model represents a significant advancement in multimodal AI by incorporating a Speech Adapter into the original Gemma architecture, enabling capabilities like speech recognition and translation while maintaining the core language and vision processing abilities.
Implementation Details
The model builds upon the google/gemma-3-4b-it base model by adding a 596B parameter Speech LoRA adapter. Training was conducted on ASR and AST tasks using a single A100 GPU over one epoch (12 hours), focusing on English and Korean languages from the Covost2 Dataset for audio clips under 30 seconds.
- Architecture: Multimodal Language Model with Speech Processing capabilities
- Training Data: Covost2 Dataset (English and Korean)
- Performance Metrics: ASR (English) - BLEU: 85.95, CER: 4.47, WER: 8.49
- AST Performance: English-Korean translation BLEU score of 29.83
Core Capabilities
- Automatic Speech Recognition (ASR) with high accuracy
- Audio-to-Text Translation (AST)
- Vision-Language Processing
- Multilingual Support (English and Korean)
- 128K token context window
Frequently Asked Questions
Q: What makes this model unique?
The model uniquely combines Gemma's language and vision capabilities with speech processing, offering a comprehensive multimodal solution. It's one of the few open models that can handle text, vision, and speech inputs within a single architecture.
Q: What are the recommended use cases?
The model is best suited for experimental and research purposes, particularly for tasks involving speech recognition and translation of short audio clips (under 30 seconds). It's specifically optimized for English ASR and English-to-Korean translation tasks.