KcELECTRA-base
Property | Value |
---|---|
Parameter Count | 109M parameters |
License | MIT |
Author | beomi |
Model Type | ELECTRA |
Languages | Korean, English |
What is KcELECTRA-base?
KcELECTRA-base is a specialized Korean language model trained specifically on user-generated content, particularly comments and replies from Naver news. Unlike traditional Korean language models that focus on formal text, this model excels at processing noisy, informal text with colloquialisms and internet language.
Implementation Details
The model was trained on approximately 17GB of text data collected between 2019-2021, comprising over 180 million sentences. It uses a BERT WordPiece tokenizer with a vocabulary size of 30,000 tokens and was trained on TPU v3-8 for approximately 10 days.
- Trained on user comments and replies from news articles
- Implements ELECTRA architecture for efficient training
- Supports both Korean and English text processing
- Includes emoji support and special character handling
Core Capabilities
- Sentiment Analysis (91.97% accuracy on NSMC)
- Named Entity Recognition (87.35% F1 score)
- Question-Answer Processing (90.40% F1 score on KorQuaD)
- Text Classification and Paraphrase Detection
Frequently Asked Questions
Q: What makes this model unique?
KcELECTRA-base is specifically designed for processing user-generated content, making it particularly effective for social media text, comments, and informal Korean language that contains neologisms and colloquialisms.
Q: What are the recommended use cases?
The model is best suited for tasks involving informal Korean text analysis, including sentiment analysis, comment classification, and social media content processing. It performs particularly well on noisy text where traditional language models might struggle.