Ruri-base

Property	Value
Parameter Count	111M
Output Dimensions	768
Max Sequence Length	512 tokens
License	Apache 2.0
Paper	arXiv:2409.07737

What is ruri-base?

Ruri-base is a state-of-the-art Japanese text embedding model developed by cl-nagoya, designed specifically for general-purpose text understanding and similarity tasks. It represents a significant advancement in Japanese natural language processing, offering robust performance across various benchmark tasks including retrieval, semantic textual similarity, and classification.

Implementation Details

The model is built on the Sentence Transformers framework, utilizing a BERT-based architecture with specialized pooling mechanisms. It processes Japanese text with a maximum sequence length of 512 tokens and outputs 768-dimensional embeddings. A unique feature is its requirement for specific prefixes ("クエリ: " for queries and "文章: " for passages) during input processing.

Achieves 71.91% average performance on JMTEB benchmark suite
Implements mean pooling strategy for token aggregation
Utilizes cosine similarity for computing text similarities
Requires specific Japanese language preprocessing tools (fugashi, sentencepiece, unidic-lite)

Core Capabilities

Semantic Text Similarity (82.87% performance)
Document Retrieval (69.82% performance)
Text Classification (75.58% performance)
Reranking (92.91% performance)
Text Clustering (54.16% performance)

Frequently Asked Questions

Q: What makes this model unique?

Ruri-base stands out for its balanced performance across different NLP tasks while being specifically optimized for Japanese text. It provides strong performance with a moderate model size, making it practical for production deployments. The model's architecture includes specialized prefixing system that helps distinguish between query and passage inputs.

Q: What are the recommended use cases?

The model excels in applications requiring semantic understanding of Japanese text, including: document search and retrieval, semantic similarity matching, content recommendation systems, and text classification tasks. It's particularly effective for applications requiring high-quality text embeddings while maintaining reasonable computational requirements.

ruri-base