ruri-base

Maintained By
cl-nagoya

Ruri-base

PropertyValue
Parameter Count111M
Output Dimensions768
Max Sequence Length512 tokens
LicenseApache 2.0
PaperarXiv:2409.07737

What is ruri-base?

Ruri-base is a state-of-the-art Japanese text embedding model developed by cl-nagoya, designed specifically for general-purpose text understanding and similarity tasks. It represents a significant advancement in Japanese natural language processing, offering robust performance across various benchmark tasks including retrieval, semantic textual similarity, and classification.

Implementation Details

The model is built on the Sentence Transformers framework, utilizing a BERT-based architecture with specialized pooling mechanisms. It processes Japanese text with a maximum sequence length of 512 tokens and outputs 768-dimensional embeddings. A unique feature is its requirement for specific prefixes ("クエリ: " for queries and "文章: " for passages) during input processing.

  • Achieves 71.91% average performance on JMTEB benchmark suite
  • Implements mean pooling strategy for token aggregation
  • Utilizes cosine similarity for computing text similarities
  • Requires specific Japanese language preprocessing tools (fugashi, sentencepiece, unidic-lite)

Core Capabilities

  • Semantic Text Similarity (82.87% performance)
  • Document Retrieval (69.82% performance)
  • Text Classification (75.58% performance)
  • Reranking (92.91% performance)
  • Text Clustering (54.16% performance)

Frequently Asked Questions

Q: What makes this model unique?

Ruri-base stands out for its balanced performance across different NLP tasks while being specifically optimized for Japanese text. It provides strong performance with a moderate model size, making it practical for production deployments. The model's architecture includes specialized prefixing system that helps distinguish between query and passage inputs.

Q: What are the recommended use cases?

The model excels in applications requiring semantic understanding of Japanese text, including: document search and retrieval, semantic similarity matching, content recommendation systems, and text classification tasks. It's particularly effective for applications requiring high-quality text embeddings while maintaining reasonable computational requirements.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.