UltraRM-13b
Property | Value |
---|---|
Base Model | LLaMA2-13B |
License | MIT |
Paper | UltraFeedback Paper |
Framework | PyTorch, Transformers |
What is UltraRM-13b?
UltraRM-13b is a state-of-the-art reward model developed by OpenBMB, built on the LLaMA2-13B architecture. It's trained on the UltraFeedback dataset along with a mixture of other high-quality feedback datasets, including Anthropic HH-RLHF, Stanford SHP, and Summarization feedback data. The model has demonstrated exceptional performance, achieving a 92.30% win rate against text-davinci-003 on the AlpacaEval benchmark.
Implementation Details
The model implements a regression head on top of the LLaMA architecture to provide reward scores for text completions. It's designed to evaluate the quality of AI-generated responses and can be easily integrated into reinforcement learning pipelines.
- Built on LLaMA2-13B architecture
- Trained on UltraFeedback and multiple high-quality feedback datasets
- Implements custom reward modeling architecture
- Provides scalar reward scores for text evaluation
Core Capabilities
- State-of-the-art performance in preference evaluation
- Effective text quality assessment
- Compatible with standard transformers pipeline
- Supports both direct reward computation and comparative evaluation
Frequently Asked Questions
Q: What makes this model unique?
UltraRM-13b stands out for its exceptional performance in reward modeling, achieved through training on a diverse set of high-quality feedback datasets. It sets new state-of-the-art benchmarks for open-source reward models and demonstrates superior capabilities in evaluating text quality.
Q: What are the recommended use cases?
The model is primarily designed for evaluating the quality of language model outputs, making it ideal for: reinforcement learning from human feedback (RLHF), quality assessment of generated text, and model comparison studies. It's particularly useful in research and development of better language models.