OpenAssistant-Llama2-13B-Orca-8K-3319-GGML
Property | Value |
---|---|
Base Model | LLaMA 2 13B |
Context Length | 8K tokens |
License | LLaMA 2 Community License |
Paper | ORCA Paper |
Quantization | GGML (Multiple variants) |
What is OpenAssistant-Llama2-13B-Orca-8K-3319-GGML?
This is a GGML-quantized version of the OpenAssistant LLaMA 2 13B model, fine-tuned on the Orca dataset with enhanced 8K context length support. The model uses RoPE scaling for improved long-context understanding and is optimized for both CPU and GPU inference.
Implementation Details
The model features multiple quantization variants ranging from 2-bit to 8-bit precision, offering different trade-offs between model size (5.74GB - 13.83GB) and performance. It implements the OpenAssistant conversation format and uses special tokens for system, prompter, and assistant roles.
- Trained on Orca-Chat, RedPajama1T, and FanFics datasets
- Uses linear scaling of RoPE embeddings for 8K context
- Supports various quantization methods including q2_K through q8_0
- Compatible with multiple inference frameworks including text-generation-webui and llama.cpp
Core Capabilities
- Extended context handling up to 8K tokens
- Efficient CPU/GPU inference through GGML quantization
- Instruction following and chat functionality
- Multiple quantization options for different hardware configurations
Frequently Asked Questions
Q: What makes this model unique?
The model combines the capabilities of LLaMA 2 with extended context length and efficient quantization, making it suitable for deployment on various hardware configurations while maintaining the ability to handle longer conversations.
Q: What are the recommended use cases?
The model is ideal for chatbots, content generation, and applications requiring longer context understanding. Different quantization versions allow deployment on hardware ranging from resource-constrained devices to high-performance systems.