OpenAssistant-Llama2-13B-Orca-8K-3319-GGML

Property	Value
Base Model	LLaMA 2 13B
Context Length	8K tokens
License	LLaMA 2 Community License
Paper	ORCA Paper
Quantization	GGML (Multiple variants)

What is OpenAssistant-Llama2-13B-Orca-8K-3319-GGML?

This is a GGML-quantized version of the OpenAssistant LLaMA 2 13B model, fine-tuned on the Orca dataset with enhanced 8K context length support. The model uses RoPE scaling for improved long-context understanding and is optimized for both CPU and GPU inference.

Implementation Details

The model features multiple quantization variants ranging from 2-bit to 8-bit precision, offering different trade-offs between model size (5.74GB - 13.83GB) and performance. It implements the OpenAssistant conversation format and uses special tokens for system, prompter, and assistant roles.

Trained on Orca-Chat, RedPajama1T, and FanFics datasets
Uses linear scaling of RoPE embeddings for 8K context
Supports various quantization methods including q2_K through q8_0
Compatible with multiple inference frameworks including text-generation-webui and llama.cpp

Core Capabilities

Extended context handling up to 8K tokens
Efficient CPU/GPU inference through GGML quantization
Instruction following and chat functionality
Multiple quantization options for different hardware configurations

Frequently Asked Questions

Q: What makes this model unique?

The model combines the capabilities of LLaMA 2 with extended context length and efficient quantization, making it suitable for deployment on various hardware configurations while maintaining the ability to handle longer conversations.

Q: What are the recommended use cases?

The model is ideal for chatbots, content generation, and applications requiring longer context understanding. Different quantization versions allow deployment on hardware ranging from resource-constrained devices to high-performance systems.