Llama-3.1-8B-Instruct-FP8
Property | Value |
---|---|
Model Size | 8B parameters |
License | NVIDIA Open Model License |
Supported Hardware | NVIDIA Blackwell, Hopper, Lovelace |
Quantization | FP8 |
Model URL | huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8 |
What is Llama-3.1-8B-Instruct-FP8?
The NVIDIA Llama-3.1-8B-Instruct-FP8 is a quantized version of Meta's Llama 3.1 8B Instruct model, optimized for efficient inference while maintaining impressive performance. This model represents a significant advancement in model optimization, reducing both disk space and GPU memory requirements by approximately 50% through FP8 quantization.
Implementation Details
The model employs quantization specifically on the weights and activations of linear operators within transformer blocks, achieving a remarkable 1.3x speedup on H100 GPUs compared to the original BF16 version. It can be deployed using either TensorRT-LLM or vLLM runtime engines, supporting context lengths up to 128K tokens.
- Calibrated using CNN/DailyMail dataset
- Evaluated on MMLU, GSM8K, ARC Challenge, and IFEVAL benchmarks
- Maintains strong performance metrics (68.7% on MMLU, 83.1% on GSM8K)
- Achieves 11,062.90 TPS compared to original's 8,579.93 TPS
Core Capabilities
- Efficient inference with reduced memory footprint
- High-performance text generation and instruction following
- Seamless integration with TensorRT-LLM and vLLM
- Support for commercial and non-commercial applications
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its optimized FP8 quantization, which significantly reduces resource requirements while maintaining performance within 1-2% of the original model across key benchmarks.
Q: What are the recommended use cases?
The model is ideal for production environments where efficiency is crucial, particularly in applications requiring high-throughput text generation and instruction following. It's especially suitable for deployment on NVIDIA's latest GPU architectures.