Llama-3.1-8B-Instruct-FP8

Property	Value
Model Size	8B parameters
License	NVIDIA Open Model License
Supported Hardware	NVIDIA Blackwell, Hopper, Lovelace
Quantization	FP8
Model URL	huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8

What is Llama-3.1-8B-Instruct-FP8?

The NVIDIA Llama-3.1-8B-Instruct-FP8 is a quantized version of Meta's Llama 3.1 8B Instruct model, optimized for efficient inference while maintaining impressive performance. This model represents a significant advancement in model optimization, reducing both disk space and GPU memory requirements by approximately 50% through FP8 quantization.

Implementation Details

The model employs quantization specifically on the weights and activations of linear operators within transformer blocks, achieving a remarkable 1.3x speedup on H100 GPUs compared to the original BF16 version. It can be deployed using either TensorRT-LLM or vLLM runtime engines, supporting context lengths up to 128K tokens.

Calibrated using CNN/DailyMail dataset
Evaluated on MMLU, GSM8K, ARC Challenge, and IFEVAL benchmarks
Maintains strong performance metrics (68.7% on MMLU, 83.1% on GSM8K)
Achieves 11,062.90 TPS compared to original's 8,579.93 TPS

Core Capabilities

Efficient inference with reduced memory footprint
High-performance text generation and instruction following
Seamless integration with TensorRT-LLM and vLLM
Support for commercial and non-commercial applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its optimized FP8 quantization, which significantly reduces resource requirements while maintaining performance within 1-2% of the original model across key benchmarks.

Q: What are the recommended use cases?

The model is ideal for production environments where efficiency is crucial, particularly in applications requiring high-throughput text generation and instruction following. It's especially suitable for deployment on NVIDIA's latest GPU architectures.