Llama-3.1-8B-Instruct-FP8

Maintained By
nvidia

Llama-3.1-8B-Instruct-FP8

PropertyValue
Model Size8B parameters
LicenseNVIDIA Open Model License
Supported HardwareNVIDIA Blackwell, Hopper, Lovelace
QuantizationFP8
Model URLhuggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8

What is Llama-3.1-8B-Instruct-FP8?

The NVIDIA Llama-3.1-8B-Instruct-FP8 is a quantized version of Meta's Llama 3.1 8B Instruct model, optimized for efficient inference while maintaining impressive performance. This model represents a significant advancement in model optimization, reducing both disk space and GPU memory requirements by approximately 50% through FP8 quantization.

Implementation Details

The model employs quantization specifically on the weights and activations of linear operators within transformer blocks, achieving a remarkable 1.3x speedup on H100 GPUs compared to the original BF16 version. It can be deployed using either TensorRT-LLM or vLLM runtime engines, supporting context lengths up to 128K tokens.

  • Calibrated using CNN/DailyMail dataset
  • Evaluated on MMLU, GSM8K, ARC Challenge, and IFEVAL benchmarks
  • Maintains strong performance metrics (68.7% on MMLU, 83.1% on GSM8K)
  • Achieves 11,062.90 TPS compared to original's 8,579.93 TPS

Core Capabilities

  • Efficient inference with reduced memory footprint
  • High-performance text generation and instruction following
  • Seamless integration with TensorRT-LLM and vLLM
  • Support for commercial and non-commercial applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its optimized FP8 quantization, which significantly reduces resource requirements while maintaining performance within 1-2% of the original model across key benchmarks.

Q: What are the recommended use cases?

The model is ideal for production environments where efficiency is crucial, particularly in applications requiring high-throughput text generation and instruction following. It's especially suitable for deployment on NVIDIA's latest GPU architectures.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.