Skip to content

Reduced Inference Costs by 75% Through Innovative Data Type Adoption by OpenAI

Modelling Reduction: Employing MXFP4 Results in Smaller, Faster, and Economical Models for All Users

Reduced inference costs by 75% through adoption of a novel data type by OpenAI
Reduced inference costs by 75% through adoption of a novel data type by OpenAI

Reduced Inference Costs by 75% Through Innovative Data Type Adoption by OpenAI

MXFP4, a 4-bit floating point data type defined by the Open Compute Project (OCP), is making waves in the world of deep learning and large language models (LLMs). This innovative data type, designed specifically for efficient deep learning computations, is set to significantly reduce VRAM, memory bandwidth, and compute requirements for machine learning models, making them less expensive to run.

What is MXFP4?

MXFP4 consists of 1 sign bit, 2 exponent bits, and 1 mantissa bit, giving it 16 distinct representable values. Unlike traditional FP4 formats, MXFP4 employs a micro-scaling block floating-point scheme, where tensors are grouped into small blocks that share a scaling factor to better preserve numerical precision across these blocks.

The Impact of MXFP4 on Large Language Models

The significance of MXFP4 for LLMs lies in its ability to drastically reduce both memory consumption and computational workload while maintaining accuracy. By using only 4 bits per value (versus 16+ bits in formats like BF16), models require much less storage and bandwidth, enabling very large models to fit into smaller GPU memory footprints.

Moreover, the micro-scaling approach dynamically adjusts scale factors per tensor block, which mitigates the limited numeric range of 4-bit FP formats and results in better utilization of the representational capacity compared to integer 4-bit (INT4) quantization. This allows MXFP4 to handle the long-tail distributions in weights and activations typical of LLMs more effectively than INT4.

Compute Savings and Efficiency

The smaller data size and efficient quantization lead to substantial compute savings, because lower precision arithmetic requires fewer hardware resources and less data transfer bandwidth. This translates to faster inference and fine-tuning speeds as well as lower operational cost.

MXFP4's design supports accurate fine-tuning of LLMs even on constrained hardware by combining it with parameter-efficient fine-tuning methods such as LoRA, enabling efficient customization with reduced compute and memory needs.

MXFP4 in Practice

OpenAI has made the choice for users by only releasing MXFP4 versions of the gpt-oss models, and not providing BF16 or FP8 versions. By quantizing gpt-oss to MXFP4, the LLM can occupy 4 times less memory than an equivalently sized model trained at BF16. Furthermore, the 120 billion parameter gpt-oss model can run on a GPU with 80GB of VRAM, and the 20 billion parameter version can run on a GPU with 16GB of memory.

Nvidia's H100s, which were used to train gpt-oss, don't support FP4 natively, yet can run the models just fine. MXFP4 is not only limited to OpenAI, as it is one of several micro-scaling data types; there are also MXFP6 and MXFP8 versions.

During inference, MXFP4 values are de-quantized on the fly, resulting in more precise values than standard FP4. MXFP4 can represent 16 distinct values: eight positive and eight negative.

In summary, MXFP4 represents a key innovation enabling large-scale LLM deployment that is smaller, faster, and cheaper by optimizing numeric representation without severely compromising model quality.

MXFP4, a 4-bit floating point data type, consists of 1 sign bit, 2 exponent bits, and 1 mantissa bit, providing 16 distinct representable values. In large language models (LLMs), MXFP4 significantly reduces memory consumption and computational workload, allowing very large models to fit into smaller GPU memory footprints.

The micro-scaling approach used by MXFP4 mitigates the limited numeric range of 4-bit FP formats, making it capable of handling the long-tail distributions in weights and activations typical of LLMs more effectively than INT4.

MXFP4's design leads to substantial compute savings due to lower precision arithmetic, resulting in faster inference and fine-tuning speeds, as well as lower operational cost. MXFP4 supports accurate fine-tuning of LLMs even on constrained hardware when combined with parameter-efficient fine-tuning methods such as LoRA.

The efficiency gains of MXFP4 are demonstrated by OpenAI's choice to only release MXFP4 versions of their gpt-oss models, allowing the LLM to occupy 4 times less memory than an equivalently sized model trained at BF16. Furthermore, the 120 billion parameter gpt-oss model can run on a GPU with 80GB of VRAM, and the 20 billion parameter version can run on a GPU with 16GB of memory. Thus, MXFP4 represents a critical innovation for large-scale LLM deployment, offering smaller, faster, and cheaper solutions without compromising model quality.

Read also:

    Latest