Skip to content

SuperOffload Unveiled: Revolutionizing Large Language Model Training on NVIDIA's Grace Hopper Superchips

SuperOffload unlocks the full potential of NVIDIA's Superchips. It's a game-changer for training larger, more complex language models.

This picture contains a box which is in red, orange and blue color. On the top of the box, we see a...
This picture contains a box which is in red, orange and blue color. On the top of the box, we see a robot and text written as "AUTOBOT TRACKS". In the background, it is black in color and it is blurred.

SuperOffload Unveiled: Revolutionizing Large Language Model Training on NVIDIA's Grace Hopper Superchips

Researchers have unveiled SuperOffload, a groundbreaking system that significantly enhances the training of large language models (LLMs) on NVIDIA's Grace Hopper Superchips. This innovation, developed by Xinyu Lian, Masahiro Tanaka, Olatunji Ruwase, and Minjia Zhang, represents a substantial stride towards more efficient and powerful AI development.

SuperOffload unlocks the full potential of Superchips, processors integrating GPUs and CPUs on a single package. It achieves up to a 2.5x throughput improvement compared to state-of-the-art offloading systems, thanks to its adaptive weight offloading and optimised Adam optimizer.

The system enables the training of significantly larger models. It can handle a 25 billion parameter model on a single Superchip, surpassing the capacity of GPU-only solutions by a factor of seven. Furthermore, with ZeRO-style data parallelism, SuperOffload allows for training a 50 billion parameter model using only four Superchips, a 2.5x increase over existing parallel training methods.

SuperOffload-Ulysses supports long-sequence training, achieving 55% multi-factor utilisation while training a 13 billion parameter model with sequences up to one million tokens on eight GH200 Superchips. This capability opens up new possibilities for training complex, context-rich models.

SuperOffload, developed by a team including Yao Meng, Jianyu Wu, Xutao Lv, Hongzhe Liu, and Xiaoguang Zhao, optimises LLM training on Superchips by efficiently utilising the combined resources of Hopper GPUs, Grace CPUs, and NVLink-C2C interconnects. Its evaluation alongside established methods has demonstrated its potential as a superior approach for large-scale training. This innovation paves the way for more efficient and powerful AI development, enabling the training of larger, more complex models.

Read also:

Latest