Azure Optimized Stack with DeepSpeed for Hyperscale Model Training

Azure Machine Learning (AzureML) now offers an optimized stack that makes use of the newest NVIDIA GPU expertise with Quantum InfiniBand to effectively prepare and fine-tune giant fashions like Megatron-Turing and GPT-3.

In current years, large-scale transformers-based deep studying fashions skilled on enormous quantities of information are used for new merchandise and several other cognitive duties. These fashions have grown in measurement and magnitude and the purchasers’ wants for coaching and high-quality tuning have grown accordingly. 

The coaching and high-quality tuning of those sorts of fashions require a fancy and distributed structure and the arrange of those architectures require a number of guide and error susceptible steps. With this new optimized stack, AzureML permits a greater expertise by way of usability and performances, offering a easy to make use of coaching pipeline. The AzureML proposed stack contains: {hardware}, OS, VM picture, Docker picture (with optimized PyTorch, DeepSpeed, ONNX Runtime and different Python packages) for efficiency and scalability with out complexity.

Optimized stack for scalable distributed coaching on Azure

A potential experimental setup consists of NDm A100 v4-series that features two socket AMD EPYC 7V12 64-Core CPUs, 1.7TB of important reminiscence and eight A100 80GB GPUS. A balanced PCIe topology to attach 4 GPUs to every CPU is used and every GPU has its personal topology agnostic 200 Gb/s NVIDIA Mellanox HDR InfiniBand. The 1.7TB of important reminiscence and the DeepSpeed library offload capabilities permits the scaling to giant fashions measurement. This setup can be utilized each in AzureML studio and Azure VMSS however, the AzureML studio resolution, is beneficial as a result of is the simplest approach to have the setup up and working in the suitable and simple manner.

Differences between distributed structure and AzureML coaching setup

The AzureML proposed stack permits an environment friendly coaching of 2x bigger mannequin sizes (2 trillion vs. 1 trillion parameters), scaling to 2x extra GPUs (1024 vs. 512), and as much as 1.8x increased compute throughput/GPU (150 TFLOPs vs. 81 TFLOPs). This stack additionally has the potential to supply a near-linear scalability by way of growing the mannequin measurement and the rise of the variety of GPUs. Thanks to DeepSpeed ZeRO-3 with its CPU offloading capabilities and this new AzureML stack, the environment friendly throughput/GPU of 157 TFLOPs is maintained because the mannequin improve from 175 billion to 2 trillion parameters and, given a mannequin measurement (eg 175 Billion within the following graph), a linear scaling is achieved if the variety of GPU improve. 

More detailed outcomes are described within the deepspeed prolonged technical weblog. 

a. throughput/GPU vs mannequin measurement from 175 billion to 2 trillion parameters (BS/GPU=8),

b. Linear will increase efficiency scaling with the rise in variety of GPU units for the 175B mannequin (BS/GPU=16).

Recommended For You