Lessons from the metaseq paper and the "Open Pre-Trained Transformer" project. There is another paper by DeepMind called Gopher that I haven't read. The large scale OPT-175B model was trained on 1000, 80GB, A100 GPUs for around 34-35 days on around 180B tokens ≅ 800GB of text data. Each A100 offers 312 Teraflops and the meta system reached 147 Tflops utilization. The state of the optimizer "AdamW" is stored in FP32 format, whereas the model weights are stored in FP16 format. The different floating points have the following format.

Format	(Sign=1, Range/Exponent, Precision/Mantissa)
FP32	8, 23
TF32	8, 10
FP16	5, 10
BF16	8, 7

So the total training floating point ops were about 0.4Yotta operations. At inference time, the entire process can be sped up by 30-50% by investing in sparsifying with 2:4 sparsity.

how to generate a structured sparse network using APEX's Automatic Sparsity library

""" Generate a structured sparse network using APEX's Automatic Sparsity library.
"""
import torch
from apex.contrib.sparsity import ASP
device = torch.device('cuda')
model = TheModelClass()
model.load_state_dict(torch.load("dense_model.pth")) # Load existing model
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
ASP.prune_trained_model(model, optimizer)
dataloader = DataLoader()
for t in range(500):
      optimizer.zero_grad()
      loss = loss_fn(model(x), y)
      loss.backward()
      optimizer.step()

torch.save(model.state_dict(), "pruned_model.pth")

The batch size was 0.5M-4M and each sequence length was 2048. The total model size was 350GB and assuming that a checkpoint was made after every 1hr of work will mean that 264TB of disk space was required just to store the checkpoints. One of the cheapest cloud provider for A100 is lambdalabs that charges $1.1/hr/A100gpu so the training cost will be $892K. OTOH the acquisition cost for 1000 A100 is 11K$ * 1000 = 11M$. If one acquires 100 3090 or 3080 for personal/commercial use then the cost can come down to 100K$ and the electricity-cost to run A100s which take something between 300W to 1KW will be 7.26e-2$/KW/H * 24 * 1 * 33 * 1024 = 59K$ which is 1/20th of the cloud-rent. But the 3090/3080 have smaller memory at 24/12GB instead of the 80/40GB for A100.

In terms of algorithmic decisions the main things to decide while training large LMs are following:

	Decision Dimensions	Specific to MetaSeq (OPT)
	Activation Function	Relu (they weren't able to get swish to work well with Mixed Precision)
	Seq. Length	2048
	Optimizer	AdamW (see below)
	LR Scheduler	Triangular (Howard and Ruder from FastAI)
	Batch Size	0.5 - 4M
	Gradient Clipping
	Dropout	0.1
	Dynamic Loss Scaling

AdamW : Broadly speaking the decoupled weight decay idea is that instead of computing the momentum from the regularized loss, we compute momentum from the unregularized loss but then regularize the weight update. This was described on page 3 of this paper.

Loss Scaling : The FP16 numbers can represent exponents ranging from -14 to 15. But usually the gradient values tend to be small in magnitude and therefore they only require negative exponents. If we scale up the loss value before backprogation, then scale back down the values (by using a small learning rate ??) before weight update/gradient clipping or any other gradient-related computations take place.

Timelines : GPT3 details were released on May 2020, FB OPT was trained around Dec 2021, and Gopher model details were released in Dec 2021. The FB team was using Azure cloud provider.

Automatic Mixed Precision via NVIDIA APEX and AMP :

# At the logical level, Amp works by employing a whitelist / blacklist model.
# We divide the universe of functions into three sets:
# Whitelist : Functions where we expect a speedup with FP16 math.
# Blacklist : Functions for which 16 bits of precision may not be sufficient.
#     so we want to ensure that inputs are in FP32, the most common examples
#     of these are the neural net loss functions like softmax with cross entropy.
# Everything : Treated like blacklist.

from apex import amp
model, optimizer = amp.initialize(model, optimizer)
loss = criterion(…)
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
optimizer.step()

Systems Stuff

A useful Linux Commands is pdsh which can do remote execution via ssh on multiple machine.

Nvidia Networking Glossary

NVIDIA NVLink enables high-speed peer-to-peer communication between GPUs within a node.
Nvidia Infiniband and Ethernet support upto 200GB/s data speed for inter-node GPU comm. Infiniband is better because it has much lower latency so it can transfer "medium" sized packets quicker.
RDMA -- This is naturally supported on Infiniband and is available via ROCE (RDMA over converged ethernet).
NCCL -- NVIDIA collective communication library.