As the number of parameters of a deep learning model increases significantly, the memory required for training is also increasing. OpenAI's GPT-2 consists of 1.5B parameters, and Google's mT5 also has a number of parameters up to 13B. In addition, the number of parameters of OpenAI's GPT-3 reaches 175B. In the case of such large models, even if you try fine-tuning with small data, it is difficult to try it easily because it requires a large amount of GPU memory.
A paper published by Microsoft, “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models” covers techniques to mitigate this problem and is known to significantly reduce the amount of maximum memory required. Here is a link to the paper:
Microsoft has published a number of technologies implemented in this paper under the name DeepSpeed. Using DeepSpeed, it is said that memory requirements can be reduced by 1/10 compared to the previous one, allowing the use of 10 times larger batch size or 10 times larger model. Microsoft used this library to train a model with 17B parameters called Turing-NLG. Meanwhile, Facebook also released the technology implemented in the paper under the name FairScale. Here are links to DeepSpeed and FairScale's github repositories:
In addition, HuggingFace, which is well known as a natural language framework, implemented DeepSpeed and FairScale in version 4.2.0, and made it easy to use by setting a few parameters. For example, in the case of the mT5-3B model with 3B parameters, one RTX 3090 24G GPU will cause a memory error even with a batch size of 1, but it is said that if you use DeepSpeed, FairScale, and FP16 at the same time, you can learn with a batch size of 20. We share articles with relevant analysis: