As deep learning models grow exponentially in size, it is no longer difficult to achieve usable learning times with a single machine. GPT-2, a well-known conversation model, has about 1.5B parameters and is said to have used 8 million web pages for training. GPT-3 has 175B parameters, more than 100 times that of GPT-2, and training a model at this level requires building a large GPU cluster.
Well-known learning frameworks such as recent versions of Tensorflow and PyTorch include distributed learning capabilities that utilize multiple GPU machines to perform training. However, the configuration is complex, and in order to achieve adequate efficiency, you must perform high-difficulty customization in many areas such as network configuration, authority management, and data sharing. Because of this, several frameworks aimed at easy distributed learning are emerging.
Horovod is a well-known distributed learning framework that supports Keras, Tensorflow, PyTorch and MxNet:
According to the experiment using Horovod, in the case of Inception V3 or ResNet-101, a distributed learning efficiency of 90% can be obtained compared to a single node, and in the case of VGG-16, a distributed learning efficiency of 68% can be obtained. In other words, for example, if you use 4 nodes, you can get 3.6 times the learning efficiency for Inception V3 and 1.8 times for VGG-16.
RaySGD is a framework implemented on top of PyTorch's distributed learning function, and is designed to greatly improve the convenience of setting. There is a weakness of being limited to PyTorch, but the efficiency of distributed learning is slightly improved compared to Horovod, which is said to be about 20% or more improvement compared to the distributed learning function built into the existing PyTorch. In particular, Horovod is a bit complicated to set up external libraries such as MPI or NCCL, and RaySGD has the advantage of being able to create extensible learning code with only simple installation and configuration, compared to having to build and use it according to the environment:
Both Horovod and RaySGD are freely available, open source and actively improving projects. Of course, using a pre-built GPU cluster such as AWS is one way, but for the purpose of building your own GPU farm, introducing and using such a framework would be a good starting point.