The performance improvement shown by Transformer-based language models is surprising, but as the model size increases exponentially, concerns about service costs are also becoming important. In the case of Bert-base or GPT-2, there are about 100 million parameters, so the model size, memory bandwidth, and inference time all have to be optimized to a significant degree.
Two representative techniques used for model optimization are distillation and quantization. The shared link is the result of Huggingface and Microsoft's work, applying INT8 quantization and ONNX runtime to Huggingface's models, and analyzing the performance of the SIMD instruction sets (AVX2, AVX512 VNNI) supported by the latest CPUs. For reference, AVX2 has 256-bit registers and AVX512 supports 512-bit registers, plus a CNN-specific instruction set called VNNI.
In summary, applying INT8 quantization and ONNX runtime can reduce the model size to about a quarter, with little performance drop. In addition, the inference speed is also improved by 1.6 times for AVX2 and 3 times for AVX512 VNNI. Among them, the contribution of INT8 quantization is about 85%, and the speed improvement obtained by applying ONNX runtime compared to Pytorch is about 15% again.
On the other hand, if you measure the inference time, Pytorch FP32 is about 58ms based on batch size 1, sequence length 128, GPT-2, and AVX512 VNNI. When ONNX runtime is applied, it is 45ms, and when INT8 quantization is applied, it is 20ms. Considering that the above result is not applied to other optimization techniques such as distillation, I think that both INT8 quantization and ONNX runtime application are MUST-HAVE ITEMs for serviceization.
In the deep learning training market, GPUs are essential and it will be impossible to replace them with CPUs, but the inference market has yet to come to a conclusion. NVidia Ampere multi-instance GPU, Intel AVX512 VNNI, Google TPU, Qualcomm neural processing engine, Huawei Kirin, Apple Bionic, etc., each can be seen as a step in their efforts to have their own solutions. In this respect, interoperability and standardization of products from different vendors can be important, and I think that model standardization frameworks such as ONNX will contribute a lot in this regard. Below is an introduction and experiment result of using HuggingFace and ONNX runtime together.
We also share links to the HuggingFace transformer and ONNX Runtime github.