Most chatbot systems still operate based on rules, but in order to implement natural conversations, you eventually need to use more complex language models such as BERT. However, there is a lot of recognition that BERT is heavy and complex, and we briefly introduced what the game platform company Roblox did to make it a service using BERT. Basically, servicing using GPU is not cost-effective, so I prepared for a scenario using CPU from the beginning.
There are three changes compared to vanilla BERT, (1) DistilBert (2) Dynamic input shape (3) Integer Quantization. Roughly speaking, (1) and (2) are 2 times faster, and (3) is about 8 times faster. When all are applied, about 30 times faster than vanilla BERT and latency is said to be greatly reduced. Over 3,000 inferences per second are possible on a 32-core Xeon basis, which is said to be more than 6x more efficient than a similarly priced V100 GPU. Especially in the case of CPU, I think it will be affected more by (3), but I will have to test it later.