LipGan is the study of creating mouth shapes from speech signals. It is a technique that can be useful for creating an animation of a virtual character's mouth, but when applied in practice, the limitation is clear because only the lips of a character standing still move. In fact, when humans communicate, they use abundant body movements such as upper body movements, face direction, and hand movements, rather than moving only the lips.
To solve this problem, the research in the link below generates 3D models of body and hand movements from audio signals. Specifically, it uses an auto-regressive model such as LSTM to learn motion information from the time series distribution of body postures. At this time, the previous body posture is input to predict the next posture. By providing the acoustic feature vector extracted from the voice as an input, body posture time series data dependent on the voice signal can be obtained. However, it is said that a probabilistic generation model was also introduced to prevent this, because if the same behavior is always performed when the same words are said, the utility is reduced.
The code is also open, but I couldn't run it because part of the dataset was not accessible. I'll share the github link and the link to the review article page below.