MIT's Speech2Face is a study that generates a speaker's face from a speech signal. However, it does not perform speech to face transform with one model, and it combines the results of existing studies for different purposes to create impressive results. (The first author is Professor Tae-Hyun Oh, currently at Pohang University)
The first is an existing study that extracts facial vectors from images. VGGFace, which has been continuously studied for image classification, such as ImageNet, is learned to specialize in facial images. Usually, the fc7 layer (4096d) of VGGFace is used as a feature vector, which can be used for other purposes such as face classification, age recognition, and face search. Speech2Face converts a speech signal into a complex spectrogram (598x257x2) and converts it into a 4096d vector using a 7-layer CNN with a VGG-like structure. The purpose of learning is to get a vector 4096d of faces obtained by VGGFace when speech is put into this model for a speech-face pair. In other words, Speech2Face can be viewed as spectrogram to VGGFace.
The second step is to generate facial images by inputting feature vectors of VGGFace. This research is a separate technology created by cooperation between Google and MIT, and Speech2Face utilized this technology. This technology is also not an end-to-end direct generation model like GAN. It creates two distinct features, facial landmark and facial texture, from facial vectors, and then “composites” them through warping. The reason for doing this is that although it was more efficient to learn the two traits separately, the original purpose of this technique was to convert an image of a person who is looking and making an expression to a faceless face in front. For this, we have two modules divided into input image to facial vector and facial vector to output image, and Speech2Face uses the second module. For reference, facial vector to landmark uses MLP, and facial vector to texture uses CNN.
One of the things I felt while reading the paper is that the method of creating new value by recycling and combining the basic unit technology blocks of AI seems to be one of the main trends in the future. (Fast Prototyping, Micro-Service Architecture in terms of development) It is as if the above paper combines speech to facial vector and facial vector to facial image. I think that in the future, if you combine speech to emotion here, you will be able to create human faces and expressions from voice signals.
Currently, AI APIs tend to be designed as high-level for easy service application (eg speech recognition – speech to text), but rather, many low-level APIs (eg speech to vector) per model (or concept) are used. By designing and mesh-up at the upper layer, I think we can quickly create multiple services more efficiently.
The complete source code of Speech2Face is not disclosed. However, speech to facial vector is implemented in (1) below, and facial vector to face image is 3D, so it is a little different, but you can refer to (2) below.
- Speech2Face's speech to facial vector module
- Facial vector to face image (3D) module (not the same, for reference only)
Attach the related link.