High-Fidelity Facial and Speech Animation for VR HMDs
Kyle Olszewski    Joseph J. Lim    Shunsuke Saito    Hao Li   
University of Southern California    Stanford University    Pinscreen    USC Institute for Creative Technologies

Figure 1: A live demonstration of our system. We are able to obtain high-fidelity animations of the user’s facial expressions in real-time using convolutional neural net regressors. Left: a user wearing our prototype system, which uses cameras attached to the HMD to track the user’s eye and mouth movements. Right: a digital avatar controlled by the user.
Abstract

Significant challenges currently prohibit expressive interaction in virtual reality (VR). Occlusions introduced by head-mounted displays (HMDs) make existing facial tracking techniques intractable, and even state-of-the-art techniques used for real-time facial tracking in unconstrained environments fail to capture subtle details of the user’s facial expressions that are essential for compelling speech animation. We introduce a novel system for HMD users to control a digital avatar in real-time while producing plausible speech animation and emotional expressions. Using a monocular camera attached to an HMD, we record multiple subjects performing various facial expressions and speaking several phonetically-balanced sentences. These images are used with artist-generated animation data corresponding to these sequences to train a convolutional neural network (CNN) to regress images of a user’s mouth region to the parameters that control a digital avatar. To make training this system more tractable, we use audio-based alignment techniques to map images of multiple users making the same utterance to the corresponding animation parameters. We demonstrate that this approach is also feasible for tracking the expressions around the user’s eye region with an internal infrared (IR) camera, thereby enabling full facial tracking. This system requires no user-specific calibration, uses easily obtainable consumer hardware, and produces high-quality animations of speech and emotional expressions. Finally, we demonstrate the quality of our system on a variety of subjects and evaluate its performance against state-of-the-art real-time facial tracking techniques.

Introduction

Science fiction authors have excitedly envisioned immersive technologies that allow us to project our own digital avatars into captivating virtual worlds. Dramatic advancements in computer graphics and mobile display technologies have led to a remarkable revival of virtual reality, with the introduction of low cost consumer head-mounted displays, such as the Oculus Rift [Oculus VR 2014], the HTC Vive [HTC 2016], and the Google Cardboard [Google 2014]. Beyond immersive gaming and free-viewpoint videos, virtual reality is drawing wide interest from consumers and pushing the boundaries of next-generation social media platforms (e.g., High Fidelity, AltSpaceVR). We could mingle, discuss, collaborate, or watch films remotely with friends all over the world in a shared online virtual space. However, a truly immersive and faithful digital presence is unthinkable without the ability to perform natural faceto- face communication through personalized digital avatars that can convey compelling facial expressions, emotions, and dialogues.

State-of-the-art facial tracking methods commonly use explicitly tracked landmarks, depth signals in addition to RGB videos, or humans-in-the-loop. However, approaches directly using tracked landmarks to recover the full facial motion [Li et al. 2015] often suffer from occlusions. A tongue is invisible in many motions, and a large portion of the lips become invisible when an user bites her/his lips. In another approach, artists manually draw contours for all frames, and then solve a complex 3D model to fit the data [Bhat et al. 2013]. This is a very computationally intensive process and also suffers in the case of occluded regions.

Results

Figure 2: Automatic alignment of training data to a reference sequence with corresponding animation curves. Eye

Our networks were implemented using the Caffe framework [Jia et al. 2014], which provides tools facilitating the design, training, and use of CNNs, as well as the use of GPUs to accelerate the training process. The system was tested with a variety of subjects under different circumstances, including some used in the training set and others who were not. For some tests, the user was asked to recite sentences from sets of the Harvard sentences that were not used in the original training set. For others, users were asked to improvise a variety of facial expressions or statements, or to have a dialogue with another person. The system was tested in a typical office environment with standard ambient illumination as well as in a dark room in which the HMD LED lights were the only source of illumination. Subjects were able to use the system one after another with no user-specific calibration in between sessions.


Figure 3: Here we show performance capture results for the sticky lip, a deformation that challenges most performance capture techniques.
Conclusion

We have presented a method for animating a digital avatar in real-time based on the facial expressions of an HMD user. Our system is more ergonomic than existing methods such as [Li et al. 2015], makes use of more accessible components, and is more straightforward to implement. Furthermore, it achieves higher fidelity animations than can be achieved using existing methods, and requires no user-specific calibration. As such, it makes a significant step towards enabling compelling verbal and emotional communication in VR, an important step for fully immersive social interaction through digital avatars.

Our approach regresses images of the user directly to the animation controls for a digital avatar, and thus avoids the need to perform explicit 3D tracking of the subject’s face, as is done in many existing methods for realistic facial performance capture. Our system demonstrates that plausible real-time speech animation is possible through the use of a deep neural net regressor, trained with animation parameters that not only capture the appropriate emotional expressions of the training subjects, but that also make use of an appropriate psychoacoustic data set.

Downloads

Paper
High-Fidelity Facial and Speech Animation for VR HMDs.pdf, (17.6MB)

Video
Download, (170MB)

Footer With Address And Phones