Significant challenges currently prohibit expressive interaction in virtual reality (VR). Occlusions introduced by head-mounted displays (HMDs) make existing facial tracking techniques intractable, and even state-of-the-art techniques used for real-time facial tracking in unconstrained environments fail to capture subtle details of the user’s facial expressions that are essential for compelling speech animation. We introduce a novel system for HMD users to control a digital avatar in real-time while producing plausible speech animation and emotional expressions. Using a monocular camera attached to an HMD, we record multiple subjects performing various facial expressions and speaking several phonetically-balanced sentences. These images are used with artist-generated animation data corresponding to these sequences to train a convolutional neural network (CNN) to regress images of a user’s mouth region to the parameters that control a digital avatar. To make training this system more tractable, we use audio-based alignment techniques to map images of multiple users making the same utterance to the corresponding animation parameters. We demonstrate that this approach is also feasible for tracking the expressions around the user’s eye region with an internal infrared (IR) camera, thereby enabling full facial tracking. This system requires no user-specific calibration, uses easily obtainable consumer hardware, and produces high-quality animations of speech and emotional expressions. Finally, we demonstrate the quality of our system on a variety of subjects and evaluate its performance against state-of-the-art real-time facial tracking techniques.
Science fiction authors have excitedly envisioned immersive technologies that allow
us to project our own digital avatars into captivating virtual worlds. Dramatic
advancements in computer graphics and mobile display technologies have led to a
remarkable revival of virtual reality, with the introduction of low cost consumer
head-mounted displays, such as the Oculus Rift [Oculus VR 2014], the HTC Vive
[HTC 2016], and the Google Cardboard [Google 2014]. Beyond immersive gaming and
free-viewpoint videos, virtual reality is drawing wide interest from consumers and
pushing the boundaries of next-generation social media platforms (e.g., High Fidelity,
AltSpaceVR). We could mingle, discuss, collaborate, or watch films remotely with friends
all over the world in a shared online virtual space. However, a truly immersive and
faithful digital presence is unthinkable without the ability to perform natural faceto-
face communication through personalized digital avatars that can convey compelling facial
expressions, emotions, and dialogues.
State-of-the-art facial tracking methods commonly use explicitly tracked landmarks,
depth signals in addition to RGB videos, or humans-in-the-loop. However, approaches
directly using tracked landmarks to recover the full facial motion [Li et al. 2015]
often suffer from occlusions. A tongue is invisible in many motions, and a large
portion of the lips become invisible when an user bites her/his lips. In another
approach, artists manually draw contours for all frames, and then solve a complex
3D model to fit the data [Bhat et al. 2013]. This is a very computationally intensive
process and also suffers in the case of occluded regions.
Our networks were implemented using the Caffe framework [Jia et al. 2014], which provides tools facilitating the design, training, and use of CNNs, as well as the use of GPUs to accelerate the training process. The system was tested with a variety of subjects under different circumstances, including some used in the training set and others who were not. For some tests, the user was asked to recite sentences from sets of the Harvard sentences that were not used in the original training set. For others, users were asked to improvise a variety of facial expressions or statements, or to have a dialogue with another person. The system was tested in a typical office environment with standard ambient illumination as well as in a dark room in which the HMD LED lights were the only source of illumination. Subjects were able to use the system one after another with no user-specific calibration in between sessions.
We have presented a method for animating a digital avatar in real-time based on the
facial expressions of an HMD user. Our system is more ergonomic than existing methods
such as [Li et al. 2015], makes use of more accessible components, and is more
straightforward to implement. Furthermore, it achieves higher fidelity animations
than can be achieved using existing methods, and requires no user-specific calibration.
As such, it makes a significant step towards enabling compelling verbal and emotional
communication in VR, an important step for fully immersive social interaction through
digital avatars.
Our approach regresses images of the user directly to the animation controls for a
digital avatar, and thus avoids the need to perform explicit 3D tracking of the subject’s
face, as is done in many existing methods for realistic facial performance capture. Our
system demonstrates that plausible real-time speech animation is possible through the
use of a deep neural net regressor, trained with animation parameters that not only
capture the appropriate emotional expressions of the training subjects, but that also
make use of an appropriate psychoacoustic data set.