We introduce the concept of unconstrained real-time 3D facial performance capture through explicit semantic segmentation in the RGB input. To ensure robustness, cutting edge supervised learning approaches rely on large training datasets of face images captured in the wild. While impressive tracking quality has been demonstrated for faces that are largely visible, any occlusion due to hair, accessories, or hand-toface gestures would result in significant visual artifacts and loss of tracking accuracy. The modeling of occlusions has been mostly avoided due to its immense space of appearance variability. To address this curse of high dimensionality, we perform tracking in unconstrained images assuming non-face regions can be fully masked out. Along with recent breakthroughs in deep learning, we demonstrate that pixel-level facial segmentation is possible in real-time by repurposing convolutional neural networks designed originally for general semantic segmentation. We develop an efficient architecture based on a two-stream deconvolution network with complementary characteristics, and introduce carefully designed training samples and data augmentation strategies for improved segmentation accuracy and robustness. We adopt a state-of-the-art regressionbased facial tracking framework with segmented face images as training, and demonstrate accurate and uninterrupted facial performance capture in the presence of extreme occlusion and even side views. Furthermore, the resulting segmentation can be directly used to composite partial 3D face models on the input images and enable seamless facial manipulation tasks, such as virtual make-up or face replacement.
Recent advances in real-time 3D facial performance capture have not only transformed the entertainment
industry with highly scalable animation and affordable production tools [8], but also popularized mobile
social media apps with facial manipulation. Many state-of-the-art techniques have been developed to operate
robustly in natural environments, but pure RGB solutions are still susceptible to occlusions (e.g., caused
by hair, hand-to-face gestures, or accessories), which result in unpleasant visual artifacts or the inability
to correctly initialize facial tracking.
State-of-the-art facial tracking methods commonly use explicitly tracked landmarks, depth signals in addition
to RGB videos, or humans-in-the-loop. However, approaches directly using tracked landmarks to recover the full
facial motion [Li et al. 2015] often suffer from occlusions. A tongue is invisible in many motions, and a large
portion of the lips become invisible when an user bites her/his lips. In another approach, artists manually draw
contours for all frames, and then solve a complex 3D model to fit the data [Bhat et al. 2013]. This is a very
computationally intensive process and also suffers in the case of occluded regions.
We demonstrate successful facial segmentation and tracking on a wide range of examples with a variety of complex occlusions, including hair, hands, headwear, and props. Our convolutional network effectively predicts a dense probability map revealing face regions even when they are blocked by objects with similar skin tones such as hands. In most cases, the boundaries of the visibile face regions are correctly estimated. Even when only a small portion of the face is visibile we show that reliable 3D facial fitting is possible when processing input data with clean segmentations. In contrast to most RGB-D based solutions [7], our method works seamlessly in outdoor environments and with any type of video sources.
Segmentation Evaluation and Comparison. We evaluate the accuracy of our segmentation technique on 437 color test images from the Caltech Occluded Faces in the Wild (COFW) dataset [48]. We use the commonly used intersection over union (IOU) metric between the predicted segmentations and the manually annotated ground truth masks provided by [66] in order to assess over and undersegmentations. We evaluate our proposed data augmentation strategy as well as the use of negative training samples in Figure 6 and show that the explicit use of hand compositings significantly improves the probability map accuracy during hand occlusions. We evalute the architecture of our network in Table 1 (left) and Figure 6 and compared our results with the state-of-the-art out of the box segmentation networks, FCN-8s[11], DeconvNet [12], and the naive ensemble of DeconvNet and FCN (EDeconvNet). Compared to FCN-8s and Deconvnet, the IOU of our method is improved by 12:7% and 1:4% respectively, but also contains much less noise as shown in Figure 6. While comparable to the performance of EDeconvNet, our method achieves nearly double the performance, which enables real-time capabilities (30 fps) on the latest GPU.
We demonstrate that real-time, accurate pixel-level facial segmentation is possible using only unconstrained RGB images with a deep learning approach. Our experiments confirm that a segmentation network with two-stream deconvolution network and shared convolution network is not only critical for extracting both the overall shape and fine-scale details effectively in real-time, but also presents the current state-of-the-art in face segmentation. We also found that a carefully designed data augmentation strategy effectively produces sufficiently large training datasets for the CNN to avoid overfitting, especially when only limited ground truth segmentations are available in public datasets. In particular, we demonstrate the first successful facial segmentations for skin-colored occlusions such as hands and arms using composited hand datasets on both positive and negative training samples. Significantly superior tracking accuracy and robustness to occlusion can be achieved by processing images with masked regions as input. Training the DDE regressor with images containing only facial regions and augmenting the dataset with synthetic occlusions ensures continuous tracking in the presence of challenging occlusions (e.g., hair and hands). Although we focus on 3D facial performance capture, we believe the key insight of this paper - reducing the dimensionality using semantic segmentation - is generally applicable to other vision problems beyond facial tracking and regression.