One-Shot Identity-Preserving Portrait Reenactment
Sitao Xiang1    Yuming Gu1    Pengda Xiang1    Mingming He1   
Koki Nagano1    Haiwei Chen1    Hao Li1,2   
University of Southern California/USC Institute for Creative Technologies1    Pinscreen2   



Semantic image synthesis results produced by our method. Our method can not only synthesize images using only semantic segmentation masks as input, but also supports controllable synthesis via a reference style image.
Abstract

We present a deep learning-based framework for portrait reenactment from a single picture of a target (one-shot) and a video of a driving subject. Existing facial reenactment methods suffer from identity mismatch and produce inconsistent identities when a target and a driving subject are different (cross-subject), especially in one-shot settings. In this work, we aim to address identity preservation in cross-subject portrait reenactment from a single picture. We introduce a novel technique that can disentangle identity from expressions and poses, allowing identity preserving portrait reenactment even when the driver's identity is very different from that of the target. This is achieved by a novel landmark disentanglement network (LD-Net), which predicts personalized facial landmarks that combine the identity of the target with expressions and poses from a different subject. To handle portrait reenactment from unseen subjects, we also introduce a feature dictionary-based generative adversarial network (FD-GAN), which locally translates 2D landmarks into a personalized portrait, enabling one-shot portrait reenactment under large pose and expression variations. We validate the effectiveness of our identity disentangling capabilities via an extensive ablation study, and our method produces consistent identities for cross-subject portrait reenactment. Our comprehensive experiments show that our method signifcantly outperforms the state-of-the-art single-image facial reenactment methods. We will release our code and models for academic use.


The training procedure of LD-Net, consisting of two stages.
Landmark Disentanglement Network (LD-Net)

Disentangling landmarks into identity and pose/expression is difficult due to the lack of accurate numerical labeling for pose/expression. Inspired by [1], which can disentangle two complementary factors of variations with only one of them labeled, we propose a landmark disentanglement network (LD-Net) to disentangle identity and pose/expression using data with only the subject's identity labeled. More importantly, our network generalizes well to novel identities (i.e., those unseen during training), unlike previous works (e.g. [1]).

Given 2D facial landmarks from a pair of face images, LD-Net first disentangles the landmarks into a pose/expression latent code and an identity latent code, then combines the target's identity code with the source's pose/expression code to synthesize new landmarks. As shown in Fig. 3, the training procedure of LD-Net is divided into two stages. Stage 1 aims to train a stable pose/expression encoder and Stage 2 generalizes to predict an identity code from landmarks instead of using identity labels so as to handle unseen identities.

Experiments

We first conduct an evaluation and ablation study in Sec. 4.1 on the performance of LD-Net and FD-GAN independently, followed by comparisons of our full method with the state-of-the-art methods on cross-subject face reenactment in Sec. 4.2. For more results tested on unconstrained portrait images, please refer to the supplemental material.

Implementation details. For FD-GAN, the extractor and translator are based on U-Nets, with both networks joined together by dictionary writer/reader modules inserted into the up-convolution modules. The discriminator and classifier for FD-GAN are patch-based and have the same structure as the down-convolution part of the U-Nets. Please refer to the supplemental material for more details concerning the network structures and training strategies.

Performance. Our method takes approximately 0.08s for FD-GAN to generate one image and 0.02s for LD-Net to perform landmark disentanglement on a single NVIDIA TITAN X GPU. Training datasets. The training dataset is built from VoxCeleb video training data [13] which is processed by dlib [36] to crop a 256x256 face image at 25fps and to extract its landmarks. In total, it contains 52,112 videos for 1,000 ran- domly selected subjects.


Qualitative comparison between our method and two 3D-based methods: Face2Face [56] and NeuralTexture [55].
Discussion and Future Work

We have demonstrated a technique for portrait reenactment that only requires a single target picture and 2D landmarks of the target and the driver. The resulting portrait is not only photorealistic but also preserves recognizable facial features of the target. Our comparison shows significantly improved results compared to state-of-the-art single-image portrait manipulation methods. Our extensive evaluations confirm that identity disentanglement of 2D landmarks is effective in preserving the identity when synthesizing a reenacted face. We have shown that our method can handle a wide variety of challenging facial expressions and poses of unseen identities without subject-specific training. This is made possible thanks to our generator, which uses a feature dictionary to translate landmark features into a photorealistic portrait.

A limitation of our method is that the resulting portrait has only a resolution of 256x256, and it is still difficult to capture high-resolution person-specific details such as stubble hair. It could also suffer from some artifacts for non-facial parts and the background region, since we rely on the landmarks to transfer facial appearance but the landmarks contain no structural information about the hair or background. We believe such a limitation could be further addressed by incorporating dense pixel-wise conditioning [40] and segmentation. While our method can produce reasonably stable portrait reenactment results from a frame of target and 2D landmarks, the temporal consistency could be further improved by taking into account temporal information from the entire video.

Footer With Address And Phones