We present a deep learning-based framework for portrait reenactment from a single picture of a target (one-shot) and a video of a driving subject. Existing facial reenactment methods suffer from identity mismatch and produce inconsistent identities when a target and a driving subject are different (cross-subject), especially in one-shot settings. In this work, we aim to address identity preservation in cross-subject portrait reenactment from a single picture. We introduce a novel technique that can disentangle identity from expressions and poses, allowing identity preserving portrait reenactment even when the driver's identity is very different from that of the target. This is achieved by a novel landmark disentanglement network (LD-Net), which predicts personalized facial landmarks that combine the identity of the target with expressions and poses from a different subject. To handle portrait reenactment from unseen subjects, we also introduce a feature dictionary-based generative adversarial network (FD-GAN), which locally translates 2D landmarks into a personalized portrait, enabling one-shot portrait reenactment under large pose and expression variations. We validate the effectiveness of our identity disentangling capabilities via an extensive ablation study, and our method produces consistent identities for cross-subject portrait reenactment. Our comprehensive experiments show that our method signifcantly outperforms the state-of-the-art single-image facial reenactment methods. We will release our code and models for academic use.
Disentangling landmarks into identity and pose/expression is difficult due to the
lack of accurate numerical labeling for pose/expression. Inspired by , which
can disentangle two complementary factors of variations with only one of them
labeled, we propose a landmark disentanglement network (LD-Net) to disentangle
identity and pose/expression using data with only the subject's identity
labeled. More importantly, our network generalizes well to novel identities (i.e.,
those unseen during training), unlike previous works (e.g. ).
Given 2D facial landmarks from a pair of face images, LD-Net first disentangles the landmarks into a pose/expression latent code and an identity latent code, then combines the target's identity code with the source's pose/expression code to synthesize new landmarks. As shown in Fig. 3, the training procedure of LD-Net is divided into two stages. Stage 1 aims to train a stable pose/expression encoder and Stage 2 generalizes to predict an identity code from landmarks instead of using identity labels so as to handle unseen identities.
We first conduct an evaluation and ablation study in Sec. 4.1 on the performance
of LD-Net and FD-GAN independently, followed by comparisons of our
full method with the state-of-the-art methods on cross-subject face reenactment
in Sec. 4.2. For more results tested on unconstrained portrait images, please refer
to the supplemental material.
Implementation details. For FD-GAN, the extractor and translator are based on U-Nets, with both networks joined together by dictionary writer/reader modules inserted into the up-convolution modules. The discriminator and classifier for FD-GAN are patch-based and have the same structure as the down-convolution part of the U-Nets. Please refer to the supplemental material for more details concerning the network structures and training strategies.
Performance. Our method takes approximately 0.08s for FD-GAN to generate one image and 0.02s for LD-Net to perform landmark disentanglement on a single NVIDIA TITAN X GPU. Training datasets. The training dataset is built from VoxCeleb video training data  which is processed by dlib  to crop a 256x256 face image at 25fps and to extract its landmarks. In total, it contains 52,112 videos for 1,000 ran- domly selected subjects.
We have demonstrated a technique for portrait reenactment that only requires a
single target picture and 2D landmarks of the target and the driver. The resulting
portrait is not only photorealistic but also preserves recognizable facial features
of the target. Our comparison shows significantly improved results compared
to state-of-the-art single-image portrait manipulation methods. Our extensive
evaluations confirm that identity disentanglement of 2D landmarks is effective
in preserving the identity when synthesizing a reenacted face. We have shown
that our method can handle a wide variety of challenging facial expressions and
poses of unseen identities without subject-specific training. This is made possible
thanks to our generator, which uses a feature dictionary to translate landmark
features into a photorealistic portrait.
A limitation of our method is that the resulting portrait has only a resolution of 256x256, and it is still difficult to capture high-resolution person-specific details such as stubble hair. It could also suffer from some artifacts for non-facial parts and the background region, since we rely on the landmarks to transfer facial appearance but the landmarks contain no structural information about the hair or background. We believe such a limitation could be further addressed by incorporating dense pixel-wise conditioning  and segmentation. While our method can produce reasonably stable portrait reenactment results from a frame of target and 2D landmarks, the temporal consistency could be further improved by taking into account temporal information from the entire video.