Recent advances in single-view avatar creation have facilitated the consumer accessibility and scalable production of compelling CG avatars, with potential applications in personalized gaming, social VR, and immersive communication. Such images are easy to access and consumer friendly - all you need is a selfie. In addition, such technology is especially useful for instances in which controlled capture is not available - for example, celebrities or the deceased. Photorealistic CG characters are known to be difficult and expensive to produce, especially in the context of conventional production pipelines. The slightest inaccuracy in the modeling, shading, and rendering can result in the uncanny valley effect. Image-based approaches, where photographs of subjects are directly used as textures, can easily achieve convincing results while keeping rendering costs low, but are not suitable for reproducing complex non-linear facial expressions or renderings with novel illumination conditions.
Dynamic Texture Synthesis. Given a single face image in a neutral pose, along with a desired blendshape expression and viewpoint, our deep generative network (paGAN) is able to produce a realistic image of the face with the desired expression. Our network is conditioned on multiple inputs, including eye gaze and a rendered image of normals of the 3D fitting to the target image, providing fine scale control of the output. We note that we use a single network that is able to work for all identities, whereas the method of [Kim et al. 2018] use different networks trained for each target avatar, requiring for each a source video as training data. Likewise, though Olszewski et al. [Olszewski et al.] can produce animations from a single image of new subjects, they use temporally aligned video performances for training. In either case, a tedious process is required to accommodate new subjects or extend the training set size. Our method works for any number of subjects on a single network, and only requires lightweight annotation of the data (which images have the same identity). Once trained, our network can produce fully-controllable temporally stable video performances from any image.
We have demonstrated paGAN - an end-to-end deep learning approach for facial expression texture synthesis - on a wide range of challenging examples. We show how individualized and continuous variations of facial expressions, including the mouth interior and eyes, can be synthesized in various poses from a single input image. Our technique does not train individualized networks for each identity [Kim et al. 2018], meaning new subjects are easy to process. We show that paGAN enables performance-based animation using an image-based dynamic avatar as well as video-driven facial animation, by generating a compressed model representation that can be run on a mobile device in real-time.