paGAN: Real-time Avatars Using Dynamic Textures

paGAN: Real-time Avatars Using
Dynamic Textures

ACM Transactions on Graphics (SIGGRAPH ASIA 2018)

Koki Nagano^1,3 Jaewoo Seo³ Jun Xing¹ Lingyu Wei³ Zimo Li³ Shunsuke Saito^2,3 Aviral Agarwal³ Jens Fursund³ Hao Li^1,2,3

USC Institute for Creative Technologies¹ University of Southern California² Pinscreen³

Introduction

Recent advances in single-view avatar creation have facilitated the consumer accessibility and scalable production of compelling CG avatars, with potential applications in personalized gaming, social VR, and immersive communication. Such images are easy to access and consumer friendly - all you need is a selfie. In addition, such technology is especially useful for instances in which controlled capture is not available - for example, celebrities or the deceased. Photorealistic CG characters are known to be difficult and expensive to produce, especially in the context of conventional production pipelines. The slightest inaccuracy in the modeling, shading, and rendering can result in the uncanny valley effect. Image-based approaches, where photographs of subjects are directly used as textures, can easily achieve convincing results while keeping rendering costs low, but are not suitable for reproducing complex non-linear facial expressions or renderings with novel illumination conditions.

Gallery of results of image-based dynamic avatars. Avatar hair generated using the method of [Hu et al. 2017]. The far right column shows the avatar rendered for a real-time application in a scene for mobile devices.

Method Overview

Dynamic Texture Synthesis. Given a single face image in a neutral pose, along with a desired blendshape expression and viewpoint, our deep generative network (paGAN) is able to produce a realistic image of the face with the desired expression. Our network is conditioned on multiple inputs, including eye gaze and a rendered image of normals of the 3D fitting to the target image, providing fine scale control of the output. We note that we use a single network that is able to work for all identities, whereas the method of [Kim et al. 2018] use different networks trained for each target avatar, requiring for each a source video as training data. Likewise, though Olszewski et al. [Olszewski et al.] can produce animations from a single image of new subjects, they use temporally aligned video performances for training. In either case, a tedious process is required to accommodate new subjects or extend the training set size. Our method works for any number of subjects on a single network, and only requires lightweight annotation of the data (which images have the same identity). Once trained, our network can produce fully-controllable temporally stable video performances from any image.

Conclusion

We have demonstrated paGAN - an end-to-end deep learning approach for facial expression texture synthesis - on a wide range of challenging examples. We show how individualized and continuous variations of facial expressions, including the mouth interior and eyes, can be synthesized in various poses from a single input image. Our technique does not train individualized networks for each identity [Kim et al. 2018], meaning new subjects are easy to process. We show that paGAN enables performance-based animation using an image-based dynamic avatar as well as video-driven facial animation, by generating a compressed model representation that can be run on a mobile device in real-time.

Downloads

SIGA2018_paGAN.pdf (172MB)

paGAN: Real-time Avatars Using Dynamic Textures

Koki Nagano1,3 Jaewoo Seo3 Jun Xing1 Lingyu Wei3 Zimo Li3 Shunsuke Saito2,3 Aviral Agarwal3 Jens Fursund3 Hao Li1,2,3

paGAN: Real-time Avatars Using
Dynamic Textures

Koki Nagano^1,3 Jaewoo Seo³ Jun Xing¹ Lingyu Wei³ Zimo Li³ Shunsuke Saito^2,3 Aviral Agarwal³ Jens Fursund³ Hao Li^1,2,3