Dynamic Facial Asset and Rig Generation from a Single Scan

ACM SIGGRAPH Asia 2020

Jiaman Li^1,2 Zhengfei Kuang^1,2 Yajie Zhao¹ Mingming He¹ Karl Bladin¹ Hao Li^1,2,3

University of Southern California¹ USC Institute for Creative Technologies² Pinscreen³

Introduction

High-quality and personalized digital humans are relevant to a wide range of applications, such as film and game production (e.g. Unreal Engine, Digital Doug), and virtual reality [Fyffe et al. 2014; Lombardi et al. 2018; Wei et al. 2019]. To produce high-fidelity digital doubles, complex capture equipment is often needed in conventional computer graphics pipelines, and the acquired data typically undergoes intensive manual post-processing by a production team. New approaches based on deep learning-based synthesis are promising as they show how photorealistic faces can be generated from captured data directly [Lombardi et al. 2018; Wei et al. 2019] allowing one to overcome the notorious Uncanny Valley. In addition to their intensive GPU compute requirements and the need for large volumes of training data, these deep learning-based methods are still difficult to integrate seamlessly into virtual CG environments as they lack relighting capabilities and fine rendering controls, which prevents them from being adopted for games and film production. On the other hand, realistic digital doubles in conventional graphics pipelines require months of production and involve large teams of highly skilled digital artists as well as sophisticated scanning techniques [Ghosh et al. 2011]. Building facial assets of a virtual character typically requires a number of facial expression models often based on the Facial Action Coding System (FACS), as well as physically-based texture assets (e.g., albedo, specular maps, displacement maps) to ensure realistic facial skin reflectance in a virtual environment.

Given a single neutral scan (a), we generate a complete set of dynamic face model assets, including personalized blendshapes and physically-based dynamic facial skin textures of the input subjects (b). The results carry high-fidelity details which we render in Arnold [Maya 2019] (c). Our generated facial assets are animation-ready as shown in (d).

System Overview

Our system takes a single scanned neutral geometry with an albedo map as input and generates a set of face rig assets and texture attributes for physically based production-level rendering. We developed a cascaded framework, in which we first estimate a set of personalized blendshape geometries of the input subject using a Blendshape Generation network, followed by a Texture Generation network to infer a set of dynamic maps including albedo maps, specular intensity maps, and displacement maps. In the final step, we combine the obtained secondary facial components (i.e. teeth, gums, and eye assets) from a set of template shapes, to assemble the final face model.

System Overview. Given the model from a single scan in a neutral expression, the blendshape generation module first generates its personalized blendshapes. Then, using the personalized blendshapes, along with the input neutral model and its albedo map, the texture generation module produces high-resolution dynamic texture maps including albedo, specular intensity and displacement maps. With these assets ready, we then assemble personalized blendshapes and the input neutral model into 3D models, combining other facial components (eyes, teeth, gums, and tongue) from the template models. The final output is complete face models rendered using the blendshape models and textures.

Applications

Expression Reconstruction/ Face Tracking. In Fig. 22, we compare our generated personalized blendshapes on fitting of performance capture sequences with other methods. As shown in Fig. 18, smaller fitting errors indicates better personality on blendshapes. Results show that our generated personalized blendshapes outperform baseline methods (Template and optimization-based method in Li et al. [2010] on accuracy of the face tracking task using the same solver. To provide better quantitative evidence, we evaluate face reconstruction on 2,548 expressions in training dataset and 626 expressions in testing datasets. The results are listed in Table 2. Blendshapes optimized by Li et al. [2010] and ours show smaller reconstruction errors in both training and testing data.

Conclusion

We have demonstrated an end-to-end framework for high-quality personalized face rig and asset generation from a single scan. Our face rig assets include a set of personalized blendshapes, physically based dynamic textures and secondary facial components (including teeth, eyeballs, and eyelashes). Compared to previous automatic avatar and facial rig generation approaches, which either require a considerable number of person-specific scans or can only produce a relatively low-fidelity avatar, our framework only requires a single neutral scan as input and can produce plausible identity attributes including physically-based dynamic textures of facial skins. This characteristic is key to creating compelling animation-ready avatars at scale.

Downloads

Dynamic Facial Asset and Rig Generation from a Single Scan.pdf (52MB)

Dynamic Facial Asset and Rig Generation from a Single Scan

Jiaman Li1,2 Zhengfei Kuang1,2 Yajie Zhao1 Mingming He1 Karl Bladin1 Hao Li1,2,3

Jiaman Li^1,2 Zhengfei Kuang^1,2 Yajie Zhao¹ Mingming He¹ Karl Bladin¹ Hao Li^1,2,3