High-quality and personalized digital humans are relevant to a wide range of applications, such as film and game production (e.g. Unreal Engine, Digital Doug), and virtual reality [Fyffe et al. 2014; Lombardi et al. 2018; Wei et al. 2019]. To produce high-fidelity digital doubles, complex capture equipment is often needed in conventional computer graphics pipelines, and the acquired data typically undergoes intensive manual post-processing by a production team. New approaches based on deep learning-based synthesis are promising as they show how photorealistic faces can be generated from captured data directly [Lombardi et al. 2018; Wei et al. 2019] allowing one to overcome the notorious Uncanny Valley. In addition to their intensive GPU compute requirements and the need for large volumes of training data, these deep learning-based methods are still difficult to integrate seamlessly into virtual CG environments as they lack relighting capabilities and fine rendering controls, which prevents them from being adopted for games and film production. On the other hand, realistic digital doubles in conventional graphics pipelines require months of production and involve large teams of highly skilled digital artists as well as sophisticated scanning techniques [Ghosh et al. 2011]. Building facial assets of a virtual character typically requires a number of facial expression models often based on the Facial Action Coding System (FACS), as well as physically-based texture assets (e.g., albedo, specular maps, displacement maps) to ensure realistic facial skin reflectance in a virtual environment.
Our system takes a single scanned neutral geometry with an albedo map as input and generates a set of face rig assets and texture attributes for physically based production-level rendering. We developed a cascaded framework, in which we first estimate a set of personalized blendshape geometries of the input subject using a Blendshape Generation network, followed by a Texture Generation network to infer a set of dynamic maps including albedo maps, specular intensity maps, and displacement maps. In the final step, we combine the obtained secondary facial components (i.e. teeth, gums, and eye assets) from a set of template shapes, to assemble the final face model.
Expression Reconstruction/ Face Tracking. In Fig. 22, we compare our generated personalized blendshapes on fitting of performance capture sequences with other methods. As shown in Fig. 18, smaller fitting errors indicates better personality on blendshapes. Results show that our generated personalized blendshapes outperform baseline methods (Template and optimization-based method in Li et al. [2010] on accuracy of the face tracking task using the same solver. To provide better quantitative evidence, we evaluate face reconstruction on 2,548 expressions in training dataset and 626 expressions in testing datasets. The results are listed in Table 2. Blendshapes optimized by Li et al. [2010] and ours show smaller reconstruction errors in both training and testing data.
We have demonstrated an end-to-end framework for high-quality personalized face rig and asset generation from a single scan. Our face rig assets include a set of personalized blendshapes, physically based dynamic textures and secondary facial components (including teeth, eyeballs, and eyelashes). Compared to previous automatic avatar and facial rig generation approaches, which either require a considerable number of person-specific scans or can only produce a relatively low-fidelity avatar, our framework only requires a single neutral scan as input and can produce plausible identity attributes including physically-based dynamic textures of facial skins. This characteristic is key to creating compelling animation-ready avatars at scale.