Rapid Face Asset Acquisition with Recurrent Feature Alignment
SIGGRAPH Asia 2022
Shichen Liu1,2    Yunxuan Cai2    Haiwei Chen1,2    Yichao Zhou3    Yajie Zhao1   
University of Southern California1    USC Institute for Creative Technologies2    University of California Berkeley3   

Our end-to-end framework infers production-ready face assets from multi-view images, with a state-of-the-art efficiency at 4.5 frame per second. The inferred assets contain both the pore-level geometry and a skin reflectance property maps (specularity and diffuse maps), allowing physically-based renderings in various lighting conditions. Notably, our framework is fully automatic: the results shown are direct output of our designed neural network without any manual editing and post processing.
Introduction

Photo-realistic face avatar capture has become a key element in entertainment media due to the realism and immersion it enables. As the digital assets created from photos of human faces surpass their artist-created counterparts in both diversity and naturalness, there are increasing demands for the digitized face avatars in the majority of the sectors in the digital industry: movies, video games, teleconference, and social media platforms, to name a few. In a studio setting, the term “avatar” encompasses several production standards for a scanned digital face, including high-resolution geometry ( with pore-level details), high-resolution facial textures (4K) with skin reflectance measurements, as well as a digital format that is consistent in mesh connectivity and ready to be rigged and animated. These standards together are oftentimes referred to as a production-ready face avatar.

In this paper, we consider a common face acquisition setting where a collection of calibrated cameras capture the color images that are processed into a full set of assets for a face avatar. In general, today’s professional setting employs a two-step approach to the creation of the face assets. The first step computes a middle-frequency geometry of the face (with noticeable wrinkle and facial muscle movement) from multi-view stereo (MVS) algorithms. A second registration step is then taken to register the geometries to a template meth connectivity, commonly of lower resolution with around 10k to 50k vertices. For production use, the registered base mesh is augmented by a set of texture maps, composed of albedo, specular and displacement maps, that are computed via photogrammetry cues and specially designed devices (e.g. polarizers and gradient light patterns in [Ghosh et al. 2011a; Ma et al. 2008]). The lower-resolution base mesh is combined with a high resolution displacement maps to represent geometry with pore, freckle-level details. Modern physically-based rendering agents further utilize the albedo and specularity maps to render the captured face in photo-realistic quality.


An example set of subject data used for training. (a) Selected views of the captured images as input. (b) Processed geometry in the form of a 3D mesh. In addition to the face, head, and neck, our model represents teeth, gums, eyeballs, eye blending, lacrimal fluid, eye occlusion, and eyelashes. The green region denotes the face area that our model aims to reconstruct. The other parts are directly adopted from a template (c) 4𝐾 × 4𝐾 physically-based skin properties, including albedo (bottom-left), specular (top-left) and displacement maps (top-right) used for texture supervision, and the 512 × 512 position map (bottom-right), converted from the 3D mesh in (b), used for geometry supervision.
Data Collection

3.1 Capture System Setup
Our training data is acquired by a Light Stage scan system, which is able to capture at pore-level accuracy in both geometry and reflectance maps by combining photometric stereo reconstruction [Ghosh et al. 2011b] and polarization promotion [LeGendre et al. 2018]. The camera setup consists of 25 Ximea machine vision cameras, including 17 monochrome and 8 color cameras. The monochrome cameras, compared to their color counterparts, support more efficient and higher-resolution capturing, which are essential for sub-millimeter geometry details, albedo, and specular reflectance reconstruction. The additional color cameras aid in stereo-based mesh reconstruction. The RGB colors in the captured images are obtained by adding successive monochrome images recorded under different illumination colors as shown in [LeGendre et al. 2018]. We selected a FACS set [Ekman and Friesen 1978] which combines 40 action units to a condensed set of 26 expressions for each subjects to perform. A total number of 64 subjects, ranging from age 18 to 67, were scanned.

3.2 Data Preparation
Starting from the multi-view images, we first reconstruct the geometry of the scan with neutral expression of the target subject using a multi-view stereo (MVS) algorithm. Then the reconstructed scans are registered using a linear fitting algorithm based on a 3D face morphable model, similar to the method in [Blanz and Vetter 1999]. In particular, we fit the scan by estimating the morphable model coefficients using linear regression to obtain an initial shape in the template topology. Then a non-rigid Laplacian deformation is performed to further minimize the surface-to-surface distance. We deform all the vertices on the initially fitted mesh by setting the landmarks to match their correspondence on the scan surface as data terms and use the Laplacian of the mesh as a regularization term.

We adopt and implement a variation of [Sorkine et al. 2004] to solve this system. Once the neutral expression of the target person is registered, the rest of the expressions are processed based on it. We first adopted a set of generic blendshapes (a set of vertex differences computed between each expression and the neutral, with 54 predefined orthogonal expressions ) and the fitted neutral base mesh to fit the scanned expressions and then performed the same non-rigid mesh registration step to further minimize the fitting error. Additionally, to ensure the cross-expression consistency for the same identity, optical flow from neutral to other expressions is added as a dense consistency constraint in the non-rigid Laplacian deformation step. This 2D optical flow will be further used as a projection constraint when solving for the 3D location of a vertex on the target expression mesh during non-linear deformation.


Network architecture of ReFA. Our model recurrently optimizes for the facial geometry and the head pose based on computation of visual-semantic correlation (VSC) and utilizes the pixel-aligned signals learned thereof for high-resolution texture inference.

All the processed geometries and textures share the same mesh connectivity and thus have dense vertex-level correspondence. The diffuse-specular separation is computed under a known spherical illumination [Ma et al. 2007]. The pore-level details of the geometry are computed by employing albedo and normal maps in the stereo reconstruction [Ghosh et al. 2011b] and represented as displacement maps to the base mesh. The full set of the generic model consists of a base geometry, a head pose, and texture maps (albedo, specular intensity, and displacement) encoded in 4𝐾 resolution. 3D vertex positions are rasterized to a three-channel HDR bitmap of 256 × 256 pixels resolution to enable joint learning of the correlation between geometry and albedo. 15 camera views are used for the default setting to infer the face assets with our neural network. Figure 2 shows an example of captured multi-view images and a full set of our processed face asset that is used for training.

In addition to the primary assets generated using our proposed network, we may also assemble secondary components (e.g., eyeballs, lacrimal fluid, eyelashes, teeth, and gums) to the network-created avatar. Based on a set of handcrafted blendshapes with all the primary and secondary parts, we linearly fit the reconstructed mesh by computing the blending weights that drive the secondary components to travel with primary parts, such that the eyelashes will travel with eyelids. Except for the eyeball, other secondary parts share a set of generic textures for all the subjects. For eyeball, we adopt an eyeball assets database [Kollar 2019] with 90 different pupil patterns to match with input subjects. Note that all the eyes share the same shape as in [Kollar 2019] and in our database. For visualization purposes, we manually pick the matching eye color. The dataset is split into 45 subjects for the training and 19 for the evaluation. Each set of capture contains a neutral face and 26 expressions, including extreme face deformation, asymmetrical motions, and subtle expressions.


Fig. 6. Images rendered from our reconstructed face assets. Geometry constructed from the input images and the inferred appearance maps are used in the physical renderings with Maya Arnold under an lighting environments provided HDRI images. The renderings achieve photo-realistic quality that faithfully recovers the appearance and expression captured in the input photos.
Results

Figure 6 shows the rendered results using the complete set of assets produced by our system from randomly selected testing data, including the input reference images, the directly inferred texture maps, and the renderings under different illuminations. In addition, Figure 7 shows a detailed visualization of the inferred high-resolution texture maps: diffuse albedo, specular and displacement map. All results are rendered using the reconstructed geometries and texture maps with Maya Arnold using a physically-based shader under environment illumination provided by HDRI images.

In the following sections, we provide comparative evaluation to directly related baseline methods (Section 6.1) as well as an ablation study (Section 6.2). In addition, we also demonstrate three meaningful applications that ReFA enables in Section 6.3.


Fig. 7. Detailed results for the texture map inference. The even rows display the zoom-in images of the 4096 × 4096 texture maps. Our texture inference network constructs texture maps from the multi-view images with high-frequency details that essentially allow for photo-realistic renderings of the face assets.

To quantitatively evaluate the geometry reconstruction, we first convert our inferred position map to a mesh representation as described in section 4.1. We then compute the scan-to-mesh errors using a method that follow [Li et al. 2021], with the exception that the errors are computed on a full face region including the ears. We measure both the mean and median errors as the main evaluation metrics, given that the two statistics capture the overall accuracy of the reconstruction models. To better quantize the errors for analysis, we additionally show the Cumulative Density Function (CDF) curves of the errors, which measure the percentage of point errors that falls into a given error threshold.




Footer With Address And Phones