High-fidelity face digitization solutions often combine multi-view stereo (MVS) techniques for 3D reconstruction and a non-rigid registration step to establish dense correspondence across identities and expressions. A common problem is the need for manual clean-up after the MVS step, as 3D scans are typically affected by noise and outliers and contain hairy surface regions that need to be cleaned up by artists. Furthermore, mesh registration tends to fail for extreme facial expressions. Most learning-based methods use an underlying 3D morphable model (3DMM) to ensure robustness, but this limits the output accuracy for extreme facial expressions. In addition, the global bottleneck of regression architectures cannot produce meshes that tightly fit the ground truth surfaces. We propose ToFu, Topological consistent Face from multi-view, a geometry inference framework that can produce topologically consistent meshes across facial identities and expressions using a volumetric representation instead of an explicit underlying 3DMM. Our novel progressive mesh generation network embeds the topological structure of the face in a feature volume, sampled from geometry-aware local features. A coarse-to-fine architecture facilitates dense and accurate facial mesh predictions in a consistent mesh topology. ToFu further captures displacement maps for porelevel geometric details and facilitates high-quality rendering in the form of albedo and specular reflectance maps. These high-quality assets are readily usable by production studios for avatar creation, animation and physically-based skin rendering. We demonstrate state-of-the-art geometric and correspondence accuracy , while only taking 0.385 seconds to compute a mesh with 10K vertices, which is three orders of magnitude faster than traditional techniques. The code and the model are available for research purposes at Tianye Li's github.
Creating high-fidelity digital humans is not only highly
sought after in the film and gaming industry, but is also gaining
interest in consumer applications, ranging from telepresence
in AR/VR to virtual fashion models and virtual assistants.
While fully automated single-view avatar digitization
solutions exist [28, 29, 42, 56, 63], professional studios
still opt for high resolution multi-view images as input, to
ensure the highest possible fidelity and surface coverage in
a controlled setting [8, 23, 25, 40, 41, 46, 50] instead of unconstrained
input data. Typically, high-resolution geometric
(< 1mm error) are desired along with high resolution physically-based material properties (at least 4K). Furthermore, to build a fully rigged face model for animation, a 3824 large number of facial scans and alignments (often over 30) are performed, typically following some conventions based on the Facial Action Coding System (FACS).
A typical approach used in production consists of using a multi-view stereo acquisition process to capture detailed 3D scans of each facial expression, and a non-rigid registration [8, 36] or inference method  is used to warp a 3D face model to each scan in order to ensure consistent mesh topology. Between these two steps, manual clean-up is often necessary to remove artifacts and unwanted surface regions, especially those with facial hair (beards, eyebrows) as well as teeth and neck regions. The registration process is often assisted with manual labeling tasks for correspondences and parameter tweaking to ensure accurate fitting. In a production setting, a completed rig of a person can easily take up to a week to finalize.
Several recent techniques have been introduced to automate this process by fitting a 3D model directly to a calibrated set of input images. The multi-view stereo face modeling method of  is not only particularly slow, but relies on dynamic sequences and carefully tuned parameters for each subject to ensure consistent parameterization between expressions. In particular facial expressions that are not captured continuously cannot ensure accurate topological consistencies. More recent deep learning approaches [4, 63] use a 3D morphable model (3DMM) inference to obtain a coarse initial facial expression, but require a refinement step based on optimization to improve fitting accuracy. Those methods are limited in fitting extreme expressions due to the constraints of linear 3DMMs and fitting tightly to the ground-truth face surfaces due to the global nature of their regression architectures. The additional photometric refinement also tends to fit unwanted regions like facial hair.
We introduce a new volumetric approach for consistent 3D face mesh inference using multi-view images. Instead of relying explicitly on a mesh-based face model such as 3DMM, our volumetric approach is more general, allowing it to capture a wider range of expressions and subtle deformation details on the face. Our method is also three orders of magnitude faster than conventional methods, taking only 0.385 seconds to generate a dense 3D mesh (10K vertices) as well as produce additional assets for high-fidelity production use cases, such as albedo, specular, and high-resolution displacement maps.
To this end, we propose a progressive mesh generation network that can infer a topologically consistent mesh directly. Our volumetric architecture predicts vertex locations as probability distributions, along with volumetric features that are extracted using the underlying multi-view geometry. The topological structure of the face is embedded into this architecture using a hierarchical mesh representation and coarse-to-fine network.
Our experiments show that ToFu is capable of producing highly accurate geometry consistent with topology automatically, while existing methods either rely on manual clean-up and parameter tuning, or are less accurate especially for subjects with facial hair. Since we can ensure a consistent parameterization across facial identities and expressions without any human input, our solution is suitable for scaled digitization of high-fidelity facial avatars, We not only reduce the turn around time for production, but is also provide a critical solution for generating large facial datasets, which is often associated with excessive manual labor. Our main contributions are:
Face Capture. Traditionally, face acquisition is separated
into two steps, 3D face reconstruction and registration .
Facial geometry can be captured with laser scanners ,
passive Multi-View Stereo (MVS) capture systems , dedicated
active photometric stereo systems [23, 41], or depth
sensors based on structured light or time-of-flight sensors.
Among these, MVS is the most commonly used
[18, 20, 24, 34, 43, 60]. Although these approaches produce
high-quality geometry, they suffer from heavy computation
due to the pairwise features matching across views, and they
tend to fail in case of sparse view inputs due to the lack of
overlapping neighboring views. More recently, deep neural
networks learn multi-view feature matching for 3D geometry
reconstruction [26, 31, 33, 51, 64]. Compared to classical
MVS methods, these learning based methods represent
a trade-off between accuracy and efficacy. All these MVS
methods output unstructured meshes, while our method produces
meshes in dense vertex correspondence.
Most registration methods use a template mesh and fit it to the scan surface by minimizing the distance between the scan’s surface and the template. For optimization, the template mesh is commonly parameterized with a statistical shape space [3, 9, 11, 38] or a general blendshape basis . Other approaches directly optimize the vertices of the template mesh using a non-rigid Iterative Closest Point (ICP) , with a statistical model as regularizer , or jointly optimize correspondence across an entire dataset in a groupwise fashion [12, 65]. For a more thorough review of face acquisition and registration, see Egger et al. . All these registration methods solve for facial correspondence independent from the data acquisition. Therefore, errors in the raw scan data propagate into the registration.
Only few methods exist that are similar to our method of directly outputting high-quality registered 3D faces from calibrated multi-view input [8, 13, 14, 21]. While sharing a similar goal, our method goes beyond these approaches in several significant ways. Unlike our method, they require calibrated multi-view image sequence input, contain multiple optimization steps (e.g. for building a subject specific template , or anchor frame meshes ), and are computationally slow (e.g. 25 minutes per frame for the coarse mesh reconstruction ). ToFu instead takes calibrated multi-view images as input (i.e. static) and directly outputs a high-quality mesh in dense vertex correspondence in 0.385 seconds. Regardless, our method achieves stable reconstruction and registration results for sequence input.
We introduced a 3D face inference approach from multiview input images that can produce high-fidelity 3D faces meshes with consistent topology using a volumetric sampling approach. We have shown that, given multi-view inputs, implicitly learning a shape variation and deformation field can produce superior results, compared to methods that use an underlying 3DMM even if they refine the resulting inference with an optimization step. We have demonstrated sub-millimeter surface reconstruction accuracy, and stateof- the-art correspondence performance while achieving up to 3 orders of magnitude of speed improvement over conventional techniques. Most importantly, our approach is fully automated and eliminates the need for data clean up after MVS, or any parameter tweaking for conventional nonrigid registration techniques. Our experiments also show that the volumetric feature sampling can aggregate effectively features across views at various scales and can also provide salient information for predicting accurate alignment without the need for any manual post-processing. Our next step is to extend our approach to regions beyond the skin region, including teeth, tongue, and eyes. We believe that our volumetric digitization framework can handle nonparametric facial surfaces, which could potentially eliminate the need for specialized shaders and models in conventional graphics pipelines. Furthermore, we would like to explore video sequences, and investigate ways to ensure temporal coherency in fine-scale surface deformations. Our model is suitable for articulated non-rigid objects such as human bodies, which motivates us to look into more general shapes and objects such as clothing and hair.