We present a deep learning-based technique to infer high-quality facial reflectance and geometry given a single unconstrained image of the subject, which may contain partial occlusions and arbitrary illumination conditions. The reconstructed high-resolution textures, which are generated in only a few seconds, include high-resolution skin surface reflectance maps, representing both the diffuse and specular albedo, and medium- and highfrequency displacement maps, thereby allowing us to render compelling digital avatars under novel lighting conditions. To extract this data, we train our deep neural networks with a high-quality skin reflectance and geometry database created with a state-of-the-art multi-view photometric stereo system using polarized gradient illumination. Given the raw facial texture map extracted from the input image, our neural networks synthesize complete reflectance and displacement maps, as well as complete missing regions caused by occlusions. The completed textures exhibit consistent quality throughout the face due to our network architecture, which propagates texture features from the visible region, resulting in high-fidelity details that are consistent with those seen in visible regions. We describe how this highly underconstrained problem is made tractable by dividing the full inference into smaller tasks, which are addressed by dedicated neural networks. We demonstrate the effectiveness of our network design with robust texture completion from images of faces that are largely occluded. With the inferred reflectance and geometry data, we demonstrate the rendering of high-fidelity 3D avatars from a variety of subjects captured under different lighting conditions. In addition, we perform evaluations demonstrating that our method can infer plausible facial reflectance and geometric details comparable to those obtained from high-end capture devices, and outperform alternative approaches that require only a single unconstrained input image.
Our system pipeline is illustrated in Fig. 2. Given a single input image captured in unconstrained conditions, we begin by extracting the base mesh of the face and the corresponding texture map obtained by projecting the face in the input image onto this mesh. This map is passed through 2 convolutional neural networks (CNNs) that perform inference to obtain the corresponding reflectance and displacement maps (Sec. 5). The first network infers the diffuse albedo map, while the second infers the specular albedo as well as the mid- and high-frequency displacement maps. However, these maps may contain large missing regions due to occlusions in the input image. In the next stage, we perform texture completion and refinement to fill these regions with content that is consistent with that found in the visible regions (Sec. 6). Finally, we perform superresolution to increase the pixel resolution of the completed texture from 512 × 512 into 2048 × 2048. The resulting textures contain natural and high-fidelity details that can be used with the base mesh to render high-fidelity avatars in novel lighting environments. To obtain high-quality results, we found that it was essential to divide the inference and completion process into these smaller objectives so as to make training process more tractable. Using a single network that performs both the texture completion and detail refinement on all of the desired output data (reflectance and geometry maps) produces significantly worse results than our described approach, in which the problems are decomposed into separate stages addressed by networks trained for more specific tasks, and in which the diffuse albedo is generated by a separate network than the one that generates the remaining output data.
All our results are rendered with brute-force path tracing in the
Solid Angle’s Arnold renderer [Solid Angle 2016] with physically
based specular reflection and subsurface scattering with high dynamic
range image-based illumination. The resulting surface and
subsurface reflectance, together with the base surface mesh and the
displacement, are used to produce the final render using a layered
skin reflectance model as in [The Digital Human League 2015] (see
supplemental material for more details on the rendering process).
Evaluation. We quantitatively measure the ability of our system
to faithfully recover the reflectance and geometry data from a set
of 100 test images for which we have the corresponding groundtruth
measurements.
Please refer to the pdf linked below for full description.