Photorealistic Facial Texture Inference Using Deep Neural Networks
Shunsuke Saito    Lingyu Wei    Liwen hu    Koki Nagano    Hao Li   
Pinscreen    University of Southern California    USC Institute for Creative Technologies

Figure 1: We present an inference framework based on deep neural networks for synthesizing photorealistic facial texture along with 3D geometry from a single unconstrained image. We can successfully digitize historic figures that are no longer available for scanning and produce high-fidelity facial texture maps with mesoscopic skin details.
Abstract

We present a data-driven inference method that can synthesize a photorealistic texture map of a complete 3D face model given a partial 2D view of a person in the wild. After an initial estimation of shape and low-frequency albedo, we compute a high-frequency partial texture map, without the shading component, of the visible face area. To extract the fine appearance details from this incomplete input, we introduce a multi-scale detail analysis technique based on midlayer feature correlations extracted from a deep convolutional neural network. We demonstrate that fitting a convex combination of feature correlations from a high-resolution face database can yield a semantically plausible facial detail description of the entire face. A complete and photorealistic texture map can then be synthesized by iteratively optimizing for the reconstructed feature correlations. Using these high-resolution textures and a commercial rendering framework, we can produce high-fidelity 3D renderings that are visually comparable to those obtained with state-of-theart multi-view face capture systems. We demonstrate successful face reconstructions from a wide range of low resolution input images, including those of historical figures. In addition to extensive evaluations, we validate the realism of our results using a crowdsourced user study.

Introduction

Until recently, the digitization of photorealistic faces has only been possible in professional studio settings, typically involving sophisticated appearance measurement devices and carefully controlled lighting conditions. While such a complex acquisition process is acceptable for production purposes, the ability to build highend 3D face models from a single unconstrained image could widely impact new forms of immersive communication, education, and consumer applications. With virtual and augmented reality becoming the next generation platform for social interaction, compelling 3D avatars could be generated with minimal efforts and pupeteered through facial performances. Within the context of cultural heritage, iconic and historical personalities could be restored to life in captivating 3D digital forms from archival photographs.

Results

We processed a wide variety of input images with subjects of different races, ages, and gender, including celebrities and people from the publicly available annotated facesin- the-wild (AFW), dataset. We cover challenging examples of scenes with complex illumination as well as nonfrontal faces. Our inference technique produces high-resolution texture maps with complex skin tones and mesoscopic-scale details (pores, stubble hair), even from very low-resolution input images. Consequentially, we are able to effortlessly produce highfidelity digitizations of iconic personalities who have passed away, such as Muhammad Ali, or bring back their younger selves (e.g., young Hillary Clinton) from a single archival photograph. Until recently, such results would only be possible with high-end capture devices or intensive effort from digital artists. We also show photorealistic renderings of our reconstructed face models from the widely used AFW database, which reveal high-frequency pore structures, skin moles, as well as short facial hair. We clearly observe that low-frequency albedo maps obtained from a linear PCA model are unable to capture these details. For the renderings, we use a Monte Carlo ray-tracer, with generic subsurface scattering, image-based lighting, procedural roughness and specularity, and a bump map derived from the synthesized texture.



Figure 2: Photorealistic renderings of geometry, texture obtained using PCA model fitting, and our method.


Figure 3: Different numbers of mid-layers affects the level of detail of our inference.
Evaluation

We evaluate the performance of our texture synthesis with three widely used convolutional neural networks (CaffeNet, VGG-16, and VGG-19) [5, 48] for image recognition. While different models can be used, deeper architectures tend to produce less artifacts and higher quality textures. To validate our use of all 5 mid-layers of VGG-19 for the multi-scale representation of details, we show that if less layers are used, the synthesized textures would become blurrier, as shown in Figure 8. While the texture synthesis formulation in Equation 3 suggests a blend between the low-frequency albedo and the multi-scale facial details, we expect to maximize the amount of detail and only use the low-frequency PCA model estimation for initialization.

Footer With Address And Phones