We present a learning-based approach for synthesizing facial geometry at medium and fine scales from diffusely-lit facial texture maps. When applied to an image sequence, the synthesized detail is temporally coherent. Unlike current state-of-the-art methods which assume ”dark is deep”, our model is trained with measured facial detail collected using polarized gradient illumination in a Light Stage. This enables us to produce plausible facial detail across the entire face, including where previous approaches may incorrectly interpret dark features as concavities such as at moles, hair stubble, and occluded pores. Instead of directly inferring 3D geometry, we propose to encode fine details in high-resolution displacement maps which are learned through a hybrid network adopting the state-of-the-art image-to-image translation network and super resolution network. To effectively capture geometric detail at both mid- and high frequencies, we factorize the learning into two separate sub-networks, enabling the full range of facial detail to be modeled. Results from our learning-based approach compare favorably with a high-quality active facial scanhening technique, and require only a single passive lighting condition without a complex scanning setup.
We evaluate the effectiveness of our approach on different
input textures with a variety of subjects and expressions.
We show the synthesized geometries embossed
by only medium-scale details, 1K and 4K combined
multi-scale (both medium and high frequency) displacement
maps, with the the input textures and base mesh
shown in the first and second column, respectively. As seen
from the results, our method can faithfully capture both
the medium and fine scale geometries. The final geometry
synthesized using the 4K displacement map exhibits mesoscale
eometry on par with active facial scanning. None of
these subjects are used in training the network, and show the
the robustness of our method to a variety of texture qualities,
expressions, gender, and ages. We validate the effectiveness of geometry detail separation
by comparing with an alternative solution which does not decouple middle and high frequencies.
The displacement map learned from the alternative
method fails to capture almost all the high frequency
details while introducing artifacts in middle frequencies,
which is manifested in the embossed geometry.
Our method, on the other hand, faithfully replicates both
medium and fine scale details in the resulting displacement map.
We also assess the effectiveness of the proposed superresolution
network in our framework. The reconstructed result using supersolution
network outperforms its opponent significantly in
faithfully replicating mesocopic facial structures.