Recent advances in single-view 3D hair digitization have made the creation of high-quality CG characters scalable and accessible to end-users, enabling new forms of personalized VR and gaming experiences. To handle the complexity and variety of hair structures, most cutting-edge techniques rely on the successful retrieval of a particular hair model from a comprehensive hair database. Not only are the aforementioned data-driven methods storage intensive, but they are also prone to failure for highly unconstrained input images, complicated hairstyles, and failed face detection. Instead of using a large collection of 3D hair models directly, we propose to represent the manifold of 3D hairstyles implicitly through a compact latent space of a volumetric variational autoencoder (VAE). This deep neural network is trained with volumetric orientation field representations of 3D hair models and can synthesize new hairstyles from a compressed code. To enable end-to-end 3D hair inference, we train an additional embedding network to predict the code in the VAE latent space from any input image. Strand-level hairstyles can then be generated from the predicted volumetric representation. Our fully automatic framework does not require any ad-hoc face fitting, intermediate classification and segmentation, or hairstyle database retrieval. Our hair synthesis approach is significantly more robust and can handle a much wider variation of hairstyles than state-of-the-art data-driven hair modeling techniques with challenging inputs, including photos that are low-resolution, overexposured, or contain extreme head poses. The storage requirements are minimal and a 3D hair model can be produced from an image in a second. Our evaluations also show that successful reconstructions are possible from highly stylized cartoon images, non-human subjects, and pictures taken from behind a person. Our approach is particularly well suited for continuous and plausible hair interpolation between very different hairstyles.
The 3D acquisition of human hair has become an active research
area in computer graphics in order to make the creation of digital
humans more efficient, automated, and cost effective. High-end hair
capture techniques based on specialized hardware [Beeler et al. 2012;
Echevarria et al. 2014; Herrera et al. 2012; Jakob et al. 2009; Luo et al.
2013; Paris et al. 2008; Xu et al. 2014] can already produce highquality
3D hair models, but can only operate in well-controlled
studio environments. More consumer-friendly techniques, such
as those that only require a single input image [Chai et al. 2015,
2016; Hu et al. 2015, 2017], are becoming increasingly popular and
important as they can facilitate the mass adoption of new 3D avatardriven
applications, including personalized gaming, communication
in VR [Li et al. 2015; Olszewski et al. 2016; Thies et al. 2018], and
social media apps [FaceUnity 2017; itSeez3D: Avatar SDK 2017;
Myidol 2017; Pinscreen 2017]. Existing single-view hair modeling
methods all rely on a large database containing hundreds of 3D
airstyles, which is used as shape prior for further refinement and
to handle the complex variations of possible hairstyles.
This paradigm comes with several fundamental limitations: (1)
the large storage footprints of the hair model database prohibit
their deployment on resource-constrained platforms such as mobile
devices; (2) the search steps are usually slow and difficult to scale
as the database grows to handle increasingly various hairstyles; (3)
these techniques also rely on well-conditioned input photographs
and are susceptible to the slightest failures during the image preprocessing
and analysis step, such as failed face detection, incorrect
head pose fitting, or poor hair segmentation. Furthermore, these
data-driven algorithms are based on hand-crafted descriptors and
do not generalize well beyond their designed usage scenarios. They
often fail in practical scenarios, such as those with occluded face/hair,
poor resolution, degraded quality, or artistically stylized input.
To address the above challenges, we propose an end-to-end singleview
3D hair synthesis approach using a deep generative model to
represent the continuous space of hairstyles.We implicitly model the
continuous space of hairstyles using a compact generative model so
that plausible hairstyles can be effectively sampled and interpolated,
and hence, eliminate the need for a comprehensive database. We
also enable end-to-end training and 3D hairstyle inference from
a single input image by learning deep features from a large set of
unconstrained images.
To effectively model the space of hairstyles, we introduce the use
of volumetric occupancy and flow fields to represent 3D hairstyles
for our generative hair modeling framework. We present a variant
of volumetric variational autoencoder (VAE) [Kingma and Welling
2014] to learn the mapping from a compact latent space to the space
of hairstyles represented by a volumetric representation of a large
database of hairstyles [Hu et al. 2015].
To achieve end-to-end 3D hair inference, we train an additional
hair embedding neural network to predict the code in the learned
VAE latent space from input images. Instead of direct prediction
to the latent space, we perform Principled Component Analysis
(PCA) in the latent space for an embedding subspace to achieve
better generalization performance via prediction to this subspace.
In addition, we apply Iterative Error Feedback (IEF) [Carreira et al.
2016] to our embedding network to further facilitate generalization.
We include an ablation study of different algorithmic components
to validate our proposed architecture (Section 4). We show that our
method can synthesize faithful 3D hairstyles from a wide range of input
images with various occlusions, degraded image quality, extreme
lighting conditions, uncommon hairstyles, and significant artistic
abstraction (see Fig 1 and Section 5).We also compare our technique
to the latest algorithm for single-view 3D hair modeling [Chai et al.
2016] and show that our approach is significantly more robust on
challenging input photos. Using our learned generative model, we
further demonstrate that plausible hairstyles can be interpolated
effectively between drastically different ones, while the current
state-of-the-art method [Weng et al. 2013] fails.
Our main contributions are:
• The first end-to-end framework for synthesis of 3D hairstyles
from a single input image without requirement of face detection
or hair segmentation. Our approach can handle a wider
range of hairstyles and is significantly more robust for challenging
input images than existing data-driven techniques.
• A variational autoencoder using a volumetric occupancy and
flow field representation. The corresponding latent space is
compact and models the wide range of possible hairstyles
continuously. Plausible hairstyles can be sampled and interpolated
effectively using this VAE-based generative model,
and converted into a strand-based hair representation.
• A hair embedding network with robust generalization performance
using PCA embedding and an iterative error feedback
technique.
In this section, we describe the entire pipeline of our algorithm for single-view 3D hair modeling (Figure 3). We first explain our hair data representation using volumetric occupancy and flow fields (Section 3.1). Using a dataset of more than two thousand different 3D hairstyles, we train a volumetric variational autoencoder to obtain a compact latent space, which encodes the immerse space of plausible 3D hairstyles (Section 3.2). To enable end-to-end single-view 3D hairstyle modeling, we train an additional embedding network to help predict the volumetric representation from an input image (Section 3.3). Finally, we synthesize hair strands by growing them from the scalp of a head model based on the predicted volume. If a face can be detected or manually fitted from the input image, we can optionally refine the output strands to better match the single-view input (Section 3.4).
Single-View Hair Modeling. We show single-view 3D hairstyle
modeling results from a variety of input images in Figures 1 and 5.
For each image, we show the predicted occupancy field with colorcoded
local orientation as well as synthesized strands with manually
specified color. Note that none of these test images are used to train
our hair embedding network. Our method is end-to-end and does
not require any user interactions such as manually fitting a head
model and drawing guiding strokes. Moreover, several input images
in Figure 5 are particularly challenging, because they are either
over-exposed (the third row), have low contrast between the hair
and the background (the fourth row), have low resolution (the fifth
row and the sixth row), or are illustrated in a cartoon style (the last
two rows). Although our training dataset for the hair embedding
network only consists of examples modeled from normal headshot
photographs without any extreme cases (e.g. poorly illuminated
images or pictures of dogs), our method generalizes very well due
to the robustness of deep image features. A typical face detector
will fail to detect a human face from the third, the fifth and the sixth
input images in Figure 5, which will prevent existing automatic hair
modeling method [Hu et al. 2017] from generating any meaningful
results. In Figure 5, only the first image can be handled by the
system proposed by Chai et al. [2016], since their algorithm requires both
successful face detection and high-quality hair segmentation.
In Figure 6, we compare our method to a state-of-the-art automatic
single-view hair modeling technique [Chai et al. 2016] on a
variety of input images. Our results are comparable to those by Chai
et al. [2016] on those less challenging input of typical hairstyles
(Figure 6(a) - (f) and (l)). For these challenging cases (Figure 6(g)
- (k)), we can generate more faithful modeling output, since the
method of Chai et al. [2016] relies on accurate hair segmentation
which can be difficult to achieve with partial occlusions or less
typical hairstyles.
See link to paper for complete description.