3D Hair Synthesis Using Volumetric Variational Autoencoders
2018
Shunsuke Saito1,2,3    Liwen Hu2,3    Chongyang Ma4    Hikaru Ibayashi2    Linjie Luo4    Hao Li1,2,3   
USC Institute for Creative Technologies1    University of Southern California2    Pinscreen3    Snap Inc.4   
Abstract

Recent advances in single-view 3D hair digitization have made the creation of high-quality CG characters scalable and accessible to end-users, enabling new forms of personalized VR and gaming experiences. To handle the complexity and variety of hair structures, most cutting-edge techniques rely on the successful retrieval of a particular hair model from a comprehensive hair database. Not only are the aforementioned data-driven methods storage intensive, but they are also prone to failure for highly unconstrained input images, complicated hairstyles, and failed face detection. Instead of using a large collection of 3D hair models directly, we propose to represent the manifold of 3D hairstyles implicitly through a compact latent space of a volumetric variational autoencoder (VAE). This deep neural network is trained with volumetric orientation field representations of 3D hair models and can synthesize new hairstyles from a compressed code. To enable end-to-end 3D hair inference, we train an additional embedding network to predict the code in the VAE latent space from any input image. Strand-level hairstyles can then be generated from the predicted volumetric representation. Our fully automatic framework does not require any ad-hoc face fitting, intermediate classification and segmentation, or hairstyle database retrieval. Our hair synthesis approach is significantly more robust and can handle a much wider variation of hairstyles than state-of-the-art data-driven hair modeling techniques with challenging inputs, including photos that are low-resolution, overexposured, or contain extreme head poses. The storage requirements are minimal and a 3D hair model can be produced from an image in a second. Our evaluations also show that successful reconstructions are possible from highly stylized cartoon images, non-human subjects, and pictures taken from behind a person. Our approach is particularly well suited for continuous and plausible hair interpolation between very different hairstyles.

Introduction

The 3D acquisition of human hair has become an active research area in computer graphics in order to make the creation of digital humans more efficient, automated, and cost effective. High-end hair capture techniques based on specialized hardware [Beeler et al. 2012; Echevarria et al. 2014; Herrera et al. 2012; Jakob et al. 2009; Luo et al. 2013; Paris et al. 2008; Xu et al. 2014] can already produce highquality 3D hair models, but can only operate in well-controlled studio environments. More consumer-friendly techniques, such as those that only require a single input image [Chai et al. 2015, 2016; Hu et al. 2015, 2017], are becoming increasingly popular and important as they can facilitate the mass adoption of new 3D avatardriven applications, including personalized gaming, communication in VR [Li et al. 2015; Olszewski et al. 2016; Thies et al. 2018], and social media apps [FaceUnity 2017; itSeez3D: Avatar SDK 2017; Myidol 2017; Pinscreen 2017]. Existing single-view hair modeling methods all rely on a large database containing hundreds of 3D airstyles, which is used as shape prior for further refinement and to handle the complex variations of possible hairstyles. This paradigm comes with several fundamental limitations: (1) the large storage footprints of the hair model database prohibit their deployment on resource-constrained platforms such as mobile devices; (2) the search steps are usually slow and difficult to scale as the database grows to handle increasingly various hairstyles; (3) these techniques also rely on well-conditioned input photographs and are susceptible to the slightest failures during the image preprocessing and analysis step, such as failed face detection, incorrect head pose fitting, or poor hair segmentation. Furthermore, these data-driven algorithms are based on hand-crafted descriptors and do not generalize well beyond their designed usage scenarios. They often fail in practical scenarios, such as those with occluded face/hair, poor resolution, degraded quality, or artistically stylized input. To address the above challenges, we propose an end-to-end singleview 3D hair synthesis approach using a deep generative model to represent the continuous space of hairstyles.We implicitly model the continuous space of hairstyles using a compact generative model so that plausible hairstyles can be effectively sampled and interpolated, and hence, eliminate the need for a comprehensive database. We also enable end-to-end training and 3D hairstyle inference from a single input image by learning deep features from a large set of unconstrained images.

To effectively model the space of hairstyles, we introduce the use of volumetric occupancy and flow fields to represent 3D hairstyles for our generative hair modeling framework. We present a variant of volumetric variational autoencoder (VAE) [Kingma and Welling 2014] to learn the mapping from a compact latent space to the space of hairstyles represented by a volumetric representation of a large database of hairstyles [Hu et al. 2015]. To achieve end-to-end 3D hair inference, we train an additional hair embedding neural network to predict the code in the learned VAE latent space from input images. Instead of direct prediction to the latent space, we perform Principled Component Analysis (PCA) in the latent space for an embedding subspace to achieve better generalization performance via prediction to this subspace. In addition, we apply Iterative Error Feedback (IEF) [Carreira et al. 2016] to our embedding network to further facilitate generalization. We include an ablation study of different algorithmic components to validate our proposed architecture (Section 4). We show that our method can synthesize faithful 3D hairstyles from a wide range of input images with various occlusions, degraded image quality, extreme lighting conditions, uncommon hairstyles, and significant artistic abstraction (see Fig 1 and Section 5).We also compare our technique to the latest algorithm for single-view 3D hair modeling [Chai et al. 2016] and show that our approach is significantly more robust on challenging input photos. Using our learned generative model, we further demonstrate that plausible hairstyles can be interpolated effectively between drastically different ones, while the current state-of-the-art method [Weng et al. 2013] fails.

Our main contributions are:

• The first end-to-end framework for synthesis of 3D hairstyles from a single input image without requirement of face detection or hair segmentation. Our approach can handle a wider range of hairstyles and is significantly more robust for challenging input images than existing data-driven techniques.

• A variational autoencoder using a volumetric occupancy and flow field representation. The corresponding latent space is compact and models the wide range of possible hairstyles continuously. Plausible hairstyles can be sampled and interpolated effectively using this VAE-based generative model, and converted into a strand-based hair representation.

• A hair embedding network with robust generalization performance using PCA embedding and an iterative error feedback technique.

Method

In this section, we describe the entire pipeline of our algorithm for single-view 3D hair modeling (Figure 3). We first explain our hair data representation using volumetric occupancy and flow fields (Section 3.1). Using a dataset of more than two thousand different 3D hairstyles, we train a volumetric variational autoencoder to obtain a compact latent space, which encodes the immerse space of plausible 3D hairstyles (Section 3.2). To enable end-to-end single-view 3D hairstyle modeling, we train an additional embedding network to help predict the volumetric representation from an input image (Section 3.3). Finally, we synthesize hair strands by growing them from the scalp of a head model based on the predicted volume. If a face can be detected or manually fitted from the input image, we can optionally refine the output strands to better match the single-view input (Section 3.4).

Results

Single-View Hair Modeling. We show single-view 3D hairstyle modeling results from a variety of input images in Figures 1 and 5. For each image, we show the predicted occupancy field with colorcoded local orientation as well as synthesized strands with manually specified color. Note that none of these test images are used to train our hair embedding network. Our method is end-to-end and does not require any user interactions such as manually fitting a head model and drawing guiding strokes. Moreover, several input images in Figure 5 are particularly challenging, because they are either over-exposed (the third row), have low contrast between the hair and the background (the fourth row), have low resolution (the fifth row and the sixth row), or are illustrated in a cartoon style (the last two rows). Although our training dataset for the hair embedding network only consists of examples modeled from normal headshot photographs without any extreme cases (e.g. poorly illuminated images or pictures of dogs), our method generalizes very well due to the robustness of deep image features. A typical face detector will fail to detect a human face from the third, the fifth and the sixth input images in Figure 5, which will prevent existing automatic hair modeling method [Hu et al. 2017] from generating any meaningful results. In Figure 5, only the first image can be handled by the system proposed by Chai et al. [2016], since their algorithm requires both successful face detection and high-quality hair segmentation. In Figure 6, we compare our method to a state-of-the-art automatic single-view hair modeling technique [Chai et al. 2016] on a variety of input images. Our results are comparable to those by Chai et al. [2016] on those less challenging input of typical hairstyles (Figure 6(a) - (f) and (l)). For these challenging cases (Figure 6(g) - (k)), we can generate more faithful modeling output, since the method of Chai et al. [2016] relies on accurate hair segmentation which can be difficult to achieve with partial occlusions or less typical hairstyles.

See link to paper for complete description.