SiCloPe: Silhouette-Based Clothed People
arXiv 2019
Ryota Natsume1,3    Shunsuke Saito1,2    Zeng Huang1,2    Weikai Chen1    Chongyang Ma4   
Hao Li1,2,5    Shigeo Morishima3   
USC Institute for Creative Technologies1    University of Southern California2   
Waseda University3    Snap Inc.4    Pinscreen5   

We introduce a new silhouette-based representation for modeling clothed human bodies using deep generative models. Our method can reconstruct a complete and textured 3D model of a person wearing clothes from a single input picture. Inspired by the visual hull algorithm, our implicit representation uses 2D silhouettes and 3D joints of a body pose to describe the immense shape complexity and variations of clothed people. Given a segmented 2D silhouette of a person and its inferred 3D joints from the input picture, we first synthesize consistent silhouettes from novel view points around the subject. The synthesized silhouettes, which are the most consistent with the input segmentation are fed into a deep visual hull algorithm for robust 3D shape prediction. We then infer the texture of the subject’s back view using the frontal image and segmentation mask as input to a conditional generative adversarial network. Our experiments demonstrate that our silhouette-based model is an effective representation and the appearance of the back view can be predicted reliably using an image-to-image translation network. While classic methods based on parametric models often fail for single-view images of subjects with challenging clothing, our approach can still produce successful results, which are comparable to those obtained from multi-view input.

Our 3D reconstruction results of clothed human body using test images from the DeepFashion dataset.
Simple View Reconstruction

To reduce the immense solution space of human body shapes, several 3D body model repositories, e.g. SCAPE and SMPL, have been introduced, which have made the single-view reconstruction of human bodies more tractable. In particular, a 3D parametric model is built from such database, which uses pose and shape parameters of the 3D body to best match an input image. As the mapping between the body geometry and the parameters of the deformable model is highly non-linear, alternative approaches based on deep learning have become increasingly popular. The seminal work of Dibra et al. introduces deep neural networks to estimate the shape parameters from a single input silhouette. More recent works predict body parameters of the popular SMPL model by either minimizing the silhouette matching error, joint error based on the silhouette and 2D joints, or an adversarial loss that can distinguish unrealistic reconstruction output. Concurrent to our work, Weng et al. present a method to animate a person in 3D from a single image based on the SMPL model and 2D warping.

Deep Visual Hull Prediction

Although our silhouette synthesis algorithm generates sharp prediction of novel-view silhouettes, the estimated results may not be perfectly consistent as the conditioned 3D joints may fail to fully disambiguate the details in the corresponding silhouettes (e.g., fingers, wrinkles of garments). Therefore, naively applying conventional visual hull algorithms is prone to excessive erosion in the reconstruction, since the visual hull is designed to subtract the inconsistent silhouettes in each view. To address this issue, we propose a deep visual hull network that reconstructs a plausible 3D shape of clothed body without requiring perfectly view-consistent silhouettes by leveraging the shape prior of clothed human bodies.

In particular, we use a network structure based on Deep volumetric video from very sparse multi-view performance capture. At a high level, Huang et al. propose to map 2D images to a 3D volumetric field through a multi-view convolutional neural network. The 3D field encodes the probabilistic distribution of 3D points on the captured surface. By querying the resulting field, one can instantiate the geometry of clothed human body at an arbitrary resolution. However, unlike their approach which takes carefully calibrated color images from fixed views as input, our network only consumes the probability maps of novelview silhouettes, which can be inconsistent across different views. Although arbitrary number of novel-view silhouettes can be generated, it remains challenging to properly select optimal input views to maximize the network performance. Therefore, we introduce several improvements to increase the reconstruction accuracy.


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum convallis erat ex, non consequat magna rutrum eget. Integer rhoncus dolor elit, eget molestie leo porttitor vel. In finibus urna eros, id tristique turpis elementum at. Suspendisse urna metus, finibus ultrices cursus quis, vehicula non elit. Fusce nec velit nec tellus dictum lobortis. Proin dictum pharetra porttitor. Duis est erat, ornare non tempus sed, commodo ut lorem. Ut a scelerisque lorem, a sagittis eros. Praesent aliquet neque at orci aliquet, eu pellentesque libero finibus. Ut facilisis, lectus quis ultricies fringilla, felis tortor volutpat odio, vel vulputate libero elit nec quam. Nulla a rhoncus arcu, in ullamcorper enim. Pellentesque ultrices condimentum odio, eu suscipit lorem mollis et. Sed quis turpis at tortor imperdiet vestibulum. Aliquam venenatis nisl in venenatis fermentum.