Dense Human Body Correspondences Using Convolutional Networks

Lingyu Wei ¹ Qixing Huang ² Duygu Ceylan ³ Etienne Vouga ⁴ Hao Li ¹

USC Institute for Creative Technologies ¹ Toyota Technological Institute at Chicago ²
Adobe Research ³ University of Texas at Austin ⁴

Figure 1: We introduce a deep learning framework for computing dense correspondences between human shapes in arbitrary, complex poses, and wearing varying clothing. Our approach can handle full 3D models as well as partial scans generated from a single depth map. The source and target shapes do not need to be the same subject, as highlighted in the left pair.

Abstract

We propose a deep learning approach for finding dense correspondences between 3D scans of people. Our method requires only partial geometric information in the form of two depth maps or partial reconstructed surfaces, works for humans in arbitrary poses and wearing any clothing, does not require the two people to be scanned from similar view- points, and runs in real time. We use a deep convolutional neural network to train a feature descriptor on depth map pixels, but crucially, rather than training the network to solve the shape correspondence problem directly, we train it to solve a body region classification problem, modified to increase the smoothness of the learned descriptors near re- gion boundaries. This approach ensures that nearby points on the human body are nearby in feature space, and vice versa, rendering the feature descriptor suitable for comput- ing dense correspondences between the scans. We validate our method on real and synthetic data for both clothed and unclothed humans, and show that our correspondences are more robust than is possible with state-of-the-art unsuper- vised methods, and more accurate than those found using methods that require full watertight 3D geometry.

Introduction

The computation of correspondences between geometric shapes is a fundamental building block for many important tasks in 3D computer vision, such as reconstruction, tracking, analysis, and recognition. Temporally-coherent sequences of partial scans of an object can be aligned by first finding corresponding points in overlapping regions, then recovering the motion by tracking surface points through a sequence of 3D data; semantics can be extracted by fitting a 3D template model to an unstructured input scan. With the popularization of commodity 3D scanners and recent advances in correspondence algorithms for deformable shapes, human bodies can now be easily digitized and their performances captured using a single RGB-D sensor.

Most techniques are based on robust non-rigid surface registration methods that can handle complex skin and cloth deformations, as well as large regions of missing data due to occlusions. Because geometric features can be ambiguous and difficult to identify and match, the success of these techniques generally relies on the deformation between source and target shapes being reasonably small, with sufficient overlap. While local shape descriptors can be used to determine correspondences between surfaces that are far apart, they are typically sparse and prone to false matches, which require manual clean-up. Dense correspondences between shapes with larger deformations can be obtained reliably using statistical models of human shapes, but the subject has to be naked. For clothed bodies, the automatic computation of dense mappings have been demonstrated on full surfaces with significant shape variations, but are limited to compatible or zero-genus surface topologies. Consequently, an automated method for estimating accurate dense correspondence between partial shapes, such as scans from a single RGB-D camera and arbitrarily large deformations has not yet been proposed.

Results

We evaluate our method extensively on various real and synthetic datasets, naked and clothed subjects, as well as full and partial matching for challenging examples as illustrated in Figure 5. The real capture data examples (last column) are obtained using a Kinect One (v2) RGB-D sensor and demonstrate the effectiveness of our method for real life scenarios. Each partial data is a single depth map frame with 512 × 424 pixels and the full template model is obtained using the non-rigid 3D reconstruction algorithm of. All examples include complex poses (side views and bended postures), challenging garment (dresses and vests), and props (backpacks and hats).

Performance

We perform all our experiments on a 6- core Intel Core i7-5930K Processor with 3.9 GHz and 16GB RAM. Both offline training and online correspondence computation run on an NVIDIA GeForce TITAN X (12GB GDDR5) GPU. While the complete training of our neural network takes about 250 hours of computation, the extraction of all the feature descriptors never exceeds 1 ms for each depth map. The subsequent correspondence computation with these feature descriptors varies between 0.5 and 1 s, depending on the resolution of our input data.

Figure 2: We compare our method to other non-rigid registration algorithms and show that larger deformations between a full template and a partial scan can be handled.

Conclusion

We have shown that a deep learning framework can be particularly effective at establishing accurate and dense correspondences between partial scans of clothed subjects in arbitrary poses. The key insight is that a smooth embedding needs to be learned to reduce misclassification artifacts at segmentation boundaries when using traditional classification networks. We have shown that a loss function based on the integration of multiple random segmentations can be used to enforce smoothness. This segmentation scheme also significantly decreases the amount of training data needed as it eliminates an exhaustive pairwise distance computation between the feature descriptors during training as apposed to methods that work on pairs or triplets of samples. Compared to existing classification networks, we also present the first framework that unifies the treatment of human body shapes and clothed subjects. In addition to its remarkable efficiency, our approach can handle both full models and partial scans, such as depth maps captured from a single view. While not as general as some state of the art shape matching methods, our technique significantly outperforms them for partial input shapes that are human bodies with clothing.

Downloads

Paper
Dense Human Body Correspondences Using Convolutional Networks.pdf, (3.09MB)

Video
siggraphAsia2016video.mov, (31.3MB)

Dense Human Body Correspondences Using Convolutional Networks

Lingyu Wei 1 Qixing Huang 2 Duygu Ceylan 3 Etienne Vouga 4 Hao Li 1

Lingyu Wei ¹ Qixing Huang ² Duygu Ceylan ³ Etienne Vouga ⁴ Hao Li ¹