We propose a deep learning approach for finding dense correspondences between 3D scans of people. Our method requires only partial geometric information in the form of two depth maps or partial reconstructed surfaces, works for humans in arbitrary poses and wearing any clothing, does not require the two people to be scanned from similar view- points, and runs in real time. We use a deep convolutional neural network to train a feature descriptor on depth map pixels, but crucially, rather than training the network to solve the shape correspondence problem directly, we train it to solve a body region classification problem, modified to increase the smoothness of the learned descriptors near re- gion boundaries. This approach ensures that nearby points on the human body are nearby in feature space, and vice versa, rendering the feature descriptor suitable for comput- ing dense correspondences between the scans. We validate our method on real and synthetic data for both clothed and unclothed humans, and show that our correspondences are more robust than is possible with state-of-the-art unsuper- vised methods, and more accurate than those found using methods that require full watertight 3D geometry.
The computation of correspondences between geometric shapes is a fundamental building block
for many important tasks in 3D computer vision, such as reconstruction, tracking, analysis,
and recognition. Temporally-coherent sequences of partial scans of an object can be aligned
by first finding corresponding points in overlapping regions, then recovering the motion by
tracking surface points through a sequence of 3D data; semantics can be extracted by fitting
a 3D template model to an unstructured input scan. With the popularization of commodity 3D
scanners and recent advances in correspondence algorithms for deformable shapes, human bodies
can now be easily digitized and their performances captured using a single RGB-D sensor.
Most techniques are based on robust non-rigid surface registration methods that can handle
complex skin and cloth deformations, as well as large regions of missing data due to occlusions.
Because geometric features can be ambiguous and difficult to identify and match, the success of
these techniques generally relies on the deformation between source and target shapes being
reasonably small, with sufficient overlap. While local shape descriptors can be used to determine
correspondences between surfaces that are far apart, they are typically sparse and prone to
false matches, which require manual clean-up. Dense correspondences between shapes with larger
deformations can be obtained reliably using statistical models of human shapes, but the subject
has to be naked. For clothed bodies, the automatic computation of dense mappings have been
demonstrated on full surfaces with significant shape variations, but are limited to compatible
or zero-genus surface topologies. Consequently, an automated method for estimating accurate
dense correspondence between partial shapes, such as scans from a single RGB-D camera and
arbitrarily large deformations has not yet been proposed.
We evaluate our method extensively on various real and synthetic datasets, naked and clothed subjects, as well as full and partial matching for challenging examples as illustrated in Figure 5. The real capture data examples (last column) are obtained using a Kinect One (v2) RGB-D sensor and demonstrate the effectiveness of our method for real life scenarios. Each partial data is a single depth map frame with 512 × 424 pixels and the full template model is obtained using the non-rigid 3D reconstruction algorithm of. All examples include complex poses (side views and bended postures), challenging garment (dresses and vests), and props (backpacks and hats).
We perform all our experiments on a 6- core Intel Core i7-5930K Processor with 3.9 GHz and 16GB RAM. Both offline training and online correspondence computation run on an NVIDIA GeForce TITAN X (12GB GDDR5) GPU. While the complete training of our neural network takes about 250 hours of computation, the extraction of all the feature descriptors never exceeds 1 ms for each depth map. The subsequent correspondence computation with these feature descriptors varies between 0.5 and 1 s, depending on the resolution of our input data.
We have shown that a deep learning framework can be particularly effective at establishing accurate and dense correspondences between partial scans of clothed subjects in arbitrary poses. The key insight is that a smooth embedding needs to be learned to reduce misclassification artifacts at segmentation boundaries when using traditional classification networks. We have shown that a loss function based on the integration of multiple random segmentations can be used to enforce smoothness. This segmentation scheme also significantly decreases the amount of training data needed as it eliminates an exhaustive pairwise distance computation between the feature descriptors during training as apposed to methods that work on pairs or triplets of samples. Compared to existing classification networks, we also present the first framework that unifies the treatment of human body shapes and clothed subjects. In addition to its remarkable efficiency, our approach can handle both full models and partial scans, such as depth maps captured from a single view. While not as general as some state of the art shape matching methods, our technique significantly outperforms them for partial input shapes that are human bodies with clothing.