DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor
CVPR 2018
Tao Yu1,2    Zerong Zheng1    Kaiwen Guo1,3    Jianhui Zhao2    Qionghai Dai1    Hao Li4    Gerard Pons-Moll5    Yebin Liu1,6   
Tsinghua University, Beijing, China1    Beihang University, Beijing, China2    Google Inc3    University of Southern California / USC Institute for Creative Technologies4    Max-Planck-Institute for Informatics, Saarland Informatics Campus5    Beijing National Research Center for Information Science and Technology (BNRist)6   
Abstract

We propose DoubleFusion, a new real-time system that combines volumetric dynamic reconstruction with datadriven template fitting to simultaneously reconstruct detailed geometry, non-rigid motion and the inner human body shape from a single depth camera. One of the key contributions of this method is a double layer representation consisting of a complete parametric body shape inside, and a gradually fused outer surface layer. A pre-defined node graph on the body surface parameterizes the nonrigid deformations near the body, and a free-form dynamically changing graph parameterizes the outer surface layer far from the body, which allows more general reconstruction. We further propose a joint motion tracking method based on the double layer representation to enable robust and fast motion tracking performance. Moreover, the inner body shape is optimized online and forced to fit inside the outer surface layer. Overall, our method enables increasingly denoised, detailed and complete surface reconstructions, fast motion tracking performance and plausible inner body shape reconstruction in real-time. In particular, experiments show improved fast motion tracking and loop closure performance on more challenging scenarios.


Pipeline of the system
Introduction

Human performance capture has been a challenging research topic in computer vision and computer graphics for decades. The goal is to reconstruct a temporally coherent representation of the dynamically deforming surface of human characters from videos. Although array based methods [21, 12, 5, 6, 41, 22, 27, 11, 16, 30] using multiple video or depth cameras are well studied and have achieved high quality results, the expensive camera-array setups and controlled studios limit its application to a few technical experts. As depth cameras are increasingly popular in the consumer space (iPhoneX, Google Tango, etc.), the recent trend focuses on using more and more practical setups like a single depth camera [45, 13, 3]. In particular, by combining non-rigid surface tracking and volumetric depth integration, DynamicFusion like approaches [28, 15, 14, 34] allow real-time dynamic scene reconstruction using a single depth camera without the requirement of pre-scanned model templates. Such systems are low cost, easy to set up and promising for popularization; however, they are still restricted to controlled slow motions. The challenges are occlusions (single view), computational resources (real-time), loop closure and no pre-scanned template model.

Overview

Double-layer Surface Representation

The input to DoubleFusion is a depth stream captured from a single consumer-level depth sensor and the output is a double-layer surface of the performer. The outer layer are observable surface regions, such as clothing, visible body parts (e.g. face, hair), while the inner layer is a parametric human shape and skeleton model based on the skinned multi-person linear model (SMPL) [24]. Similar to previous work [28], the motion of the outer surface is parametrized by a set of nodes. Every node deforms according to a rigid transformation. The node graph interconnects the nodes and constrain them to deform similarly. Unlike [28] that uniformly samples nodes on the newly fused surface, we pre-define an on-body node graph on the SMPL model, which provides a semantic and real prior to constrain non-rigid human motions. For example, it will prevent erroneous connections between body parts (e.g., connecting the legs). We uniformly sample on-body nodes and use geodesic distances to construct the predefined on-body node graph on the mean shape of SMPL model as shown in Fig. 2(a)(top). The on-body nodes are inherently bound to skeleton joints in the SMPL model. Outer surface regions that are close to the inner body are bound to the on-body node graph. Deformations of regions far from the body cannot be accurately represented with the on-body graph. Hence, we additionally sample far-body nodes with a radius of δ = 5cm on the newly fused far-body geometry. A vertex is labled as far-body when it is located further than 1.4 × δcm from its nearest on-body node, which helps to make sure the sampling scheme is robust against depth noise and tracking failures.

Conclusion

In this paper, we have demonstrated the first method for real-time reconstruction of both clothing and inner body shape from a single depth sensor. Based on the proposed double surface representation, our system achieved better non-rigid tracking and surface loop closure performance than state-of-the-art methods. Moreover, the real-time reconstructed inner body shapes are visually plausible. We believe the robustness and accuracy of our approach will enable many applications, especially in AR/VR, gaming, entertainment and even virtual try-on as we also reconstruct the underlying body shape. For the first time, with DoubleFusion, users can easily digitize themselves.