We present a deep learning based volumetric approach for performance capture using a passive and highly sparse multi-view capture system. State-of-the-art performance capture systems require either prescanned actors, large number of cameras or active sensors. In this work, we focus on the task of template-free, per-frame 3D surface reconstruction from as few as three RGB sensors, for which conventional visual hull or multi-view stereo methods fail to generate plausible results. We introduce a novel multi-view Convolutional Neural Network (CNN) that maps 2D images to a 3D volumetric field and we use this field to encode the probabilistic distribution of surface points of the captured subject. By querying the resulting field, we can instantiate the clothed human body at arbitrary resolutions. Our approach scales to different numbers of input images, which yield increased reconstruction quality when more views are used. Although only trained on synthetic data, our network can generalize to handle real footage from body performance capture. Our method is suitable for high-quality low-cost full body volumetric capture solutions, which are gaining popularity for VR and AR content creation. Experimental results demonstrate that our method is signifcantly more robust and accurate than existing techniques when only very sparse views are available.
Given multiple views and their corresponding camera calibration parameters as input, our method aims to predict a dense 3D field that encodes the probabilistic distribution of the reconstructed surface. We formulate the probability prediction as a classifcation problem. At a high level, our approach resembles the spirit of the shape-from-silhouette method: reconstructing the surface according to the consensus from multi-view images on any 3D point staying inside the reconstructed object. However, instead of directly using silhouettes, which only contain limited information, we leverage the deep features learned from a multi-view convolution neural network. As demonstrated in Figure 1, for each query point in the 3D space, we project it onto the multi-view image planes using the input camera parameters. We then collect the multi-scale CNN features learned at each projected location and aggregate them through a pooling layer to obtain the final global feature for the query point. The per-point feature is later fed into a classifcation network to infer its possibilities of lying inside and outside the reconstructed object respectively. As our method outputs a dense probability field, the surface geometry can be faithfully reconstructed from the field using marching cube reconstruction. We introduce the multi-view based probability inference network and training details in Section 4. of the paper. In Section 5 of , we will detail the surface reconstruction.
We would like to thank the authors of [74, Surfacenet: An end-to-end 3d neural network for multiview stereopsis] who helped testing with their system. This work was supported in part by the ONR YIP grant N00014-17-S-FO14, the CONIX Research Center, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, the Andrew and Erna Viterbi Early Career Chair, the U.S. Army Research Laboratory (ARL) under contract number W911NF-14-D-0005, Adobe, and Sony. The content of the information does not necessarily respect the position or the policy of the Government, and no official endorsement should be inferred.