Deep Volumetric Video From Very Sparse Multi-View Performance Capture
ECCV 2018
Zeng Huang1,2    Tianye Li1,2    Weikai Chen2    Yajie Zhao2    Jun Xing2    Chloe LeGendre2    Linjie Luo3    Chongyang Ma3    Hao Li1,2,4   
University of Southern California1    USC Institute for Creative Technologies2    Snap Inc.3    Pinscreen4   
Abstract

We present a deep learning based volumetric approach for performance capture using a passive and highly sparse multi-view capture system. State-of-the-art performance capture systems require either prescanned actors, large number of cameras or active sensors. In this work, we focus on the task of template-free, per-frame 3D surface reconstruction from as few as three RGB sensors, for which conventional visual hull or multi-view stereo methods fail to generate plausible results. We introduce a novel multi-view Convolutional Neural Network (CNN) that maps 2D images to a 3D volumetric field and we use this field to encode the probabilistic distribution of surface points of the captured subject. By querying the resulting field, we can instantiate the clothed human body at arbitrary resolutions. Our approach scales to different numbers of input images, which yield increased reconstruction quality when more views are used. Although only trained on synthetic data, our network can generalize to handle real footage from body performance capture. Our method is suitable for high-quality low-cost full body volumetric capture solutions, which are gaining popularity for VR and AR content creation. Experimental results demonstrate that our method is signifcantly more robust and accurate than existing techniques when only very sparse views are available.


Fig. 1: Network architecture
Overview

Given multiple views and their corresponding camera calibration parameters as input, our method aims to predict a dense 3D field that encodes the probabilistic distribution of the reconstructed surface. We formulate the probability prediction as a classifcation problem. At a high level, our approach resembles the spirit of the shape-from-silhouette method: reconstructing the surface according to the consensus from multi-view images on any 3D point staying inside the reconstructed object. However, instead of directly using silhouettes, which only contain limited information, we leverage the deep features learned from a multi-view convolution neural network. As demonstrated in Figure 1, for each query point in the 3D space, we project it onto the multi-view image planes using the input camera parameters. We then collect the multi-scale CNN features learned at each projected location and aggregate them through a pooling layer to obtain the final global feature for the query point. The per-point feature is later fed into a classifcation network to infer its possibilities of lying inside and outside the reconstructed object respectively. As our method outputs a dense probability field, the surface geometry can be faithfully reconstructed from the field using marching cube reconstruction. We introduce the multi-view based probability inference network and training details in Section 4. of the paper. In Section 5 of , we will detail the surface reconstruction.

Acknowledgments

We would like to thank the authors of [74, Surfacenet: An end-to-end 3d neural network for multiview stereopsis] who helped testing with their system. This work was supported in part by the ONR YIP grant N00014-17-S-FO14, the CONIX Research Center, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, the Andrew and Erna Viterbi Early Career Chair, the U.S. Army Research Laboratory (ARL) under contract number W911NF-14-D-0005, Adobe, and Sony. The content of the information does not necessarily respect the position or the policy of the Government, and no official endorsement should be inferred.