We present the first approach to volumetric performance capture and novel-view rendering at real-time speed from monocular video, eliminating the need for expensive multi-view systems or cumbersome pre-acquisition of a personalized template model. Our system reconstructs a fully textured 3D human from each frame by leveraging Pixel-Aligned Implicit Function (PIFu). While PIFu achieves high-resolution reconstruction in a memory-efficient manner, its computationally expensive inference prevents us from deploying such a system for real-time applications. To this end, we propose a novel hierarchical surface localization algorithm and a direct rendering method without explicitly extracting surface meshes. By culling unnecessary regions for evaluation in a coarse-to-fine manner, we successfully accelerate the reconstruction by two orders of magnitude from the baseline without compromising the quality. Furthermore, we introduce an Online Hard Example Mining (OHEM) technique that effectively suppresses failure modes due to the rare occurrence of challenging examples. We adaptively update the sampling probability of the training data based on the current reconstruction accuracy, which effectively alleviates reconstruction artifacts. Our experiments and evaluations demonstrate the robustness of our system to various challenging angles, illuminations, poses, and clothing styles. We also show that our approach compares favorably with the state-of-the-art monocular performance capture. Our proposed approach removes the need for multi-view studio settings and enables a consumer-accessible solution for volumetric capture.
In this section, we describe the overall pipeline of our algorithm for real-time
volumetric capture (Fig. 2). Given a live stream of RGB images, our goal is to
obtain the complete 3D geometry of the performing subject in real-time with
the full textured surface, including unseen regions. To achieve an accessible
solution with minimal requirements, we process each frame independently, as
tracking-based solutions are prone to accumulating errors and sensitive to
initialization, causing drift and instability [49, 88]. Although recent approaches
have demonstrated that the use of anchor frames [3,10] can alleviate drift, ad-hoc
engineering is still required to handle common but extremely challenging scenarios
such as changing the subject.
For each frame, we first apply real-time segmentation of the subject from the background. The segmented image is then fed into our enhanced Pixel-Aligned Implicit Function (PIFu)  to predict continuous occupancy fields where the underlining surface is defined as a 0.5-level set. Once the surface is determined, texture inference on the surface geometry is also performed using PIFu, allowing for rendering from any viewpoint for various applications. As this deep learning framework with effective 3D shape representation is the core building block of the proposed system, we review it in Sec. 3.1, describe our enhancements to it, and point out the limitations on its surface inference and rendering speed. At the heart of our system, we develop a novel acceleration framework that enables real-time inference and rendering from novel viewpoints using PIFu (Sec. 3.2). Furthermore, we further improve the robustness of the system by sampling hard examples on the fly to efficiently suppress failure modes in a manner inspired by Online Hard Example Mining  (Sec. 3.3).
To reduce the computation required for real-time performance capture, we
introduce two novel acceleration techniques. First, we present an efficient surface
localization algorithm that retains the accuracy of the brute-force reconstruction
with the same complexity as naive octree-based reconstruction algorithms.
Furthermore, since our final outputs are renderings from novel viewpoints, we
bypass the explicit mesh reconstruction stage by directly generating a novel-view
rendering from PIFu. By combining these two algorithms, we can successfully
render the performance from arbitrary viewpoints in real-time. We describe each
algorithm in detail below.
Octree-based Robust Surface Localization. The major bottleneck of the pipeline is the evaluation of implicit functions represented by an MLP at an excessive number of 3D locations. Thus, substantially reducing the number of points to be evaluated would greatly increase the performance. The octree is a common data representation for efficient shape reconstruction  which hierarchically reduces the number of nodes in which to store data. To apply an octree for an implicit surface parameterized by a neural network, recently  propose an algorithm that subdivides grids only if it is adjacent to the boundary nodes (i.e., the interface between inside node and outside node) after binarizing the predicted occupancy value. We found that this approach often produces inaccurate reconstructions compared to the surface reconstructed by the brute force baseline (see Fig. 3). Since a predicted occupancy value is a continuous value in the range [0, 1], indicating the confidence in and proximity to the surface, another approach is to subdivide grids if the maximum absolute deviation of the neighbor coarse grids is larger than a threshold. While this approach allows for control over the trade-off between reconstruction accuracy and acceleration, we also found that this algorithm either excessively evaluates unnecessary points to perform accurate reconstruction or suffers from impaired reconstruction quality in exchange for higher acceleration. To this end, we introduce a surface localization algorithm that hierarchically and precisely determines the boundary nodes.
We train our networks using NVIDIA GV100s with 512 × 512 images. During
inference, we use a Logitech C920 webcam on a desktop system equipped with 62
GB RAM, a 6-core Intel i7-5930K processor, and 2 GV100s. One GPU performs
geometry and color inference, while the other performs surface reconstruction,
which can be done in parallel in an asynchronized manner when processing
multiple frames. The overall latency of our system is on average 0.25 second.
We evaluate our proposed algorithms on the RenderPeople  and BUFF datasets , and on self-captured performances. In particular, as public datasets of 3D clothed humans in motion are highly limited, we use the BUFF datasets  for quantitative comparison and evaluation and report the average error measured by the Chamfer distance and point-to-surface (P2S) distance from the prediction to the ground truth. We provide implementation details, including the training dataset and real-time segmentation module, in the appendix.
In Fig. 1, we demonstrate our real-time performance capture and rendering from a single RGB camera. Because both the reconstructed geometry and texture inference for unseen regions are plausible, we can obtain novel-view renderings in real-time from a wide range of poses and clothing styles. We provide additional results with various poses, illuminations, viewing angles, and clothing in the appendix and supplemental video.