A General Differentiable Mesh Renderer for Image-based 3D Reasoning
Shichen Liu1    Tianye Li1    Weikai Chen1    Hao Li1   
Abstract

Rendering bridges the gap between 2D vision and 3D scenes by simulating the physical process of image formation. By inverting such renderer, one can think of a learning approach to infer 3D information from 2D images. However, standard graphics renderers involve a fundamental step called rasterization, which prevents rendering to be differentiable. Unlike the state-of-the-art differentiable renderers [25], [35], which only approximate the rendering gradient in the backpropagation, we propose a natually differentiable rendering framework that is able to (1) directly render colorized mesh using differentiable functions and (2) back-propagate efficient supervisions to mesh vertices and their attributes from various forms of image representations. The key to our framework is a novel formulation that views rendering as an aggregation function that fuses the probabilistic contributions of all mesh triangles with respect to the rendered pixels. Such formulation enables our framework to flow gradients to the occluded and distant vertices, which cannot be achieved by the previous state-of-the-arts. We show that by using the proposed renderer, one can achieve significant improvement in 3D unsupervised single-view reconstruction both qualitatively and quantitatively. Experiments also demonstrate that our approach can handle the challenging tasks in image-based shape fitting, which remain nontrivial to existing differentiable renders


We propose Soft Rasterizer R (upper), a natually differentiable renderer, which formulates rendering as a differentiable aggregating process A(·) that fuses per-triangle contributions {Di} in a “soft” probabilistic manner. Our approach attacks the core problem of differentiating the standard rasterizer, which cannot propagate gradients from pixels to geometry due to the discrete sampling operation (below).
Image-based 3D Reasoning

2D images are widely used as the media for reasoning about 3D properties. In particular, imagebased reconstruction has received the most attention. Conventional approaches mainly leverage the stereo correspondence based on the multi-view geometry [12], [17] but are restricted to the coverage provided by the multiple views. With the availability of largescale 3D shape dataset [7], learning-based approaches [16], [19], [58] are able to consider single or few images thanks to the shape prior learned from the data. To simplify the learning problem, recent works reconstruct 3D shape via predicting intermediate 2.5D representations, such as depth map [31], image collections [24], displacement map [20] or normal map [49], [59]. Pose estimation is another key task to understanding the visual environment. For 3D rigid pose estimation, while early approaches attempt to cast it as classification problem [56], recent approaches [26], [61] can directly regress the 6D pose by using deep neural networks. Estimating the pose of non-rigid objects, e.g. human face or body, is more challenging. By detecting the 2D key points, great progress has been made to estimate the 2D poses [5], [38], [60]. To obtain 3D pose, shape priors [2], [34] have been incorporated to minimize the shape fitting errors in recent approaches [3], [4], [5], [23]. Our proposed differentiable renderer can provide dense rendering supervision to 3D properties, benefitting a variety of image-based 3D reasoning tasks.

Single-view Mesh Reconstruction

Image-based 3D reconstruction plays a key role in a variety of tasks in computer vision and computer graphics, such as scene understanding, VR/AR, autonomous driving, etc. Reconstructing 3D objects either in mesh [47], [58] or voxel [62] representation from a single RGB image has been actively studied thanks to the advent of deep learning technologies. While most approaches on mesh reconstruction rely on supervised learning, methods working on voxel representation have strived to leverage rendering loss [8], [28], [57] to mitigate the lack of 3D data. However, the reconstruction quality of voxel-based approaches are limited primarily due to the high computational expense and its discrete nature. Nonetheless, unlike voxels, which can be easily rendered via differentiable projection, rendering a mesh in a differentiable fashion is non-trivial as discussed in the previous context. By introducing a naturally differentiable mesh renderer, SoftRas combines the merits of both worlds – the ability to harness abundant resources of multi-view images and the high reconstruction quality of mesh representation.

To demonstrate the effectiveness of soft rasterizer, we fix the extrinsic variables and evaluate its performance on single-view 3D reconstruction by incorporating it with a mesh generator. The direct gradient from image pixels to shape and color generators enables us to achieve 3D unsupervised mesh reconstruction. Our framework is demonstrated in Figure 6. Given an input image, our shape and color generators generate a triangle mesh M and its corresponding colors C, which are then fed into the soft rasterizer. The SoftRas layer renders both the silhouette Is and color image Ic and provide rendering-based error signal by comparing with the ground truths. Inspired by the latest advances in mesh learning [25], [58], we leverage a similar idea of synthesizing 3D model by deforming a template mesh. To validate the performance of soft rasterizer, the shape generator employ an encoder-decoder architecture identical to that of [25], [62]. The details of the shape and generators are described in the supplemental materials.


3D mesh reconstruction from a single image. From left to right, we show input image, ground truth, the results of our method (SoftRas), Neural Mesh Renderer [25] and Pixel2mesh [58] – all visualized from 2 different views. Along with the results, we also visualize mesh-to-scan distances measured from reconstructed mesh to ground truth.
Forward Rendering Results

Our proposed SoftRas can directly render a given mesh using differentiable functions, while previous rasterization-based differentiable renderers [25], [35] have to rely the off-the-shelf renders for forward rendering. In addition, compared to standard graphics renderer, SoftRas can achieve different rendering effects in a continuous manner thanks to its probabilistic formulation. By increasing σ, the key parameter that controls the sharpness of the screen-space probability distribution, we are able to generate more blurry rendering results. Furthermore, with increased , one can assign more weights to the triangles on the far end, naturally achieving more transparency in the rendered image. We demonstrate rendering effects in the supplemental materials. We will show in Section 5.3 that the blurring and transparent effects are the key to reshaping the energy landscape in order to avoid local minima.

Footer With Address And Phones