We propose a novel approach to performing fine-grained 3D manipulation of image content via a convolutional neural network, which we call the Transformable Bottleneck Network (TBN). It applies given spatial transformations directly to a volumetric bottleneck within our encoder bottleneck-decoder architecture. Multi-view supervision encourages the network to learn to spatially disentangle the feature space within the bottleneck. The resulting spatial structure can be manipulated with arbitrary spatial transformations. We demonstrate the efficacy of TBNs for novel view synthesis, achieving state-of-the-art results on a challenging benchmark. We demonstrate that the bottlenecks produced by networks trained for this task contain meaningful spatial structure that allows us to intuitively perform a variety of image manipulations in 3D, well beyond the rigid transformations seen during training. These manipulations include non-uniform scaling, non-rigid warping, and combining content from different images. Finally, we extract explicit 3D structure from the bottleneck, performing impressive 3D reconstruction from a single input image.
We train and evaluate our framework on a variety of tasks. We provide quantitative evaluations for our results for novel view synthesis using both single and multi-view input, and compare our results to state-of-the-art methods on an established benchmark. We also perform 3D object reconstruction from a single image and quantitatively compare our results to recent work. Finally, we provide qualitative examples of our approach applying creative manipulations via non-rigid deformations.
As reported above, our method performs well on NVS with a single view, and progressively improves as more input views are used. We now show that this trend extends to 3D reconstruction. However, given that more views aid reconstruction, and that our network can generate more views, an interesting question is whether the generative power of our network can be used to aid the 3D reconstruction task. We ran experiments to find out.
This work has presented a novel approach to applying
spatial transformations in CNNs: applying them directly
to a volumetric bottleneck, within an encoder-bottleneck
decoder network that we call the Transformable Bottleneck
Network. Our results indicate that TBNs are a powerful
and versatile method for learning and representing the 3D
structure within an image. Using this representation, one
can intuitively perform meaningful spatial transformations
to the extracted bottleneck, enabling a variety of tasks.
We demonstrate state-of-the-art results on novel view
synthesis of objects, producing high quality reconstructions
by simply applying a rigid transformation to the bottleneck
corresponding to the desired view. We also demonstrate
that the 3D structure learned by the network when
trained on the NVS task can be straightfowardly extracted
from the bottleneck, even without 3D supervision, and furthermore,
that the powerful generative capabilities of the
complete encoder-decoder network can be used to substantially
improve the quality of the 3D reconstructions by
reencoding regularly spaced, synthetic novel views. Finally,
and perhaps most intriguingly, we demonstrate that
a network trained on purely rigid transformations can be used
to apply arbitrary, non-rigid, 3D spatial transformations to
content in images.