We present a method for large-mask pluralistic image inpainting based on the
generative framework of discrete latent codes. Our method learns latent priors, discretized as
tokens, by only performing computations at the
visible locations of the image. This is realized by a restrictive partial
encoder that predicts the token label for each visible block,
a bidirectional transformer that infers the missing labels by
only looking at these tokens, and a dedicated synthesis
network that couples the tokens with the partial image priors
to generate coherent and pluralistic complete image even
under extreme mask settings. Experiments on public benchmarks
validate our design choices as the proposed method
outperforms strong baselines in both visual quality and diversity metrics
Image inpainting is the task of filling the missing pixels of
a masked image with appropriate contents that are coherent to its
visible regions. As a long-studied topic in computer vision, image
inpainting has evolved from a restoration technique solely relying on existing information from
the input image (e.g. [3]) to data-driven generative methods
(e.g. [23, 27, 36, 41, 44, 48]) that hallucinates detailed
contents from not only the observable pixels but also learned,
rich image priors.
Pluralistic inpainting refers to the ability of a model to
generate multiple plausible results that complete a partial
image. It offers a view of image inpainting as a generative
method that models the smooth distributions of the
complete images given the partial image as prior information [48].
However, modeling such distributions is challenging
in the typical encoder-decoder network structures. In order to
synthesize missing contents that both respect the partial image
and maintain sample diversity, the decoder in this
setting takes as input two types of information: 1) features
propagated from the visible regions and 2) random noise
vectors sampled from a prior distribution. If the training objective
is to reconstruct a ground-truth image from a partial
image, the objective itself may discourage conditioning on
the random variable.
Moreover, as the training dataset contains
numerous examples that only require low-level information
to complete an image (e.g. smoothly interpolating a
wall texture), the model may choose to ignore the latent priors
when the available image cues are strong enough to provide
an answer. The phenomenon has been found in image
translation networks [16], where adding noise to generate a
conditional image does little to create pluralistic results.
Our method is divided into three stages to complete an input partial image. The neural network model takes as input a partial image XM and a mask image M specifying the area to complete. The first stage encodes the partial image into a set of discrete tokens, referred to as latent codes, at a lower resolution and specifies the masked tokens that need to be predicted (Section 3.1); the second stage utilizes a bidirectional transformer to predict the missing tokens iteratively (Section 3.2); and the third stage couples the predicted tokens with features from the partial image and decodes them into a completed image (Section 3.3). Figure 2 provides a visualization of the overall pipeline of our method.
In this paper, we present a pluralistic image inpainting method that first analyzes only the visible and near-visible regions through latent code prediction, and synthesizes the missing contents through a versatile bidirectional transformer and a reconstruction network that composes the code prediction with partial image priors. We have validated our design choices through comparative experiments on public benchmarks and an ablation study, as our method achieves state-of-the-art performance in both visual quality and sample diversity.