Don’t Look into the Dark: Latent Codes for Pluralistic Image Inpainting
CVPR 2024
Haiwei Chen    Yajie Zhao   
University of Southern California    USC Institute for Creative Technologies   

Inpainting results on the Places Dataset [50] (first two rows) and the CelebA-HQ Dataset [18] (third row). Our method is able to diversely complete partial image with free-form, large holes with state-of-the-art visual quality.
Abstract

We present a method for large-mask pluralistic image inpainting based on the generative framework of discrete latent codes. Our method learns latent priors, discretized as tokens, by only performing computations at the visible locations of the image. This is realized by a restrictive partial encoder that predicts the token label for each visible block, a bidirectional transformer that infers the missing labels by only looking at these tokens, and a dedicated synthesis network that couples the tokens with the partial image priors to generate coherent and pluralistic complete image even under extreme mask settings. Experiments on public benchmarks validate our design choices as the proposed method outperforms strong baselines in both visual quality and diversity metrics


Introduction

Image inpainting is the task of filling the missing pixels of a masked image with appropriate contents that are coherent to its visible regions. As a long-studied topic in computer vision, image inpainting has evolved from a restoration technique solely relying on existing information from the input image (e.g. [3]) to data-driven generative methods (e.g. [23, 27, 36, 41, 44, 48]) that hallucinates detailed contents from not only the observable pixels but also learned, rich image priors.

Pluralistic inpainting refers to the ability of a model to generate multiple plausible results that complete a partial image. It offers a view of image inpainting as a generative method that models the smooth distributions of the complete images given the partial image as prior information [48]. However, modeling such distributions is challenging in the typical encoder-decoder network structures. In order to synthesize missing contents that both respect the partial image and maintain sample diversity, the decoder in this setting takes as input two types of information: 1) features propagated from the visible regions and 2) random noise vectors sampled from a prior distribution. If the training objective is to reconstruct a ground-truth image from a partial image, the objective itself may discourage conditioning on the random variable.

Moreover, as the training dataset contains numerous examples that only require low-level information to complete an image (e.g. smoothly interpolating a wall texture), the model may choose to ignore the latent priors when the available image cues are strong enough to provide an answer. The phenomenon has been found in image translation networks [16], where adding noise to generate a conditional image does little to create pluralistic results.



Overall pipeline of our method. Erst denotes our proposed restrictive encoder that predicts partial tokens from the source image (see Section 3.1). The grey square space in the figure denotes missing tokens, which are iteratively predicted by a bidirectional transformer (see Section 3.2). Eprt denotes an encoder with partial convolution layers, which processes the source image into complementary features to the predicted tokens. The coupled features are decoded into a complete image by a generate G (see Section 3.3).


Method

Our method is divided into three stages to complete an input partial image. The neural network model takes as input a partial image XM and a mask image M specifying the area to complete. The first stage encodes the partial image into a set of discrete tokens, referred to as latent codes, at a lower resolution and specifies the masked tokens that need to be predicted (Section 3.1); the second stage utilizes a bidirectional transformer to predict the missing tokens iteratively (Section 3.2); and the third stage couples the predicted tokens with features from the partial image and decodes them into a completed image (Section 3.3). Figure 2 provides a visualization of the overall pipeline of our method.



Visual examples on inpainting with both the random masks (upper half) and the challenging large box mask (lower half), compared to the selected baseline methods.


Conclusion

In this paper, we present a pluralistic image inpainting method that first analyzes only the visible and near-visible regions through latent code prediction, and synthesizes the missing contents through a versatile bidirectional transformer and a reconstruction network that composes the code prediction with partial image priors. We have validated our design choices through comparative experiments on public benchmarks and an ablation study, as our method achieves state-of-the-art performance in both visual quality and sample diversity.





Footer With Address And Phones