Deep Exemplar-based Video Colorization

Bo Zhang¹ Mingming He^1,2 Jing Liao³ Pedro V. Sander¹
Lu Yuan⁴ Amine Bermak¹ Dong Chen⁵

Hong Kong UST¹ USC Institute for Creative Technologies² City University of Hong Kong³
Microsoft AI Perception and Mixed Reality⁴ Microsoft Research⁵

Abstract

This paper presents the first end-to-end network for exemplar-based video colorization. The main challenge is to achieve temporal consistency while remaining faithful to the reference style. To address this issue, we introduce a recurrent framework that unifies the semantic correspondence and color propagation steps. Both steps allow a provided reference image to guide the colorization of every frame, thus reduce accumulated propagation errors. Video frames are colorized in sequence based on the history of colorization, and its coherency is further enforced by the temporal consistency loss. All of these components, learnt end-toend, help produce realistic videos with good temporal stability. Experiments show our result is superior to the stateof-the-art methods both quantitatively and qualitatively.

Figure 2. The detailed diagram of the proposed network. The correspondence subnet finds the correspondence of of source image x_t^l and reference image y^lab in the deep feature domain, and aligns the reference color accordingly. Based on the intermediate result of the correspondence map along with the last colorized frame, the colorization subnet predicts the color for the the current frame.

Implementation

Network Structure. The correspondence network involves 4 residual blocks each with 2 conv layers. The colorization subnet adopts an auto-encoder structure with skipconnections to reuse the low-level features. There are 3 convolutional blocks in the contractive encoder and 3 convolutional blocks in the decoder which recovers the resolution; each convolutional block contains 2~3 conv layers. The tanh serves as the last layer to bound the chrominance output within the color space. The video discriminator consists of 7 conv layers where the first six layers halve the input resolution progressively. Also, we insert the self-attention block after the second conv layer to let the discriminator examine the global consistency. We use instance normalization since colorization should not be affected by the samples in the same batch. To further improve training stability we apply spectral normalization on both generator and discriminator as suggested in.

Experiments

In this section, we first study the effectiveness of individual components in our method. Then, we compare our method with state-of-the-art approaches.

Conclusion

In this work, we propose the first exemplar-based video colorization. We unify the semantic correspondence and colorization into a single network, training it end-to-end. Our method produces temporal consistent video colorization with realistic effects. Readers could refer to our supplementary material for more quantitative results.

Downloads

Deep Exemplar-based Video Colorization.pdf (3MB)

Deep Exemplar-based Video Colorization

Bo Zhang1 Mingming He1,2 Jing Liao3 Pedro V. Sander1 Lu Yuan4 Amine Bermak1 Dong Chen5

Bo Zhang¹ Mingming He^1,2 Jing Liao³ Pedro V. Sander¹
Lu Yuan⁴ Amine Bermak¹ Dong Chen⁵