Deep Exemplar-based Video Colorization
Bo Zhang1    Mingming He1,2    Jing Liao3    Pedro V. Sander1    Lu Yuan4    Amine Bermak1    Dong Chen5   
Hong Kong UST1    USC Institute for Creative Technologies2    City University of Hong Kong3    Microsoft AI Perception and Mixed Reality4    Microsoft Research5   

This paper presents the first end-to-end network for exemplar-based video colorization. The main challenge is to achieve temporal consistency while remaining faithful to the reference style. To address this issue, we introduce a recurrent framework that unifies the semantic correspondence and color propagation steps. Both steps allow a provided reference image to guide the colorization of every frame, thus reduce accumulated propagation errors. Video frames are colorized in sequence based on the history of colorization, and its coherency is further enforced by the temporal consistency loss. All of these components, learnt end-toend, help produce realistic videos with good temporal stability. Experiments show our result is superior to the stateof-the-art methods both quantitatively and qualitatively.

Figure 2. The detailed diagram of the proposed network. The correspondence subnet finds the correspondence of of source image xtl and reference image ylab in the deep feature domain, and aligns the reference color accordingly. Based on the intermediate result of the correspondence map along with the last colorized frame, the colorization subnet predicts the color for the the current frame.

Network Structure. The correspondence network involves 4 residual blocks each with 2 conv layers. The colorization subnet adopts an auto-encoder structure with skipconnections to reuse the low-level features. There are 3 convolutional blocks in the contractive encoder and 3 convolutional blocks in the decoder which recovers the resolution; each convolutional block contains 2~3 conv layers. The tanh serves as the last layer to bound the chrominance output within the color space. The video discriminator consists of 7 conv layers where the first six layers halve the input resolution progressively. Also, we insert the self-attention block after the second conv layer to let the discriminator examine the global consistency. We use instance normalization since colorization should not be affected by the samples in the same batch. To further improve training stability we apply spectral normalization on both generator and discriminator as suggested in.


In this section, we first study the effectiveness of individual components in our method. Then, we compare our method with state-of-the-art approaches.


In this work, we propose the first exemplar-based video colorization. We unify the semantic correspondence and colorization into a single network, training it end-to-end. Our method produces temporal consistent video colorization with realistic effects. Readers could refer to our supplementary material for more quantitative results.

Footer With Address And Phones