We present a learning-based method for estimating 4D reflectance field of a person given video footage illuminated under a flat-lit environment of the same subject. For training data, we use one light at a time to illuminate the subject and capture the reflectance field data in a variety of poses and viewpoints. We estimate the lighting environment of the input video footage and use the subject’s reflectance field to create synthetic images of the subject illuminated by the input lighting environment. We then train a deep convolutional neural network to regress the reflectance field from the synthetic images. We also use a differentiable renderer to provide feedback for the network by matching the relit images with the input video frames. This semi-supervised training scheme allows the neural network to handle unseen poses in the dataset as well as compensate for the lighting estimation error. We evaluate our method on the video footage of the real Holocaust survivors and show that our method outperforms the state-of-the-art methods in both realism and speed.
The New Dimensions in Testimony project at the University
of Southern California’s Institute for Creative Technologies
recorded extensive question-and-answer interviews
with twelve survivors of the World War II Holocaust.
Each twenty-hour interview, conducted over five days, produced
over a thousand responses, providing the material
for time-offset conversations through AI based matching
of novel questions to recorded answers. These interviews
were recorded inside a large Light Stage system
with fifty-four high-definition video cameras. The multiview
data enabled the conversations to be projected threedimensionally
on an automultiscopic display.
The light stage system is designed for recording relightable
reflectance fields, where the subject is illuminated
from one lighting direction at a time, and these datasets can
be recombined through image-based relighting. If the
subject is recorded with a high speed video camera, a large
number of lighting conditions can be recorded during a normal
video frame duration allowing a dynamic video
to be lit with new lighting. This enables the subject to be realistically
composited into a new environment (for example,
the place that the subject is speaking about) such that their
lighting is consistent with that of the environment. In 2012,
the project performed a successful early experiment using a
Spherical Harmonic Lighting Basis as in for relighting
a Holocaust survivor interview. However, recording with
an array of high speed cameras proved to be too expensive
for the project, both in the cost of the hardware, and the
greatly increased storage cost of numerous high-speed uncompressed
video streams.
One of the most effective ways to perform realistic relighting
is to combine a dense set of basis lighting conditions
(a reflectance field) with according to a novel lighting
environment to simulate the appearance in the new lighting.
However, this approach is not ideal for a dynamic performance
since it requires either high-speed cameras, or requires
the actor to sit still for several seconds to capture
the set of OLAT images. [25] overcomes this limitation
by using neural networks to regress 4D reflectance fields
from just two images of a subject lit by gradient illumination.
They postulate that one can also use flat-lit images
to achieve similar results with less high-frequency detail.
Since the method casts relighting as a supervision regression
problem, it requires pairs of tracking images and their
corresponding OLAT images as ground truth for training.
In the New Dimensions in Testimony project, most of
the Holocaust survivors’ interview footage was captured
in front of a green screen so that the virtual backgrounds
can be added during post-production. However, this setup
poses difficulties for achieving consistent illumination between
the actors and the backgrounds in the final testimony
videos and does not provide the ground truth needed for supervision
training. In this paper, we use the limited OLAT
data to train a neural network to infer reflectance fields from
synthetically relit images. The synthetic relit images are
improved by matching them with the input interview images
through a differentiable renderer, enabling an end-toend
training scheme.