Impending advances in the quality and affordability of Head-Mounted Displays (HMDs) such as the Oculus Rift and the Sony Morpheus indicate that the delivery of virtual training may be done increasingly effectively through such devices. As in many simulation applications including ICT projects, the user in the HMD is represented as an avatar present in the virtual environment, which can be seen by other people experiencing the simulation either as audience members or as other active participants. Such avatars would be able to mirror the facial expressions naturally and appear to mouth the words being spoken by the person wearing the HMD. Sony's SOEmote system uses a web camera and a system similar to the Image Metrics LiveDriver to drive avatars with the user's facial expressions as they participate in traditional multi-user games. However, for users experiencing a simulation with an HMD, the face is largely obscured from observation by a facial performance capture system. In this project, we hypothesize that a multi-sensory approach can be used to record an HMD user's facial performance intent and map it realistically onto a virtual character. The multisensory approach will include invisible light close-focus cameras observing the eyes from within the HMD, pressure and motion sensors around the facial interface to the HMD, and a close-focus camera at the bottom of the HMD to observe the mouth. We will train the system using data from an unoccluded facial performance capture system, and evaluate its effectiveness for generating facial performance parameters for novel facial expressions.
Background: A great deal of work has been done on facial performance capture
[Williams 1990]. Modern systems successfully use a variety techniques, including depth
sensors [Weise et al. 2011], passive multi-view stereo [Beeler et al. 2011], and monocular
web cameras [Cao et al. 2013]. Some techniques can leverage data from cameras mounted
directly to the head in the form of "head-cam" devices, used on movies such as Avatar
and research projects such as [Jones et al. 2010], allowing the user to physically walk
within an environment. However, to our awareness, there is no system available which can
enable facial motion capture for a user wearing a head-mounted display where the face is
largely occluded from view. One facial performance capture technique which could be applied
in this circumstance is electroencephalography (EEG) to detect muscle activations from
electrical currents sensed with electrodes in contact with the skin. Such a system was
used experimentally on the Sony Imageworks' movie Beowulf for eye gaze estimation
[Robertson 2007], but the results were somewhat unreliable and required carefully
placing electrodes with high sensitivity on the face. Nonetheless, the problem is
important because sensing the user's facial motion is crucial to effectively representing
them as an expressive avatar in a virtual training simulation.
Motivation and Opportunity: The opportunity we will leverage is to augment an HMD
with four different types of sensors: eye cameras, motion sensors, a mouth camera, and a
microphone, together with a training procedure to map the detected data to facial performance
actions. This will allow HMD wearing participants to appear in a virtual environment in a way
that their avatar faithfully mirrors their facial expressions and performance, including being
able to make eye contact virtually with co-participants.
Addressing S&T Required Capabilities: This seedling project will enhance the S&T required
capability of (T-4) Enhanced Gaming Capability by allowing multiple training game participants
use believable facial communication which using low-cost head-mounted displays for experiencing
the environment. It will also help provide (T-7) Virtual Human Capabilities of providing
sophisticated persona capable of natural interaction with live participants since the facial
expressions of the live participants using HMDs will be available to the virtual human AI
algorithms. In addition, the virtual human avatars of the live participants will have enhanced
photorealistic capability (T-7) by being able to mirror the facial performance intent and eye
gaze of their human counterparts.
Our objective is to show how it may be possible to record facial performance in an
effective manner while a user is wearing an HMD. The research will be conducted will
be in two parts. The first is to design the hardware which is capable of recording
the facial motion information to derive the necessary performance data from the HMD
device. The second is to design a process for interpreting the data so that it can
be used to drive a digital face model of the avatar matching the wearer's facial
performance intent. With these hardware and software systems in place, we will be
able to test the hypothesis that data which can be sensed by an HMD augmented with
additional hardware can be used to record the intended facial performance of the user.
The hardware will be designed to record the performance intent of the face from data
sensible from the HMD platform. We say performance intent since we recognize that facial
movement will be somewhat limited by contact with the HMD, namely, the rim of the structure
forming a shaded space between the eyes and the lenses of the HMD. The parts of the facial
motion intent we need to sense are the mouth, the gaze direction, the skin around the eyes,
the forehead, and the cheeks.
The mouth is the easiest to record since it not covered by the HMD and can be observed
by a camera. The right panel of Fig. 1 shows the view of the mouth recordable from a
small camera sensor attached to the bottom of the HMD fitted with a miniature fisheye
lens. The entire mouth is viewable, and data from such imagery could be used to drive
facial animation using, for example, an Active Appearance Model (AAM) [Cootes et al. 1998].
The image of the mouth will be affected by ambient illumination, which will change as the
HMD user turns their head to look around or physically walks within a space. To counteract
this, active illumination from white LEDs could provide a constant illumination condition
on the mouth. Additionally, the LEDs could be pulsed brightly at the frame rate of the
camera, with the camera's shutter set to a very short exposure time, to ensure that the
active illumination on the mouth will dominate the ambient illumination,
as shown in [Jones et al. 2010].
The motion of the skin near the eyes is more challenging to record since it is inside
the shade of the HMD, close to the HMD lenses, in a relatively dark environment, and
visible active illumination would distract from the imagery viewed by the user.
Fortunately, thanks to miniaturization from the mobile communications and medical
industries, cameras are quite small and several could be placed within hood of the
HDM looking at the skin around the eyes. To avoid visible distraction, we propose to
use infrared LEDs to light the skin and infrared-only cameras to record the reflections.
Since skin exhibits a greater degree of translucency in the infrared spectrum due to
subsurface scattering, we will see if parallel-polarization to accentuate specular
reflections helps to generate a strong enough detectable signal for recording the
motion of the skin beneath the surface. This subsystem may resemble the technique
employed by infrared laser mouse pointer devices, which project an infrared laser
pattern onto the surface below the mouse and use simple image processing algorithms
to detect the direction and the amount of motion of the surface under the mouse.
Detecting the eye gaze of a user wearing an HMD has been previously explored for the
purpose of concentrating rendering computation where the user happens to be looking.
One technique employed in [Miyashita et al. 2008] is electrooculography (EOG), where
electrode sensors placed around the eye sense the electrical stimuli to the eye muscles
in order to determine how the gaze is being directed. However, this was only found to
be 69% effective at determining the gaze between eight positions in a grid, which is
insufficient for producing the few degrees of accuracy [Chen 2002] needed for proper
eye contact for a virtual avatar. Better results were reported by [Lee and Park 2007],
which used a corneal model and colluminated infrared LEDs and a camera to detect gaze
direction. We propose to implement such a system using the same cameras and illumination
with which we plan to detect the motion of the skin around the eyes. Also, we will test
to see whether the eye's reflection of the images displayed by the HMD in the visible
spectrum can be used in determining the eye gaze direction.
We have now proposed methods to acquire data within an HMD for eye gaze direction,
the motion of the skin around the eyes, and mouth animation. The part of the face
not yet observed includes the forehead and cheeks, whose motions are also part of
the expressive repertoire. As seen in Fig. 2, significant parts of these areas are
directly beneath the padded interface between the HMD and the face. Thus, to detect
the motion of these areas, we propose to physically detect the motion of the skin
using pressure sensors. We will test that multiple dimensions of motion (left-right,
up-down, and in-out) can be usefully detected at various points around the rim of
the HMD to determine the muscle activations of the frontalis and corrugator muscles
of the forehead (providing raised eyebrows and knitted brows), the lateral portions
of the orbicularis occuli (providing additional information about the motion of the
skin around the eye), and the nasalis muscles (responsible for scrunched noses).
The second major area of inquiry of this project is to train a mapping so that the
multisensor performance data can be efficiently mapped into blendshape facial performance
data able to drive a realistic avatar. We note that the HMD will likely, to some extent,
restrict the motion of the face due to the contact points. Thus, the mapping we wish to
determine is from the detected facial activations to the intended facial expression the
user is trying to make.
In this proposal, we will test and evaluate the following approach to determining this
mapping. For a number of sample subjects, we will guide the subject to make a series of
facial expressions based on the Facial Action Coding System (FACS) [Ekman and Friesen 1978]
both while wearing and while not wearing the HMD. While not wearing the HMD, the user will
be prompted through images of the facial expressions shown on a monitor, and the HMD itself
will show the expression images again for the HMD capture. While outside the HMD, we will
use an image-based facial performance capture system such as FaceShift based [Weiss et al.
2011] to determine the facial blendshape parameters for each expression. We will then collect
the corresponding data from the multimodal HMD sensors sensing the facial muscle motion and
the visual appearance of the mouth and the skin around the eyes. (We propose that the eye
gaze will be driven directly from the eye gaze measurement subsystem.) From this training
data, we will determine the facial performance mapping between the data sensed from the HMD
and the blendshape parameters from the traditional facial performance capture system.
Our metric of success is whether the facial motion parameters detected with the sensors HMD
can be provide comparable facial performance capture data to an unoccluded facial performance
capture approach. For a number of subjects, after training the system using a subset of the
FACS poses, we will ask the subjects to create novel facial expressions not in the FACS
training set. These poses will also be taken from randomly chosen moments of an extended
facial performance. We will then be able to plot the degree of correspondence between the
facial blendshape poses detected by each of the systems. We will report which facial poses
are detected most consistently by the proposed system, and which facial poses are detected
least successfully, and draw conclusions regarding the success of the system and avenues for
future work on improvements.
We will also evaluate the effectiveness of the system qualitatively by asking users to compare
the facial motion performance synthesized from data recorded of the HMD wearer to that of similar
performances recorded with the unoccluded facial capture system. A series of users will be able
to rate the expressiveness, realism, and engagement of virtual character facial performances
recorded from both the HMD-based capture system and the unoccluded capture system. The results
will be tabulated and the performance of the two systems will be compared.
The anticipated benefit will be the development and demonstration of a practical technique
for a person to more fully inhabit and communicate as an avatar within a head-mounted virtual
environment system.