Template Page

Lorem Ipsum Dolor Sit Amet

20XX PAPER

Maya Moody Brendan Webster Karla Mcguire Lyndsey Holt Paul Debevec

USC Institute for Creative Technologies

Figure 1: Real-time puppeteering a photorealistic character built within minutes using Kinect v1 sensor and the framework proposed in this paper.

Abstract

Creating and animating realistic 3D human faces is an important element of virtual reality, video games, and other areas that involve interactive 3D graphics. In this paper, we propose a system to generate photorealistic 3D blendshape-based face models automatically using only a single consumer RGB-D sensor. The capture and processing requires no artistic expertise to operate, takes 15 seconds to capture and generate a single facial expression, and approximately 1 minute of processing time per expression to transform it into a blendshape model. Our main contributions include a complete end-to-end pipeline for capturing and generating photorealistic blendshape models automatically and a registration method that solves dense correspondences between two face scans by utilizing facial landmarks detection and optical flows. We demonstrate the effectiveness of the proposed method by capturing different human subjects with a variety of sensors and puppeteering their 3D faces with real-time facial performance retargeting. The rapid nature of our method allows for just-in-time construction of a digital face. To that end, we also integrated our pipeline with a virtual reality facial performance capture system that allows dynamic embodiment of the generated faces despite partial occlusion of the user’s real face by the head-mounted display.

Keywords

face modeling, blendshapes, RGB-D, animation 3D characters are an important element of many 3D games, simulations, feature films, and other media that use 3D content. Of particular interest is the ability to represent the human face for purposes of expression, speech and recognition. The generation of a highly realistic, emotive digitally-based human face has been shown in feature films and high end video games which utililize a combination of traditional 3D art techniques and high-quality scanning.

Scan-based facial modeling techniques are capable of capturing the appearance and expression of a human subject through a combination of still images, video, or 3D images from a depth sensor or laser scanner. In contrast to traditional 3D art pipelines, scan-based techniques allow for subtle and realistic variations in shape and color between subjects. However, the challenge when using scan-based data is in manipulating the data into a form that be controlled in a simulation. Scanning techniques are susceptible to problems such as noisy data, inconsistent topologies, texture discolorations and so forth. Thus scan-based techniques are challenged to transform complex, unstructured data into well-formed, structured data, which can be complicated and time-consuming. And thus, the widespread use of high quality, scanned 3D faces in simulations is limited due to the complexity and time needed to transform such scanned data into a well-formed set of controllable data.

Results

The blendshapes generated by the proposed method can be used in many animation and simulation environments that utilize blendshapes. Our approach is data agnostic and can utilize scan input from depth-sensors. We demonstrate results using the Kinect v1, Intel RealSense F200, and Occipital Structure Sensor. We expect our method to produce results according to the quality of the sensors, including depth and color specifications. Thus our method should produce higher levels of details with better sensors as they have already shown from low (Kinect v1), medium (RealSense) and high (Structure Sensor). Additionally, the method introduced in this paper is also compatible with 3D data generated from photogrammetry techniques. A single user can quickly capture scans, process the data, and puppeteer the generated face without artist intervention in a matter of minutes.

Our accompanying videos demonstrate the acquisition, processing, and use of the blendshape data with a real-time animation system and real-time facial tracking software. Noting the recent proliferation of consumer virtual reality technology, we have integrated our face scanning and processing pipeline with a recently developed head-mounted display facial performance capture system, see Figure 10. This system uses a head-mounted RGB-D camera to capture lower facial expressions combined with strain sensors embedded in the foam lining of the display to sense expressions in the occluded upper region. This makes it possible for dynamic real-time embodiment of one’s own (or someone else’s) face within an immersive virtual reality environment.

Geometry and Texture Warping

A geometry and texture 2D warping algorithm is used to align the source scan into the target scan [42], following three steps. First, a Delaunay triangulation is built between the set of landmarks of both the source and target scans. The constructed mesh is used to roughly pre-warp the source texture map to the target using affine triangle transformations. Second, a GPU-accelerated optical flow is used to compute a dense warp field from the pre-warped source texture map to to the target. Finally, dense warp is used to deform both the texture map and the point cloud from the source to the target scan. This results in the source scan warped to the target UV space.

Figure 2: Example texture alignment results. Left and center columns are the target and source texture maps respectively, generated from RGB-D capture (top) and multi-camera capture (bottom). Right column contains the results after applying the proposed approach to warp the source UV space to the target.

Some expressions are more challenging to correspond than others. Especially expressions with lots of occlusions, like mouth open to mouth closed. In such cases, optical flow may fail to get a good result, but our pipeline provides also a semi-automatic tool that allows the user to interactively manipulate the set of correspondences. Also, we can assist the optical flow in two ways. First, by painting black masks around occlusion regions in both source and target diffuse textures. Second, by marking some points as “pinned” and those points are rasterized into small black dots at runtime. Using both of these techniques in combination usually produces good results even in the most challenging cases. Figure 4 presents a visualization of texture warping results achieved with the proposed approach. Notice how the UV space of the warped texture aligns with the neutral. Analogously, Figures 4d, 4e and 4f present texture warping results for a character captured in a multi-camera setup with controlled lighting

Figure 3: Diagram depicting the proposed pipeline for photorealistic blendshapes from RGB-D: (a) A set of facial expressions are scanned; (b) Rigid alignment between expressions is obtained by automatic ICP registration; (c) and (f) 3D textured meshes are converted into a 2D representation and stored in EXR floating image format, in this particular case (c) is the source scan and (f) the target scan; (d) Automatic facial landmark detection is used to detect common features in scans; (e) A combination of Delaunay triangulation over the detected landmarks and 2D optical flow is used for dense warping between source scan (c) and target scan (f); (g) Reference mesh sharing the same UV space as the target scan is used to extract the final blendshape (h).

Conclusion

Our method can rapidly generate a set of photorealistic, expressive facial poses as blendshapes from a single commodity depth sensor within a relatively short amount of time, while requiring no artistic or technical expertise on the part of the capture subject. We demonstrate our approach as part of a complete end-to-end system for scanning, processing, and real-time control. The rapid nature of model acquisition and automatic processing enables the ability to generate a controllable 3D face model for environments where the fast construction of an new face model is desirable. For example, in a virtual environment while wearing a head-mounted display. Thus, this work advances the state-of-the-art for the rapid creation of photorealistic digital representations of real people that can enable multi-user communication and collaboration in virtual reality.

Downloads

Paper
Rapid Photorealistic Blendshape Modeling from RGB-D Sensors.pdf, (22.9MB)

Video
siggraphAsia2016video.mov, (61.7MB)