We present a real-time method for synthesizing highly complex human motions using a novel training regime we call the auto-conditioned Recurrent Neural Network (acRNN). Recently, researchers have attempted to synthesize new motion by using autoregressive techniques, but existing methods tend to freeze or diverge after a couple of seconds due to an accumulation of errors that are fed back into the network. Furthermore, such methods have only been shown to be reliable for relatively simple human motions, such as walking or running. In contrast, our approach can synthesize arbitrary motions with highly complex styles, including dances or martial arts in addition to locomotion. The acRNN is able to accomplish this by explicitly accommodating for autoregressive noise accumulation during training. Our work is the first to our knowledge that demonstrates the ability to generate over 18,000 continuous frames (300 seconds) of new complex human motion w.r.t. different styles.
The synthesis of realistic human motion has recently seen increased interest (Holden et al., 2016; 2017; Fragkiadaki et al., 2015; Jain et al., 2016; Bütepage et al., 2017; Martinez et al., 2017) with applications beyond animation and video games. The simulation of human looking virtual agents is likely to become mainstream with the dramatic advancement of Artificial Intelligence and the democratization of Virtual Reality. A challenge for human motion synthesis is to automatically generate new variations of motions while preserving a certain style, e.g., generating large numbers of different Bollywood dances for hundreds of characters in an animated scene of an Indian party. Aided by the availability of large human-motion capture databases, many database-driven frameworks have been employed to this end, including motion graphs (Kovar et al., 2002; Safonova & Hodgins, 2007; Min & Chai, 2012), as well as linear (Safonova et al., 2004; Chai & Hodgins, 2005; Tautges et al., 2011) and kernel methods (Mukai, 2011; Park et al., 2002; Levine et al., 2012; Grochow et al., 2004; Moeslund et al., 2006; Wang et al., 2008), which blend key-frame motions from a database. It is hard for these methods, however, to add new variations to existing motions in the database while keeping the style consistent. This is especially true for motions with a complex style such as dancing and martial arts. More recently, with the rapid development in deep learning, people have started to use neural networks to accomplish this task (Holden et al., 2017; 2016; 2015). These works have shown promising results, demonstrating the ability of using high-level parameters (such as a walking-path) to synthesize locomotion tasks such as jumping, running, walking, balancing, etc. These networks do not generate new variations of complex motion, however, being instead limited to specific use cases. In contrast, our paper provides a robust framework that can synthesize highly complex human motion variations of arbitrary styles, such as dancing and martial arts, without querying a database. We achieve this by using a novel deep auto-conditioned RNN (acRNN) network architecture.
Recurrent neural networks are autoregressive deep learning frameworks which seek to predict sequences of data similar to a training distribution. Such a framework is intuitive to apply to human motion, which can be naturally modeled as a time series of skeletal joint positions. We are not the first to leverage RNNs for this task (Fragkiadaki et al., 2015; Jain et al., 2016; Bütepage et al., 2017; Martinez et al., 2017), and these works produce reasonably realistic output at a number of tasks such as sitting, talking, smoking, etc. However, these existing methods also have a critical drawback: the motion becomes unrealistic within a couple of seconds and is unable to recover.
We have shown the effectiveness of the acLSTM architecture to produce extended sequences of complex human motion. We believe our work demonstrates qualitative state-of-the-art results in motion generation, as all previous work has focused on synthesizing relatively simple human motion for extremely short time periods. These works demonstrate motion generation up to a couple of seconds at most while acLSTM does not fail even after over 300 seconds. Though we are as of yet unable to prove indefinite stability, it seems empirically that acLSTM can generate arbitrarily long sequences. Current problems that exist include choppy motion at times, self-collision of the skeleton, and unrealistic sliding of the feet. Further developement of GAN methods, such as (Lamb et al., 2016), could result in increased realism, though these models are notoriously hard to train as they often result in mode collapse. Combining our technique with physically based simulation to ensure realism after synthesis is also a potential next step. Finally, it is important to study the effects of using various condition lengths during training. We begin the exploration of this topic in the appendix, but further analysis is needed.