Learning-Based Human Character Animation Synthesis for Content Production
Type of position: PhD
Location: Rennes (Technicolor & Inria)
Pierre Hellier – Pierre.Hellier@technicolor.com
Francois Le Clerc – Francois.LeClerc@technicolor.com
Ludovic Hoyet - email@example.com
Introduction and context
Content production for film and advertising increasingly relies on computer-generated imagery to lower costs and enhance creative possibilities. In particular, many of today’s movies and advertisements feature synthetic human characters. The animation of the characters’ bodies is driven by the dynamics of an underlying skeleton, built from the main joints of the human body. The skeleton is later fleshed
into a 3D mesh by a process known as skinning, whereby the displacement of each vertex of the mesh is computed from the displacement of the neighbouring skeleton joints it is bound to. Accurately capturing the naturalness of human motion in the dynamics of the skeleton is key to the perceptual plausibility of the rendered animation.
Creating animations for photorealistic computer-generated movies is a highly demanding complex part of the film production workflow that requires an insane amount of manual work. Keyframing and motion capture are the two dominant techniques used in the industry today. Keyframing refers to a purely manual editing process wherein artists draw the skeletons at selected temporal frames (“keyframes”), and further define non-linear interpolation paths for joints locations in-between the keyframes. Motion capture is performed in a green room with specialized hardware, with marker-based setups that requires some involvement on the part of the actors, as well as manual post-processing to incorporate artistic edits into the animations. In both cases, the amount of human intervention and hence the production costs are very high. Thus, there is a strong business justification in the automation of the non-creative parts of the animation process.
Advances in machine learning and particularly deep learning in recent years have boosted the research effort towards obtaining skeletal animations from the analysis of videos. The idea is to learn a mapping between the image of a human character and the 2D or even 3D locations of the joints of the character body. However, due in part to the difficulty of the problem and in part to the lack of 3D annotated
training data, the accuracy on joint location estimates is often poor, especially in the depth direction that is not observable in the image. Besides, the estimated skeletons consist of only a few joints and often fail to cover the hands and the feet.
The generation of animations from videos offers promising prospects for optimizing the animation workflow in the content production industry. Still, a lot of work is needed to improve the resolution and accuracy of the produced animations, and to adapt the technology to make it usable in an interactive way by animation artists. Advancing towards these goals is the main purpose of the proposed PhD.
Existing techniques and limitations
The estimation of animation skeletons, a.k.a. human poses, in images and videos is an active research area, dominated by supervised machine learning approaches that leverage databases of images annotated with human joint locations. The initial target of 2D pose estimation  has now been extended to 3D, see for instance [2, 3]. Inferring the depth components of the skeleton joints turns out
to be a challenging ill-posed problem. Even though various regularization strategies have been proposed, the estimated joint locations are still quite noisy, especially in the depth direction orthogonal to the plane of the observed image. This is also, to some extent, a consequence of the scarcity of 3D skeleton annotations, which are difficult to generate in “in-the-wild” environments . A further issue with annotations, and as a result human pose estimates, is that they are limited to a small number of body joints, excluding hands and feet. Overall, the accuracy and resolution of state-of-art “video to analysis” techniques is still unsuitable for animating even secondary characters in photorealistic films and movies.
In parallel to human pose estimation, some research effort has been devoted to the characterization of human motion kinematics using learning-based approaches. The seminal work of Holden  leverages an autoencoder framework to learn a “manifold” of human motion. It further proposes methods for editing animations in this manifold and mapping the editing controls to human-understandable high-level parameters. The learnt parameters of the encoder can be used to characterize the style of the motion and perform style transfer on animations. This technique could be extended to learn a specific motion model for a given character, perhaps based on initially produced animation sequences for this character, and further improve the generation of subsequent animations for this same character based
on the learnt model.
Directions for research
Directions of research are flexible within the proposed context, but will explore areas related to improving animation quality for production usages.
Requirements for candidacy
- Strong programming skills (C/C++ recommended)
- Strong knowledge of machine learning
- Basic knowledge of computer animation and graphics
We are looking for motivated candidates, please send CV, a motivation letter, reference letters, and any relevant material to firstname.lastname@example.org, email@example.com and firstname.lastname@example.org
This PhD will be conducted in the context of a CIFRE collaboration between Technicolor and the MimeTIC team (Inria Rennes). Technicolor is a leading company in the VFX world, combining their R&D expertise in Computer Vision and Computer Graphics with the artistic expertise from their studios, such as The Mill, Moving Picture Company, Mikros Image, etc. Inria is a French leading research centre in
Computer Sciences, where research activities in MimeTIC focus on simulating virtual humans that behave in a natural manner and act with natural motions. The starting date of the PhD is flexible, and could be as soon as 1st of February 2019.
 A. Newell, K. Yang and J. Deng, « Stacked Hourglass Networks for Human Pose Estimation, » in European Conference on Computer Vision, 2016.
 D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H. Seidel, W. Xu, D. Casas and C. Theobalt, « VNect: Real-Time 3D Human POse Estimation with a Single RGB Camera, » ACM Transactions on Computer Graphics, vol. 36, no. 4, pp. 44:1 – 44:14, 2017.  B. Tekin, A. Rozantsev, V. Lepetit and P. Fua, « Direct Prediction of 3D Body Poses from Motion Compensated Sequences, » in IEEE International Conference on Computer Vision and Pattern
Recognition (CVPR), 2016.
 X. Zhou, Q. Huang, X. Sun, X. Xue and Y. Wei, « Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach, » in IEEE International Conference on Computer Vision (ICCV), 2017.