Computational Studies of Human Motion: Part 1, Tracking and Motion Synthesis
Foundations and Trends® in
Computer Graphics and Vision
Volume 1 Issue 2/3
DOI: 10.1561/0600000005
Computational Studies of Human Motion: Part 1, Tracking and Motion Synthesis
David A. Forsyth
University of Illinois Urbana Champaign
Okan Arikan
University of Texas at Austin
Leslie Ikemoto
University of California, Berkeley
James O’Brien
University of California, Berkeley
Deva Ramanan
Toyota Technological Institute at Chicago
Abstract
We review methods for kinematic tracking of the human body in video. The review is part of a projected book that is intended
to cross-fertilize ideas about motion representation between the animation and computer vision communities. The review confines
itself to the earlier stages of motion, focusing on tracking and motion synthesis; future material will cover activity representation
and motion generation.
In general, we take the position that tracking does not necessarily involve (as is usually thought) complex multimodal inference
problems. Instead, there are two key problems, both easy to state.
The first is lifting, where one must infer the configuration of the body in three dimensions from image data. Ambiguities
in lifting can result in multimodal inference problem, and we review what little is known about the extent to which a lift
is ambiguous. The second is data association, where one must determine which pixels in an image come from the body. We see
a tracking by detection approach as the most productive, and review various human detection methods.
Lifting, and a variety of other problems, can be simplified by observing temporal structure in motion, and we review the literature
on data-driven human animation to expose what is known about this structure. Accurate generative models of human motion would
be extremely useful in both animation and tracking, and we discuss the profound difficulties encountered in building such
models. Discriminative methods - which should be able to tell whether an observed motion is human or not - do not work well
yet, and we discuss why.
There is an extensive discussion of open issues. In particular, we discuss the nature and extent of lifting ambiguities, which
appear to be significant at short timescales and insignificant at longer timescales. This discussion suggests that the best
tracking strategy is to track a 2D representation, and then lift it. We point out some puzzling phenomena associated with
the choice of human motion representation - joint angles vs. joint positions. Finally, we give a quick guide to resources.