- Analysis of people and their activities in video/multimedia (live video,
video archives, video-on-the-web)
- "Making sense" of multimedia databases of people in actions
Video and multimedia from a variety of sources such as video archives,
live video, and the web, can be analyzed for their content. A major
component of that content are people and their actions. We are developing
state-of-the art algorithms for simple and efficient analysis and
interpretation of people and their actions in video.
Model
Human figure in each image of a video sequence is observed as a set of
pixels, z. This representation is, of course, particular to
images and can be reduced to a more representative kinematic figure
model. We adopted the 2-D Scaled Prismatic Model proposed by Morris
and Rehg. The kinematic model lies in the image plane, with each link having
one degree of freedom in rotation and another
DOF in length. The appearance of each link in the image is described
by a template of pixels, y. Final state of the figure is then fully
determined by a set of joint angles x.
Dynamics
The human figure exhibits complex and rich dynamic behavior that is both
nonlinear and time-varying. However, most work on tracking and analysis
of figure motion has employed either generic or highly specific hand-tailored
dynamic models superficially coupled with hidden Markov models of motion
regimes. We propose an alternative class of learned dynamic models known
as switching linear dynamic systems (SLDSs).
A SLDS model describes the dynamics of a complex, nonlinear physical
process by switching among a set of linear dynamic models over time.
A Markov model governs this switching process: a probability is associated
with each possible switching event.
Systems view of SLDS. Switching state sk
at time k selects a linear dynamic system with transition matrix
Ak and noise variance Qk.
An equivalent and appealing representation of the SLDS utilizes the
language of probabilistic graphical models. In particular, a dynamic
Bayesian network representation depicts dependencies of states of the system
as they evolve over time. The Bayesian network representation is
useful for another reason --- it allows one to apply a wealth of Bayesian
network inference and learning techniques to SLDS and, hence, modeling
of the human figure motion.
Inference in SLDS
The goal of inference in SLDS is to estimate the posterior probability
of the hidden states of the system (switching and linear) from a sequence
of measurements. If there were no switching dynamics, the inference would
be straightforward -- we could infer the linear state using LDS inference
(RTS smoothing). However, the presence of
switching dynamics makes exact inference exponentially complex and
intractable for even moderate sequence lengths. It is therefore necessary
to explore approximate inference techniques that will result in a tractable
learning method.
We propose two approximate inference techniques for SLDS: Viterbi or
winner-takes-all inference and structural variational inference.
Viterbi Inference
Viterbi approximation finds one, most likely, sequence of switching
states that best explains the sequence of measurements. This is a
greedy, DP-based procedure, which is computationally efficient. Unfortunately,
it only performs well when measurement sequences have low levels of noise.
Structured Variational Inference
We noted that complexity of inference in SLDS comes from the dependency
between the switching and the linear states. If these states could
somehow be decoupled the inference process would become easier. Structured
variational inference does exactly that --- it decouples the switching
from the linear states by introducing a set of new decoupling variables.
Inference in two decoupled networks (an HMM and an LDS) can now be accomplished
trivially.
The simplified structure comes at a cost, though. The inference
becomes an iterative process. At each iteration inference in the
HMM becomes dependent on the solution of inference in the LDS and vice
versa.
Variational inference solution is globally optimal and does not suffer
from artifacts of local approximations, as is the case in Viterbi inference
and some other local approximation schemes. However, its iterative
nature induces tradeoff in complexity.
Structure variational inference recursions in SLDS.
Original SLDS is decoupled into an HMM and a LDS. Inference in each
of the two models depends on the values of their variational parameters
(q for HMM and A,O for LDS). Those are in turn dependent on the estimates
of the states of the opposing models.
Another view of variational recursions for SLDS.
Learning in SLDS
As alluded to before, learning in SLDS is trivial once an efficient inference
algorithm is determined. A rather general EM framework can be used
to estimate MAP parameters of an SLDS. For instance, estimates of
model parameters are

where <> denote estimates of quantities obtained during inference.
Examples of Applications
Classification of human actions
Classification of unknown motion sequences is an obvious way to show
the impact of different dynamic models. The goal is to interpret
an unknown motion sequence in terms of the known, learned action models.
For instance, an unknown motion can be recognized as a motion consisting
of alternations of ``jog'' and ``walk."
Classification of human motion. A sequence of joint
angles extracted from a video of a person in motion is classified as jog
(state 1) or walk (state 2).
Synthesis of realistic-looking human motion
SLDS, as a generative model, can be used to synthesize natural-looking
human figure motion. In the simplest setting, sampling from a "walk"
SLDS yields an example of the "walk" motion consistent with the learned
model.
Synthesized "jog" motion obtained by sampling from "jog"
SLDS.
|