Human Action Modeling Using Switching Linear Dynamic Systems
Project Setting
  • Analysis of people and their activities in video/multimedia (live video, video archives, video-on-the-web)
  • "Making sense" of multimedia databases of people in actions

Video and multimedia from a variety of sources such as video archives, live video, and the web, can be analyzed for their content.  A major component of that content are people and their actions.  We are developing state-of-the art algorithms for  simple and efficient analysis and interpretation of people and their actions in video.

Model  

Human figure in each image of a video sequence is observed as a set of pixels, z.  This representation is, of course, particular to images and can be reduced to a more representative kinematic figure model. We adopted the 2-D Scaled Prismatic Model proposed by Morris and Rehg. The kinematic model lies in the image plane, with each link having one degree of freedom in rotation and another
DOF in length.  The appearance of each link in the image is described by a template of pixels, y. Final state of the figure is then fully determined by a set of joint angles x

Dynamics 

The human figure exhibits complex and rich dynamic behavior that is both nonlinear and time-varying. However, most work on tracking and analysis of figure motion has employed either generic or highly specific hand-tailored dynamic models superficially coupled with hidden Markov models of motion regimes. We propose an alternative class of learned dynamic models known as switching linear dynamic systems (SLDSs). 

A SLDS model describes the dynamics of a complex, nonlinear physical process by switching among a set of linear dynamic models over time.  A Markov model governs this switching process: a probability is  associated with each possible switching event. 


Systems view of SLDS. Switching state sk at time k selects a linear dynamic system with transition matrix Ak  and noise variance Qk.

An equivalent and appealing representation of the SLDS utilizes the language of probabilistic graphical models.  In particular, a dynamic Bayesian network representation depicts dependencies of states of the system as they evolve over time.  The Bayesian network representation is useful for another reason --- it allows one to apply a wealth of Bayesian network inference and learning techniques to SLDS and, hence, modeling of the human figure motion. 


 
 

Inference in SLDS 

The goal of inference in SLDS is to estimate the posterior probability of the hidden states of the system (switching and linear) from a sequence of measurements. If there were no switching dynamics, the inference would be straightforward -- we could infer the linear state using LDS inference (RTS smoothing). However, the presence of
switching dynamics makes exact inference exponentially complex and intractable for even moderate sequence lengths. It is therefore necessary to explore approximate inference techniques that will result in a tractable learning method. 

We propose two approximate inference techniques for SLDS: Viterbi or winner-takes-all inference and structural variational inference. 

Viterbi Inference 

Viterbi approximation finds one, most likely, sequence of switching states that best explains the sequence of measurements.  This is a greedy, DP-based procedure, which is computationally efficient.  Unfortunately, it only performs well when measurement sequences have low levels of noise. 

Structured Variational Inference 

We noted that complexity of inference in SLDS comes from the dependency between the switching and the linear states.  If these states could somehow be decoupled the inference process would become easier.  Structured variational inference does exactly that --- it decouples the switching from the linear states by introducing a set of new decoupling variables.  Inference in two decoupled networks (an HMM and an LDS) can now be accomplished trivially. 

The simplified structure comes at a cost, though.  The inference becomes an iterative process.  At each iteration inference in the HMM becomes dependent on the solution of inference in the LDS and vice versa. 

Variational inference solution is globally optimal and does not suffer from artifacts of local approximations, as is the case in Viterbi inference and some other local approximation schemes.  However, its iterative nature induces tradeoff in complexity. 


Structure variational inference recursions in SLDS.  Original SLDS is decoupled into an HMM and a LDS.  Inference in each of the two models depends on the values of their variational parameters (q for HMM and A,O for LDS). Those are in turn dependent on the estimates of the states of the opposing models.


Another view of variational recursions for SLDS.

Learning in SLDS 

As alluded to before, learning in SLDS is trivial once an efficient inference algorithm is determined.  A rather general EM framework can be used to estimate MAP parameters of an SLDS.  For instance, estimates of model parameters are 

where <> denote estimates of quantities obtained during inference. 

Examples of Applications 

Classification of human actions 

Classification of unknown motion sequences is an obvious way to show  the impact of different dynamic models.  The goal is to interpret an unknown motion sequence in terms of the known, learned action models.  For instance, an unknown motion can be recognized as a motion consisting of alternations of ``jog'' and ``walk." 


Classification of human motion.  A sequence of joint angles extracted from a video of a person in motion is classified as jog (state 1) or walk (state 2).

Synthesis of realistic-looking human motion 

SLDS, as a generative model, can be used to synthesize natural-looking human figure motion.  In the simplest setting, sampling from a "walk" SLDS yields an example of the "walk" motion consistent with the learned model. 


Synthesized "jog" motion obtained by sampling from "jog" SLDS.