My Research


Research Topics:

  • Manifold Models for Human Motion Analysis

  • Gait Analysis, Tracking and Recognition

  • Facial Expression Analysis

  • Tracking

  • Scene Modeling and Background Subtraction

  • Efficient Kernel Density Estimation using FGT



Manifold Models for Human Motion Analysis
  Modeling View and Posture Manifolds
  We consider modeling data lying on multiple continuous manifolds. In particular, we model shape manifold of a person performing a motion observed from different view points along a view circle at fixed camera height. We introduce a model that ties together the body configuration (kinematics) manifold and the visual manifold (observations) in a way that facilitates tracking the 3D configuration with continues relative view variability. The model exploits the low dimensionality nature of both the body configuration manifold and the view manifold where each of them are represented separately.
  Tracking People on a Torus
  Suppose we want to model the visual patterns of a periodic articulated motion (such as walking) observed from any view point. Such visual patterns lie on a product space (different body configuration X different views). We showed that a topology preserving setting is suitable to model certain human motions which lie intrinsically on one-dimensional manifolds, whether closed and periodic (such as walking, jogging, running, etc.) or open (such as golf swing, kicking, tennis serve, etc.) We showed that we can represent the visual manifold of such motions (in terms of shape) as observed from different view points by mapping such data to a torus manifold (for the case of a single view circle) or family of tori (for the whole view sphere). The approach we introduced is based on learning the visual observation manifold in a supervised manner. Traditional manifold learning approaches are unsupervised where the goal is to find a low dimensional embedding of the data. However, if the manifold topology is known, manifold learning can be formulated as learning a mapping from/to a topological structure to/from the data where that topological structure is homeomorphic to manifold of the data.
  Multi-factor Models for Style Separation
  We show how to model several ``style'' factors using multilinear analysis in the space of nonlinear basis functions. This way we can separate different sources of style variations of the same motions. For example, the model can be used to render data for different people's walking figures from different views; or different faces performing different facial expressions. On the other hand, given an input pattern, we introduced an optimization process that can solve for the different factors which produced that pattern. For example, from a single shape instance we can recover the body configuration, the view point, and the person's shape identity. Similarly, from a single face image, we can recover the facial expression, the face identity, and the motion phase.
  Separating Style and Content on a Nonlinear Manifold
  Bilinear and multi-linear models have been successful in decomposing static image ensembles into perceptually orthogonal sources of variations, e.g., separation of style and content. If we consider the appearance of human motion such as gait, facial expression and gesturing, most of such activities result in nonlinear manifolds in the image space. The question that we address in this research is how to separate style and content on manifolds representing dynamic objects. We learn a decomposable generative model that explicitly decomposes the intrinsic body configuration (content) as a function of time from the appearance (style) of the person performing the action as time-invariant parameter. The framework  is based on decomposing the style parameters in the space of nonlinear functions which map between a learned unified nonlinear embedding of multiple content manifolds and the visual input space.
  Inferring 3D Body Pose from Silhouettes using Activity Manifold Learning
  We aim to infer 3D body pose directly from human silhouettes. Given a visual input (silhouette), the objective is to recover the intrinsic body configuration, recover the view point, reconstruct the input and detect any spatial or temporal outliers. In order to recover intrinsic body configuration (pose) from the visual input (silhouette), we explicitly learn view-based representations of activity manifolds as well as learn mapping functions between such central representations and both the visual input space and the 3D body pose space. The body pose can be recovered in a closed form in two steps by projecting the visual input to the learned representations of the activity manifold, i.e., finding the point on the learned manifold representation corresponding to the visual input, followed by interpolating 3D pose.
  Nonlinear Generative Models for Dynamic Shape and Dynamic Appearance
  Our objective is to learn representations for the shape and the appearance of moving (dynamic) objects that supports tasks such as synthesis, pose recovery, reconstruction and tracking.  We introduce a framework that aim to learn a landmark-free correspondence-free global representations of dynamic appearance manifolds. We use nonlinear dimensionality reduction to achieve an embedding of the global deformation manifold that preserves the geometric structure of the manifold. Given such embedding, a nonlinear mapping is learned from the embedding space into the visual input space. Therefore, any visual input is represented by a linear combination of nonlinear bases functions centered along the manifold in the embedding space. We also show how approximate solution for the inverse mapping can be obtained in a closed form which facilitate recovery of the intrinsic body configuration. We use the framework to learn the gait manifold as an example of a dynamic shape manifold, as well as to learn the manifolds for some simple gestures and facial expressions as examples of dynamic appearance manifolds.
  Bilinear and Multilinear Models for Gait Recognition
  Human Identification using gait is a challenging computer vision task due to the dynamic motion of gait and the existence of various sources of variations such as viewpoint, walking surface, clothing, etc. In this research we investigate gait recognition algorithms based on bilinear and multilinear decomposition of gait data into time-invariant gait-style and time-dependent gait-content factors. We developed a generative model by embedding gait sequences into a unit circle and learning nonlinear mapping which facilitates synthesis of temporally-aligned gait sequences. Given such synthesized gait data, bilinear model is used to separate invariant gait style which is used for recognition. We also show that the recognition can be generalized to new situations by adapting the gait-content factor to the new condition and therefore obtain corrected gait-styles for recognition.
  Exemplar-based Tracking and Gesture Recognition - nonparametric HMMs
  In this research we addresses the problem of capturing the dynamics for exemplar-based recognition systems. Traditional HMM provides a probabilistic tool to capture system dynamics and in exemplar paradigm, HMM states are typically coupled with the exemplars. Alternatively, we propose a non-parametric HMM approach that uses a discrete HMM with arbitrary states (decoupled from exemplars) to capture the dynamics over a large exemplar space where a nonparametric estimation approach is used to model the exemplar distribution. This reduces the need for lengthy and non-optimal training of the HMM observation model. We used the proposed approach for view-based recognition of gestures. The approach is based on representing each gesture as a sequence of learned body poses (exemplars). The gestures are recognized through a probabilistic framework for matching these body poses and for imposing temporal constraints between different poses using the proposed non-parametric HMM.
  Learning to Track
  Tracking is typically posed as a search problem in a geometric transformation parameter space as well as in the object's configuration parameter space. Generally, tracking is based on learning an invariant representation of the tracked object, then searching the parameter space for the best fit.
The goal of this research is to achieve trackers that can directly ``infer'' such parameters from the object appearance through learned models of the visual manifolds of such parameter spaces. We show how to learn a representation of the appearance manifold of an object, given a class of geometric transformation. We learn a generative model for object appearance where the appearance of the object at each new frame is an invertible function that maps from a representation of the geometric transformation space into the visual space. By learning such generative model we can infer the geometric transformation (track) directly from the tracked object appearance. As a result tracking can be achieved in a closed-form and therefore can be done very efficiently. The novelty of this work is that it showed how learning the appearance manifold of an object can play a role to achieve efficient tracking.
  Appearance-Based Generalized Kernel Tracking
  We exploit the feature-spatial distribution of a region representing an object as a probabilistic constraint to track that region over time. The tracking is achieved by maximizing a similarity-based objective function over transformation space given a nonparametric representation of the joint feature-spatial distribution. Such a representation imposes a probabilistic constraint on the region feature distribution coupled with the region structure which yields an appearance tracker that is robust to small local deformations and partial occlusion. We presented the approach for the general form of joint feature-spatial distributions and apply it to tracking with different types of image features including row intensity, color and image gradient.
Cap12_out_003.jpg (19712 bytes) Cap12_out_053.jpg (20437 bytes) Cap12_out_116.jpg (20892 bytes) Cap12_out_247.jpg (20110 bytes)
Cap12_out_251.jpg (20278 bytes) Cap12_out_263.jpg (21280 bytes) Cap12_out_266.jpg (21201 bytes) Cap12_out_287.jpg (20599 bytes)
  Tracking Multiple People
  segresult_828s.jpg (34764 bytes) In this research we address the problem of segmenting foreground regions corresponding to a group of people given models of their appearance that were initialized before occlusion. We present a general framework that uses maximum likelihood estimation to estimate the best arrangement for people in terms of 2D translation that yields a segmentation for the foreground region. Given the segmentation result we conduct occlusion reasoning to recover relative depth information and we show how to utilize this depth information in the same segmentation framework. We also present a more practical solution for the segmentation problem that is online to avoid searching an exponential space of hypothesis. The person model is based on segmenting the body into regions in order to spatially localize the color features corresponding to the way people are dressed. Modeling these regions involves modeling their appearance (color distributions) as well as their spatial distribution with respect to the body. We use a non-parametric approach based on kernel density estimation to represent the color distribution of each region and therefore we do not restrict the clothing to be of uniform color. Instead, it can be any mixture of colors and/or patterns. We also present a method to automatically initialize these models and learn them before the occlusion.
Scene Modeling and Background Subtraction
  Feature Selection for Background Subtraction - Boosted Background Model

Various statistical approaches have been proposed for modeling a given scene background. However, there is no theoretical framework for choosing which features to use to model different regions of the scene background. In research paper we introduce a novel framework for feature selection for background modeling and subtraction. A oosting algorithm, namely RealBoost, is used to choose the best combination of features at each pixel. Given the probability estimates from a pool of features calculated by Kernel Density Estimate (KDE) over a certain time period, the algorithm selects the most useful ones to discriminate foreground objects from the scene background. The results show that the proposed framework successfully selects appropriate features for different parts of the image.



  Nonparametric Model for Background Subtraction
  Jump474.jpg (31116 bytes) In video surveillance systems, stationary cameras are typically used to monitor activities at outdoor or indoor sites. Since the cameras are stationary, the detection of moving objects can be achieved by comparing each new frame with a representation of the scene background. This process is called background subtraction and the scene representation is called the background model. Typically, background subtraction forms the first stage in automated visual surveillance systems. Results from background subtraction are used for further processing, such as tracking targets and understanding events.

We introduced a novel background model and a background subtraction technique based on statistical nonparametric modeling of pixel process. The model keeps a sample of intensity values for each pixel in the image and uses this sample to estimate the probability density function of the pixel intensity. The density function is estimated using kernel density estimation technique. Since this approach is quite general, the model can approximate any distribution for the pixel intensity without any assumptions about the underlying distribution shape. The model can handle situations where the background of the scene is cluttered and not completely static but contains small motions that are due to moving tree branches and bushes. The model is updated continuously and therefore adapts to changes in the scene background. The approach runs in real-time.

Code is available per request




  prob0474.jpg (34074 bytes)
  Efficient Kernel Density Estimation using Fast Gauss Transform
  Many vision algorithms depend on the estimation of a probability density function from observations. Kernel density estimation techniques are quite general and powerful methods for this problem, but have a significant disadvantage in that they are computationally intensive. In this research we explore the use of kernel density estimation with the fast Gauss transform (FGT) for problems in vision. The FGT allows the summation of a mixture of M Gaussians at N evaluation points in O(M+N) time as opposed to O(MN) time for a naive evaluation, and can be used to considerably speed up kernel density estimation. We present applications of the technique to problems from image segmentation and tracking, and show that the algorithm allows application of advanced statistical techniques to solve practical vision problems in real time with today's computers.