|It's happened to all of us. You've wound up in a hole-in-the-wall restaurant you already regret going into, and you're worried about creepy-crawlies. Then there! On the floor! A really big scary bug... Oh wait, it's just a piece of fuzz.|
|Model-based computer vision can use this story for computational inspiration. (No, we're not going to see bugs everywhere!) A computer system can use what it's expecting to see, to help determine what it does see. In particular, the use of a model which describes objects in a scene will make scene understanding easier. Of course, using a model in the wrong situation can produce the wrong interpretation. (Ok, so we might see a few bugs.)|
Now consider the problem of tracking the motion of an observed human face. We could use a general purpose object-tracker for this application (one which simply sees the face as a deforming surface), but this would be ignoring everything we know about faces, especially relating to their highly constrained appearance and motion.
Face models and model-based estimation
Instead, a model-based approach to this problem would use a model which describes the appearance, shape, and motion of faces to aid in estimation. This model has a number of parameters (basically, "knobs" of control), some of which describe the shape of the resulting face, and some describe its motion.
In the picture below, the default model (top, center) can be made to look like specific individuals by changing shape parameters (the 4 faces on the right). The model can also display facial motions (the 4 faces on the left showing eyebrow frowns, raises, a smile, and an open mouth) by changing motion parameters. And of course, we can simultaneously change both shape and motion parameters (bottom, center).
Now you might be asking yourself, how can this model be detailed enough to represent any person making any expression? The answer is: it can't. But, it will be able to represent any of these faces to an acceptable degree of accuracy. The benefit of this simplifying assumption is that we can have a fairly small set of parameters (about 100) which describe a face. This results in a more efficient, and more robust system.
Estimating parameters from images using a model lets us use new methods for processing (which are still related to their counterparts which do not use a model). For example, computing the motion of an object using optical flow (without a model) results in a field of arrows. However, when a model is used, a set of "parameter velocities" is extracted, instead.
Below are some of our face tracking results (this work is joint with Dimitris Metaxas). In each experiment, the tracked subject moves their head and makes some facial expressions (currently, the system is not real-time). The initial position of the face in each sequence is performed by hand (by identifying a small set of feature points).
In each case, the estimated model (the black wireframe) is superimposed on the captured images. In each of the following sequences, the subjects make complex facial and head motions, which are successfully tracked by our framework.
[444K QuickTime movie]
[505K QuickTime movie]
[556K QuickTime movie]
(I would like to thank Yaser Yacoob and the Center for Automation Research at the University of Maryland at College Park for providing the second and third image sequences).
We have also performed validation studies where we mark a subject
with dots (top), extract the positions of these markers (middle:
marker locations are shown as white dots), and compare this with the
model-predicted locations of these markers.
Our results show that our method can maintain accurate track to about one-half centimeter accuracy (in the image plane) for long sequences (many hundreds of frames) without drifting or losing track.
Results from validation study
Why track a face?
Well, being able to track a face from images contributes toward the ability to monitor a user's attention and reactions automatically and without intrusion, and has obvious benefits in human-machine interaction. This is the subject of interaction research underway at THE VILLAGE.
There are a number of available on-line publications describing this work (in chronological order):
This paper describes a model-based approach (using deformable models) to face tracking where a 3D face model is used to estimate the shape and motion of a face in a sequence of images using a combination of optical flow information (as a constraint), and edge information.
(A revised version of this paper appears in IJCV, combined with the CVPR '99 paper, and is listed below).
This paper presents a method for updating the shape of a
model, based on analysis of the least-squares residuals
from the model-based optical flow computation. This
analysis allows the shape of a tracked face to be
computed with much less data (fewer frames) than the
above method which uses only edge information.
(a longer version of this paper has been submitted to PAMI; ask me for a copy if you're interested)
The reason why the framework described in the CVPR '96 paper worked as well as it did, was due to edge-to-model alignment optimization problem being constrained by the model-based optical flow solution. This constraint projected away parts of the search space, making the edge-model alignment easier (faster, and more likely to converge to the true answer). Experiments are presented which suggest that it was indeed the constraint that improved performance.
This paper provides a more detailed look at the framework from the CVPR 96 paper, and also describes how a Kalman filter is incorporated, and how it is used to combine the optical flow and edges in an efficient way (as described in the CVPR 99 paper). It also contains a number of validation experiments.