Cue Integration

Visual cues

The content of images is revealed through visual cues, which convey information about the form of the viewed scene. While information from each cue is ambiguous and incomplete, agreement across cues provides a vital constraint on which one can formulate algorithms for visual perception. Within a single image, information is delivered in many forms: contour, shading, texture, occlusion, shadowing, perspective; just to name a few. Additional images taken either simultaneously (to achieve stereo) or over time (to observe motion) provide even more cues. For instance, the shape of the hand in the image below is in part revealed by the visible contours in the image.

And image of a hand, and the contours in this image.


Robust vision systems cannot rely solely on one particular visual cue. With each cue comes limits on accuracy, limits on applicability, and variations in how much computation is required to use it. Each individual cue reveals a separate aspect of the scene, but alone it will never suffice.

Vision systems will be successful only if they exploit the many sources of information in the image; it is crucial that we integrate the results from the various visual cues. This is, of course, well known. The human vision community has studied the problem of cue integration extensively, aiming to determine the strategy observers use to combine cues. Often, stimuli are created which have conflicting cues (below), in hope of elucidating the nature of this process. These experiments suggest a mixture of statistical combination and cue selection which can depend on viewing conditions or context.

Example stimulus used to determine how stereo, contour, and texture cues interact.
The top stereo pair has texture cues which agree with the shape (and seems "deeper"), while the bottom pair does not.
[from Johnston, Cumming and Parker, "Integration of Depth Modules: Stereopsis and Texture", Vision Research 33 (5/6), 1993]

The sophistication of cue integration strategies in human observers casts doubt on whether computational vision systems can function without them.

Current research

Robustness is the only alternative to controlling the visual environment, and is almost certainly required for the development of a wide range of visual applications, including perceptual user interfaces and multi-modal interactive systems. These applications also bring real-time performance requirements. For there to be time to process and integrate multiple cues, we will need a principled approach that accounts for the information that each cue provides, while considering how much time is required to extract the information.

I work towards such an approach. It's design is inspired by AI research on meta-reasoning--the problem of applying computational resources dynamically towards solving the problems in a developing time-critical situation. In this case, the system must choose between proceeding with computation on any number of visual cues. As a first step, I have developed a probabilistic model of partial results, which predicts the error that remains in an iterative cue computation (DeCarlo, ECCV 2002). This model allows for using ordinary probabilistic combination methods for partial cue computations, which has several benefits: