General object and activity recognition is a fundamental problem in
computer vision that has been the subject of much research. Traditional
approaches include model based and appearance template based methods.
Recently, inspired by methods from the text retrieval literature, local
visual feature-based models have shown a lot of success for recognition
of objects or activities with large within-class geometric variability.
There are several challenges in this approach, namely feature selection
and target modeling using these features. This thesis proposes a
local-global visual feature-based framework for general object and
activity recognition with novel methods for these problems:
1) Combinatorial and statistical methods for selecting informative
parts to build statistical models for part-based object recognition.
First a combinatorial optimization formulation is used for clustering
on a weighted multipartite graph. Second, a statistical method for
selecting discriminative parts from positive images is used to localize
objects.
2) An entropy based vocabulary selection method for "bag-of-words"
model for activity recognition.
3) Integrating both spatial and temporal information with
appearance feature for human activity recognition. This method models
the human motions with the distribution of local motion features and
their spatial-temporal arrangements.
The effectiveness of the proposed methods is demonstrated by several
object recognition and activity recognition data sets, which include
human facial expressions and hand gestures, etc.
This thesis also covers an interesting project regarding a framework of
applying Discrete Fourier Transform to detect salient regions in images
and video sequences. This framework generalizes the previous saliency
detection methods and can be applied for saliency detection in the
video sequences.