Two approaches have been proposed in statistical and machine
learning communities in order to address the problem of uncovering
clusters with complex structure. One approach relies on the development
of clustering criteria that are able to accommodate increasingly complex
characteristics of the data. The other approach is based on
simplification of structure of data by mapping it to a different feature
space via a non-linear function and then clustering in the new space.
This dissertation covers three related studies: development of a novel
multi-dimensional clustering method, development of non-linear mapping
functions that leverage higher-order co-occurrences between features in
boolean data, and applications of these mapping functions for improving
the performance of clustering methods. In particular, we treat
clustering as a combinatorial optimization problem of finding a
partition of the data so as to minimize a certain criterion. We develop
a novel multi-dimensional clustering method based on a
statistically-motivated criterion proposed by J. Neyman for stratified
sampling from one-dimensional data. We show that this criterion is more
reflective of the underlying data structure than the seemingly similar
K-means criterion when second order variability is not homogeneous
between constituent subgroups. Furthermore, experimental results
demonstrate that generalization of the Neyman's criterion to
multi-dimensional spaces and development of the associated clustering
algorithm allow for statistically efficient estimation of the grand mean
vector of a population.
In the framework of the mapping-based approach to discovering complex
cluster structures, we introduced a novel adaptive non-linear data
transformation termed Unsupervised Second Order Transformation (USOT).
The novelties behind USOT are (a) that it leverages in a unsupervised
manner, higher-order co-occurrences between features in boolean data,
and (b) that it considers each feature in the context of probabilistic
relationships with other features. In addition, USOT has two desirable
properties. USOT adaptively selects features that would influence the
mapping of a given feature, and preserves the interpretability of
dimensions of the transformed space. Experimental results on text
corpora and financial time series demonstrate that by leveraging
higher-order co-occurrences between features, clustering methods
achieved statistically significant improvements in USOT space over the
original boolean space.