Sept 29:
Algorithmic embedding for comparing large text streams
Graham Cormode
.
Texts are ubiquitous in daily life, varying in size from small
(SMS and
email) to potentially immense (automatically generated
reports, biological
sequences). When scanning for similarities between
new data and
previously stored examples, we need a model that takes
account of how
texts are changed: pieces are inserted or deleted,
and sections are moved
around. With a large number of large texts, trying
to compare all pairs
is not feasible, and for the largest texts we may not
even be able to hold
the whole of any text in memory at once. I will describe
an approach to
approximately comparing large texts quickly and in
sublinear space. It
relies on finding combinatorial structures present
in any string of
characters, and generating a vector representation.
This allows rapid
comparison of sequences based on a succint representation,
and the
application of clustering, nearest neighbor searching,
and other data
mining techniques.