Sept 29:     Algorithmic embedding for comparing large text streams
                                      Graham Cormode .




Texts are ubiquitous in daily life, varying in size from small (SMS and
email) to potentially immense (automatically generated reports, biological
sequences).  When scanning for similarities between new data and 
previously stored examples, we need a model that takes account of how 
texts are changed: pieces are inserted or deleted, and sections are moved
around.  With a large number of large texts, trying to compare all pairs
is not feasible, and for the largest texts we may not even be able to hold
the whole of any text in memory at once. I will describe an approach to
approximately comparing large texts quickly and in sublinear space.  It
relies on finding combinatorial structures present in any string of
characters, and generating a vector representation.  This allows rapid
comparison of sequences based on a succint representation, and the
application of clustering, nearest neighbor searching, and other data
mining techniques.