Feb 2: Comparing
Biological Sequences with Segment Rearrangements
S. Muthukrishnan.
Abstract:
Computational genomics involves comparing sequences based on ``similarity''
for detecting evolutionary and functional
relationships. Until very recently, available portions of the human genome
sequence (and that of other species) were
fairly short and sparse. Most sequencing effort was focused on genes and
other short units; similarity between such sequences
was measured based on character level differences. However with the advent
of whole genome sequencing technology
there is emerging consensus that the measure of similarity between
long genome sequences must capture the rearrangements
of large segments found in abundance in the human genome.
We abstract the general problem of computing sequence similarity in
the presence of segment rearrangements.
This problem is closely related to computing the smallest grammar for a string
or the block edit distance between two strings.
Our problem, like these other problems, is NP hard. The main result
I will present is a simple $O(1)$ factor approximation
algorithm for this problem. In contrast, best known approximations for the
related problems are factor $\Omega(\log n)$
off from the optimal. Our algorithm works in linear time, and in one pass.
In proving our result, we relate sequence similarity
measures based on different segment rearrangements, to each other, tight
up to constant factors.
This is joint work with Funda Ergun (CWRU) and Cenk Sahinalp (SFU).
--------------------------------------------------------------------------