High-throughput sequencing (HTS), a technology to unravel genomic
sequences on a large scale, is pervasive in clinical and biological
applications such as cancer research and basic science, and is
expected to gain enormous momentum in future personalized medicine
applications.
To address this deluge of data we developed new methods which
operate directly on reduced representations of the data and
enable the use of advanced statistics even on very large data
sets. For identifying Copy Number Variants (CNV) our approach
accelerated full Bayesian methods to the point of matching
Maximum-likelihood methods.
Typical data sets consist of 2 billions of sequencing reads and more,
and large studies might provided hundreds of such data sets. Core steps
of the analysis include read error correction, mapping to the reference
genome, and identifying genetic variations. We arrive at a reduced
representation of HTS data sets by a clustering method able to cluster
billions of reads. Adaptations of downstream algorithms operate
directly on the clustered representations, thus enabling compressive
genomics increasing the fidelity of the analysis at constant or
lowered costs.