Improving genome assembly by identifying reliable sequencing data

Rajat Shuvro Roy, Rutgers

Defense Committee: Prof. Alexander Schliep(Chair), Prof. Kevin Chen, Prof. Martin Farach-Colton, Prof. Andrey Grigoriev(External Member), Prof. Debashish Bhattacharya(5th Member)

Abstract

De novo Genome assembly is one of the classical problems of Bioinformatics. It has two major subproblems: contig construction and scaffolding. A contig is a continuous sub-sequence of the genome and is usually a few orders of magnitude longer than the sequenced reads (short sub-strings of length 25-500 of the genome that are produced by genome sequencing machines). Scaffolding attempts to construct a linear sequence of contigs (with possible gaps in between) using paired reads (two reads whose approximate distance on the genome follow a known distribution). In this talk I will present our work that improves both scaffolding and contig construction.

Many popular assemblers use the de Bruijn graph approach for contig construction. The complexity (both space and time) of contig construction can be greatly reduced by constructing the de Bruijn graph using reliable k-mers only. Our experiments show that this also improves the quality of contigs. For multi-cell read libraries with uniform coverage, reliable k-mers can be extracted by applying a simple fixed frequency cutoff on the k-mers. We present an efficient algorithm capable of computing k-mer frequency from large read libraries which were beyond the scope of any earlier tools. Identifying reliable k-mers from Whole Genome Amplified (WGA) data is even more challenging due to the coverage variation introduced by the amplification step (MDA, MALBEC, etc.), which implies that applying a simple k-mer frequency cutoff is unreasonable. We observed that with sufficient coverage, using partial reads (read prefix of a certain length) of length slightly larger than the k-mer length recovers a large proportion of genomic k-mers while keeping the proportion of erroneous k-mers low. Our experiments also show that using partial reads for assembly and gene prediction recovers a significant proportion of genes. We propose to use this approach for rapid pathogen detection using single cell genomics and 'on the fly' genome assembly.

Contig scaffolding is known to be NP-hard and many approximate solution have been proposed. The complexity and quality of the approximate solutions may be improved if unreliable paired reads are filtered out. We present a set of simple linear equations derived from the geometry of contigs on the line that can be used to identify reliable paired reads and also predict the relative positions and orientations of contigs. We show that these predictions are highly accurate and helps to greatly simplify the scaffolding problem.

Go Back to Colloquiua Listing