Skip to content Skip to navigation
PhD Defense
5/21/2014 12:00 pm
BioMaPS Conference Room (Room 260, Hill Center)

Reduced Representations for Efficient Analysis of Genomic Data; From Microarray to High Throughput Sequencing

Md Pavel Mahmud, Rutgers University

Defense Committee: Alexander Schliep (advisor), Kevin Chen, Martin Farach-Colton and Jan Freudenberg (The Feinstein Institute for Medical Research, NY)

Abstract

Since biological datasets are usually large, and often contain repetitive, sometimes redundant, elements, it is a natural idea to identify groups of similar objects and directly perform computations on them. Usually, the relevant type of similarity is specific to the type of data and the application in hand. In this talk, I will discuss two fundamentally different kind of datasets generated from biological experiments and show how alternate, reduced representations of these datasets can be exploited by downstream applications for savings in computational time.

Hidden Markov Models (HMM) are widely used statistical models for analyzing biological data; for example, copy number data from microarray experiments can be segmented using Bayesian HMMs. In the first part of the talk, I will show how temporal dependency and locality among consecutive copy numbers can be exploited for efficient approximation of Markov Chain Monte Carlo (MCMC) sampling of Bayesian HMMs.

Although high resolution microarray experiments are very cost efficient and appropriate statistical analysis is well established, the future of biological data analysis belongs to high-throughput sequencing (HTS). To overcome sequencing errors and increase statistical confidence of analysis, HTS datasets often have high coverage, which means more redundant reads. This redundancy has been exploited by some applications in a somewhat ad hoc manner. In the second part of the talk, I will discuss a clustering based approach for producing reduced representations of HTS data which can directly be used by other applications. As an example, I will show it’s applicability and benefits to the read mapping problem.