SEMINAR ON DATA MINING IN GENOMICS

Page under construction - in progress!

  SPRING 2002 SEMESTER 
16:198:673
INDEX:64546  SECTION: 01 
Instructor: Tomasz Imielinski 
Meeting Time: Tusdays 10:30 A.M. - 1:00 P.M.
 

ABSTRACT


With the proliferation of genetic data sets ranging from the pedigree
data for families affected by various diseases to SNP data and gene
expression microarray, there is  rapidly growing need for data
analysis and data mining of genetic data sets. Such techniques will
help geneticists discover causes of complex diseases (caused possibly
by multiple genes) and analyze gene expression data to learn behaviours
of different genes.  

The field of research which studies associations between genes and
diseases (phenotypes) is called functional genomics. While there has
been a lot of success in functional genomics for single gene diseases,
there is very little if any, in functional genomics for complex
diseases which are causes by multiple genes. The objective of this
seminar is to introduce research challenges of building software 
to support functional genomics for complex diseases in functional genomics.

We will discuss currently used techniques for discovery of gene
location for single gene diseases. We will also review the data mining
methods such as association rule mining, decision trees, cluster
analysis which can be useful in tackling a much more difficult problem
of gene discovery for complex diseases.

At the end of the semester we hope to achieve two obejctives:

  1. Provide the whole group (students and instructor) with "hands on" familiarity with major existing software tools along with their algorithmic foundations. In other words to have a complete review of the current state of software in functional genomics
  2. To discuss original new approaches for functional genomics for complex diseases.
 The class will be divided into mini research groups each of
which will be responsible for learning specific software and its
theoretical/algorithmic foundations and make a presnetation in
class. Such presentation will involve demonstrating the results
obtained from running the given software on specific syntheically
generated pedigrees.

The following software will be discussed and assigned to groups:

  1. LINKAGE
  2. GENEHUNTER
  3. VITESSE
  4. MENDEL
  5. SAGE
  6. ALLEGRA
  7. more...
      
      
      

      Syllabus

      1. Basic genetics explained to computer scientists: Classes: Jan 29th, Feb 5th
      2. Foundations of data mining methods Association rules, decision trees and clustering methods: Feb 12, Feb 19
      3. Linkage analysis algorithm and implementation: excercise in recursion on pedigrees: Feb 26th, March, 5th
      4. Group presentations from March 12th onward till the end of the semester. Possibly with one or two invited speakers to be announced

      Software Study Groups

      GENEHUNTER: Xiaojie Xi (CS) and Li Deng (Genetics), Xiaolei Huang (CS)
      LINKAGE:  Rong Zhang (CS) and Li Liang (Genetics), Rong Xu (stat)
      Vitesse:  Jianhuan Ye (CS), Bin Xu (Genetics)
      SIMWALK: Dmitry Fradkin (CS), Andrei Anghelescu (CS), Zhiyang Pan
      (Neuroscience)
      SAGE: Xiaofeng Liu (CS, molecular biology) and Yangzhe Xiao (CS)
      
      
      
      
      

      Schedule pf presentations

      March 12nd: Linkage Group
      March 26th: Vitesse Group
      April 2nd: Simwalk Group
      April 9:  Genehunter
      April 16th SAGE Group
      April 23: External speaker (SNPs)
      April 30: TBA
      
      
      Each presentation will involve
      
      
      1. Description of software, its objectives and functionalities, algorithmic foundations (from their recommended reference list)
      2. Examples showing the results for the specific synthetic pedigrees
      3. Benchmark analysis - scaling in terms of execution time and memory in function of the pedigree size, number of markers, allele distribution (homozygosity etc)

          Software

          
          LINKAGE
          
          General Info
          
          User Documentation
          http://linkage.rockefeller.edu/soft/linkage/
          
          Platform: PC(DOS/OS2/..), UNIX,VMS
          
          FTP Site
          ftp://linkage.rockefeller.edu/software/linkage/
          
          
          
          MENDEL
          
          General Info
          
          http://www.biomath.medsch.ucla.edu/faculty/klange/software.html
          MENDEL is a Fortran 77 program for the genetic analysis of human
          pedigree data under models involving a small number of loci. MENDEL is
          useful for segregation analysis, linkage calculations, genetic
          counseling, allele frequency estimation, and related kinds of problems. 
          FISHER is a Fortran 77 program for the genetic analysis of classical
          biometric traits like blood pressure or height that are caused by a
          combination of polygenic inheritance and complex environmental forces. 
          SEARCH is a Fortran 77 subprogram for function minimization. SEARCH
          permits bounds and linear constraints to be imposed on parameters and,
          in statistical applications, computes asymptotic standard errors and
          correlations of parameter estimates. 
          dGENE is a simple dBASE III program for the management of pedigree and
          locus data. It permits easy extraction of genetic data for use with
          MENDEL and FISHER. 
          LINKMEND is a Pascal program written by Dan Weeks that converts LINKAGE
          input files to MENDEL input files. 
          SPERM is a Fortran 77 program written by Ken Lange and Laura Lazzeroni
          for the analysis of sperm typing data. 
          
          User Documentation - none online, coming soon
          
          Platforms - not listed
          
          FTP Site
          need to register w/prof Ken Lange at ucla @
          http://www.biomath.medsch.ucla.edu/faculty/klange/register.html
          
          Software at this website:
          MENDEL - Human Pedigree Analysis 
          FISHER - Genetic Analysis of Biometric Traits 
          
          SEARCH - Function Minimization 
          dGENE - Pedigree and Locus Data Management
          LINKMEND - Converts LINKAGE input files to MENDEL input files
          SPERM - Analysis of Sperm Typing Data
          
          
          
          VITESSE 
          
          General Info
          VITESSE is a software package that computes likelihoods with the
          functionality of the LINKMAP and MLINK programs from LINKAGE. VITESSE
          uses the novel algorithms of set-recoding and fuzzy inheritance to
          reduce the number of genotypes needed for exact computation of the
          likelihood, which accelerates the calculation. It also represents
          multilocus genotypes locus-by-locus to reduce the memory requirements.
          
          User Documentation
          http://watson.hgen.pitt.edu/docs/vitesse.html
          
          Platforms:
          Sun SPARCStation 10, 
          Solaris 2.3 Sun SPARCStation 2, 
          SunOS 4.1.3 DEC Alpha, 
          OSF/1.3 HP workstation, 
          HP/UX 9.01 IBM SP2 (parallel RS/6000 CPUs), 
          
          AIX 3.21 Silicon Graphics Challenge, 
          IRIX 5.2 Silicon Graphics Indigo, 
          IRIX 4.0.5 IBM-PC DOS, BorlandC Compiler (being tested) 
          
          
          FTP Site
          ftp://watson.hgen.pitt.edu
            (login as 'anonymous' with your e-mail address as password) 
          cd pub/vitesse get vitesse.tar.Z 
          On your machine: uncompress vitesse.tar.Z tar xvf vitesse.tar 
          All files will appear in the directory ./vitesse. 
          
          may need to register on
          http://watson.hgen.pitt.edu/register/new_main.html
          
          
          
          SIMWALK
          
          General Info
          SimWalk2 is a statistical genetics computer application for haplotype,
          parametric linkage, non-parametric linkage (NPL), identity by descent
          (IBD) and mistyping analyses on any size of pedigree. SimWalk2 uses
          Markov chain Monte Carlo (MCMC) and simulated annealing algorithms to
          perform these multipoint analyses
          
          User Documentation
          http://watson.hgen.pitt.edu/docs/simwalk2.html
          http://www.well.ox.ac.uk/docs/simwalk2-SIMWALK.html
          
          Platforms - UNIX (SunOS/Solaris), MAC
          
          FTP Site
          ftp://watson.hgen.pitt.edu
          may need to register at http://watson.hgen.pitt.edu/register
          
          
          
          SAGE
          
          General Info
          S.A.G.E. 4.0 (Statistical Analysis for Genetic Epidemiology version 4.0)
          is a software package containing new programs for use in the genetic
          analysis of family and pedigree data.
          http://darwin.cwru.edu/pub/sage.html - home page
          
          User Documentation
          http://darwin.cwru.edu/sage40/sage40.pdf
          
          Platforms: Digital Unix 4.0, Sun Solaris 2.5 (32 bit), Linux 2.x,
          Windows NT 4.0/2000/98/Me/XP
          
          FTP Site - need to register on website - free trial version until March
          1, 2002
          after, if you decide to purchase it will be $6000 -  
          
          Site License: Unlimited CPU's, multiple users at a time, three sets of
          manuals, three support contracts*, all at one institution. We are
          willing to help set up the distribution at your site for the cost of one
          person's travel expenses. Also, we would consider giving a short course,
          at cost, for a minimum of 15 participants at your institution. Please
          contact us so that we can work something out for you.
          
          
          
          GENEHUNTER
          
          General Info
          GENEHUNTER is a software package that allows the very rapid extraction
          of complete multipoint inheritance information from pedigrees of
          moderate size.  This information is then used in exact computation of
          multipoint LOD scores and in a powerful new method of non-parametric
          linkage analysis.  Quick calculations involving dozens of markers, even
          in pedigrees with inbreeding and marriage loops, is possible with
          GENEHUNTER.  In addition, the multipoint inheritance information allows
          the reconstruction of maximum-likelihood haplotypes for all individuals
          in the pedigree and information content mapping which measures the
          fraction of the total inheritance information extracted from the marker
          data.  All of these calculations are performed in a user-friendly
          environment familiar to MAPMAKER, SIBS, or HOMOZ users.
          
          http://www.cs.washington.edu/homes/pmork/final_project/GeneHunter.html
          
          User Documentation
          (install instructions)
          http://www-genome.wi.mit.edu/ftp/distribution/software/genehunter/INSTALL
          http://linkage.rockefeller.edu/soft/gh/
          http://linkage.rockefeller.edu/soft/gh/gh.pdf (PDF version)
          
          Release 1.1 Commands - http://bimas.dcrt.nih.gov/linkage/ghcmds.html
          
          Platforms - UNIX
          
          FTP Site
          http://www-genome.wi.mit.edu/ftp/distribution/software/genehunter/
          
          http://www.fhcrc.org/labs/kruglyak/Downloads/index.html (version 2.1); 
          http://waldo.wi.mit.edu/ftp/distribution/software/genehunter/gh2/
          
          
          
          
          ALLEGRO
          
          General Info
          For more info: allegro@decode.is
          
          User Documentation
          http://depts.washington.edu/statgen/Computing/Doc/Allegro/manual.pdf
          
          Platforms
          
          FTP Site
          
          
          
          

          Slides

          PDF format PowerPoint format Postscript format

          Powerpoint of Linkage Presentation - February 26,2000

          Books and articles

           
          
          These are not textbooks for the course. There is really none - we will
          try to provide you as much relevant material as possible in class. The
          books below may be useful though:
          
           Michael S. Waterman
          
          
          "Computational Molecular Biology" by Pavel Pevzner
          
          
          "Handbook of Human Genetic Linkage" by Joseph Douglas Terwilliger
          
          
          
          

          Useful Links - major research groups, data sources, consortia and companies

          
          http://snp.cshl.org/ - SNP consortium
            
          http://bioinfo.weizmann.ac.il/cards-bin/carddisp?NRG1&search=erbb2&suff=txt
             http://linkage.rockefeller.edu/bib/algorithms - basic reference about software for linkage/
             http://www-genome.wi.mit.edu/SNP/human/
             http://www.applera.com/press/prccorp091300.html
             http://www.washington.edu/students/crscat/genet.html
             http://www.globaltechnoscan.com/25thApr-2ndMay01/microarray.htm
             http://www.sas.com/products/miner/
             http://www.orchid.com/
             http://www.celera.com/
             http://www.incyte.com/
             http://www.nextwavestocks.com/biowave799.html
             http://munin.mskcc.org/
             http://ihome.cuhk.edu.hk/~b400559/complexdismap.html
             http://www.upenn.edu/almanac/v47/n21/Genomics.html
             http://www.cs.jhu.edu/~salzberg/cs439.html
            
          http://directory.google.com/Top/Science/Biology/Bioinformatics/Companies/
            
          http://www.advancetechmonitor.com/Products/Reports/Industry_Reports/PG/PG_TOC/PG_TOC.html
             http://www.silicoinsights.com/html/news_events.html
             http://www.labbook.com/
             http://www.santafe.edu/~shalizi/reviews/vapnik-nature/
            
          http://dmoz.org/Science/Biology/Genetics/Eukaryotic/Animal/Mammal/Human/
             http://dmoz.org/Science/Biology/Bioinformatics/Companies/
             http://www.doubletwist.com/corporate/faqs.jhtml
             http://www.bioinformatics.ucla.edu/genemine/
             http://www.megametrics.com/bioinformatics.htm
             http://www.blsystems.nl/index.html
             http://pubs.acs.org/hotartcl/ac/99/oct/focus.html
             http://bioinformatics.weizmann.ac.il/repository/mapping_software.html
             http://www.well.ox.ac.uk/docs/simwalk2.html
             http://cmag.cit.nih.gov/lserver/SIMWALK2.html
             http://linkage.rockefeller.edu/ott/
             http://www.washington.edu/students/crscat/genet.html
             http://linkage.rockefeller.edu/
             http://www.well.ox.ac.uk/
             http://www.sanger.ac.uk/
             http://www.nhgri.nih.gov/Data/
             http://www.lionbioscience.com/eng/index_c_1.htm
             http://www.research.ibm.com/journal/sj/402/goble.html
             http://www.gdb.org/
             http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html#Least
             http://www.dkfz-heidelberg.de/GeneCards/background.html#how
             http://www.cs.helsinki.fi/research/fdk/medical_genetics/
             http://www.ncbi.nlm.nih.gov/