SEMINAR ON DATA MINING IN GENOMICS
Page under construction - in progress!
SPRING 2002 SEMESTER
16:198:673
INDEX:64546 SECTION: 01
Instructor: Tomasz Imielinski
Meeting Time: Tusdays 10:30 A.M. - 1:00 P.M.
ABSTRACT
With the proliferation of genetic data sets ranging from the pedigree
data for families affected by various diseases to SNP data and gene
expression microarray, there is rapidly growing need for data
analysis and data mining of genetic data sets. Such techniques will
help geneticists discover causes of complex diseases (caused possibly
by multiple genes) and analyze gene expression data to learn behaviours
of different genes.
The field of research which studies associations between genes and
diseases (phenotypes) is called functional genomics. While there has
been a lot of success in functional genomics for single gene diseases,
there is very little if any, in functional genomics for complex
diseases which are causes by multiple genes. The objective of this
seminar is to introduce research challenges of building software
to support functional genomics for complex diseases in functional genomics.
We will discuss currently used techniques for discovery of gene
location for single gene diseases. We will also review the data mining
methods such as association rule mining, decision trees, cluster
analysis which can be useful in tackling a much more difficult problem
of gene discovery for complex diseases.
At the end of the semester we hope to achieve two obejctives:
- Provide the whole group (students and instructor) with "hands
on" familiarity with major existing software tools along with their
algorithmic foundations. In other words to have a complete review of
the current state of software in functional genomics
- To discuss original new approaches for functional genomics for
complex diseases.
The class will be divided into mini research groups each of
which will be responsible for learning specific software and its
theoretical/algorithmic foundations and make a presnetation in
class. Such presentation will involve demonstrating the results
obtained from running the given software on specific syntheically
generated pedigrees.
The following software will be discussed and assigned to groups:
- LINKAGE
- GENEHUNTER
- VITESSE
- MENDEL
- SAGE
- ALLEGRA
- more...
Syllabus
- Basic genetics explained to computer scientists: Classes: Jan 29th, Feb 5th
- Foundations of data mining methods
Association rules, decision trees and clustering methods: Feb 12, Feb 19
- Linkage analysis algorithm and implementation:
excercise in recursion on pedigrees: Feb 26th, March, 5th
- Group presentations from March 12th onward till the end of the semester.
Possibly with one or two invited speakers to be announced
Software Study Groups
GENEHUNTER: Xiaojie Xi (CS) and Li Deng (Genetics), Xiaolei Huang (CS)
LINKAGE: Rong Zhang (CS) and Li Liang (Genetics), Rong Xu (stat)
Vitesse: Jianhuan Ye (CS), Bin Xu (Genetics)
SIMWALK: Dmitry Fradkin (CS), Andrei Anghelescu (CS), Zhiyang Pan
(Neuroscience)
SAGE: Xiaofeng Liu (CS, molecular biology) and Yangzhe Xiao (CS)
Schedule pf presentations
March 12nd: Linkage Group
March 26th: Vitesse Group
April 2nd: Simwalk Group
April 9: Genehunter
April 16th SAGE Group
April 23: External speaker (SNPs)
April 30: TBA
Each presentation will involve
- Description of software, its objectives and functionalities,
algorithmic foundations (from their recommended reference list)
- Examples showing the results for the specific synthetic pedigrees
- Benchmark analysis - scaling in terms of execution time and memory in function of the pedigree size, number of markers, allele distribution
(homozygosity etc)
Software
LINKAGE
General Info
User Documentation
http://linkage.rockefeller.edu/soft/linkage/
Platform: PC(DOS/OS2/..), UNIX,VMS
FTP Site
ftp://linkage.rockefeller.edu/software/linkage/
MENDEL
General Info
http://www.biomath.medsch.ucla.edu/faculty/klange/software.html
MENDEL is a Fortran 77 program for the genetic analysis of human
pedigree data under models involving a small number of loci. MENDEL is
useful for segregation analysis, linkage calculations, genetic
counseling, allele frequency estimation, and related kinds of problems.
FISHER is a Fortran 77 program for the genetic analysis of classical
biometric traits like blood pressure or height that are caused by a
combination of polygenic inheritance and complex environmental forces.
SEARCH is a Fortran 77 subprogram for function minimization. SEARCH
permits bounds and linear constraints to be imposed on parameters and,
in statistical applications, computes asymptotic standard errors and
correlations of parameter estimates.
dGENE is a simple dBASE III program for the management of pedigree and
locus data. It permits easy extraction of genetic data for use with
MENDEL and FISHER.
LINKMEND is a Pascal program written by Dan Weeks that converts LINKAGE
input files to MENDEL input files.
SPERM is a Fortran 77 program written by Ken Lange and Laura Lazzeroni
for the analysis of sperm typing data.
User Documentation - none online, coming soon
Platforms - not listed
FTP Site
need to register w/prof Ken Lange at ucla @
http://www.biomath.medsch.ucla.edu/faculty/klange/register.html
Software at this website:
MENDEL - Human Pedigree Analysis
FISHER - Genetic Analysis of Biometric Traits
SEARCH - Function Minimization
dGENE - Pedigree and Locus Data Management
LINKMEND - Converts LINKAGE input files to MENDEL input files
SPERM - Analysis of Sperm Typing Data
VITESSE
General Info
VITESSE is a software package that computes likelihoods with the
functionality of the LINKMAP and MLINK programs from LINKAGE. VITESSE
uses the novel algorithms of set-recoding and fuzzy inheritance to
reduce the number of genotypes needed for exact computation of the
likelihood, which accelerates the calculation. It also represents
multilocus genotypes locus-by-locus to reduce the memory requirements.
User Documentation
http://watson.hgen.pitt.edu/docs/vitesse.html
Platforms:
Sun SPARCStation 10,
Solaris 2.3 Sun SPARCStation 2,
SunOS 4.1.3 DEC Alpha,
OSF/1.3 HP workstation,
HP/UX 9.01 IBM SP2 (parallel RS/6000 CPUs),
AIX 3.21 Silicon Graphics Challenge,
IRIX 5.2 Silicon Graphics Indigo,
IRIX 4.0.5 IBM-PC DOS, BorlandC Compiler (being tested)
FTP Site
ftp://watson.hgen.pitt.edu
(login as 'anonymous' with your e-mail address as password)
cd pub/vitesse get vitesse.tar.Z
On your machine: uncompress vitesse.tar.Z tar xvf vitesse.tar
All files will appear in the directory ./vitesse.
may need to register on
http://watson.hgen.pitt.edu/register/new_main.html
SIMWALK
General Info
SimWalk2 is a statistical genetics computer application for haplotype,
parametric linkage, non-parametric linkage (NPL), identity by descent
(IBD) and mistyping analyses on any size of pedigree. SimWalk2 uses
Markov chain Monte Carlo (MCMC) and simulated annealing algorithms to
perform these multipoint analyses
User Documentation
http://watson.hgen.pitt.edu/docs/simwalk2.html
http://www.well.ox.ac.uk/docs/simwalk2-SIMWALK.html
Platforms - UNIX (SunOS/Solaris), MAC
FTP Site
ftp://watson.hgen.pitt.edu
may need to register at http://watson.hgen.pitt.edu/register
SAGE
General Info
S.A.G.E. 4.0 (Statistical Analysis for Genetic Epidemiology version 4.0)
is a software package containing new programs for use in the genetic
analysis of family and pedigree data.
http://darwin.cwru.edu/pub/sage.html - home page
User Documentation
http://darwin.cwru.edu/sage40/sage40.pdf
Platforms: Digital Unix 4.0, Sun Solaris 2.5 (32 bit), Linux 2.x,
Windows NT 4.0/2000/98/Me/XP
FTP Site - need to register on website - free trial version until March
1, 2002
after, if you decide to purchase it will be $6000 -
Site License: Unlimited CPU's, multiple users at a time, three sets of
manuals, three support contracts*, all at one institution. We are
willing to help set up the distribution at your site for the cost of one
person's travel expenses. Also, we would consider giving a short course,
at cost, for a minimum of 15 participants at your institution. Please
contact us so that we can work something out for you.
GENEHUNTER
General Info
GENEHUNTER is a software package that allows the very rapid extraction
of complete multipoint inheritance information from pedigrees of
moderate size. This information is then used in exact computation of
multipoint LOD scores and in a powerful new method of non-parametric
linkage analysis. Quick calculations involving dozens of markers, even
in pedigrees with inbreeding and marriage loops, is possible with
GENEHUNTER. In addition, the multipoint inheritance information allows
the reconstruction of maximum-likelihood haplotypes for all individuals
in the pedigree and information content mapping which measures the
fraction of the total inheritance information extracted from the marker
data. All of these calculations are performed in a user-friendly
environment familiar to MAPMAKER, SIBS, or HOMOZ users.
http://www.cs.washington.edu/homes/pmork/final_project/GeneHunter.html
User Documentation
(install instructions)
http://www-genome.wi.mit.edu/ftp/distribution/software/genehunter/INSTALL
http://linkage.rockefeller.edu/soft/gh/
http://linkage.rockefeller.edu/soft/gh/gh.pdf (PDF version)
Release 1.1 Commands - http://bimas.dcrt.nih.gov/linkage/ghcmds.html
Platforms - UNIX
FTP Site
http://www-genome.wi.mit.edu/ftp/distribution/software/genehunter/
http://www.fhcrc.org/labs/kruglyak/Downloads/index.html (version 2.1);
http://waldo.wi.mit.edu/ftp/distribution/software/genehunter/gh2/
ALLEGRO
General Info
For more info: allegro@decode.is
User Documentation
http://depts.washington.edu/statgen/Computing/Doc/Allegro/manual.pdf
Platforms
FTP Site
Slides
PDF format
PowerPoint format
Postscript format
Powerpoint of Linkage Presentation - February 26,2000
Books and articles
These are not textbooks for the course. There is really none - we will
try to provide you as much relevant material as possible in class. The
books below may be useful though:
Michael S. Waterman
"Computational Molecular Biology" by Pavel Pevzner
"Handbook of Human Genetic Linkage" by Joseph Douglas Terwilliger
Useful Links - major research groups, data sources, consortia and companies
http://snp.cshl.org/ - SNP consortium
http://bioinfo.weizmann.ac.il/cards-bin/carddisp?NRG1&search=erbb2&suff=txt
http://linkage.rockefeller.edu/bib/algorithms - basic reference about software for linkage/
http://www-genome.wi.mit.edu/SNP/human/
http://www.applera.com/press/prccorp091300.html
http://www.washington.edu/students/crscat/genet.html
http://www.globaltechnoscan.com/25thApr-2ndMay01/microarray.htm
http://www.sas.com/products/miner/
http://www.orchid.com/
http://www.celera.com/
http://www.incyte.com/
http://www.nextwavestocks.com/biowave799.html
http://munin.mskcc.org/
http://ihome.cuhk.edu.hk/~b400559/complexdismap.html
http://www.upenn.edu/almanac/v47/n21/Genomics.html
http://www.cs.jhu.edu/~salzberg/cs439.html
http://directory.google.com/Top/Science/Biology/Bioinformatics/Companies/
http://www.advancetechmonitor.com/Products/Reports/Industry_Reports/PG/PG_TOC/PG_TOC.html
http://www.silicoinsights.com/html/news_events.html
http://www.labbook.com/
http://www.santafe.edu/~shalizi/reviews/vapnik-nature/
http://dmoz.org/Science/Biology/Genetics/Eukaryotic/Animal/Mammal/Human/
http://dmoz.org/Science/Biology/Bioinformatics/Companies/
http://www.doubletwist.com/corporate/faqs.jhtml
http://www.bioinformatics.ucla.edu/genemine/
http://www.megametrics.com/bioinformatics.htm
http://www.blsystems.nl/index.html
http://pubs.acs.org/hotartcl/ac/99/oct/focus.html
http://bioinformatics.weizmann.ac.il/repository/mapping_software.html
http://www.well.ox.ac.uk/docs/simwalk2.html
http://cmag.cit.nih.gov/lserver/SIMWALK2.html
http://linkage.rockefeller.edu/ott/
http://www.washington.edu/students/crscat/genet.html
http://linkage.rockefeller.edu/
http://www.well.ox.ac.uk/
http://www.sanger.ac.uk/
http://www.nhgri.nih.gov/Data/
http://www.lionbioscience.com/eng/index_c_1.htm
http://www.research.ibm.com/journal/sj/402/goble.html
http://www.gdb.org/
http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html#Least
http://www.dkfz-heidelberg.de/GeneCards/background.html#how
http://www.cs.helsinki.fi/research/fdk/medical_genetics/
http://www.ncbi.nlm.nih.gov/