BMI/CS 776 Course Project
Incomplete List of Project Suggestions
- Design and evaluate a motif-finding method that takes into account
dependencies between non-adjacent positions.
- Implement and empirically compare motif-finding methods that vary
in the types of dependencies they can represent (e.g. zeroth and
first-order Markov chains, Bayes nets, MDD).
- Design, implement and evaluate an algorithm for identifying
cis-regulatory modules (CRM) (arrangements of binding site motifs that
regulate a set of genes under certain conditions).
- Extend the Noto & Craven method for CRM finding by adding other
search operators. Evaluate the effect of these operators.
- Compare the time to convergence and the resulting accuracy when
EM and Gibbs sampling are used in a model with hidden state. The
model could be a MEME-style motif model, another type of hidden Markov
model, a stochastic context free grammar, a mixture model, etc.
- Implement a method for gene finding that employs multiple
genomes. Investigate how the accuracy of the predictions are affected
by how closely related the informant genome is (e.g. you might use,
say, mouse, zebrafish, and fruit fly as the informant genomes).
- Implement and compare generative
and discriminative probabilistic methods for a given task,
such as gene finding.
- Devise a randomized version of a traditional filter for finding
highly similar local similarities in sequences. Use an algorithm
based on this randomization to efficiently find all high-scoring local
alignments in a set of sequences.
- Implement and experiment with an SCFG-based approach for
identifying RNA genes via cross-genome comparisons.
- Extend the method of Bockhorst
and Craven for refining the structure of a context free grammar.
Devise a new operator and an appropriate heuristic for applying it.
Evaluate the method using a terminator data set.
- Design and evaluate an algorithm for aligning of protein
networks.
- Implement and experiment with the module network approach of Segal et al.
- Design, implement and evaluate a method that clusters genes using
multiple sources of evidence, such as gene-expression data
and text associated with the genes.
- Implement and empirically compare an EM-based and an LDA-based
approach for discovering topics in a set of scientific
articles.
- Devise a grammar for some type of biological named entity
(e.g. gene/protein names). Implement and evaluate a model, based on
the grammar, for recognizing entities of this type in scientific
articles.