Computer Science Department
University of Illinois at Urbana-Champaign

Cis-regulatory modules (CRMs) that perform similar functions should share binding sites for the same transcription factors. Often, these binding sites are unknown, and the relevant transcription factor motifs may also be unknown. Nevertheless, the shared binding sites should affect the statistical properties of the functionally related sequences. This is what we seek to exploit in developing alignment-free measures of similarity between regulatory sequences. Alignment-free sequence comparison can serve two purposes: (i) given the cis-regulatory modules in one species, to discover their orthologs in a highly diverged species (e.g., from fruitfly to mosquito), and (ii) given the cis-regulatory modules belonging to a pathway, to find other CRMs in this pathway in the same species.

The D2Z score: In the following paper, we developed the statistics for such an alignment-free measure of similarity between regulatory sequences. The basic idea here was to count the number of shared short words (k-mers) between two given sequences, and "normalize" this count so as to measure its statistical significance.

  • "A statistical method for alignment-free comparison of regulatory sequences." - R. Kantorovitz, G. E. Robinson and S. Sinha.
    Bioinformatics, 2007. 23(13). (Special Issue on ISMB 2007).

Ab initio module discovery: We have used the D2Z score and another score for alignment-free sequence comparison, in conjunction with Simulated Annealing search strategies, to discover CRMs in the control regions of co-expressed genes.

  • "Computational discovery of cis-regulatory modules in Drosophila, without prior knowledge of motifs" - A. Ivan, M. S. Halfon, S. Sinha.
    In review.

Supervised CRM prediction: Given the known CRMs active in a specific tissue or stage of development, we can search near other co-expressed genes or genome-wide for functionally related CRMs, using our alignment-free measures of similarity. This is work in progress.

CRM discovery across large evolutionary divergence:A large body of experimentally validated CRMs are catalogued by the REDfly database (Gallo et al. 2006), for Drosophila development. We are interested in finding the orthologs of these CRMs, to the extent that they are conserved, in highly diverged insect species such as the mosquito, beetle, wasp and honeybee. Traditional methods for finding orthologs, that are based on genome-wide alignments, break down for this application, since the non-coding genomes of these insect species do not align well. We are therefore searching for these missing orthologs in the control regions of orthologous genes, using our alignment-free measures. This is work in progress.

OVERVIEW
TRANSCRIPTIONAL REGULATION
CRE & ALIGNMENT
ALIGNMENT-FREE COMPARISON
PUBLICATIONS
SOFTWARE DOWNLOADS
PEOPLE
NEWS
PI'S HOME PAGE
SUPPORTED BY