|
|
Cherbas Lab |
|
|
Research:
|
Regulatory Elements Identifying conserved transcription factor binding sites (TFBSs) is a critical problem for understanding patterns of gene regulation. Recently the genomes of 10 Drosophila species have been determined. These sequences will present a wealth of opportunities for sequence comparison to identify regulatory elements and new approaches are urgently needed. We have begun work on this problem by concentrating on sequences present in conserved positions in the vicinity of transcription start sites (TSSs). For practical reasons sequences like these are usually identified by statistical methods that identify overexpressed "consensus" motifs. We have been experimenting with a contrasting approach based on the comprehensive lexical analysis of specific, overrepresented DNA sequences (CLA). By analysis of published sources we assembled a database (Orthomine) of 3393 experimentally well-known Drosophila melanogaster TSSs. We analyzed all positionally-specific words of length 5-8 nucleotides contained within the region [-250, +100] relative to the TSS. Using Monte Carlo methods we identified significantly overrepresented words position-by-position. Using a measure based on silhouette width, similar words over-represented at single positions were clustered by sequence similarity, then merged with their overrepresented cognates in adjacent positions. For most of the positionally-defined promoter elements CLA gives results that are similar to those already known from other methods. However CLA excels at detecting classes of sites that don't differ enough to be distinguished by "averaging" methods. In particular we have that one important promoter element is composed not of a single "average" motif but of 5 distinct "flavors". The biological individuality of these flavors seems likely since they are associated with different TFBSs and with genes in distinct functional classes (GO terms).
|
|