Code - Sequence Retrieval

Sequences were retrieved from Genbank (mm release 06/21/2003, hs release 10/22/2001), Ensembl (dm release 01-07-2003) and SGD (sc release 1/21/2003). The transcriptional start sites (TSS) as well as the boundaries for the first exon and first intron were retrieved from the corresponding genome annotations (NCBI Refseq for Mus musculus and Homo sapiens, Ensembl for Drosophila melanogaster, and SGD for Saccharomyces cerevisiae). In many cases, investigators study sequences retrieved with respect to the translation start site (which is much easier to characterize) instead of the TSS . The average annotated distance from the TSS to the ATG was 662 bp for the human muscle set and 1225 bp for the fly gene set. A comprehensive study of the distance between the TSS and the translation initial ATG in the latest release of the Drosophila genome can be found in Ohler et al., 2003. In cases of multiple alternative start sites, we used the one farthest upstream. Homologous genes in mice for the human muscle set were retrieved using the NCBI HomoloGene list. We analyzed 5000 base pairs upstream of the TSS (1000 base pairs for yeast). In addition, we also analyzed the first exon and first intron. If the length of the exon or intron was longer than 5000 base pairs, we used only the 5000 base pairs closest to the TSS. In cases of alternative splicing in the first exon/intron junction, we used the junction closest to the TSS.