Kreiman Lab

Code - Cisregul - Single Motifs

In order to search for overrepresented motifs we used three available motif-finding programs: (i) alignACE (Hughes et al., 2000), (ii) MEME (Bailey et al., 1995) and (iii) MotifSampler (Thijs et al., 2002). Besides other default parameters, the parameters for alignACE were -numcols 10 and GC frequency in the corresponding genome. The parameters for MEME were -minw 6, -maxw 20, -dna, -mod tcm, -nmotifs 100, -evt 1, -minsites 3, -maxsites 500, 6th order background model, -revcomp. The parameters for MotifSampler were 6th order background model, -n 10. The background models and GC frequency were computed from the upstream sequences of all genes in the corresponding genome (using 5000 bp for each gene for mouse, human and flies and 1000 bp for yeast). The output of the motif search algorithms depends on the initial conditions. We therefore ran 10 iterations of each motif search on the same sequences. For the human muscle set, we ran the motif finding algorithms both on the raw sequences and on the sequences after masking those segments not conserved in the mouse orthologs. Sequence conservation was determined using BLAST (default parameters were -p blastn -d 0 -G 1 -E 2 -X 50 -W 11 -q -2 -r 1 -F T -e 10 -S 3). Many other approaches for sequence conservation have been proposed (see for example Wasserman et al, 2000; Loots et al, 2002; Nobrega et al, 2003). Many of these other approaches provide much more accurate models of evolutionary conservation across species for two sequences. This was not the focus of our current approach but this is a step that could be improved in future versions.

In addition, we used the available weight matrices from the TRANSFAC database of transcription factors and binding sites. We used all the entries that had the corresponding species in the species field and where a weight matrix was available from the TRANSFAC database, public release 6.0 (Wingender et al., 1997). For the muscle set, we used both the human and mouse weight matrices.

In order to remove redundancies (between PWMs obtained from different iterations of a motif-finding program, or between two different motif-finding programs or between the motif-finding programs and TRANSFAC), we used the Spearman correlation coefficient (which is more robust to outliers than the Pearson correlation coefficient). To compare two motifs, the corresponding weight matrices were converted to vectors by concatenating the rows (we refer to these vectors as "linerized weight matrices in the text").. We then computed the Spearman correlation coefficient of the two vectors and define the similarity between the two PWM as the correlation coefficient of the best alignment. Two motifs were considered to be redundant if the Spearman correlation coefficient was above 0.70 (similar results were obtained with a threshold of 0.80, not shown).

Furthermore, we only considered motifs with an information content (Stormo and Fields, 1998) larger than 0.2331 (this value corresponded to the lowest 5th percentile from the TRANSFAC database), a minimum length of 6 nucleotides and a minimum of 5 sequences used to define the weight matrix.

The resulting list L containing nL non-redundant position specific weight matrices is used to search for modules (see Preliminary search for modules below).

Top