Code - Cisregul - Module enrichment

We searched for occurrences of each of the modules selected in the previous step in the regulatory sequences (5 kb upstream + first exon + first intron) of all the genes in RefSeq for human and mice, and all annotated genes for fly and yeast (here labeled 'all genes search'). This set comprised 16979 genes in mm, 17689 genes in hs, 13639 genes in dm and 6327 genes in sc. Let Pg represent the frequency of occurrence of the module in all genes. As a null hypothesis we assumed that the module occurred with frequency Pg in S and computed the probability of obtaining the observed number of occurrences in S, here denoted x, assuming a binomial probability distribution (for large samples where Pg can be computed rather accurately, the hypergeometric distribution converges to the binomial distribution).

We reported all modules that show a probability of enrichment less than 0.01 after Bonferroni correcting by the number of putative modules. It should be noted that in the resulting list of modules, there may be a strong overlap between different modules. As an example, in the file listing the results for the fly pattern formation set with maximum distance of 100 bp and a maximum of 3 motifs, the motif number 215 appears in 3 modules from the top 10 predictions, motif number 270 appears 4 times, and even several combinatinos of motifs appear in several of the modules (e.g. the homotypic interaction of motif 270). Furthermore,even if the same motif does not appear in multiple modules, it is possible that different motifs have overlapping binding site predictions (as in the same file, when comparing the modules formed by motifs {214; 215} and the motifs {64; 215}).

When comparing multiple species, we searched for the occurrences of the module {defined by the same PSWMs) in the two species and separately computed an enrichment factor for both species.