Hypergeometric distribution versus binomial distribution
Given a set of m objetcs, with k of one class 1 and m-k of another class 2, we draw n randomly and without replacement. We ask what is the probability of obtaining exactly x elements from class 1? This follows the hypergeometric distribution. For n/m small, this is well approximated by the binomial distribution with parameters x,n and p=k/m.
In the context of our search for enrichment, we have a set of n genes and we have a module that is present in k genes out of the total of m genes in the genome. We ask what is the probability of obtaining x genes from the set of n that contain the module assuming that the module is equally distributed in the set and the whole genome. In the text, we approximate this with a binomial distribution after computing the frequency in the whole genome as k/m. Our enrichment probability is therefore given simply by the cumulative binomial distribution with parameters x, n and k/m.
Here we show the hypergeometric probability distribution (blue) and the binomial probability distribution (red) for relevant possible values of the parameters x,n,k,m. Here the genome size m varies from m=50 to m=5000 (left to right in the plot). The number of genes k with the module varies from 0.5m to 0.125m (top to bottom). The number of genes in the set was fixed at n=40. We see that for m>1000 the two distributions are very hard to distinguish. Given that the smallest genome that we used (yeast) contained >6000 genes, the binomial distribution is a very good approximation in our case.