In biostats and bioinformatics, the hypergeometric distribution is often used to assign probability of surprise to the amount of overlap between results and annotation, e.g.: 100 gene levels are changed by drug treatment and 50 of those genes are annotated as relating to immune system. The probability of surprise of such an overlap depends on the total number of genes examined in the analysis and the number of genes annotated as relating to the immune system.
However a hypergeometric test is not perfect for this application, as it assumes the margins are fixed (“margins” meaning the sums along the side of the contingency table, i.e. the number of changed genes and the number of immune system genes). While the annotation side might be considered fixed, the number of genes which are observed as changed is better considered a random variable, as it depends on the dataset.
What happens if one of the margins is a random variable? Here is a simple example showing how the null distribution of the number of genes in the intersection changes when one of the margins is allowed to vary by different amounts.
- consider 100 genes, 20 annotated for a given category
- black/blue density is randomly taking 20 genes as changed
- red density is flipping a 0.2 coin 100 times and taking this many genes as changed
- green densities are variations on a censored negative binomial, which has more variance than the binomial in red
plot(0:20,dhyper(0:20,20,80,20),type="l", xlab="# in intersection",ylab="density",main="how does a random margin change null distr.?") n <- 1e4 dens <- table(factor(replicate(n,sum(sample(100,20,replace=FALSE) <= 20)),levels=0:20))/n lines(0:20,dens,col="blue") dens <- table(factor(replicate(n,sum(sample(100,rbinom(1,prob=.2,size=100),replace=FALSE) <= 20)),levels=0:20))/n lines(0:20,dens,col="red") dens <- table(factor(replicate(n,sum(sample(100,min(100,rnbinom(1,mu=20,size=10)),replace=FALSE) <= 20)),levels=0:20))/n lines(0:20,dens,col="green") dens <- table(factor(replicate(n,sum(sample(100,min(100,rnbinom(1,mu=20,size=5)),replace=FALSE) <= 20)),levels=0:20))/n lines(0:20,dens,col="green") dens <- table(factor(replicate(n,sum(sample(100,min(100,rnbinom(1,mu=20,size=1)),replace=FALSE) <= 20)),levels=0:20))/n lines(0:20,dens,col="green") legend("topright", c("fixed theor.","fixed simul.","binom. simul.","neg. binom. simul."), lwd=1, col=c("black","blue","red","green"))
