For certain sequencing experiments (e.g. methylation data), one might end up with a ratio of read counts at a certain location satisfying a given property (e.g. ‘is methylated’) and want to test if this ratio is significantly associated with a given variable, x.
One way to proceed would be a linear regression of ratio ~ x. However the case when 100 reads cover a nucleotide is not statistically equivalent to the case when 2 reads cover a nucleotide. The binomial probabilities will become increasingly spiked at p*n as the number of reads n increases. So the case with 100 reads covering gives us more information than the case with 2 reads covering.
Here is a bit of code for using the glm() function in R with the binomial distribution with weights representing the covering reads.
n <- 100 # random poisson number of observations (reads) reads <- rpois(n,lambda=5) reads # make a N(0,2) predictor variable x x <- rnorm(n,0,2) # x will be negatively correlated with the target variable y beta <- -1 # through a sigmoid curve mapping x*beta to probabilities in [0,1] p <- exp(x*beta)/(1 + exp(x*beta)) # binomial distribution from the number of observations (reads) y <- rbinom(n,prob=p,size=reads) # plot the successes (y) over the total number of trials (reads) # and order the x-axis by the predictor variable x par(mfrow=c(2,1),mar=c(2,4.5,1,1)) o <- order(x) plot(reads[o],type="h",lwd=2,ylab="reads",xlab="",xaxt="n") points(y[o],col="red",type="h",lwd=2) points(reads[o],type="p",pch=20) points(y[o],col="red",type="p",pch=20) par(mar=c(4.5,4.5,0,1)) # more clear to see the relationship # plot just the ratio plot((y/reads)[o],type="h",col="red",ylab="ratio",xlab="rank of x") points((y/reads)[o],type="p",col="red",pch=20) # from help for glm(): # "For a binomial GLM prior weights are used to give # the number of trials when the response is the # proportion of successes" fit <- glm(y/reads ~ x, weights=reads, family=binomial) summary(fit)
