German nouns and gender
I’m working on a presentation about classification of strings, and using 240,000 German nouns as an example dataset.
View ArticlePCA on training and test data
In the past months, I heard some talks where dimension reduction (e.g. taking the top k principal components) was used on the full data set before splitting the data into training and test sets. My...
View ArticleBlock bootstrap
In looking at sequential data (e.g. time-series or genomic data), any inference comparing different sequences needs to take into account local correlations within a sequence. For example, you might...
View ArticlePlotting hclust
After many years I’ve finally worked out the x and y coordinates of the points in plot.hclust. hang <- 0.07 hc <- hclust(dist) plot(hc) pt.heights <- c(hc$height[hc$merge[,1] <...
View ArticlePoints and line ranges
Two ways of plotting a grid of points and line ranges. I’m coming around to ggplot2. I recommend skimming the first few chapters of the book to understand what is going on – but it only takes about 30...
View ArticlePipe to Rscript
with this, I can switch from doing simple statistics on the command line using awk to using R, which is more familiar for me: blah blah blah | Rscript -e 'summary(scan(file("stdin")))'
View ArticleSplitting data
The caret package has a nice function for splitting up balanced subsets of data. Though I don’t see why I don’t get 3 rows out of 10 in this example. The p argument is defined as “the percentage of...
View ArticlePoisson regression
In trying to explain generalized linear models, I often say something like: GLMs are very similar to linear models but with different domains for the target y, e.g. positive numbers, outcomes in {0,1},...
View ArticleHow wrong is hypergeometric test with one random margin?
In biostats and bioinformatics, the hypergeometric distribution is often used to assign probability of surprise to the amount of overlap between results and annotation, e.g.: 100 gene levels are...
View ArticleBinomial GLM for ratios of read counts
For certain sequencing experiments (e.g. methylation data), one might end up with a ratio of read counts at a certain location satisfying a given property (e.g. ‘is methylated’) and want to test if...
View ArticleJacob and Monod
The original gene regulation diagram? J Mol Biol. 1961 Jun;3:318-56. Genetic regulatory mechanisms in the synthesis of proteins. JACOB F, MONOD J.
View ArticlePlot hclust with colored labels
Again I find myself trying to plot a cluster dendrogram with colored labels. With some insight from this post, I came up with the following function: library(RColorBrewer) # matrix contains...
View ArticleMore hclust madness
Here is a bit of code for making a heatmap, which orders the rows of a matrix such that the first column (as ordered by in the dendrogram) has all 0s then all 1s, then the 2nd column is similarly...
View ArticleEmpirical Bayes and the James-Stein rule
Suppose we observe 300 individual estimates y_i which are distributed N(mean_i, sigma.y^2), with sigma.y known. Now if we assume mean_i ~ N(0, sigma.mean^2), the James-Stein rule gives an estimator for...
View ArticleR gotchas
I put together a short list of potential R gotchas: unexpected results which might trip up new R users. For example, if we have a matrix m, m[1:2,] returns a matrix, while m[1,] returns a vector,...
View ArticleHow to check your simple definition of p-value
I just read Andrew Gelman’s post about an article with his name on it starting with an inaccurate definition of p-value. I sympathize with all parties. Journalists and editors are just trying to reduce...
View ArticleHow to use latex math in Rmd to display properly on GitHub Pages
Working on our PH525x online course material, Rafa and I wanted to base all lecture material in Rmd files, as these are easy for students to load into RStudio to walk through the code. Additionally,...
View ArticleBe precise
I’ve seen a lot of brash negativity lately on twitter. Here are 3 reasons why you shouldn’t say “x sucks” or “y FAIL” on twitter: 1. you are being sarcastic. sarcasm doesnt work on twitter and some...
View ArticleJacob and Monod
The original gene regulation diagram? J Mol Biol. 1961 Jun;3:318-56. Genetic regulatory mechanisms in the synthesis of proteins. JACOB F, MONOD J.
View ArticleRNA-seq fragment sequence bias
Our paper was just published describing a new method for modeling and correcting fragment sequence bias for estimation of transcript abundances from RNA-seq: “Modeling of RNA-seq fragment sequence bias...
View Article