RNA-seq fragment sequence bias

Our paper was just published describing a new method for modeling and correcting fragment sequence bias for estimation of transcript abundances from RNA-seq:

“Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation”

cartoon_biases

Q: What’s the point of the paper?

GC content bias – affecting the amplification of fragments – is widespread in sequencing datasets and well-known, yet there were no existing transcript abundance estimation methods that properly corrected for this sample-specific bias. We identified hundreds of cases where top methods mis-identified the dominant isoform due to fragment-level GC bias being left uncorrected. While there are existing methods for post-hoc correction of gene abundance estimates using gene-level GC content, the transcript-level bias correction task is much more difficult. Sample-specific technical variation in coverage on small regions of transcripts leads to dramatic shifts in abundance estimates.

Q: What do you mean by “systematic errors”?

You can split any errors into systematic and stochastic components, with the former a fixed quantity and the latter varying but with zero mean. With the measurement of transcript expression, the stochastic part can be minimized by, for example, increasing sequencing depth and increasing the sample size: the number of biological units measured to infer a population mean expression value. We used the term “systematic” in the title to underscore that higher sequencing depth and more samples will not help to remove the bias we describe.

Q: Is there a video where you explain the gist of this?

Yes. Below is a 5 minute video where I cover the basic ideas presented in the paper.

Q: When should I correct for fragment GC bias? Are all my results wrong?

There are two situations where it’s critical to correct for GC bias for transcript-level analysis:

(1) The most problematic situation is when one is comparing abundances across groups of samples, and the groups have variable GC content dependence. This often happens when samples are processed in different labs or at different times (ideally experiments should not be designed this way, but it is nevertheless common in genomics research). We demonstrate that this can lead to thousands of false positives results for differential expression of transcripts, and these differences will often rise to the very top of a ranked list. The solutions are to use methods which produce GC-bias-corrected transcript abundance estimates, to use block experimental design, to include experimental batch as a covariate in statistical analysis, and to examine GC content dependence across samples using software like FASTQC and MultiQC.

(2) The second point of warning is when any of the samples in the dataset contain strong GC dependence, that is, the library preparation was not able to adequately amplify fragments with GC content < 35% or > 65%*. Such samples were present in a subset of the batches of all the datasets we examined in the paper. Even if all of the samples come from a single batch, or the experiment has a block design with multiple batches, simple descriptive analysis of transcript abundance (e.g. how many isoforms are expressed per gene, which isoforms are expressed) will be inaccurate for hundreds-to-thousands of genes when ignoring fragment GC bias. Furthermore, differential expression, if present, could be attributed to the wrong isoform or isoforms within genes.

* These guidelines come from the Discussion section of an excellent paper on technical reproducibility of RNA-seq by the GEUVADIS project: ‘t Hoen et al (2013).

Q: Where are the details of the alpine statistical method?

The statistical method is detailed in the Supplementary Note. This allowed us to have more fine grained control of the presentation of LaTeX equations. The methods in the Supplementary Note were originally in the main text and were refined during peer review.

Q: Is alpine the only method implementing fragment GC bias correction?

No, the latest version of Salmon (v0.7, with methods described in the bioRxiv preprint) also implements a fragment GC bias correction similar to alpine, which runs at a fraction of the time. Salmon with the gcBias flag similarly reduces the mis-estimation of isoforms caused by fragment GC bias (see examples in Supp. Figure 5 of the latest Salmon preprint).

Q: Is the Salmon implementation of GC bias correction better than alpine?

The GC bias correction model is probably about equal, but other aspects of the Salmon method are superior to alpine. The alpine method, when estimating transcript abundances, focuses on one gene or locus at a time, and does not account for multi-mapping fragments across genes. The Salmon implementation is a full, transcriptome-wide estimation method which can simultaneously estimate and correct for sample-specific fragment GC bias (and other biases as well).

Our focus in writing alpine was to make a super-extensible method for modeling RNA-seq biases, in order to make rigorous comparisons of various methods for bias correction, and then to show that fragment GC bias correction in particular leads to more accurate estimates of transcript abundance. By super-extensible, I mean that one can easily modify details of the bias correction specification: any combination of bias effects (fragment length distribution, positional, random hexamer sequence bias, GC bias), interactions between these, or modifications to the spline parameters used to fit positional and fragment GC bias. It is also possible to take empirical fragment GC bias curves from alpine and directly incorporate these into the Polyester RNA-seq simulator. This results in simulated RNA-seq data with variable coverage that looks more similar to real RNA-seq coverage.

Links to papers mentioned in video:

Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias (2011)
https://www.ncbi.nlm.nih.gov/pubmed/21410973
Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing (2012)
https://www.ncbi.nlm.nih.gov/pubmed/22323520

RNA-seq fragment sequence bias

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112