RNA-seq is perhaps the most widely used method for read counting, which measures and compares the number of copies of each transcript in different cell types or conditions. However, traditional RNA-seq is limited by a PCR amplification step, performed to generate sufficient DNA molecules for sequencing while bringing biases. PCR bias can cause the overrepresentation of certain transcripts in the final sequencing library. The problem of PCR duplicates is more acute with greater numbers of PCR cycles, as in single-cell RNA-seq. Unique Molecular Identifiers (UMIs) are an effective solution for minimizing PCR bias, leading to more accurate estimates of quantitative gene expression. UMI RNA-seq, also known as digital RNA-seq, has been widely used in academic and clinical research applications.
Figure 1. The use of UMIs in NGS libraries (Roloff et al. 2017).
What Is UMI?
UMIs are random oligonucleotide barcodes that have been increasingly used to confidently identify PCR duplicates in next-generation sequencing (NGS) experiments, especially RNA sequencing (RNA-seq). UMIs are complex indices incorporated into the sample location in each fragment before PCR amplification in the library preparation so as to identify the molecule of origin for each read and accurately identify true PCR duplicates as they have both identical UMI sequences and identical alignment coordinates. UMIs can be applied to a wide range of sequencing methods in which an accurate quantification or detection of rare mutations is required or the input is low, such as RNA-seq, single-cell RNA-seq and immune repertoire sequencing.
Figure 2. Alignment of read families sorted by UMIs allows for the discrimination of rare variants from protocol artifacts introduced during PCR or sequencing procedures (Roloff et al. 2017). ECS denotes error-corrected sequencing.
Workflow of UMI RNA-seq
The workflow of UMI RNA-seq consists of RNA isolation, rRNA removal, cDNA library construction with UMI barcodes (Figure 3), library quality assessment, deep sequencing, and data analysis. In the library construction step, the rRNA-depleted RNA is fragmented and reversely transcribed into cDNA along with ligation with UMI adapter, followed by library amplification and library QC.
Figure 3. UMI incorporation and library amplification in UMI RNA-seq experiments (Dixit 2016).
After deep sequencing, raw data are preprocessed to remove adapter sequences and low quality reads. UMIs in RNA-seq data can be identified using umitools reformat_fastq. PCR duplicates are marked using umitools mark_duplicates. Alignment of read families sorted by UMIs allows for the discovery of novel and rare transcript variants, read counting and then compare the abundances of reads across different samples for identifying differentially expressed transcripts.