Technical advances in RNA-seq
Sanger sequencing and microarrays. Sanger sequencing technology was first used for transcriptomics, which enabled methods such as SAGE (serial analysis of gene expression). SAGE was one of the first attempts to quantify gene expression on a global basis. Almost instantaneously, microarrays utilizing complementary probe hybridization, quickly emerged and come to dominate the field of transcriptomics profiling for the next decade.
NGS. The advent of next-generation technologies has enabled the sequencing approach to surpass microarray approach. In 2006, the first RNA-seq paper was published by utilizing454/Roche technology. The era of RNA-seq dominance began in 2008 with the maturity of Illumina/Solexa technology. The most popular technical platforms for RNA-Seq has been the Illumina Genome Analyzer and Hi-Seq. While the Illumina/Solexa technology can generate gigabases of data per run (initially 1GB per run for the Genome Analyzer in 2006 and 600 GB per run for the HiSeq in 2012), Roche/454 technology generates reads long enough for RNA-seq but are hampered by the relatively low throughput and high cost.
Third generation sequencing. Despite the popularization of the NGS technologies, the application of third generation sequencing in RNA-seq is on its way. For example, Heliscope sequencing and single-molecule real-time (SMRT) sequencing have already been applied in some RNA-seq studies. PacBio SMRT long reads sequencing technology can easily cover complete transcript from the 5' end to the 3'-poly A tail without the need of fragmentation to obtain full-length cDNA sequences, which is useful to identify new transcripts and new introns, thereby accurately identifying isoforms, alternative splicing sites, fusion gene expression, and allelic expression.
Table 1. The advantages of RNA-seq compared with other transcriptomics approaches (Wang et al. 2009).
|Technology||Tiling microarray||cDNA or EST sequencing||RNA-seq|
|Principle||hybridization||Sanger sequencing||High-throughput sequencing|
|Resolution||From several to 100 bp||Single base||Single base|
|Reliance on genomic sequence||Yes||No||In some cases|
|Simultaneously map transcribed regions and gene expression||Yes||Limited for gene expression||Yes|
|Dynamic range to quantify gene expression level||Up to a few-hundredfold||Not practical||>8,000-fold|
|Ability to distinguish different isoforms||Limited||Yes||Yes|
|Ability to distinguish allelic expression||Limited||Yes||Yes|
|Required amount of RNA||High||High||Low|
|Cost for mapping transcriptomes of large genomes||High||High||Relatively low|
Challenges of RNA-seq
Workflow of RNA-seq based on NGS
The workflow of RNA-seq by utilizing high-throughput sequencing technology is illustrated in Figure 1. Briefly, long RNAs are first converted into a library of cDNA fragments through RNA or DNA fragmentation. Sequencing adaptors are then attached to each cDNA fragment and sequence data are generated in a high-throughput manner from both ends (paired-end sequencing). The resulting sequence reads are subsequently aligned with the reference genome or transcriptome, and are classifies into three types: exonic reads, junction reads and poly(A) end-reads. A base-resolution expression profile can be generated by using these three types of sequence reads.
Figure 1. A typical workflow of RNA-seq (Wang et al. 2009).
• Library construction
Figure 2. A typical library construction pipeline of RNA-seq.
Following sample collection, total RNA is usually isolated via organic extraction and/or silica-membranes of spin columns. Total RNA sample is subsequently processed either by direct selection of poly(A) RNA or by selective removal of rRNA because the abundant rRNA is usually not the research focus and greatly reduces the coverage of the useful transcript. Oligo(dT)-based mRNA purification procedure is widely used in eukaryotes. However, some RNA transcripts that lack the poly(A) tails are missed. Compared to the poly(A) RNA selection, ribo-depletion approach is preferred because it enriches all nonribosomal RNA species, including tRNA, ncRNAs, nonpoly(A) mRNA, and preprocessed RNA. The two most popular rRNA depletion methods are: (i) hybridization of rRNA with biotin-labeled anti-rRNA probes, followed by removal with streptavidin-coated magnetic beads; and (ii) selective degradation of rRNA by a 5’-3’ exonuclease that specifically recognizes rRNA with a 5’ phosphate.
Fragmentation is subsequently conducted to reach the desired length for different NGS technologies. Some small RNAs, such as microRNAs, piwi-interacting RNAs, and short interfering RNAs, can be directly sequenced without fragmentation. Larger RNA molecules need to be fragmented into smaller pieces (200-500 bp) before deep-sequencing technologies. cDNA fragmentation (DNase I treatment or sonication) and RNA hydrolysis or nebulization. However, each of these methods can create a different bias in the outcome. For example, cDNA fragmentation is usually strongly biased towards the identification of sequences from the 3’ ends of transcripts, while RNA fragmentation has little bias over the transcript but is depleted for transcript ends. Therefore, cDNA fragmentation provides valuable information about the precise identity of these ends and RNA fragmentation provides access to precisely identity of the transcript body.
In the classic NGS protocols, adapters are ligated onto shared double-stranded DNA fragments. However, a major drawback of this approach is the loss of information on transcriptional direction. Pre-treat the RNA samples with sodium bisulphate can convert the cytidine into uridine. Widespread C-T transition thereby marks the coding stand of each transcript. Some other methods that maintain strand-specificity have been proposed, such as direct ligation of RNA adaptors to the RNA sample before reverse transcription.
The RNA-seq is currently dominated by three different platforms: Illumina (Genome Analyzer and HiSeq), Applied Biosystems SOLID, and Roche 454 Life Science systems. Read lengths range from 30-100 bp for Illumina and SOLiD, and 200-500 bp for 454 pyrosequencing system. 454-based RNA-seq is particularly attractive for non-model organisms without reference genomes or transcriptomes. Longer reads or paired-end short reads can reveal connectivity between multiple exons. RNA-seq is a powerful method to study complex transcriptomes and reveal sequence variations in the transcribed regions.
Figure 3. A typical analysis pipeline of RNA-seq data.
Quality assessment is the first step for the bioinformatics analysis of RNA-seq, which ensures a coherent final result by removal of low-quality sequences, over-represented sequences, and adapter sequences. Once all reads have been filtered and mapped or assembled, gene expression levels can thus be inferred, leading to a genome-scale transcriptome map in terms of quality and quantity. RNA-seq also allows detecting differential expression (DE) across treatments of conditions. Normalization has to be conducted to adjust the differences between samples such as library size and gene-specific features. Furthermore, RNA-seq enables us to identify SNPs, fusion genes, and post-transcriptional gene regulation, such as RNA editing, degradation, and translation.