8 Steps to Select the Right Platform for RNA Sequencing

Quick Overview

01 Precision: What Level of Precision Is Required for Sequencing? 02 Read Count: How Many Do We Need? 03 Read Length: How Long Should Reads Be? 04 SR or PE: Single-End or Paired-End Sequencing? 05 RNA or DNA: Which to Measure? 06 Samples: How Much Material Do I Need to Prepare? 07 Budgeting: How Much Will I Need to Allocate? 08 Timing: How Long Does Measurement Take?

Each sequencing platform possesses unique attributes that distinguish it from others. In fortunate cases, researchers may have access to multiple platforms, enabling them to leverage the strengths of each. Indeed, some studies capitalize on the optimal performance of different platforms, selecting them based on specific requirements. For instance, Illumina's reads excel in coverage, SOLID prioritizes accuracy, while Roche 454 or Pacific Biosciences are renowned for their extended read lengths.

Choosing the most suitable platform entails considering numerous factors, spanning various dimensions. However, with the information provided here, researchers can navigate these complexities and make informed decisions regarding platform selection for RNA-seq experiments and their respective modalities.

How to Plan Your Next RNA Sequencing Experiment is a useful article to learn different types of RNA sequencing technologies.

Precision: What Level of Precision Is Required for Sequencing?

When aiming to identify SNPs or single nucleotide editing events in RNA species, selecting a platform with minimal error rate becomes paramount. It's essential to differentiate genuine SNPs from sequencing errors. Given a human SNP frequency of approximately 1 in 800, achieving an accuracy rate of 99.9% is imperative. Only the SOLID platform claims accuracy surpassing this threshold, while some platforms fall significantly short. Nevertheless, it's noteworthy that compensating for lower accuracy is feasible by increasing the number of reads. For instance, employing 10 reads with 99.9% accuracy for the same RNA effectively yields a 99.99% accuracy rate.

Conversely, when the objective is to pinpoint known protein-coding genes, enhance gene structure model annotations, quantify transcripts, and potentially unearth novel genes, the demand for precision diminishes. Programs facilitating read mapping to established gene models often permit one or even two unpaired matches. For instance, with reads spanning 50 nucleotides and allowing for one mismatch, the accuracy rate stands at 98%. At this juncture, most widely used platforms, such as SOLID, Illumina, 454, and IonTorrent, are viable options.

CD Genomics high-throughput RNA sequencing and library construction services enable in-depth analysis of transcriptomes.

Nanopore Direct RNA Sequencing

Read Count: How Many Do We Need?

It's common practice to assess coverage statistics in RNA-seq projects. As a rule of thumb, the human genome spans about 3000 million nucleotides (Mnt), with roughly 1/30 allocated to protein-coding genes. This implies that the RNA slated for sequencing amounts to approximately 100 million nucleotides. If we opt for single-end sequencing with reads of 100 nucleotides (nt) each (or double-end sequencing with 50nt reads), then acquiring 1 million reads yields 100 million nt of sequence data, equivalent to 1x coverage. A typical output for a standard platform is 30 million reads, providing 30x coverage. With 30 million reads, we can anticipate comprehensive coverage of most expressed genes, though some less abundant ones might be missed.

To estimate the likelihood of a read mapping to a specific gene, we can assume an average gene size of 4000 nt (derived from 100M nt divided by 25,000 genes). With 30 million reads translating to 30x coverage, and assuming a read length of 100nt (or 50nt for double-end sequencing), a single read is expected to map to the average gene approximately 1200 times. Consequently, if a gene is expressed at 1/1200th the level of an average gene, there's a 50:50 chance that a read maps to it. In practice, 30 million reads suffice for capturing most, though not all, expressed genes in a sample. Since many platforms can generate up to 30 million reads, this is typically not a limiting factor. Platforms capable of producing higher read counts are preferred for enhanced coverage, especially for analyzing alternative exon usage, rare events, or fine-grained gene modeling.

A newer technique known as 'capture sequencing' involves enriching RNA at specific loci in the human genome. This method has been successful in capturing RNA from around 50 loci, including protein-coding genes and long non-coding RNAs. By employing capture sequencing, researchers achieved over 4600-fold coverage of these loci, enabling the discovery of unannotated exons, novel splicing patterns, and in-depth investigations of well-studied genes. This underscores the challenge of attaining exhaustive coverage for every transcript within a gene locus.

Alternatively, determining the minimum number of reads needed to confirm transcript presence remains a subject of debate. The literature offers conflicting examples, with some studies suggesting that a single read suffices, while others argue that fewer than 10 reads are inadequate. The appropriate threshold depends on various factors, including the study context, journal or database standards, and the overarching research objectives.

Read Length: How Long Should Reads Be?

For basic mapping to known genes within an organism, even as short as 14 nucleotides (nt) can suffice. However, given that some reads may map to multiple sites, longer reads become essential. With a length of 50 nt, only a small fraction of reads will still map to multiple sites, typically accounting for very few occurrences (<0.01%). Consequently, in practical terms, longer read lengths enable more robust differential expression studies and finer delineation of gene patterns.

Nevertheless, numerous scenarios demand even longer reads, particularly when annotating new genes in species lacking extensive sequence data, such as genomes, expressed sequence tags (ESTs), or long-stranded cDNA. Longer sequences offer a distinct advantage over attempting to infer gene patterns solely from mapped, discontinuous 50 nt reads. Platforms like Roche 454 have demonstrated effectiveness in such applications, leveraging their capability to produce longer reads. Additionally, advancements in Pacific Biosciences technology, particularly the latest generation of instruments and kits, enable the generation of reads stretching up to 10,000 nt or beyond, further expanding the scope of genomic exploration.

CD Genomics long-read RNA sequencing and library construction services enable in-depth analysis of transcriptomes.

Full-Length Transcripts Sequencing (Iso-Seq)

Nanopore Direct RNA Sequencing

SR or PE: Single-End or Paired-End Sequencing?

In an ideal scenario where every step of library preparation, from RNA fragmentation to cDNA synthesis, generates completely unbiased fragments representing the RNA samples, single-end (SR) and paired-end (PE) sequencing would yield comparable results. However, bias inevitably creeps in during these preparation stages. To mitigate this, sequencing both ends of the cloned library enhances fragment randomization, thereby optimizing sequencing data quality.

Paired-end sequencing offers a twofold advantage: not only does it increase the randomness of sequenced fragments, but it also allows for overlapping of sequences from short fragments, offering additional sequence confirmation. Most modern data analysis programs accommodate both SR and PE data seamlessly, eliminating any hindrance in downstream analysis.

Unfortunately, not all sequencing platforms support paired-end sequencing. Therefore, whenever feasible, opting for paired-end sequencing is advisable to maximize data quality and analytical insights.

Recommended reading: Single-read vs. Paired-end Sequencing.

RNA or DNA: Which to Measure?

As previously discussed, the majority of sequencing platforms focus on RNA molecules derived from reverse-transcribed double-stranded cDNA and PCR-amplified RNA samples. However, certain research projects prioritize studying RNA structural modifications, such as mRNA capping. In such cases, sequencing RNA directly becomes preferable. This approach is exemplified by recent advancements like Nanopore sequencing, which directly sequences RNA instead of cDNA.

Samples: How Much Material Do I Need to Prepare?

With the advent of sequencing total RNA from individual cells, the question arises: is there a minimum requirement for sample material? Platforms utilizing amplified double-stranded cDNA effectively have no lower limit, but this doesn't imply that minimal material suffices. Increasing the sample material not only ensures an adequate supply for sequencing but also enhances the diversity of RNA species detected.

Most modern sequencing platforms offer specialized kits tailored for library preparation from nanograms of total RNA, accommodating varying sample sizes. Single-molecule platforms, in particular, require just one molecule for sequencing, eliminating any practical limitations across different sequencing platforms.

You can refer to our SAMPLE SUBMISSION GUIDELINES for more details about samples and preparation.

Budgeting: How Much Will I Need to Allocate?

While the cost of sequencing has significantly decreased over the past decade, it's important to acknowledge that cost remains a factor, especially considering the rising requirements and quality standards for publication. Although the ideal scenario would disregard cost, practical considerations necessitate budgeting.

Opting to utilize business, national, or local core NGS facilities for uploading RNA-seq libraries presents an effective strategy for cost reduction without compromising quality.

Timing: How Long Does Measurement Take?

In the dynamic realm of genomics, swift progress is essential. Ideally, samples are swiftly prepared, libraries meticulously constructed, and sequencing seamlessly executed without any delays. However, in reality, many platforms like Illumina, SOLID, and 454 often have queues not because the machines are idle, but due to insufficient libraries to saturate the flow pool for a single run.

Thus, the bottleneck in the workflow typically arises during library construction, where the accumulation of a requisite number of libraries precedes initiating instrument runs. Consequently, the work queue originates not from instrument availability but from preparatory library work.

Upon completion of sequencing, the journey is far from over. Data analysis emerges as the next phase, and its duration can span from days to months, or even years, particularly in large-scale projects. Consequently, despite the brevity of sequencing instrument runs, the data analysis phase looms as a potentially protracted endeavor.

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.

Related Services