Bioinformatics Workflow of RNA-Seq
RNA sequencing (RNA-seq) has become a standard tool for gene expression analysis, transcript discovery, and splicing quantification. The bioinformatics pipeline that processes raw sequencing data into interpretable biological results is as important as the sequencing itself—an inappropriate pipeline choice can produce misleading differential expression calls, miss important splicing events, or generate results that cannot be reproduced.
This guide is written for researchers who have basic familiarity with RNA-seq and need practical guidance on selecting and executing the right analysis pipeline. It covers the three major pipeline approaches available in 2025-2026, the key decisions at each step, the computational requirements to plan for, and the common mistakes that compromise data quality. The focus is on decision-oriented content: which pipeline to choose for a given project type, what QC metrics to track, and how to interpret the outputs at each stage. By the end of this guide, you should be able to design a complete RNA-seq analysis strategy that matches your research objectives, sample types, and available computational resources. For researchers planning large-scale transcriptome studies, RNA-seq services include both sequencing and bioinformatics analysis as integrated deliverables.
The RNA-Seq Bioinformatics Pipeline — Three Approaches, One Goal
Every RNA-seq data analysis pipeline follows the same five-stage structure: raw data quality control, read alignment or quantification, expression quantification and normalization, statistical analysis for differential expression, and functional interpretation. The differences between pipelines lie in how reads are assigned to genes or transcripts, which determines the sensitivity, accuracy, and computational requirements of the overall analysis. Understanding these differences before starting the analysis prevents costly re-analysis when the chosen pipeline turns out to be incompatible with the project data type or research objectives.
Three pipeline approaches dominate current practice. The choice between them depends on the research question, the availability of a high-quality reference genome and annotation, and the computational resources available.
Path 1 — Spliced alignment + gene counting: Reads are aligned to a reference genome using a splice-aware aligner (STAR or HISAT2), then assigned to genes using featureCounts or HTSeq. This is the most comprehensive approach: it detects novel splice junctions, identifies fusion transcripts, and provides the most accurate quantification at the gene level. The trade-off is computational cost—STAR alignment of a typical RNA-seq sample (30 million reads) takes 1-3 hours and requires 32 GB of RAM.
Path 2 — Lightweight transcript quantification: Tools like Salmon, Kallisto, and RSEM use a reference transcriptome (rather than the genome) to quantify transcript abundance directly from reads, bypassing the alignment step. Salmon can process 30 million reads in 5-15 minutes using 8 GB of RAM. The trade-off is that lightweight methods do not detect novel transcripts or splice junctions and are best suited for projects with a well-annotated reference transcriptome.
Path 3 — Transcriptome assembly: For species without a reference genome or when novel transcript discovery is the primary goal, de novo assembly tools (Trinity, rnaSPAdes) or genome-guided assemblers (StringTie, Scallop) reconstruct transcripts from aligned reads. This approach is the most computationally intensive and produces results that require additional validation, but it is the only option for characterizing transcriptomes in non-model organisms where no reference genome is available. StringTie, for example, can reconstruct full-length transcripts from aligned RNA-seq reads with high sensitivity, producing transcript models that can be used for downstream quantification and functional annotation.
| Project Type | Recommended Path | Key QC Metric | Storage per Sample |
|---|---|---|---|
| Model organism with good annotation | STAR + featureCounts + DESeq2 | Uniquely mapped reads >80% | ~15 GB (FASTQ + BAM + counts) |
| Non-model or incomplete annotation | Salmon + tximport + DESeq2 | Mapping rate >60% | ~10 GB (FASTQ + quant files) |
| Novel transcript discovery | STAR + StringTie + GFFcompare | Transcripts with coding potential | ~25 GB (FASTQ + BAM + assembly) |
| Long-read RNA-seq (Iso-Seq/Nanopore) | minimap2 + TALON/Bambu | Full-length read % | ~20-50 GB |
For research groups that prefer to outsource the bioinformatics analysis, RNA-seq bioinformatics services provide established pipelines for all three approaches, with customizable parameters to match project-specific requirements.
Workflow management for reproducibility: Regardless of which pipeline is chosen, the analysis should be executed within a workflow management system such as Nextflow (nf-core/rnaseq) or Snakemake. These frameworks ensure that the same steps are applied consistently across all samples, software versions are documented, and results can be reproduced by other researchers. The nf-core/rnaseq pipeline, for example, provides a community-validated RNA-seq workflow with STAR alignment, Salmon quantification, and comprehensive QC reporting.
Figure 1. RNA-seq bioinformatics pipeline — three analysis paths and their five shared stages
Caption: Overview of the three RNA-seq bioinformatics pipeline paths—spliced alignment + gene counting, lightweight transcript quantification, and transcriptome assembly—showing their five shared stages from raw data QC to functional interpretation.
Step 1 — Raw Data Quality Control
Quality control is the first and most critical step in any RNA-seq analysis. The QC assessment determines whether the data is suitable for the intended analysis and identifies problems that must be addressed before proceeding.
Key QC metrics:
- Per-base quality (Q-scores): >80% of bases should be above Q30. Lower scores indicate potential run quality issues or library degradation.
- GC content distribution: Should match expected GC content of the target species. Bimodal distribution suggests contamination or library bias.
- Adapter content: Should be <1% after trimming. Higher levels indicate incomplete adapter removal.
- Sequence duplication levels: For RNA-seq, rates above 60% suggest low library complexity or over-amplification.
- rRNA contamination: Should be <5% for poly(A)-selected libraries. Higher levels indicate incomplete rRNA depletion.
Tools and workflow: FastQC provides individual sample reports, and MultiQC aggregates them into a single report across all samples in the project. Trimmomatic and fastp perform adapter trimming and quality filtering — fastp is generally recommended for new projects as it is faster and provides built-in quality reporting. After trimming, a second round of QC should confirm that the trimming was effective without over-trimming into biologically meaningful sequence. If QC reveals significant issues in one or more samples (e.g., rRNA contamination >20%, median Q-score <30), the affected samples should be flagged before proceeding to alignment.
Per-sample vs. multi-sample QC strategy: A common recommendation is to run FastQC on each sample individually, then use MultiQC to aggregate results across the entire project. This approach reveals both sample-specific issues and systematic issues. The per-sample heatmap in MultiQC is particularly useful for identifying outlier samples that should be examined more closely before inclusion in DE analysis.
Figure 2. FastQC report — key metrics for RNA-seq data quality assessment
Caption: FastQC quality report for RNA-seq data showing per-base quality scores, GC content distribution, adapter content, sequence duplication levels, and rRNA contamination metrics with recommended thresholds.
Step 2 — Choosing Between Alignment, Lightweight Quantification, and Assembly
The core decision in an RNA-seq pipeline is how to assign reads to transcripts or genes. Three fundamentally different approaches are available, each with distinct trade-offs in accuracy, computational cost, and discovery capability.
STAR (spliced alignment): STAR uses a sequential maximum mappable seed search algorithm that is significantly faster than earlier aligners like TopHat or Bowtie2. It detects canonical and non-canonical splice junctions, chimeric reads for fusion detection, and provides alignments that can be used for both quantification and transcript assembly. For projects with specific alignment requirements, NGS services can be configured to deliver data in formats compatible with STAR, Salmon, or other analysis pipelines.
HISAT2 (hierarchical indexing): HISAT2 is an alternative splice-aware aligner using an FM-index based hierarchical index, requiring less memory (4-8 GB per sample) than STAR. It is viable when memory-constrained environments prevent STAR usage, though it detects fewer novel splice junctions on average.
Salmon (lightweight quantification): Salmon uses a quasi-mapping approach that determines the most likely transcript origin for each read without performing full alignment. It corrects for GC bias and sequence-specific bias, producing transcript-level TPM estimates within minutes.
StringTie (transcript assembly): StringTie assembles transcripts from aligned reads, producing a transcript model for each locus. It can be used with or without a reference annotation. For projects requiring de novo transcript discovery, StringTie followed by merging across samples produces a comprehensive transcriptome capturing novel isoforms.
Decision framework: For high-quality reference projects, STAR + featureCounts provides the most comprehensive gene-level quantification. For large projects prioritizing efficiency, Salmon offers comparable accuracy at a fraction of the compute time. For transcript discovery or non-model species, genome-guided assembly with StringTie is appropriate. For fusion detection, RNA-seq analysis services can incorporate STAR-Fusion or Arriba.
Practical consideration — running multiple pipelines: For high-priority projects, running both STAR and Salmon and comparing results provides internal validation. If both pipelines rank the same genes as top candidates, confidence is substantially higher than from a single pipeline.
Figure 3. Three RNA-seq analysis paths — alignment, lightweight quantification, and transcript assembly
Caption: Comparative diagram of three RNA-seq analysis paths showing data flow through STAR alignment + featureCounts, Salmon lightweight quantification, and StringTie transcript assembly, with key trade-offs in accuracy, computational cost, and discovery capability.
Step 3 — Expression Quantification and Normalization
Quantification converts aligned reads or mapped fragments into expression measures at the gene or transcript level. The choice of normalization method has a direct impact on differential expression results and inter-sample comparability.
Gene-level quantification with featureCounts: featureCounts counts reads overlapping each gene exons, producing a raw count matrix. This is the standard input for DESeq2 and edgeR. The key parameter is the strand-specificity setting: for stranded libraries, specifying the correct strandedness doubles the usable signal.
HTSeq-count: An alternative to featureCounts using Python-based intersection models. It is slower but provides more control over read assignment logic for complex gene models with overlapping annotations.
Transcript-level quantification with Salmon: Salmon produces estimated counts and TPM values per transcript. For gene-level analysis, tximport aggregates transcript estimates to gene level, accounting for transcript length differences. Salmon built-in bias correction reduces systematic errors.
Normalization methods: TPM normalizes for both sequencing depth and transcript length, appropriate for within-sample comparisons. DESeq2 median-of-ratios normalizes for depth only and is standard for between-sample DE analysis. FPKM and RPKM are older methods largely superseded by TPM and DESeq2 normalization.
Which normalization to use when: For within-sample comparisons, TPM is appropriate. For between-sample DE analysis, DESeq2 or TMM normalization should be used. For meta-analyses combining multiple studies, TPM is preferred as it is less sensitive to library size differences.
Figure 4. Normalization methods — TPM, DESeq2 median-of-ratios, and FPKM compared
Caption: Comparison of TPM, DESeq2 median-of-ratios, and FPKM normalization showing formulas, use cases (within-sample vs between-sample), and biases from incorrect normalization choice.
Step 4 — Differential Expression Analysis
Differential expression (DE) analysis identifies genes whose expression changes significantly between conditions. The choice of statistical method affects power to detect true positives, false positive control, and ability to handle RNA-seq data characteristics.
Three mainstream methods:
- DESeq2: Negative binomial model with shrinkage dispersion estimator. Performs well with small sample sizes (3-4 replicates) and provides robust FDR control. Built-in count outlier detection reduces false positives.
- edgeR: Negative binomial model with empirical Bayes dispersion estimation. Slightly more sensitive for datasets with larger replicates (>6 per group). Quasi-likelihood framework provides more conservative FDR control.
- limma-voom: Linear modeling framework with precision weights. Computationally efficient, results comparable to DESeq2 for most datasets. Well suited for complex experimental designs with multiple factors.
With only two replicates, DE analysis can be performed but FDR control is unreliable. Standard output is genes ranked by adjusted p-value and log2 fold change, typically filtered at FDR < 0.05 and |log2FC| > 1. For standard analysis, genomic data analysis services provide validated DESeq2 and edgeR pipelines.
Figure 5. Differential expression methods — DESeq2, edgeR, and limma-voom compared
Caption: Comparison of DESeq2, edgeR, and limma-voom showing statistical models, dispersion estimation, optimal sample sizes, and best-fit use cases for each method.
Functional Interpretation — From Gene Lists to Biological Mechanisms
Once differentially expressed genes (DEGs) are identified, three complementary approaches interpret their biological significance:
- Over-representation analysis (ORA): Tests DEG enrichment in GO categories or KEGG pathways relative to background. Requires significance threshold and is sensitive to the chosen cutoff.
- Gene set enrichment analysis (GSEA): Uses full ranked gene list to test enrichment at top or bottom of ranking. Does not require a threshold and detects coordinated changes even when individual genes are not significant.
- Pathway and network analysis: Tools like IPA, Metascape, and clusterProfiler map DEGs onto biological pathways. Output includes pathway activation and upstream regulator predictions.
Database version control is essential for reproducibility — GO, KEGG, and Reactome databases are updated regularly. A common mistake is using a database version incompatible with the annotation version used during alignment. Customized bioinformatics services can incorporate pathway enrichment and network analysis for projects requiring additional depth.
Single-Cell RNA-Seq Bioinformatics — A Different Paradigm
Single-cell RNA-seq (scRNA-seq) bioinformatics differs fundamentally from bulk analysis. The data is much sparser and technical noise from dropout events is the dominant challenge. The standard scRNA-seq workflow includes: raw data processing (CellRanger), quality filtering, normalization (SCTransform, scran), dimensionality reduction (PCA, UMAP), clustering (Louvain, Leiden), marker gene identification, and cell type annotation.
Computational requirements are substantially higher: a 10,000-cell dataset requires 16-32 GB RAM. A 100,000-cell dataset may require 64+ GB. Cloud or HPC clusters are recommended for datasets exceeding 50,000 cells. Bioinformatics services provide standardized pipelines for both bulk and scRNA-seq analysis.
Computational Requirements for RNA-Seq Data Analysis
Planning computational resources before sequencing data arrives prevents analysis bottlenecks. For a 48-sample bulk RNA-seq project (30M reads each, 2x150 bp):
- Storage: ~430 GB raw FASTQ + 250-400 GB BAM + 10-50 GB intermediate files = 700-900 GB total. Compressed FASTQ reduces requirements by 40-60%.
- Compute time: STAR alignment 48-144 hr single-core (parallelizable). Salmon quantification 4-12 hr. DESeq2 30 min-2 hr. StringTie assembly 24-72 hr.
- Memory: STAR 32 GB per sample. DESeq2/edgeR 8-16 GB. Salmon 4-8 GB. scRNA-seq (10K cells) 16-32 GB.
- Infrastructure: 64+ GB RAM, 16+ CPU cores, SSD storage recommended. Cloud analysis more cost-effective for 100+ sample projects. For labs without HPC, WGS and RNA-seq service providers typically include computational analysis.
Strandedness consideration: Most modern RNA-seq projects use stranded library preparation. Pipeline settings (e.g., featureCounts strandedness flag) must match the library type to avoid losing half the signal from incorrectly assigned reads.
Figure 6. RNA-seq common mistakes — problems, causes, and solutions
Caption: Common RNA-seq analysis mistakes including excessive DEGs, batch effects, low mapping rates, insufficient replicates, and pipeline irreproducibility.
Common RNA-Seq Analysis Mistakes and How to Avoid Them
| Problem Observed | Root Cause | Prevention |
|---|---|---|
| Excessive DEGs (>30% of genes) | Inappropriate normalization or FDR threshold | Pre-register analysis plan; check dispersion estimates |
| Batch effects dominate variation | Samples processed in non-randomized order | Randomize processing; include batch in design matrix |
| Low mapping rate (<60%) | Reference mismatch, rRNA contamination, degradation | Verify reference; check rRNA content in QC |
| No significant DEGs despite visible differences | Insufficient replicates, high within-group variability | Minimum 3 replicates; consider increasing sample size |
| Irreproducible results between pipelines | Different normalization or statistical assumptions | Use at least two methods and compare concordance |
FAQ
What is the best alignment tool for RNA-seq?
STAR is the most widely used splice-aware aligner for RNA-seq. It is significantly faster than TopHat2 and HISAT2 for most datasets and detects more novel splice junctions.
Should I use STAR or Salmon for RNA-seq quantification?
Choose STAR for novel splice junctions, fusion detection, or transcript assembly. Choose Salmon for rapid quantification of known transcripts in large projects. Many projects benefit from using both.
How many biological replicates do I need for RNA-seq?
Minimum three biological replicates per condition for reliable DE analysis. With fewer, FDR control is unreliable. For high biological variability, 4-6 replicates may be needed.
What is the difference between FPKM, RPKM, and TPM?
All normalize for depth and transcript length, but TPM is preferred. FPKM and RPKM introduce bias in between-sample comparisons. TPM normalizes length before library size for direct cross-sample comparability.
How do I handle batch effects in RNA-seq data?
Address at design stage by randomizing samples across batches. Include batch as covariate in DESeq2 or limma design matrix. Computational correction (RUVseq, sva, ComBat-seq) can reduce effects but cannot fix complete confounding.
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
References:
- From bench to bytes: a practical guide for RNA-seq data analysis. Frontiers in Genetics. 2025;16:1697922.
- A comprehensive workflow for optimizing RNA-seq data analysis. BMC Genomics. 2024;25:678.
- Performance assessment of DNA sequencing platforms in the ABRF study. Nature Biotechnology. 2021;39:1348-1365.
- Faster and more accurate differential transcript analysis with edgeR v4 and Salmon. NAR Genomics and Bioinformatics. 2024;6(4):lqae151.
- DESeq2 differential expression analysis. Bioconductor (current release).