Generating a comprehensive expression profile is critical when studying normal biology and disease processes. The transcriptome is a cell's or population's complete set of transcripts, and transcriptome analysis reveals the identity and quantity of all RNA molecules. The correlation of transcriptomes across developmental stages, disease states compared to normal cells, or specific experimental stimuli compared to physiologic conditions is an essential application of RNA-seq. This sort of analysis necessitates the identification of genes and their isoforms, as well as a precise estimation of their abundance when comparing two or more samples. It's crucial for deciphering the genome's functional elements and determining the molecular makeup, which can lead to new insights into the biological mechanisms of development and disease. Cuffdiff, DESeq, DESeq2, EdgeR, PoissonSeq, Limma voom, and MISO are some of the most used tools for differential gene expression.
Following the step of preprocessing RNA-seq reads, DGE analysis is used to determine how the transcript levels differ between samples. Since the microarray era, numerous statistical techniques have been established that use read coverage to assess transcript abundance. The RPKM (reads per kilobase per million mapped reads) technique is widely used to account for expression and normalized read counts in relation to the total number of mapped reads and gene length. However, in addition to read coverage, other factors such as sequencing depth, gene length, and isoform abundance influence the approximated transcript abundance. It has been critiqued because the RPKM method treats all RNA-seq reads almost equally, for example, without regard for isoforms. RNA-Seq by Expectation-Maximization (RSEM) is a newly developed software tool that provides accurate gene and isoform expression levels for species without a reference genome assembly.
Figure 1. RNA-seq analysis workflow for gene expression. (Corchete, 2020)
To date, most differential gene expression analysis algorithms use simple count-based probability distributions followed by Fisher's exact test without taking into account biological variability between samples. While RNA-seq data has very low technical variability when compared to microarray data, biological variability can be substantially lowered by evaluating multiple replicates using permutation-derived methods. For biological variability assessment, serial analysis of gene expression has been developed, in which larger-scale datasets are used to approximate an additional dispersion parameter based on an extended Poisson distribution, enabling extensive molecular characterization capability.
However, a large number of replicas may be too expensive for most applications, so many established techniques have surmounted the problem by modeling biological variability and measuring significance with a small number of samples, using pairwise or multiple group comparisons. Several programs provide well-designed solutions for this purpose, and they have been used in numerous biomedical and clinical studies. Cuffdiff from the Cufflinks package, DESeq, DESeq2, and EdgeR are examples of these programs. Because RNA-seq read counts are highly skewed integer numbers ranging from zero to millions, a variety of transformation algorithms have been used to fit the counts to statistical distribution models for differential expression detection. For RNA-seq counts, approaches developed for microarray data analysis based on continuous distribution have been improved. The voom function in the limma package is an excellent example of how to transform count data into Gaussian distributed data so that statistical significance can be tested. A comprehensive comparison of the performance of several DGE packages was recently published. However, there is no one-size-fits-all strategy that we are aware of.