A differential expression (DEG) table is a useful starting point—but it's rarely the end of the story. Most people run into the same next question: what do these gene-level changes mean at the level of pathways and biological functions? That's where functional enrichment analysis comes in.
This resource-style guide shows a full, repeatable workflow in R: QC → normalization → differential expression → enrichment (ORA and GSEA) → visualization, using clusterProfiler (a Bioconductor R package) for enrichment and enrichment-friendly plots. It's written so you can copy the structure into your own analysis notebook or adapt it into a small script. You'll also find a compact R script skeleton near the end.
Along the way, we'll focus on the practical details that tend to matter most in real projects: input formats, gene IDs, the background gene universe, how to choose between ORA vs GSEA, and how to make the three plots people actually use (volcano + bar + dot).
If you're still deciding how to define thresholds, interpret log2FC and FDR, or structure your DEG table for downstream analysis, you may find this differential expression overview helpful: Differential Gene Expression Analysis.
Workflow overview of the clusterProfiler-based DEG-to-enrichment pipeline.
By the time you finish this workflow, you should have a small set of outputs that are easy to reuse and share:
If you're writing a methods-style blog post or internal SOP, these outputs map nicely to a simple folder structure:
Option A: Start from counts (recommended)
You have:
This is the cleanest route because you control the full chain.
Option B: Start from a DEG table (fastest)
You already have results from DESeq2/edgeR/limma:
This is fine for enrichment as long as the DEG table is well-formed and you can reconstruct a ranking for GSEA if you want it.
In enrichment workflows, the single most common "why is this empty?" moment is an ID mismatch.
Typical IDs you'll see:
clusterProfiler can work with multiple types, but many GO/KEGG workflows are simplest if you convert to ENTREZID for enrichment. The practical approach is:
That mapping table becomes your "glue" when you need to interpret hits or build gene lists for follow-up analysis.
A stable, common stack looks like this:
Note: KEGG availability and behavior can vary by organism and by how IDs are handled. If you're working with a less common species, GO may be more straightforward than KEGG unless you have strong annotation support.
QC doesn't need to be complicated, but it should be intentional. The goal is simple: make sure your samples behave like your study design expects.
Three checks cover a lot of ground:
PCA is quick, visual, and usually informative.
PCA-based sample QC visualization.
Small differences are normal. Big differences can be a sign of technical problems.
A practical note about outliers:
If you remove a sample, write down why in one sentence and keep a record. Even for internal work, your future self will be grateful.
Normalization is one of those words that means different things in different contexts.
For differential expression, most established methods expect raw counts and perform normalization as part of the model:
A frequent detour is using TPM/FPKM for DE. TPM is useful for certain kinds of comparisons, but for count-based DE it usually creates more questions than it answers. If your goal is reliable differential testing, stay in the count-based lane.
Some filtering is helpful:
Filtering improves power and reduces noise, especially in small studies.
Your DEG table is the hinge point between gene-level statistics and enrichment. A "good" DEG table for downstream enrichment includes:
A common starting point is:
But many analyses benefit from being a bit more flexible:
If you know your signal is subtle (or your sample size is modest), consider leaning more on ranking-based GSEA rather than trying to "force" a big DEG list.
For ORA, it's often informative to run enrichment separately for:
That keeps directionality clear. Otherwise you can end up with a mixed gene set where a term is "enriched" but the genes in it move in both directions.
This is the core of the guide. clusterProfiler supports multiple enrichment approaches, but two are especially common for RNA-seq interpretation:
Inputs you need
About the universe (background gene set)
If you don't set a universe, your enrichment is often compared against a broad default background from the annotation database. In many cases, a more practical and defensible universe is:
This aligns the enrichment background with what your experiment could realistically detect.
GO enrichment (recommended starting point)
KEGG enrichment (optional)
KEGG can be very useful, but it's also where ID and organism issues show up more often. If KEGG returns empty results:
GSEA shifts the question. Instead of selecting a "significant gene subset," you rank all genes by a statistic and ask whether genes from a pathway tend to appear near the top or bottom.
What you need for GSEA
Ranking choices that are easy to explain
If you choose the signed log p-value approach, guard against zeros and NAs (e.g., clamp p-values to a minimum like 1e-300).
This workflow intentionally focuses on three plots because they cover most needs without turning your write-up into a figure museum.
A volcano plot shows:
Good habits:
Volcano plot summarizing differential expression results.
Use barplots to show a short list of top enriched terms. Keep it tight:
Dotplots are a great "one panel" summary:
If you publish dotplots, keep the legend readable and avoid showing too many terms.
Dot plot summarizing enriched functional terms with size and color encoding.
This section is meant to be a quick "before you export results" checklist.
Tip: export a mapping table and keep it with your results. It prevents confusion later.
If ORA results look odd (too broad, too empty, too generic), revisit the universe.
A practical default:
If you're not happy with ORA results, try GSEA before rewriting your thresholds five times.
GO is hierarchical and redundant by nature. If your top results look like repeats:
Common reasons:
A plot that's technically correct can still be hard to read.
Below is a compact script skeleton that you can adapt. It assumes:
It's intentionally minimal so you can drop it into a blog resource download and let readers modify a few lines.
suppressPackageStartupMessages({
library(DESeq2)
library(clusterProfiler)
library(enrichplot)
library(AnnotationDbi)
library(org.Hs.eg.db) # swap for your organism
library(ggplot2)
library(dplyr)
})
# ---- Inputs (edit) ----
counts_file <- "counts.csv" # genes x samples, raw counts
meta_file <- "meta.csv" # sample, condition (and optional batch)
id_type_in <- "SYMBOL" # SYMBOL or ENSEMBL
cond_a <- "treated"
cond_b <- "control"
dir.create("tables", showWarnings = FALSE)
dir.create("plots", showWarnings = FALSE)
# ---- Load ----
counts <- read.csv(counts_file, row.names = 1, check.names = FALSE)
meta <- read.csv(meta_file, stringsAsFactors = FALSE)
counts <- counts[, meta$sample]
# ---- DE ----
dds <- DESeqDataSetFromMatrix(
countData = round(as.matrix(counts)),
colData = data.frame(meta, row.names = meta$sample),
design = ~ condition
)
# Simple filtering
dds <- dds[rowSums(counts(dds) >= 10) >= 2, ]
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", cond_a, cond_b))
res <- as.data.frame(res) %>%
mutate(gene = rownames(.)) %>%
filter(!is.na(padj))
write.csv(res, "tables/DEG_results.csv", row.names = FALSE)
# ---- Volcano ----
vol <- res %>% mutate(sig = padj < 0.05 & abs(log2FoldChange) >= 1)
p_vol <- ggplot(vol, aes(log2FoldChange, -log10(padj))) +
geom_point(aes(alpha = sig), size = 1) +
scale_alpha_manual(values = c(`TRUE` = 0.8, `FALSE` = 0.2), guide = "none") +
theme_bw() +
labs(x = "log2 Fold Change", y = "-log10(FDR)", title = "Volcano plot")
ggsave("plots/volcano.pdf", p_vol, width = 6.5, height = 5)
# ---- ID mapping ----
mapped <- bitr(res$gene,
fromType = id_type_in,
toType = "ENTREZID",
OrgDb = org.Hs.eg.db)
res_entrez <- res %>%
inner_join(mapped, by = c("gene" = id_type_in)) %>%
distinct(ENTREZID, .keep_all = TRUE)
write.csv(res_entrez, "tables/DEG_with_ENTREZID.csv", row.names = FALSE)
write.csv(mapped, "tables/ID_mapping.csv", row.names = FALSE)
# ---- ORA (GO BP) ----
deg_up <- res_entrez %>% filter(padj < 0.05, log2FoldChange >= 1) %>% pull(ENTREZID)
deg_dn <- res_entrez %>% filter(padj < 0.05, log2FoldChange <= -1) %>% pull(ENTREZID)
universe <- res_entrez$ENTREZID
ego_up <- enrichGO(gene = deg_up, universe = universe,
OrgDb = org.Hs.eg.db, keyType = "ENTREZID",
ont = "BP", pAdjustMethod = "BH", qvalueCutoff = 0.05)
ego_dn <- enrichGO(gene = deg_dn, universe = universe,
OrgDb = org.Hs.eg.db, keyType = "ENTREZID",
ont = "BP", pAdjustMethod = "BH", qvalueCutoff = 0.05)
write.csv(as.data.frame(ego_up), "tables/GO_ORA_up.csv", row.names = FALSE)
write.csv(as.data.frame(ego_dn), "tables/GO_ORA_down.csv", row.names = FALSE)
# ---- Plots: bar + dot ----
ggsave("plots/GO_ORA_up_bar.pdf", barplot(ego_up, showCategory = 15), width = 7, height = 5)
ggsave("plots/GO_ORA_up_dot.pdf", dotplot(ego_up, showCategory = 15), width = 7, height = 5)
# ---- Optional: GSEA ----
rank_vec <- res_entrez$stat
if (all(is.na(rank_vec))) {
p <- pmax(res_entrez$pvalue, 1e-300)
rank_vec <- sign(res_entrez$log2FoldChange) * -log10(p)
}
names(rank_vec) <- res_entrez$ENTREZID
rank_vec <- sort(rank_vec, decreasing = TRUE)
gse_bp <- gseGO(geneList = rank_vec,
OrgDb = org.Hs.eg.db, keyType = "ENTREZID",
ont = "BP", pAdjustMethod = "BH", verbose = FALSE)
write.csv(as.data.frame(gse_bp), "tables/GO_GSEA_BP.csv", row.names = FALSE)
ggsave("plots/GO_GSEA_dot.pdf", dotplot(gse_bp, showCategory = 15), width = 7, height = 5)
How to turn this into a downloadable resource
Do I have to run DESeq2 to use clusterProfiler?
No. You can use clusterProfiler with results from any DE method, as long as you have a gene list for ORA or a ranked gene vector for GSEA.
Should I use ORA or GSEA for pathway enrichment?
Use ORA when you want enrichment on a selected DEG set. Use GSEA when you'd rather test enrichment across a ranked list without relying on a strict cutoff.
Why does enrichment return no results?
Most often it's an ID/organism mismatch (especially for KEGG) or the input list is too small. Checking ID conversion and switching to GSEA are common fixes.
What should I use as the background gene set (universe) for ORA?
A practical default is the set of genes tested in your DE analysis (after filtering), not the entire genome.
How many terms should I plot?
Usually 10–20 terms is enough for a clear figure; more tends to get repetitive and hard to read.
References: