DEG to Functional Insights: A Practical clusterProfiler Workflow (with R Script Skeleton)

Q: Do I have to run DESeq2 to use clusterProfiler?

No. You can use clusterProfiler with results from any DE method, as long as you have a gene list for ORA or a ranked gene vector for GSEA.

Q: Should I use ORA or GSEA for pathway enrichment?

Use ORA when you want enrichment on a selected DEG set. Use GSEA when you’d rather test enrichment across a ranked list without relying on a strict cutoff.

Q: Why does enrichment return no results?

Most often it’s an ID/organism mismatch (especially for KEGG) or the input list is too small. Checking ID conversion and switching to GSEA are common fixes.

Q: What should I use as the background gene set (universe) for ORA?

A practical default is the set of genes tested in your DE analysis (after filtering), not the entire genome.

Q: How many terms should I plot?

Usually 10–20 terms is enough for a clear figure; more tends to get repetitive and hard to read.

Quick Overview

01 What You'll Produce (Outputs) 02 Inputs & Setup (Data, IDs, Packages) 03 Step 1 — QC (Before You Trust DEG) 04 Step 2 — Normalization (For Differential Expression) 05 Step 3 — Differential Expression (Generate a DEG Table) 06 Step 4 — Enrichment with clusterProfiler (ORA & GSEA) 07 Step 5 — Visualization (Volcano / Bar / Dot) 08 Key Choices & Common Pitfalls (Checklist) 09 Download: R Script Skeleton 10 FAQs

A differential expression (DEG) table is a useful starting point—but it's rarely the end of the story. Most people run into the same next question: what do these gene-level changes mean at the level of pathways and biological functions? That's where functional enrichment analysis comes in.

This resource-style guide shows a full, repeatable workflow in R: QC → normalization → differential expression → enrichment (ORA and GSEA) → visualization, using clusterProfiler (a Bioconductor R package) for enrichment and enrichment-friendly plots. It's written so you can copy the structure into your own analysis notebook or adapt it into a small script. You'll also find a compact R script skeleton near the end.

Along the way, we'll focus on the practical details that tend to matter most in real projects: input formats, gene IDs, the background gene universe, how to choose between ORA vs GSEA, and how to make the three plots people actually use (volcano + bar + dot).

If you're still deciding how to define thresholds, interpret log2FC and FDR, or structure your DEG table for downstream analysis, you may find this differential expression overview helpful: Differential Gene Expression Analysis.

End-to-end steps from QC and normalization through differential expression, enrichment (ORA/GSEA), and visualization. Workflow overview of the clusterProfiler-based DEG-to-enrichment pipeline.

What You'll Produce (Outputs)

By the time you finish this workflow, you should have a small set of outputs that are easy to reuse and share:

A DEG table with at least:
- gene ID (your original ID and a mapped ID)
- log2FoldChange
- pvalue
- padj (FDR-adjusted p-value)
- optionally stat (useful for ranking in GSEA)
Enrichment result tables (CSV-friendly) for:
- GO enrichment (e.g., Biological Process)
- KEGG enrichment (optional, depending on organism and IDs)
- GSEA results (optional but recommended when you prefer ranking-based analysis)
Three core figures:
1. Volcano plot (DE overview)
2. Enrichment barplot (Top terms, quick scan)
3. Enrichment dot/bubble plot (compact summary of terms)
A reusable "script skeleton" that:
- reads inputs
- runs DE (or loads DEG results)
- runs enrichment
- exports tables and plots into predictable folders

If you're writing a methods-style blog post or internal SOP, these outputs map nicely to a simple folder structure:

tables/ (DEG + enrichment)
plots/ (volcano + bar + dot)
logs/ (optional session info, run notes)

Inputs & Setup (Data, IDs, Packages)

Two common starting points

Option A: Start from counts (recommended)

You have:

a raw count matrix (genes × samples, integer counts)
a sample metadata table (sample IDs + condition + optional covariates)

This is the cleanest route because you control the full chain.

Option B: Start from a DEG table (fastest)

You already have results from DESeq2/edgeR/limma:

gene ID
log2FC
p-value and/or FDR

This is fine for enrichment as long as the DEG table is well-formed and you can reconstruct a ranking for GSEA if you want it.

Gene IDs: decide early, save yourself later

In enrichment workflows, the single most common "why is this empty?" moment is an ID mismatch.

Typical IDs you'll see:

SYMBOL (e.g., TP53)
ENSEMBL (e.g., ENSG00000141510)
ENTREZID (e.g., 7157)

clusterProfiler can work with multiple types, but many GO/KEGG workflows are simplest if you convert to ENTREZID for enrichment. The practical approach is:

Keep your original gene identifier in the DEG table (for traceability).
Create a mapping table (original → ENTREZID).
Run enrichment using ENTREZID.
Export both the enrichment results and the mapping table.

That mapping table becomes your "glue" when you need to interpret hits or build gene lists for follow-up analysis.

Minimal packages you'll likely need

A stable, common stack looks like this:

Differential expression: DESeq2 (or edgeR/limma-voom)
Enrichment: clusterProfiler
Annotation: org.Hs.eg.db / org.Mm.eg.db / etc + AnnotationDbi
Visualization: ggplot2
Enrichment plots: enrichplot (works smoothly with clusterProfiler results)

Note: KEGG availability and behavior can vary by organism and by how IDs are handled. If you're working with a less common species, GO may be more straightforward than KEGG unless you have strong annotation support.

Step 1 — QC (Before You Trust DEG)

QC doesn't need to be complicated, but it should be intentional. The goal is simple: make sure your samples behave like your study design expects.

Three checks cover a lot of ground:

1) Sample clustering / correlation

Compute sample-to-sample correlation on a stabilized expression matrix (e.g., VST from DESeq2).
Look for:
- replicates clustering together
- unexpected pairs
- one sample that "refuses to join the group"

2) PCA (or MDS)

PCA is quick, visual, and usually informative.

Color points by condition, batch, donor, library prep date—whatever you have.
If an unwanted factor dominates the first components, don't ignore it. Often the fix is adding a covariate to the DE design formula rather than dropping samples.

Sample clustering pattern showing separation by condition and overall similarity among replicates. PCA-based sample QC visualization.

3) Library size / distribution sanity checks

Total counts per sample
Distribution of counts / number of detected genes

Small differences are normal. Big differences can be a sign of technical problems.

A practical note about outliers:

If you remove a sample, write down why in one sentence and keep a record. Even for internal work, your future self will be grateful.

Step 2 — Normalization (For Differential Expression)

Normalization is one of those words that means different things in different contexts.

For differential expression, most established methods expect raw counts and perform normalization as part of the model:

DESeq2 uses size factors
edgeR uses TMM
limma-voom uses precision weights (after transforming counts)

A frequent detour is using TPM/FPKM for DE. TPM is useful for certain kinds of comparisons, but for count-based DE it usually creates more questions than it answers. If your goal is reliable differential testing, stay in the count-based lane.

Filtering low-count genes

Some filtering is helpful:

remove genes that are essentially absent in all samples
use a rule independent of condition labels (e.g., "at least X counts in at least Y samples")

Filtering improves power and reduces noise, especially in small studies.

Step 3 — Differential Expression (Generate a DEG Table)

Your DEG table is the hinge point between gene-level statistics and enrichment. A "good" DEG table for downstream enrichment includes:

a stable gene ID column (gene, SYMBOL, or ENSEMBL)
log2FoldChange
pvalue
padj (FDR)
optional: a test statistic (stat) for ranking

Thresholds: keep them usable, not dogmatic

A common starting point is:

padj < 0.05
optionally |log2FC| >= 1

But many analyses benefit from being a bit more flexible:

For ORA, thresholds matter because you're selecting a subset of genes.
For GSEA, thresholds matter less because you use a ranked list of all genes.

If you know your signal is subtle (or your sample size is modest), consider leaning more on ranking-based GSEA rather than trying to "force" a big DEG list.

Split up and down?

For ORA, it's often informative to run enrichment separately for:

upregulated genes
downregulated genes

That keeps directionality clear. Otherwise you can end up with a mixed gene set where a term is "enriched" but the genes in it move in both directions.

Step 4 — Enrichment with clusterProfiler (ORA & GSEA)

This is the core of the guide. clusterProfiler supports multiple enrichment approaches, but two are especially common for RNA-seq interpretation:

ORA (Over-Representation Analysis): What's overrepresented among significant genes?
GSEA (Gene Set Enrichment Analysis): What pathways shift across the whole ranked list?

ORA: enrichGO / enrichKEGG

Inputs you need

A gene list (usually significant DEGs, optionally split into up/down)
A background universe (strongly recommended)

About the universe (background gene set)

If you don't set a universe, your enrichment is often compared against a broad default background from the annotation database. In many cases, a more practical and defensible universe is:

Universe = all genes that were tested in differential expression
(i.e., after filtering and model setup)

This aligns the enrichment background with what your experiment could realistically detect.

GO enrichment (recommended starting point)

GO has three common ontologies: BP, MF, CC
Biological Process (BP) is often the most intuitive for initial interpretation

KEGG enrichment (optional)

KEGG can be very useful, but it's also where ID and organism issues show up more often. If KEGG returns empty results:

confirm your IDs (ENTREZID is usually expected)
confirm organism support
try GO first to validate that the pipeline is working

GSEA: gseGO / gseKEGG

GSEA shifts the question. Instead of selecting a "significant gene subset," you rank all genes by a statistic and ask whether genes from a pathway tend to appear near the top or bottom.

What you need for GSEA

a named numeric vector:
- names are gene IDs (often ENTREZID)
- values are ranking statistics (higher = more associated with condition A, for example)
sorted in decreasing order

Ranking choices that are easy to explain

DESeq2 Wald statistic (stat) if available
signed -log10(pvalue) scaled by direction:
- sign(log2FC) * -log10(pvalue)

If you choose the signed log p-value approach, guard against zeros and NAs (e.g., clamp p-values to a minimum like 1e-300).

Step 5 — Visualization (Volcano / Bar / Dot)

This workflow intentionally focuses on three plots because they cover most needs without turning your write-up into a figure museum.

Volcano plot (DE overview)

A volcano plot shows:

magnitude (log2FC)
significance (often -log10(padj))

Good habits:

label only a handful of genes (top hits or known markers)
keep thresholds visible but not overwhelming

Volcano plot summarizing differential expression results.

Enrichment barplot (Top terms)

Use barplots to show a short list of top enriched terms. Keep it tight:

10–20 terms is usually enough
sort by p.adjust unless you have a clear reason not to

Enrichment dot/bubble plot (compact summary)

Dotplots are a great "one panel" summary:

color: significance (p.adjust)
size: Count or GeneRatio

If you publish dotplots, keep the legend readable and avoid showing too many terms.

Functional categories compared by enrichment magnitude and statistical significance using dot size and color. Dot plot summarizing enriched functional terms with size and color encoding.

Key Choices & Common Pitfalls (Checklist)

This section is meant to be a quick "before you export results" checklist.

1) ID mapping quality

How many genes mapped successfully?
Are there duplicates (many-to-one mapping)?
Did you drop a large fraction unintentionally?

Tip: export a mapping table and keep it with your results. It prevents confusion later.

2) Background universe in ORA

If ORA results look odd (too broad, too empty, too generic), revisit the universe.

A practical default:

universe = genes tested in DE

3) ORA vs GSEA choice

ORA is sensitive to thresholds.
GSEA is sensitive to ranking definition.

If you're not happy with ORA results, try GSEA before rewriting your thresholds five times.

4) Too many similar GO terms

GO is hierarchical and redundant by nature. If your top results look like repeats:

reduce the number of displayed categories
consider redundancy reduction (e.g., semantic similarity-based simplification)
group results by theme rather than listing 40 near-duplicates

5) "Empty KEGG results"

Common reasons:

wrong ID type
organism support issues
too few genes (ORA)
Try:
confirming ENTREZID mapping
running GO enrichment first
using GSEA with a ranked list

6) Plot readability

A plot that's technically correct can still be hard to read.

limit terms
use consistent sorting
keep legends and labels short
write captions that explain what size/color represent

Download: R Script Skeleton

Below is a compact script skeleton that you can adapt. It assumes:

counts matrix + metadata
DESeq2 for DE
clusterProfiler for GO enrichment (ORA) and optional GSEA
exports tables + plots

It's intentionally minimal so you can drop it into a blog resource download and let readers modify a few lines.

suppressPackageStartupMessages({

library(DESeq2)

library(clusterProfiler)

library(enrichplot)

library(AnnotationDbi)

library(org.Hs.eg.db) # swap for your organism

library(ggplot2)

library(dplyr)

})

# ---- Inputs (edit) ----

counts_file <- "counts.csv" # genes x samples, raw counts

meta_file <- "meta.csv" # sample, condition (and optional batch)

id_type_in <- "SYMBOL" # SYMBOL or ENSEMBL

cond_a <- "treated"

cond_b <- "control"

dir.create("tables", showWarnings = FALSE)

dir.create("plots", showWarnings = FALSE)

# ---- Load ----

counts <- read.csv(counts_file, row.names = 1, check.names = FALSE)

meta <- read.csv(meta_file, stringsAsFactors = FALSE)

counts <- counts[, meta$sample]

# ---- DE ----

dds <- DESeqDataSetFromMatrix(

countData = round(as.matrix(counts)),

colData = data.frame(meta, row.names = meta$sample),

design = ~ condition

)

# Simple filtering

dds <- dds[rowSums(counts(dds) >= 10) >= 2, ]

dds <- DESeq(dds)

res <- results(dds, contrast = c("condition", cond_a, cond_b))

res <- as.data.frame(res) %>%

mutate(gene = rownames(.)) %>%

filter(!is.na(padj))

write.csv(res, "tables/DEG_results.csv", row.names = FALSE)

# ---- Volcano ----

vol <- res %>% mutate(sig = padj < 0.05 & abs(log2FoldChange) >= 1)

p_vol <- ggplot(vol, aes(log2FoldChange, -log10(padj))) +

geom_point(aes(alpha = sig), size = 1) +

scale_alpha_manual(values = c(`TRUE` = 0.8, `FALSE` = 0.2), guide = "none") +

theme_bw() +

labs(x = "log2 Fold Change", y = "-log10(FDR)", title = "Volcano plot")

ggsave("plots/volcano.pdf", p_vol, width = 6.5, height = 5)

# ---- ID mapping ----

mapped <- bitr(res$gene,

fromType = id_type_in,

toType = "ENTREZID",

OrgDb = org.Hs.eg.db)

res_entrez <- res %>%

inner_join(mapped, by = c("gene" = id_type_in)) %>%

distinct(ENTREZID, .keep_all = TRUE)

write.csv(res_entrez, "tables/DEG_with_ENTREZID.csv", row.names = FALSE)

write.csv(mapped, "tables/ID_mapping.csv", row.names = FALSE)

# ---- ORA (GO BP) ----

deg_up <- res_entrez %>% filter(padj < 0.05, log2FoldChange >= 1) %>% pull(ENTREZID)

deg_dn <- res_entrez %>% filter(padj < 0.05, log2FoldChange <= -1) %>% pull(ENTREZID)

universe <- res_entrez$ENTREZID

ego_up <- enrichGO(gene = deg_up, universe = universe,

OrgDb = org.Hs.eg.db, keyType = "ENTREZID",

ont = "BP", pAdjustMethod = "BH", qvalueCutoff = 0.05)

ego_dn <- enrichGO(gene = deg_dn, universe = universe,

OrgDb = org.Hs.eg.db, keyType = "ENTREZID",

ont = "BP", pAdjustMethod = "BH", qvalueCutoff = 0.05)

write.csv(as.data.frame(ego_up), "tables/GO_ORA_up.csv", row.names = FALSE)

write.csv(as.data.frame(ego_dn), "tables/GO_ORA_down.csv", row.names = FALSE)

# ---- Plots: bar + dot ----

ggsave("plots/GO_ORA_up_bar.pdf", barplot(ego_up, showCategory = 15), width = 7, height = 5)

ggsave("plots/GO_ORA_up_dot.pdf", dotplot(ego_up, showCategory = 15), width = 7, height = 5)

# ---- Optional: GSEA ----

rank_vec <- res_entrez$stat

if (all(is.na(rank_vec))) {

p <- pmax(res_entrez$pvalue, 1e-300)

rank_vec <- sign(res_entrez$log2FoldChange) * -log10(p)

}

names(rank_vec) <- res_entrez$ENTREZID

rank_vec <- sort(rank_vec, decreasing = TRUE)

gse_bp <- gseGO(geneList = rank_vec,

OrgDb = org.Hs.eg.db, keyType = "ENTREZID",

ont = "BP", pAdjustMethod = "BH", verbose = FALSE)

write.csv(as.data.frame(gse_bp), "tables/GO_GSEA_BP.csv", row.names = FALSE)

ggsave("plots/GO_GSEA_dot.pdf", dotplot(gse_bp, showCategory = 15), width = 7, height = 5)

How to turn this into a downloadable resource

Put all editable inputs at the top (paths, organism DB, ID type, conditions).
Always export:
- DEG table
- ID mapping table
- enrichment tables
- plots
Keep folder names consistent so the output is predictable.

FAQs

Do I have to run DESeq2 to use clusterProfiler?

No. You can use clusterProfiler with results from any DE method, as long as you have a gene list for ORA or a ranked gene vector for GSEA.

Should I use ORA or GSEA for pathway enrichment?

Use ORA when you want enrichment on a selected DEG set. Use GSEA when you'd rather test enrichment across a ranked list without relying on a strict cutoff.

Why does enrichment return no results?

Most often it's an ID/organism mismatch (especially for KEGG) or the input list is too small. Checking ID conversion and switching to GSEA are common fixes.

What should I use as the background gene set (universe) for ORA?

A practical default is the set of genes tested in your DE analysis (after filtering), not the entire genome.

How many terms should I plot?

Usually 10–20 terms is enough for a clear figure; more tends to get repetitive and hard to read.

References:

Yu, Guangchuang, et al. "clusterProfiler: An R Package for Comparing Biological Themes among Gene Clusters." OMICS: A Journal of Integrative Biology, vol. 16, no. 5, 2012, pp. 284–287.
Love, Michael I., Wolfgang Huber, and Simon Anders. "Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2." Genome Biology, vol. 15, 2014, p. 550.
Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data" Bioinformatics, vol. 26, no. 1, 2010, pp. 139–140.
Law, Charity W., et al. "voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-seq Read Counts." Genome Biology, vol. 15, 2014, p. R29.
The Gene Ontology Consortium. "The Gene Ontology Resource: Enriching a GOld Mine." Nucleic Acids Research, vol. 49, no. D1, 2021, pp. D325–D334.
Kanehisa, Minoru, and Susumu Goto. "KEGG: Kyoto Encyclopedia of Genes and Genomes." Nucleic Acids Research, vol. 28, no. 1, 2000, pp. 27–30.
Subramanian, Aravind, et al. "Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles." Proceedings of the National Academy of Sciences, 2005.
Benjamini, Yoav, and Yosef Hochberg. "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing." Journal of the Royal Statistical Society: Series B (Methodological), vol. 57, no. 1, 1995, pp. 289–300.

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.

Related Services