Pan-genome Pipeline Deep Dive: From Annotation Harmonization to Orthology

A reliable pan-genome pipeline begins with consistent annotation harmonization and ends with defensible orthology clustering. The goal is simple: turn mixed assemblies into a clean presence–absence matrix without inflating accessory calls or losing true core genes. Pan-genome pipeline: standardize annotations, run similarity search, cluster orthologs, build a copy-aware pan-matrix, and automate QC for reproducible results. This guide walks through practical choices for bacteria and eukaryotes, explains why parameters matter, and shows how to automate QC. We reference common toolchains—RAST/Prokka for gene finding, BLAST or MMseqs2 for similarity search, and OrthoMCL or OrthoFinder for orthogrouping—while keeping the framework vendor-neutral and research-use focused.

PPanGGOLiN workflow on a four-genome toy dataset, showing how annotations are integrated into a partitioned pangenome graph. (Gautreau G. et al. (2020) PLOS Computational Biology) Flowchart of PPanGGOLiN on a toy example of 4 genomes. (Gautreau G. et al. (2020) PLOS Computational Biology)

Scope and Data Model

Before touching code, define the entities your pipeline will track. You need a crisp mapping between genomes, genes, proteins, and orthogroups. That mapping drives every downstream statistic.

Inputs

Assemblies in FASTA (with clear version tags).
Annotations in GFF3 with corresponding CDS and protein FASTA (FAA).
Sample metadata, including species/strain, source, and sequencing method.

Outputs

Orthogroups with member lists and basic alignment statistics.
A pan-matrix that captures presence/absence and copy number per sample.
Core, soft-core, shell, and cloud partitions with thresholds visible in a report.
QC summaries for assemblies, annotations, and clustering decisions.

Keep identifiers stable. A single change in a sample ID can break all joins. Use a manifest that ties each file to a checksum and build date.

Pre-processing and Baseline QC

Garbage in, garbage out applies strongly to pangenomes. Start by gatekeeping assemblies and normalising names.

Assembly and contamination checks

Compute N50/L50 for contiguity and monitor total assembly length.
Use BUSCO for eukaryotic completeness or CheckM for bacteria.
Screen for contamination or index misjoins using coverage and taxonomy-aware tools.
Flag duplicated scaffolds and near-identical contigs that may inflate gene counts.

Naming and versioning

Enforce sample IDs that never change across runs.
Version every assembly with a build string, e.g., YYYYMMDD.platform.caller.
Store mapping tables for old→new IDs so historical results remain interpretable.

These steps feel administrative. They save weeks later when results do not reconcile across cohorts.

Annotation Harmonization

The same gene can look different if each lab uses a different model. Annotation harmonization removes that variability.

Structural calls

Re-annotate with a single engine where feasible. Prokka and RAST are common for microbes; pick one and freeze parameters.
Watch gene splits and merges around low-complexity or repeat regions.
Convert partial ORFs into clear states (e.g., partial=5', partial=3') rather than discarding them.
For eukaryotes, standardise splice-aware prediction and supply the same evidence tracks across samples.

Functional layers

Map functions to stable namespaces: GO, KEGG, COG or eggNOG. Avoid mixing unversioned local terms with public vocabularies.
Maintain a cross-walk that reconciles synonyms. If "gyrase subunit A" appears as "GyrA," normalise to one label in reports.
Keep evidence codes. A function inferred by HMM profile deserves different confidence than one copied from a neighbour.

File hygiene

Validate GFF3 features align with CDS coordinates. Check that every CDS has a corresponding protein in FAA.
Ensure locus tags are unique per sample and stable across re-runs.
Write small integrity tests that fail the pipeline when offsets drift.

When harmonisation is done well, clustering reflects biology instead of tool behaviour—this is essential to a robust pan-genome.

Panaroo constructs a comprehensive pangenome graph with nodes as orthologous gene clusters and edges representing genomic adjacency; this framework corrects annotation errors (e.g., split genes, contamination) and refines ortholog/paralog assignments. (Tonkin-Hill G. et al. (2020) Genome Biology) Panaroo builds a full pangenome graph where nodes are orthologous gene clusters and edges reflect genomic adjacency; the graph is used to correct annotation errors (e.g., fragmented genes, contamination) and to refine ortholog/paralog assignments. (Tonkin-Hill G. et al. (2020) Genome Biology)

Similarity Search and Orthology Clustering

Orthology inference is the backbone of a pan-genome pipeline. Choose the similarity engine and clustering method to match dataset size and divergence.

Overview of the OrthoFinder pipeline—from orthogroup inference through gene and species tree estimation to duplication-aware orthology calls. (Emms D.M. & Kelly S. (2019) Genome Biology) The OrthoFinder workflow. (Emms D.M. & Kelly S. (2019) Genome Biology)

Similarity engines

BLAST is sensitive and mature. It is slower but stable for small to mid-size sets or deep divergence.
MMseqs2 offers excellent speed and competitive sensitivity. It is ideal for thousands of proteomes.
Use both coverage and identity filters. A common starting point is reciprocal coverage ≥70% with identity tuned to clade divergence.
Keep an eye on composition bias. Low-complexity masking can reduce false hits without killing signal.

Orthogrouping

OrthoMCL uses Markov clustering on a similarity graph. Inflation parameter controls cluster granularity; start between 1.5–3.0.
OrthoFinder integrates gene trees and is resilient to uneven rates and recent duplications. It scales well and produces useful diagnostics.
Avoid pure single-linkage. It tends to chain distant homologs into bloated families.

Post-clustering refinement

Split suspect clusters where tree splits match taxonomic boundaries and sequence identity drops.
Merge obvious fragments caused by annotation splits or domain boundary errors.
For recent paralogs, keep them in the same orthogroup but track copy number per sample. That preserves biologically meaningful expansions.

Document every parameter in a manifest. Orthology clustering should be reproducible and explainable to reviewers and collaborators.

Pan-matrix Construction and Core/Accessory Calling

Once orthogroups exist, translate them into matrices that analysis teams can use.

Definitions that don't backfire

Core: Present in 100% of samples.
Soft core: Present in most samples; choose a transparent threshold (e.g., ≥95%).
Shell: Intermediate frequency genes.
Cloud: Rare genes, often mobile or sample-specific.
Compute presence as copy-number aware. A gene with three copies counts as present, but track the integer so CNV signals survive filtering.
Acknowledge sample-size effects. As you add more genomes, the apparent core shrinks. Include rarefaction curves to contextualise thresholds.

GenAPI pipeline from annotated genomes to BLAST-based presence/absence detection and a gene P/A matrix, with thresholds designed for fragmented assemblies. (Gabrielaite M. & Marvig R.L. (2020) BMC Bioinformatics) GenAPI workflow. (Gabrielaite M. & Marvig R.L. (2020) BMC Bioinformatics)

Formats for analysis

Produce both wide and long tables.
- Wide TSV: orthogroups as rows, samples as columns, values as copy numbers.
- Long format (Parquet or feather): faster joins with GWAS, SV tracks, and metadata.
Provide a dictionary for orthogroup → representative sequence → functional label.
If structural variants feed into the pangenome, keep links to SV IDs and coordinates. These links enable SV-GWAS or graph-based association later.

Make the matrix boring. Boring files are easy to load, merge, and audit across pipelines.

Automated QC, Pitfalls, and Reproducibility

Quality control should be an automated layer, not a manual afterthought. Build checks that fail fast and explain why.

Common failure modes

Fragmentation inflates accessory: highly fragmented assemblies yield split genes that masquerade as unique. Mitigation: require minimum CDS length and merge adjacent fragments with consistent coordinates.
Paralogs labelled as core: if clustering collapses recent duplicates, you may declare a duplicated family "core." Mitigation: tree-aware clustering and copy-aware presence rules.
Plasmid genes mis-mapped to chromosomes: assembly or binning artefacts can shift mobile genes. Mitigation: plasmid detection and mobile element annotation; report mobility flags in the matrix.
Batch effects in annotation: minor parameter changes change gene models. Mitigation: pin versions and parameters; track them in the run record.

Version-controlled runs

Freeze environments with Conda or containers. Save an env.yaml or image hash for each release.
Save a parameters file per run. Include thresholds, inflation settings, and masking options.
Checksums for every input and output. Store them alongside the results so any downstream team can verify integrity.

Handover pack

Deliver a concise README: data model, file inventory, and a diagram of the pipeline.
Include cluster size distributions, singleton rates, and rarefaction curves.
Export a small "edge-case" report listing clusters that were split or merged post hoc, with rationale.

This layer turns a technical build into a product others can trust and reuse.

Practical Tooling Patterns (RAST/Prokka, BLAST/MMseqs2, OrthoMCL/OrthoFinder)

Choosing tools is easier with a few guardrails. The following patterns are stable starting points for computational cores:

Prokka for bacteria with pinned databases; RAST when functional breadth is key. For eukaryotes, standardise gene prediction and evidence tracks across samples.
MMseqs2 for large cohorts or survey projects. BLAST for deep or tricky homology where sensitivity is critical.
OrthoFinder when you need robust orthology amid duplications and rate heterogeneity. OrthoMCL when you prefer graph-centric control and understand inflation tuning.
Keep "dry run" scripts that validate file presence, schema conformance, and parameter sanity before compute-heavy steps begin.

These choices align with the central aims of a pangenome workflow: scalability, interpretability, and reproducibility.

Reporting and Visualization for Stakeholders

Different teams need different views. Bake lightweight reporting into the final stage.

Operations view: run time, CPU/GPU hours, memory peaks, and the top five bottlenecks. Helps capacity planning.
Methods view: parameter table, tool versions, and confidence ranges for each threshold.
Science view: core vs accessory breakdown, function class enrichment in shell/cloud, and a short list of gene families that drive population structure.

Graphs that matter:

Rarefaction curves for core size vs number of genomes.
Copy-number heatmaps for families with known phenotypic relevance.
Network plots for orthogroups prone to split/merge decisions, annotated with similarity scores.

Deliver visuals as static PNGs and the underlying CSVs so others can rebuild figures.

Bringing It All Together

A dependable pan-genome pipeline balances sensitivity with discipline. Harmonised annotations reduce noise. Thoughtful orthology clustering captures biology instead of artefacts. Copy-aware pan-matrices, rarefaction context, and automated QC prevent mislabelled core/accessory calls. Most importantly, every decision—from BLAST thresholds to OrthoFinder settings—should be documented and reproducible. When that happens, downstream analyses like SV-GWAS, comparative genomics, or trait association on graph references become straightforward rather than fragile.

How we can help

CD Genomics supports research-use pangenome projects end-to-end: harmonised annotation, similarity search at scale, orthology clustering, and QC-ready pan-matrices, with reproducible environments and audit-friendly reports. We collaborate with universities, institutes, and R&D groups to deliver cohort-level results you can plug into your analysis. (Services are for non-clinical research only.)

Related reading:

References

Gautreau, G., Bazin, A., Gachet, M. et al. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLOS Computational Biology 16, e1007732 (2020).
Tonkin-Hill, G., MacAlasdair, N., Ruis, C. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biology 21, 180 (2020).
Emms, D.M., Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology 20, 238 (2019).
Gabrielaite, M., Marvig, R.L. GenAPI: a tool for gene absence–presence identification in fragmented bacterial genome sequences. BMC Bioinformatics 21, 362 (2020).
Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 35, 1026–1028 (2017).
Manni, M., Berkeley, M.R., Seppey, M., Zdobnov, E.M. BUSCO Update: Novel and streamlined workflows with compatible scores across data sets. Molecular Biology and Evolution 38, 4647–4654 (2021).
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).

* Designed for biological research and industrial applications, not intended for individual clinical or medical purposes.