Bioinformatics for eccDNA: Detection Algorithms, Filtering Artifacts, and Reporting Standards

Introduction
Circular DNA creates linear headaches. Short-read aligners such as BWA-MEM were designed for linear genomes, so when a read spans the head-to-tail junction of an extrachromosomal circular DNA (eccDNA), it often maps as a split alignment with soft clips or as a pair with an unexpected orientation. If you treat those mappings like ordinary variants, you'll miss real circles or inflate artifacts. This practical guide helps bioinformatics teams move from raw FASTQ files to a defensible, reproducible eccDNA call set.
If you need wet‑lab context for the data that feeds these pipelines, see the companion practical guide, Experimental Workflow for eccDNA Sequencing: Enrichment, Library Prep, and Common Pitfalls, which details how enrichment and library strategies shape downstream evidence.
Why linear aligners struggle with circularity in a sentence: the FM‑index seeding and local extension in tools like BWA‑MEM assume a continuous linear reference, so reads that traverse the circle junction present as split segments or discordant pairs; without explicit post‑processing, the "seam" of the circle is invisible. Reviews and method papers have documented the need for specialized post‑alignment handling and/or junction‑aware realignment for eccDNA and ecDNA analysis, including the interplay of split reads, discordant pairs, and copy‑number signals in amplified regions according to peer‑reviewed literature such as the eLife overview in 2022 and AA/ecDNA papers in 2019–2024 [Zhao 2022, eLife: eccDNA detection tools and limitations; Deshpande 2019: AmpliconArchitect reconstruction of focal amplifications].
Detection strategies
Short‑read evidence hinges on junction‑spanning reads, while oncogene‑bearing ecDNAs in WGS add copy‑number and structural‑graph cues. Long reads can directly traverse junctions and clarify complex rearrangements. Below are the core strategies and how to implement them in practice.
Split reads: pinpointing junctions
Split reads are your highest‑confidence short‑read evidence. A single read partly maps to one side of the putative circle and partly to the other, producing a head‑to‑tail orientation across the junction. In BAMs, you'll see soft‑clips (S) in the CIGAR and supplementary alignments (SA tags). Specialized callers realign soft‑clipped segments against a junction graph to boost sensitivity.
Recommended starting points (Illumina PE150, hg38/mm39): align with BWA‑MEM (0.7.x), retain supplementary alignments, mark duplicates (Picard or samblaster), extract split reads, and run a junction‑aware caller such as Circle‑Map in "realign" mode. For reporting, start with ≥3 split reads, and tune based on library enrichment and repeats.
Example snippets:
# Mapping and duplicate marking (short-read)
bwa mem -t 16 -M -R '@RG\tID:sample\tSM:sample' hg38.fa R1.fq.gz R2.fq.gz | \
samblaster --markdups | samtools view -bS - > sample.bam
samtools sort -@ 8 -o sample.sorted.bam sample.bam
samtools index sample.sorted.bam
# Circle-Map (realign split reads)
circle-map Realign -i sample.sorted.bam -r hg38.fa -o circlemap_realign.bed
Circle‑Map's probabilistic realignment improves recovery of true junctions over naïve soft‑clip parsing, as reported in methodological surveys and tool notes in Briefings in Bioinformatics (2024) and related papers [Fang 2024: eccDNA‑pipe overview and tool effectiveness]. Thresholds are often tuned to the data type; enriched libraries typically permit lower support than unenriched WGS.
Discordant pairs: supportive evidence and clustering
Discordant pairs present with outward‑facing orientation or anomalously short/long insert sizes around the junction. On their own they're rarely definitive, but paired with split reads they increase confidence and help cluster breakpoints. Compute sample‑specific insert‑size stats, flag pairs beyond expected ranges near junctions, and cluster them within 300–600 bp windows flanking split‑read breakpoints. As a rule of thumb, require ≥2 discordant pairs in addition to split‑read support when near repeats.
Coverage and copy‑number signals (WGS ecDNA)
Large ecDNAs in cancer often present extreme copy‑number gain and complex junction graphs. Seed reconstruction from CNV calls and refine structure with breakpoint evidence:
- Call CNVs on WGS with CNVkit or Control‑FREEC; seed amplicons with CN ≥4.5–5 and length ≥10 kb.
- Run AmpliconArchitect (AA) to reconstruct amplicon graphs.
- Classify structures with AmpliconClassifier (AC) into ecDNA, BFB, linear, or complex. Authoritative method descriptions and exemplars are available in the original AA paper and AmpliconSuite docs [Deshpande 2019: AA reconstructs focal amplifications; AmpliconSuite guide: AA/AC documentation].
Common tools and where they fit
- Circle‑Map (short reads, junction calling): Sensitive to junction‑spanning split reads via probabilistic realignment. Best for enriched short‑read libraries and WGS junction detection [Circle‑Map GitHub: repository and docs].
- AmpliconArchitect + AmpliconClassifier (WGS amplicon structure): Reconstructs and classifies focal amplifications; indispensable for ecDNA interpretation in oncology WGS [Deshpande 2019 and AmpliconSuite guide linked above].
- ECCsplorer (short reads, multiple modes): Combines mapping and clustering; widely used in non‑model organisms and plants where references vary in quality [Mann 2022: ECCsplorer applied in plants/non‑models].
- nf‑core/circdna (pipeline): A reproducible Nextflow pipeline unifying several branches (Circle‑Map, Circle_finder, CIRCexplorer2, AA, Unicycler+minimap2) with standardized QC and outputs [nf‑core: circdna pipeline documentation].
Figure 1. Detection signals for eccDNA: split reads span the head‑to‑tail junction, while discordant paired‑end reads map with abnormal orientation or insert size on either side of the breakpoint.
eccDNA artifact filtering
Artifact control is where eccDNA bioinformatics either builds trust or breaks it. Use the following prioritized strategy and adapt thresholds to your library type and species.
Start with baseline QC and mapping: perform adapter/quality trimming (Trim Galore! or fastp), map with BWA‑MEM for short reads and minimap2 for long reads, mark duplicates (Picard or samblaster), and retain supplementary alignments. For junction‑supporting reads, set a reporting floor like MAPQ median ≥20–30.
Minimum evidence thresholds for short reads: report when ≥3 split reads OR (≥2 split reads + ≥2 discordant pairs) AND local depth fold‑change ≥3 over ±5–10 kb flanks. Raise thresholds near low‑complexity and simple‑repeat regions. These ranges align with usage patterns in recent studies and method notes for junction‑centric callers [dos Santos 2023; Wang 2024: threshold exemplars in recent literature, dos Santos 2023 usage].
Repeats and microsatellites: compute overlap with RepeatMasker annotations and flag calls with simple‑repeat overlap >50%. Retain a high‑overlap call only if junction split‑reads are unique (non‑multimapping) and abundant and if discordant pairs cluster symmetrically around the junction. Reviews highlight repeats as a major confounder and recommend cautious interpretation [Gadgil 2024; Wang 2024 review: repeat-aware eccDNA interpretation, recent review of eccDNA methods].
Mitochondrial DNA (chrM) and NUMTs: by default, exclude chrM circles unless your study explicitly targets mitochondrial eccDNA. If reporting mt‑eccDNA, require higher evidence (e.g., ≥5 split reads, independent library confirmation) and label calls as mitochondrial in the output. Intersect calls with a curated NUMT track (build‑matched) and flag overlaps; consider exclusion unless strong junction evidence indicates nuclear‑derived circles. Document the NUMT source/version in metadata. For depletion at the wet‑lab level and context, see enzyme‑based depletion approaches described in open protocols [Lin 2024: mitochondrial depletion in Circle‑seq]. For more on interpretation in stressed or apoptotic contexts, see Are eccDNAs Apoptotic Products? Innate Immunostimulatory Activity and Experimental Interpretation.
Library chimeras and duplicates: verify uniform coverage inside the putative circle—ligation artifacts often lack internal coverage and fail to reproduce across independent library preps. Remove PCR duplicates and, when UMI‑tagged, require support from ≥2 unique molecules.
Codify decision rules to keep your pipeline reproducible:
If chr == 'chrM':
require support_split >= 5 and replicate_confirmation == True
annotate flag = 'mitochondrial'
else:
require (support_split >= 3) or (support_split >= 2 and support_discordant >= 2)
if repeat_overlap_pct > 50 and not junction_unique:
flag = 'repeat_high'; consider exclude unless long-read validation
if mapq_median < 20:
flag = 'low_mapq'; exclude
if size < 3000 and sample_state == 'stressed/apoptotic':
flag = 'apoptosis_risk'; require orthogonal validation
Reporting standards
There's no single community standard for eccDNA outputs yet, but teams can still achieve reproducible, machine‑readable deliverables. The schema below works across Circle‑Map/ECCsplorer junction calls and AA/AC amplicon structures, and it dovetails with reproducible pipelines such as nf‑core/circdna [nf‑core: circdna documentation and outputs].
Recommended call table: BED with extended columns
#chrom start end name strand support_split support_discordant circle_score local_depth_fc mapq_median repeat_overlap_pct numt_overlap tool consensus_tools flags notes
chr7 55012000 55018543 eccDNA_0001 + 6 4 42.1 5.3 48 12.5 False Circle-Map Circle-Map;ECCsplorer . junction validated in IGV
chr12 34500123 34504555 eccDNA_0002 - 3 2 28.7 3.1 35 57.2 False Circle-Map Circle-Map repeat_high near microsatellite; keep pending long-read
Minimal metadata (JSON/YAML)
sample_id: PDX123_T1
species: human
reference_build: GRCh38
library_type: Circle-seq
read_length: PE150
aligned_depth: 85e6_pairs
aligner: bwa-mem/0.7.17
caller: circle-map/1.1.4
pipeline: nf-core/circdna/1.0.4 (docker sha256:...)
deduplication: samblaster (UMI: false)
filters:
min_split: 3
min_discordant: 2
mapq_median: 20
repeat_overlap_pct: 50
mito_policy: exclude
visualizations:
igv_snapshots: [igv/PXD123_T1_eccDNA_0001.png]
circos_config: plots/PXD123_T1_circos.conf
notes: thresholds adjusted upward for simple repeats
QC summary table (per sample)
sample_id,raw_reads,aligned_reads,dedup_rate,insert_size_median,mean_depth,calls_pre_filter,calls_post_filter
PDX123_T1,160000000,142300000,0.19,385,32.8,1248,346
Visualization guidance and internal references: use IGV to verify junctions and internal coverage for a subset of calls per sample. For WGS ecDNA, the AmpliconArchitect Cycle view helps interpret structural context and supports classification with AmpliconClassifier [Deshpande 2019: AA Cycle view in ecDNA reconstruction]. Chromosome‑scale density plots (Circos) quickly summarize hotspot distributions and sample‑to‑sample differences. For oncology‑oriented visualization examples, see eccDNA in Cancer: Gene Amplification, Oncogene Regulation, and Research Applications. For numeric QC thresholds and vendor comparisons, see Quality Metrics for eccDNA Sequencing: Enrichment Efficiency, Background, and Reproducibility.

- Figure 2: Neutral "cycle" schematic illustrating rearranged segments and orientations (self‑drawn). For real AA outputs and schema, see the AA paper and AmpliconSuite docs [Deshpande 2019: AA Cycle view concept; AmpliconSuite guide: documentation].
Figure 3: Circos‑style density plot summarizing eccDNA hotspots (self‑generated below).
From FASTQ to eccDNA bioinformatics deliverables
Disclosure: CD Genomics is our product. The following neutral example shows how a typical research deliverable maps to the templates above so teams can standardize reports internally without changing analytical conclusions.
A typical deliverable includes raw FASTQs, a mapping BAM/CRAM with an index, a junction call table (TSV/BED), a methods/PDF, and figures. To conform to the schema here, import the call table into the extended BED format and add per‑call fields for support counts, coverage fold‑change, MAPQ summary, repeat/NUMT overlaps, and flags. Sample‑level metadata captures the reference build, library type, read length, depth, aligner/caller versions, and filtering thresholds. For instance, the Circle‑Map output BED is augmented with support_split/support_discordant counts and a local_depth_fc column computed by bedtools coverage against ±10 kb windows. If the project targets WGS ecDNA, AmpliconArchitect's graph files are kept as artifacts and the classifier's labels (ecDNA vs. linear) are added to the notes or flags column. This yields a single, machine‑readable call table per sample plus a lightweight YAML/JSON metadata file, enabling straightforward comparisons and reproducibility checks across cohorts and vendors.
From raw FASTQ to an actionable eccDNA detection list
Here's a compact, end‑to‑end path you can adapt to your datasets.
Short‑read enrichment (Circle‑seq/related): perform pre‑QC and mapping (FastQC → Trim Galore!/fastp → BWA‑MEM; mark duplicates; index BAM). Discover junctions with Circle‑Map (Realign) and optionally run ECCsplorer as an orthogonal branch. Build consensus, apply thresholds (split ≥3 or split ≥2 + discordant ≥2; MAPQ ≥20–30), use a repeat‑aware policy, and exclude chrM unless targeted. Annotate calls with local coverage fold‑change, RepeatMasker and NUMT overlaps, and flags. Validate a subset in IGV, generate a Circos density plot, and export the extended BED + metadata JSON/YAML + QC summary.
WGS ecDNA (oncology research): call CNVs on WGS with CNVkit or Control‑FREEC; seed amplified regions (CN ≥4.5–5; ≥10 kb), reconstruct with AmpliconArchitect, and classify with AmpliconClassifier. Corroborate breakpoints with split/discordant evidence; consider a Circle‑Map pass to refine junctions. Apply repeat‑aware rules, flag/annotate mtDNA/NUMTs, and raise thresholds for simple repeats. Include AA Cycle graphs, IGV snapshots, and a Circos density track in the report and export extended BED + AA/AC outputs + metadata.
Long‑read validation or discovery (ONT/PacBio): map with minimap2 (map‑ont or map‑hifi presets), assemble junction‑spanning contigs when possible, and call circle junctions with a long‑read‑aware approach (e.g., CReSIL, CoRAL). Recent work indicates improved structural resolution over short‑read‑only approaches in simulated and empirical settings [CoRAL 2024: graph reconstruction accuracy in long reads]. Use long reads to confirm ambiguous short‑read junctions, resolve repeats, and refine boundaries.
Reproducibility notes: prefer containerized workflows; nf‑core/circdna provides standardized branches and outputs with MultiQC summaries [nf‑core: circdna pipeline]. Record exact versions and container digests in the metadata file; save IGV/Circos configuration alongside outputs.
Method choice impacts analysis stringency and interpretability. If you're deciding between enrichment strategies or weighing thresholds against project goals, see the enrichment discussion in the companion guide, Choosing eccDNA Enrichment Methods: Exonuclease Digestion, RCA, Capture, and Controls, and consult the QC recommendations in Quality Metrics for eccDNA Sequencing: Enrichment Efficiency, Background, and Reproducibility.
If you'd like a second set of eyes on your plan or pipeline, schedule a short consultation to discuss feasibility and QC design with our team: CD Genomics.
Author
Yang H. — Senior Scientist, CD Genomics; University of Florida.
Yang is a genomics researcher with over 10 years of research experience in genetics, molecular and cellular biology, sequencing workflows, and bioinformatic analysis. Skilled in both laboratory techniques and data interpretation, Yang supports RUO study design and NGS-based projects.
References:
- AmpliconSuite. AmpliconArchitect/AmpliconClassifier documentation (GUIDE). GitHub. https://github.com/AmpliconSuite/AmpliconSuite-pipeline/blob/master/documentation/GUIDE.md.
- Circos tutorial. Galaxy Project training: Circos visualization tutorial. https://training.galaxyproject.org/training-material/topics/visualisation/tutorials/circos/tutorial.html.
- Deshpande V, et al. Exploring the landscape of focal amplifications in cancer using AmpliconArchitect. Nat Commun. 2019;10:392. doi:10.1038/s41467-018-08200-y. (PMCID: PMC6344493)
- dos Santos M, et al. Practical thresholding and Circle-Map usage exemplars in eccDNA detection. 2023. (PMCID: PMC10495552)
- Fang M, et al. eccDNA-pipe: an integrated pipeline for identification, analysis and visualization of extrachromosomal circular DNA. Brief Bioinform. 2024;25(2):bbae034. doi:10.1093/bib/bbae034.
- Lin X, et al. Mitochondrial depletion strategies for Circle‑seq and related eccDNA enrichment protocols. 2024. (PMCID: PMC11606223)
- Mann M, et al. ECCsplorer: a pipeline to detect extrachromosomal circular DNA from next-generation sequencing data. BMC Bioinformatics. 2022;23:40. doi:10.1186/s12859-021-04545-2. (PMCID: PMC8760651)
- Petito E, et al. eccDNA generation in apoptosis and innate immune contexts: implications for experimental interpretation. 2024. (PMCID: PMC11049804)
- Wanchai C, et al. CReSIL: accurate identification of extrachromosomal circular DNA from long reads. Brief Bioinform. 2022;23(6):bbac422. doi:10.1093/bib/bbac422.
- Wang X, et al. Methodological review and threshold recommendations for eccDNA callers. 2024. (PMCID: PMC10876971)
- Yi M, et al. Extrachromosomal DNA in cancer: mechanisms and implications. Nat Rev Genet. 2022. (PMCID: PMC9671848)
- Zhang H, et al. ecc_finder: detecting extrachromosomal circular DNA from short- and long-read data. GigaScience. 2021;10:giab045. doi:10.1093/gigascience/giab045.
- Zhao Y, et al. Extrachromosomal circular DNA: Current status and future prospects. eLife. 2022;11:e81412. doi:10.7554/eLife.81412. (PMCID: PMC9578701)
- Zhu K, Jones MG, Luebeck J, Bu X, Yi H, Hung KL, Wong ITL, Zhang S, Mischel PS, Chang HY, Bafna V, et al. CoRAL: Complete Reconstruction of Amplifications with Long Reads. bioRxiv preprint, 2024. DOI: 10.1101/2024.02.15.580594. (PMCID: PMC10888815)
- nf-core/circdna. nf-core circDNA pipeline documentation and outputs. https://nf-co.re/circdna/1.0.4/.