Reporting & Interpretation: Match Scores, Thresholds, and Limitations
This article explains how to produce audit-ready DNA barcoding for species identification that goes beyond a single top hit. Percent identity alone is never enough. Pair identity with aligned length and E-value, show voucher evidence, and use BINs when names are unsettled. That combination supports defensible decisions for stakeholders.
Why reporting quality determines trust
A perfect chromatogram with a weak report still fails audits. The most common issue is relying on one number—"99% identity"—without alignment span, search-space context, or provenance. E-values fall as scores rise but also scale with database size; identity inflates on short alignments. A defensible report interprets multiple signals together and anchors them to curated references.
Taxonomy also changes. For animal COI, Barcode Index Numbers (BINs) cluster sequences into operational units that often align with species boundaries, giving you a stable label while names catch up. When names conflict inside a BIN, you can still report a clear conclusion by citing the BIN plus evidence.
Typical outcomes of species↔BIN relationships—one-to-one, merged, or split clusters—illustrate how BINs stabilize labels while taxonomy updates. (Ratnasingham S. & Hebert P.D.N. (2013) PLOS ONE).
Helpful internal links for deeper guidance:
- Reference Libraries: BOLD & GenBank Best Practices — how to build voucher-anchored references.
- How Does DNA Barcoding Work? — lab-side QC and acceptance criteria.
- DNA Barcoding Service — scoped projects with reporting templates and review.
What belongs in a defensible barcoding report (definition list)
Use a concise definition list so reviewers can scan terms and see exactly how you derived the call.
- Percent identity (% identity). Nucleotide identity across the aligned region. Always report it with aligned length; short spans inflate identity.
- Aligned length / coverage. Number and fraction of bases aligned. Prefer longer, clean alignments; note gaps and trimming.
- E-value (BLAST). Probability of seeing a score as good by chance, given database size. Lower is better; E-value depends on both score and search space.
- BIN (BOLD). Barcode Index Number cluster for animal COI; an operational unit you can cite when species names are uncertain or discordant.
- Voucher link. Museum/herbarium accession or image voucher; enables re-examination and reduces mislabel risk.
- Geographic plausibility. Known range and habitat should fit the match.
- Source modifiers / metadata. collection-date, geo_loc_name, and other INSDC qualifiers improve reproducibility and search; many archives expect these fields.
This list aligns your write-up with how BLAST statistics behave and how BOLD/GenBank organize evidence, making your logic easy to audit.
Interpreting match scores the right way
Identity without span is a shortcut. A 99.5% hit over 120 bp is weaker than 97.8% over 658 bp, especially in clades with slow mitochondrial evolution. Always pair identity with aligned length.
Use E-value as a sanity check, not the only gate. E-values drop exponentially with higher scores but rise as databases grow. Note database version/date in your report so future reviews understand any shifts.
Prefer voucher-linked references. Voucher accessions and image vouchers provide a trail for re-examination and reduce the risk of misidentified records.
Add BIN context for animal COI. Report the BIN, plus concordance (one species ↔ one BIN) or discordance (split or merged cases). If names conflict inside a BIN, treat the BIN as your stable label and present the competing hypotheses.
Document geography. A strong sequence match from an impossible region is a red flag. Range checks catch surprising false positives.
Maps and clustering of lineages illustrate how geographic context supports or contradicts sequence-based identifications. (Pentinsaari M. et al. (2020) PLOS ONE).
Thresholds, cutoffs, and where they fail
There is no universal percent identity cutoff for DNA barcoding. Divergence rates vary among taxa and markers; some groups show overlap between intra- and inter-specific distances (a narrow or absent barcode gap), while others separate cleanly. A single identity threshold invites false matches in one clade and false splits in another.
Distributions of K2P distances within vs. between beetle species illustrate why a single percent-identity threshold is unreliable across taxa. (Pentinsaari M. et al. (2014) PLOS ONE).
Four reasons "one number" fails:
- Taxon effects. Some lineages show weak or inconsistent barcode gaps; a threshold tuned in one group misfires in another.
- Marker effects. COI, rbcL/matK, and ITS/ITS2 evolve at different rates; thresholds do not translate across loci.
- Reference density. Sparse or uneven reference libraries inflate error; thin coverage makes distant "best hits" look better than they are.
- Edge biology. Hybridization, introgression, incomplete lineage sorting, and NUMTs (nuclear mitochondrial pseudogenes) can blur mitochondrial signals; even clean lab work can be ambiguous.
Reviews routinely caution against treating the barcode gap as a universal rule. When variation overlaps, use multiple signals (a second locus, morphology, or geography) and write your conclusion accordingly.
Limits and edge cases you must disclose
Hybridization & introgression. Mitochondrial markers can track maternal history more than current species boundaries. Flag this risk when your clade includes hybrid zones or recent contact.
NUMTs (nuclear mitochondrial pseudogenes). Co-amplified NUMTs may show stop codons, frameshifts, or odd composition. Inspect chromatograms and translations; if suspicious, resequence or add a second locus.
Recent radiations & incomplete lineage sorting. Expect shallow divergences and shared haplotypes; prefer cautious wording or a BIN-level label over a forced species name.
Short or degraded templates. Mini-barcodes rescue archival or processed samples but reduce discriminative power. Report the shorter alignment and the extra caution you applied.
Database drift. As archives grow, E-values change and new near-matches appear. Record the database date/version in your report.
A schematic of short COI targets shows how mini-barcodes enable identification from processed or degraded material. (Shokralla S. et al. (2015) Scientific Reports).
Copy-paste report template (drop-in to your SOP)
Use or adapt this one-page skeleton to standardize audit-ready species ID reporting. It foregrounds identity with span, E-value, BIN context, vouchers, and geographic plausibility.
1) Specimen & matrix
- Sample ID; matrix (tissue, fin clip, leaf, powder); collection method.
2) Marker(s) & amplicon
- Locus (e.g., COI / rbcL / matK / ITS/ITS2); expected amplicon length; primers used; platform.
3) Sequencing & QC
- Trimming rules; chromatogram review; amino-acid translation screen for COI; notes on stop codons/indels; negative/positive control outcomes.
4) Database searches (date + version)
- BOLD curated set — top three hits with % identity + aligned length; voucher status; BIN page link (if COI).
- GenBank (nt) — top three hits with % identity + aligned length + E-value; voucher status.
5) BIN context (COI only)
- BIN ID; concordant/discordant; short note on conflicts and how they were handled.
6) Voucher & geography
- Voucher accession or image voucher URL; known range; any discrepancies noted.
7) Interpretation statement (one paragraph)
- Best-supported identification under RUO conditions; uncertainty notes (hybridization, recent radiation, reference gaps).
8) Compliance metadata
- collection-date, geo_loc_name, collector, permits if applicable; project DOI or accession URL.
9) Reviewer & sign-off
- Analyst, reviewer, date; SOP/report template version.
This structure makes your reasoning transparent and reusable across audits and sectors.
Applying thresholds responsibly (illustrative scenarios)
High identity, short span. 99.3% over 160 bp (mini-barcode) with a very small E-value and a voucher-linked top hit. Report a provisional ID, explain the short span, and state what additional evidence (longer locus, second marker) would upgrade confidence.
The universal mini-barcode region within COI illustrates why short amplicons aid recovery yet require cautious interpretation. (Meusnier I. et al. (2008) BMC Genomics).
Moderate identity, full span. 97.8% over 658 bp with clean translation and several voucher-linked hits within the known range. Report as best-supported and cite the BIN with a note on concordance; list plausible alternatives if they exist.
Conflicting top hits. Two best hits at 98.5% over 600+ bp from adjacent ranges that share a BIN. Report at BIN level and state the evidence needed to resolve (e.g., nuclear locus, expert specimen exam).
Label any numerical threshold you show as "working range for this project", not a universal cutoff. Justify it using clade divergence, marker behavior, and reference density.
FAQ
There is no universal cutoff. Combine % identity with aligned length and E-value, then add voucher evidence and BIN context (for animal COI). If you adopt a working threshold, justify it for your clade and locus.
Report both when possible: the Latin name if the BIN is concordant, and the BIN as a stable operational label when names conflict or are unsettled.
Because alignment span, database size, and reference quality differ. E-value scales with search space, and short alignments inflate identity. Voucher linkage and geography can also change the interpretation.
NUMTs can mimic mitochondrial hits but often show frameshifts or stop codons; hybridization/introgression can blur species boundaries. Screen translations, evaluate barcoding gap and BIN context, and add a second locus when needed.
Provide collection-date and geo_loc_name in current INSDC formats, plus voucher and permit details where relevant. These fields improve reproducibility, searchability, and long-term reuse.
Action — turn scores into decisions people accept
- Adopt the report template above and paste it into your SOP so every project presents identity with span, E-value, BIN context, vouchers, and geography.
- Harden your references using Reference Libraries: BOLD & GenBank Best Practices to improve voucher coverage and metadata quality.
- If your samples are mixed (water, soil, feces), classic barcoding will struggle; see DNA Barcoding vs Metabarcoding: Which Fits Your Study? to select the right pipeline.
- Need an audit-ready path now? Share target taxa, matrices, and reporting needs via our DNA Barcoding Service. We'll recommend marker sets, acceptance criteria, and a reporting format aligned to your stakeholders.
RUO reminder: All services and deliverables are for research use only.
References
- Ratnasingham, S., Hebert, P.D.N. A DNA-based registry for all animal species: The Barcode Index Number (BIN) system. PLOS ONE 8, e66213 (2013).
- Pentinsaari, M., Hebert, P.D.N., Mutanen, M. Barcoding Beetles: A Regional Survey of 1872 Species Reveals High Identification Success and Unusually Deep Interspecific Divergences. PLOS ONE 9, e108651 (2014).
- Shokralla, S., Hellberg, R.S., Handy, S.M., King, I., Hajibabaei, M. A DNA mini-barcoding system for authentication of processed fish products. Scientific Reports 5, 15894 (2015).
- Huemer, P., Mutanen, M., Sefc, K.M., Hebert, P.D.N. Testing DNA Barcode Performance in 1000 Species of European Lepidoptera: Large Geographic Distances Have Small Genetic Impacts. PLOS ONE 9, e115774 (2014).
- Meusnier, I., Singer, G.A.C., Landry, J.-F., Hickey, D.A., Hebert, P.D.N., Hajibabaei, M. A universal DNA mini-barcode for biodiversity analysis. BMC Genomics 9, 214 (2008).
- Collins, R.A., Cruickshank, R.H. The seven deadly sins of DNA barcoding. Molecular Ecology Resources 13, 969–975 (2013).
- Clark, K., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W. GenBank. Nucleic Acids Research 44, D67–D72 (2016).