Building and Using Reference Libraries: BOLD & GenBank Best Practices
This guide helps bioinformatics leads and QA/QC managers build audit-ready DNA barcoding reference libraries for reliable species identification. We cover the DNA barcoding process from curation in BOLD Systems to submission in GenBank, including voucher anchoring, BINs vs. species names, and metadata that improves reproducibility and searchability.
Why do reference libraries decide your ID
If your reference library is thin, mislabeled, or poorly annotated, even perfect chromatograms can produce shaky calls. In day-to-day practice, reference quality is the strongest lever you control. Curate what you search against, then show how you reached each identification.
Use both platforms for complementary strengths
- BOLD Systems (curation-first). Barcode-centric tools such as Barcode Gap/Distance summaries and the Alignment Browser make it easier to spot contamination, frameshifts, or weak separation. For animal COI, BINs (Barcode Index Numbers) add cluster context and expose taxonomy concordance/discordance before anything goes public.
BOLD's RESL algorithm clusters COI records into BINs in a standardized, five-stage workflow that supports rapid, automated assignment. (Ratnasingham S. & Hebert P.D.N. (2013) PLOS ONE)
- GenBank (archive-first). The global archive used by journals and pipelines. It enforces structured source modifiers and expects key collection metadata (for example, collection-date and geo_loc_name). Submitting to GenBank maximizes discoverability and downstream reuse.
Why this dual-platform plan works
- Curate with barcode-aware diagnostics in BOLD → publish/update in an indexed archive (GenBank).
- Keep reporting consistently by tying sequences to voucher specimens, documenting aligned length + % identity, and recording decisions when BINs and names disagree.
Related reading: Define acceptance criteria in How Does DNA Barcoding Work? and choose loci with the Marker Selection Cheat Sheet. If your samples are mixed (water, soil, feces), compare approaches in DNA Barcoding vs Metabarcoding.
Build & annotate your reference set (vouchers, metadata, BARCODE/INSDC)
Definition. A voucher-anchored reference library is a set of barcode sequences linked to curated specimens and compliant metadata so results can be audited and reused.
Before you run any search, make sure each record meets basic BARCODE/INSDC expectations. Doing this early prevents mislabels from spreading through future projects and speeds up submissions.
What "good" looks like
- Voucher anchoring. Provide a resolvable specimen_voucher (institution:collection:catalog). If tissue is fully consumed, add an image voucher and a stable specimen page URL.
- Where and when. Record geo_loc_name with a controlled place name and an ISO-style collection-date. These fields improve search precision and are commonly required in modern submissions.
K2P distance histograms illustrate a clear barcode gap for COI compared to other mitochondrial loci in parapatric bird species. (Aliabadian M. et al. (2009) PLOS ONE).
- Locus & length. Name the barcode region (COI, rbcL, matK, ITS/ITS2) and confirm the expected amplicon length for the clade you study.
- Sequence quality. Include chromatogram trace files. As a general rule, ensure a high proportion of contiguous, high-quality bases in the approved region; shorter sequences may be acceptable when clearly justified.
- Provenance. Add the collector, the institution housing the specimen, permits (if applicable), and a project DOI or accession URL.
Practical build steps in BOLD
- Create or import Specimen Data: sample ID, voucher code, taxonomy, institution, collection site, and country/location.
- Attach sequences and trace files, then run BOLD diagnostics (Alignment, Composition, Gap/Distance).
- Iterate on metadata and sequence issues before public release or export to GenBank.
End-to-end barcoding pipeline—from sample processing to BIN-based identification—mirrors the practical diagnostics run in BOLD. (Morinière J. et al. (2016) PLOS ONE).
Curate & validate in BOLD (gap tools, alignment, BINs; how to report names)
Diagnose problems before they ship
- Barcode Gap/Distance. Confirm that inter-specific divergence exceeds intra-specific divergence for your chosen locus. Weak gaps often mean you need more vouchers, a second locus, or both.
- Alignment Browser & Sequence Composition. Inspect frameshifts, premature stop codons in COI, and indel patterns that hint at contamination, pseudogenes (NUMTs), or editing errors.
- BIN Discordance. At the project level, run a discordance check to surface conflicting species names within the same COI cluster. Resolve labels, annotate uncertainty, or flag the clade for expert review.
Pairs sharing a single BIN reveal where discordant names occur and how geographic proximity relates to barcode clustering. (Hausmann A. et al. (2013) PLOS ONE).
Reporting when taxonomy lags names
Species names change. Clades split and merge. For animal COI, BINs provide operational units that frequently align with species boundaries. When a BIN contains conflicting names—or when names are unsettled—report both the BIN and your evidence:
- Match metrics: report % identity together with aligned length.
- Record quality: prefer voucher-linked references and museum/herbarium accessions.
- Geographic plausibility: note whether top matches fall within the expected range.
- Decision note: add one sentence on why you favored a particular name, or why you retained BIN-level reporting.
A documentation pattern reviewers accept
- Which databases you searched (BOLD reference sets, GenBank) and when.
- Aligned length + % identity per locus (not just a single number).
- Voucher accession or image voucher URL for the top match.
- BIN and name decision with a short rationale.
- A change log in the BOLD project before export and submission.
If your barcode gap remains weak after curation, revisit locus choice using the Marker Selection Cheat Sheet.
Interpret matches & avoid failure modes
Interpretation beats automation
A single percent identity can mislead. Always combine identity with aligned length and coverage. Review several top hits rather than trusting the first record. Favor voucher-linked entries and cross-check geography. This approach reduces false positives in routine DNA barcoding projects.
Four failure modes to plan for
- Contamination or chimeras. Mixed peaks or stop codons in COI suggest a problem. Rerun extraction, confirm instrument hygiene, and check negative controls.
- Reference incompleteness. If your clade is undersampled, add local vouchers and publish to BOLD/GenBank. A few targeted accessions can unblock dozens of field IDs.
- Misannotated names. Use BIN discordance to triage conflicts, annotate uncertainty, and—when appropriate—reach out to data owners with your evidence.
- Metadata drift. Legacy fields or inconsistent formats (for example, older "country" strings or ill-formed dates) create submission errors and poor search behavior. Standardize to controlled geo_loc_name values and valid date formats in your export templates.
Controlled vocabularies help
Build to controlled vocabularies for geo_loc_name, specimen_voucher, and other qualifiers from the start. Harmonized, machine-readable fields accelerate submission and make your records more useful to others.
If your samples contain DNA from multiple organisms, classic barcoding will struggle. For environmental DNA and community profiling, switch to DNA Barcoding vs Metabarcoding and use pipelines designed for mixed templates.
FAQ
Q1. When should I cite a BIN instead of a species name?
Use a BIN when names within the cluster disagree or taxonomy is unsettled. Report the BIN plus your evidence (identity, aligned length, vouchers, range), and revisit the name as curation improves.
Q2. Which voucher fields matter most for BOLD and GenBank?
Provide a resolvable specimen_voucher (institution:collection:catalog). Add geo_loc_name with a controlled place name and collection-date in a valid format. If tissue is consumed, include an image voucher.
Q3. Are there sequence quality expectations for "BARCODE" records?
Yes. The standard expects a high proportion of contiguous, high-quality bases in the approved region, the locus name, and trace files. Shorter sequences can be acceptable when clearly justified by guidance and use case.
Q4. Did GenBank change location fields, and are these mandatory?
Modern submissions use geo_loc_name for place names and include collection-date. Many submitters also provide country/state information in standardized form to improve search and indexing.
Q5. How should we interpret percent identity across BOLD and GenBank?
Treat % identity as one signal. Combine it with aligned length, record quality (voucher vs. unvouchered), and geography. This produces clearer, more defensible conclusions.
Copy-this QA block (paste into your SOP—no downloads needed)
- Voucher present (specimen_voucher structured; image voucher if tissue consumed)
- Where & when captured (geo_loc_name standardized; collection-date in valid format)
- Locus named & expected length (COI, rbcL, matK, ITS/ITS2; clade-appropriate amplicon)
- Quality met (contiguous, high-quality bases; trace files attached)
- BIN check complete (no unresolved discordance—or rationale documented)
- Top hits reviewed (identity + aligned length + voucher status + geography)
- Export validated (BOLD project metadata consistent; GenBank source table passes checks)
Next steps
- Paste the QA block into your SOP and update export templates to standardized place names and valid date formats.
- Triage your current dataset in BOLD: run Barcode Gap/Distance, inspect alignments, and resolve BIN discordance before public release or GenBank submission.
- When you want a guided path, outline your target taxa, matrices, and reporting needs via our DNA Barcoding Service. We'll propose a curation plan, locus and primer expectations, and a reporting format aligned to your stakeholders.
RUO reminder: All services and deliverables are for research use only and are not intended for clinical diagnosis, treatment, or individual health assessment.
Related Resources
References
- Ratnasingham, S., Hebert, P.D.N. A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. PLOS ONE 8(7), e66213 (2013).
- Hausmann, A., Godfray, H.C.J., Huemer, P. et al. Genetic Patterns in European Geometrid Moths Revealed by the Barcode Index Number (BIN) System. PLOS ONE 8(12), e84518 (2013).
- Morinière, J., Cancian de Araujo, B., Lam, A.W. et al. Species Identification in Malaise Trap Samples by DNA Barcoding Based on NGS Technologies and a Scoring Matrix. PLOS ONE 11(5), e0155497 (2016).
- Aliabadian, M., Kaboli, M., Nijman, V., Vences, M. Molecular Identification of Birds: Performance of Distance-Based DNA Barcoding in Three Genes to Delimit Parapatric Species. PLOS ONE 4(1), e4119 (2009).
- Clark, K., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W. GenBank. Nucleic Acids Research 44(D1), D67–D72 (2016).
- Schoch, C.L., Seifert, K.A., Huhndorf, S. et al. Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi. Proceedings of the National Academy of Sciences 109(16), 6241–6246 (2012).
* Designed for biological research and industrial applications, not intended
for individual clinical or medical purposes.