DAP-Seq in Multi-Genome Plant Projects: Mapping TF Binding Without Overcomplicating the Design

Inquiry

Summary

Multi-genome DAP-seq plant TF binding study design introduces specific challenges absent in single-genome applications: read mapping ambiguity across subgenomes or paralogous sequences, homeologous TF cloning decisions, reference genome selection for polyploid species, and downstream interpretation of subgenome-specific binding differences. This guide provides a practical scenario-based framework for plant molecular biology teams planning DAP-seq in multi-genome contexts — covering mapping strategy, DNA library decisions, homeolog cloning logic, sample preparation options, and QC benchmarks.

Key Takeaways

A wheat DAP-seq study profiling 189 TFs from 30 families produced a regulatory network covering 3,714,431 regulatory elements; TF family success rates varied substantially — AP2 high-confidence, B3 low-confidence — providing a practical benchmark for project planning (Zhang et al., Nature Communications, 2022)
Subgenome-divergent TF binding sites in wheat are driven predominantly by differential transposable element expansions in diploid progenitors, not by divergence in TF DNA-binding domains — meaning homoeologous TF proteins often bind similar motifs, and subgenome-specific binding patterns are largely a property of the genomic context, not the protein
The choice between amplified and non-amplified gDNA library determines whether endogenous DNA methylation is preserved; this decision must be made before library construction and directly affects whether methylation-sensitive TF binding is detectable in the data
Multi-DAP-seq (Baumgart et al., Nature Methods, 2021) eliminates the ORF cloning bottleneck through biotinylated lysine incorporation and multiplexed barcoding — enabling simultaneous profiling of multiple TFs and species at 40× higher throughput than standard DAP-seq
For cross-species binding comparison, the same recombinant TF protein can be incubated with gDNA libraries from two different species, using each species' reference genome for independent peak calling
Pairing DAP-seq with ATAC-seq or RNA-seq from the same tissue provides the in vivo context required to prioritize biologically active binding sites from the in vitro binding map

Why Multi-Genome Contexts Require a Different DAP-seq Design Approach

Standard DAP-seq protocol — in vitro TF expression, gDNA library construction, affinity purification, sequencing — was originally designed and validated in Arabidopsis thaliana, a diploid with a compact, well-annotated genome of approximately 135 Mb (Bartlett et al., Nature Protocols, 2017). The move to multi-genome plant species introduces three structural challenges that do not exist in that context. Getting each one wrong does not simply reduce data quality; it can make the resulting dataset uninterpretable.

For a technical introduction to how DAP-seq works including the affinity purification mechanism and library preparation logic, see our overview resource. For detailed DAP-seq workflow and analysis steps covering alignment, peak calling, and motif enrichment, see our protocol resource.

Diagram showing three key design challenges in multi-genome DAP-seq plant TF binding mapping projects Figure 1. Three design decisions determine whether a multi-genome DAP-seq project produces interpretable data or irresolvable mapping ambiguity.

The Mapping Problem: When Reads Don't Know Which Subgenome They Came From

In hexaploid wheat (Triticum aestivum), the A, B, and D subgenomes each contain homeologous copies of most genes. A sequencing read derived from a DAP-seq library cannot, by default, be assigned to a specific subgenome if the corresponding sequence exists in two or three nearly identical copies. Standard alignment tools applied without subgenome-aware settings will either multi-map these reads across all copies — inflating apparent signal at all homeologous loci — or discard them as ambiguous, causing systematic underrepresentation of homeologous binding sites.

The solution is not a single tool but a strategy: use a subgenome-resolved reference genome (such as IWGSC RefSeq v1.0 for wheat, which assigns chromosomes to specific subgenomes) combined with strict unique-mapping filtering. This preserves reads that are unambiguously assignable to one subgenome while discarding ambiguous reads. The proportion of uniquely mapped reads is itself an informative QC metric — unusually low unique-mapping rates signal either a reference genome assembly problem or excessive sequence conservation between subgenomes at the profiled loci.

The Cloning Problem: One TF Family, Three Homeologs

Every major TF family in hexaploid wheat has three homeologous copies — one per subgenome. A researcher profiling, for example, a WRKY TF in wheat must decide: clone and express all three homeologs separately, clone one as a representative, or use a clone-free approach. The right answer depends on what the research question actually requires, and on what the existing evidence says about homeolog binding conservation in the target TF family.

Zhang et al. (Nature Communications, 2022) demonstrated that binding specificities of homoeologous TFs are generally conserved across subgenomes in wheat — the subgenome divergence in TF binding patterns they observed was predominantly a property of the regulatory element sequences (shaped by differential TE expansions), not of the TF proteins themselves. This finding is directly actionable: for most TF families where binding domain conservation is high, profiling one well-characterized homeolog provides a valid first approximation of the binding landscape, with selective follow-up of the remaining homeologs where divergence is suspected.

Scenario 1 — One TF, Two Genomes: Cross-Species Binding Comparison

The most common multi-genome DAP-seq scenario in plant molecular biology is not polyploidy but cross-species comparison: the same TF family profiled in two related species to identify conserved versus species-specific regulatory elements. This design has been applied to compare Arabidopsis and maize ARF family binding (Galli et al., Nature Communications, 2018), and more recently to study SPL/SBP family binding divergence across multiple cereal species. For a survey of published DAP-seq applications in plant species including cross-species comparisons, see our applications resource.

Using the Same ORF Expression Construct Across Two Reference Genomes

The key experimental efficiency of cross-species DAP-seq is that the same recombinant TF protein batch can be used for parallel incubation with two separate gDNA libraries — one from each species. Because DAP-seq is an in vitro method that does not require living cells or species-specific antibodies, the same tagged TF protein produced in a wheat germ or rabbit reticulocyte cell-free expression system can bind to gDNA prepared from rice, maize, or any other plant species for which a library can be constructed.

The practical implication is significant: the bottleneck in cross-species DAP-seq is not protein preparation (one construct, two binding reactions) but gDNA library quality and reference genome annotation quality for the second species. Both gDNA libraries must meet the same fragment size distribution and input quantity requirements. Peak calling proceeds independently against each species' reference genome, and the resulting peak sets are then compared using ortholog-based alignment or conserved motif analysis.

This design is particularly powerful for studying TF binding evolution. It answers questions such as: does a rice WRKY TF that regulates blast disease resistance bind the same genomic contexts in a related cereal species? Does the binding motif diverge between monocots and dicots for the same TF family? The experimental setup is straightforward; the analytical complexity scales with the quality of cross-species genomic alignments and ortholog annotations available for the species pair.

Reference Genome Quality for the Second Species

The limiting factor in cross-species DAP-seq is often not the experiment but the reference genome of the less-studied species. For rice (Oryza sativa Nipponbare, MSU7 or RAP-DB annotation) and maize (Zea mays B73 RefGen v5), high-quality chromosome-level assemblies with comprehensive gene models are available and well-suited to DAP-seq peak calling. For less-established crop species or specific cultivar genomes, annotation quality directly caps the interpretability of peak-to-gene assignments.

Before committing to a cross-species project, the reference genome of the second species should be evaluated for: chromosome-level assembly completeness, TSS annotation quality for DAP-seq peak-to-promoter assignment, and availability of repeat annotation (required for filtering TE-derived peaks from genuine regulatory signals). For species where only draft assemblies exist, DAP-seq peak calling remains technically feasible but functional interpretation is limited to well-annotated gene bodies and proximal promoter regions.

Scenario 2 — One TF, Three Subgenomes: Mapping in Allopolyploid Wheat or Cotton

The hexaploid wheat DAP-seq dataset assembled by Zhang et al. (Nature Communications, 2022) remains the most comprehensive published benchmark for allopolyploid multi-genome DAP-seq design. Their profiling of 189 TFs from 30 families — covering 107 highly expressed TFs and 82 functionally annotated or co-expression hub TFs — produced a regulatory network of 3,714,431 regulatory elements and generated 45 high-confidence (HC), 47 median-confidence (MC), and 97 low-confidence (LC) TF datasets. Understanding what determined these confidence tiers is directly applicable to planning new projects.

Decision matrix for homeologous TF cloning strategy in allopolyploid plant DAP-seq multi-genome study design Figure 2. Homeolog cloning strategy depends on whether binding conservation or subgenome divergence is the primary research question.

Subgenome-Resolved Reference Genomes: Which Assembly to Use

For hexaploid wheat, the IWGSC RefSeq v1.0 (Chinese Spring) assembly provides the current standard subgenome-resolved reference. Each chromosome is explicitly assigned to the A, B, or D subgenome, enabling subgenome-specific peak calling after strict unique-mapping filtering. The 2022 IWGSC high-quality annotation (IWGSC v2.1) substantially improved gene model completeness and should be used over earlier versions when available.

For tetraploid cotton (Gossypium hirsutum TM-1, AD1 genome), a high-quality subgenome-resolved assembly is available and has been used for DAP-seq peak calling in comparative TF binding studies. The same unique-mapping filtering strategy applies: reads are assigned to A or D subgenome chromosomes, and multi-mapping reads between homeologous loci are discarded or handled separately.

An alternative strategy — particularly useful when the polyploid assembly quality is uncertain — is to map DAP-seq reads to the diploid progenitor genomes separately. For wheat, this means mapping independently to Aegilops tauschii (D subgenome progenitor) and Triticum urartu or Aegilops speltoides assemblies (A and B progenitors). This approach sacrifices some context about polyploid-specific regulatory interactions but produces cleaner unique-mapping statistics and is often more appropriate for projects focused on progenitor-conserved binding patterns.

When Homoeologous Binding Is Similar Enough to Simplify Your Design

Zhang et al. 2022 made a key finding that simplifies multi-genome project planning: binding specificities of homoeologous TFs are generally conserved across the A, B, and D subgenomes. The subgenome divergence they observed in TF binding sites was predominantly driven by differential expansions of transposable elements in the diploid progenitors before polyploidization — not by divergence in the TF proteins' DNA-binding domains. AP2 family TFs showed the highest functional constraint in binding specificity; their binding profiles clustered tightly regardless of subgenome of origin.

Practically: if your research question is "where does this WRKY TF bind in the wheat genome" and you are not specifically investigating subgenome-specific regulatory divergence, profiling one high-quality homeolog first is the justified starting point. Confirm that the canonical binding motif is recovered at high confidence. Then, if the results suggest subgenome-specific differences are relevant — or if the TF family has known low success rates (B3 domain proteins, some MADS-box TFs that require cofactors) — expand to the remaining homeologs selectively.

When Subgenome Divergence Is the Research Question

When the explicit goal is to characterize subgenome-specific regulatory evolution — for example, asking whether stress-responsive TF binding has diverged between the A and D subgenomes after polyploidization — all three homeologs must be profiled independently. This is the design used in the Zhang et al. 2022 wheat study, and it is justified by the biological question, not by default.

In this scenario, ORF cloning must confirm the absence of chimeric sequences from other homeologs — a specific risk in polyploid genomes where PCR amplification of one homeolog's CDS can inadvertently incorporate short segments from the other two. Zhang et al. explicitly verified each clone by full-length cDNA sequencing before proceeding to DAP-seq. This verification step is not optional when subgenome specificity is the research question.

The analytical output is a three-way comparison of peak sets: peaks unique to the A homeolog, unique to B, unique to D, and shared across all three. Integration with TE annotation — specifically, identifying which subgenome-specific peaks overlap with lineage-specific TE insertions — is the recommended path to interpreting the biological basis of binding divergence.

Scenario 3 — Self-Prepared vs Provider-Prepared Samples: What You Need to Decide Upfront

Before any multi-genome DAP-seq project begins, two preparation decisions determine the project structure: who constructs the gDNA library, and whether that library is amplified or non-amplified. These are not interchangeable convenience choices — each decision affects what biological questions the resulting data can address.

Flowchart comparing self-prepared versus provider-prepared sample options for plant DAP-seq multi-genome projects Figure 3. Both preparation paths converge on the same QC gate: confirmed TF protein expression before committing to full sequencing.

Self-Prepared gDNA Library: When It Makes Sense and What QC Is Required

Preparing the gDNA library in-house is appropriate when the project involves a species or cultivar where high-quality gDNA extraction requires specialized protocols, when the researcher needs to apply a tissue-specific extraction method to preserve relevant epigenomic features, or when the non-amplified library format is required to preserve endogenous DNA methylation.

The amplified versus non-amplified library decision deserves explicit attention. In the standard Bartlett et al. (2017) protocol, gDNA is fragmented to approximately 200 bp, end-repaired, and ligated to sequencing adapters. This library can be used directly (non-amplified, preserving endogenous methylation) or subjected to PCR amplification before use. The non-amplified library retains the full endogenous methylation pattern of the original genomic DNA, enabling the experiment to detect methylation-sensitive TF binding — instances where the TF binds preferentially to methylated or unmethylated CpG sites. The amplified library loses this information but produces higher molecular yield, which may be necessary for polyploid species with large genomes requiring deeper input.

For multi-genome projects in species like wheat (genome ~16 Gb) or cotton (~2.5 Gb diploid equivalent), the large genome size increases the minimum DNA input required for adequate library complexity. Non-amplified libraries from these species may not provide sufficient unique molecule counts without very high starting DNA quantities. This is a practical argument for amplified libraries in large genome contexts, unless methylation-sensitive binding is specifically the research question.

Self-prepared libraries require pre-library QC: Bioanalyzer or TapeStation fragment size assessment to confirm the ~200 bp peak, Qubit fluorometric quantification (not spectrophotometric, which overestimates in low-purity preparations), and purity checks (A260/A280 ≥ 1.8; A260/A230 ≥ 1.8). Contamination with phenol or CTAB — common in plant gDNA extractions — inhibits downstream enzymatic steps and must be confirmed absent before library construction.

Provider-Prepared End-to-End: What You Still Need to Provide

End-to-end service for multi-genome DAP-seq covers gDNA library construction, TF in vitro expression, affinity purification, sequencing, and primary analysis. What the researcher must provide depends on whether the ORF expression construct already exists.

If the TF has been cloned previously in a Gateway-compatible vector with an appropriate affinity tag (HaloTag or HaloTag-6×His for standard DAP-seq; biotin acceptor peptide tag for biotin-DAP-seq), the expression construct can be submitted directly. If not, the ORF sequence must be provided for cloning — and for polyploid species, it is critical to specify which homeolog's CDS is intended, as PCR amplification from hexaploid genomic DNA or cDNA without homeolog-specific primers will produce chimeric clones.

The minimum input requirement for standard DAP-seq is 5 µg of high-quality gDNA and 1 µg of ORF expression plasmid (Bartlett et al., 2017). For large polyploid genomes, the gDNA input recommendation may be higher to ensure sufficient library complexity across all subgenomes. For projects using the multi-DAP approach (Baumgart et al., Nature Methods, 2021), which uses biotinylated TF proteins expressed clone-free from PCR-amplified CDS templates, input requirements differ and should be confirmed with the service team before sample submission.

Contact our DAP-seq service team to confirm input requirements, homeolog cloning strategy, and library preparation approach for your specific species and genome configuration before committing samples.

QC Benchmarks and Failure Points in Multi-Genome DAP-seq

Quality assessment in multi-genome DAP-seq involves both the standard metrics applicable to any DAP-seq experiment and additional checks specific to multi-genome complexity. The Zhang et al. 2022 wheat study established a practical confidence tier framework — HC, MC, LC — that is directly applicable as a QC evaluation template.

TF Expression and Binding Verification Before Committing to Full Sequencing

The most common cause of DAP-seq failure — across all plant species — is inadequate TF protein expression or failure of the expressed protein to bind DNA in the in vitro system. In the large-scale Arabidopsis cistrome study, approximately 30% of assayed TFs failed to produce datasets passing quality thresholds, with most failures attributed to TF-specific properties rather than technical processing errors (Bartlett et al., 2017). For multi-genome projects where multiple homeologs are being profiled simultaneously, this failure rate must be factored into experimental planning.

Before committing the full gDNA library to sequencing, TF protein expression should be verified by Western blot against the affinity tag. This is not optional for multi-genome projects — expression levels can vary substantially between homeologs cloned from different subgenomes, particularly when codon composition differs between the A, B, and D copies. A homeolog that expresses at low levels will produce a library dominated by non-specific background peaks regardless of how deeply it is sequenced.

For TF families with known low success rates in DAP-seq — B3 domain proteins, many MADS-box TFs, some bHLH family members — a pilot binding test using a subset of the gDNA library is recommended before proceeding to full sequencing. The pilot confirms whether the canonical binding motif is recoverable, which is the earliest checkpoint at which an experiment can be classified as HC, MC, or likely LC. Committing full sequencing cost to an LC dataset before this check is a recoverable mistake in a single-genome project; in a multi-genome project where multiple libraries are running simultaneously, it multiplies the cost of the failure.

Multi-Genome-Specific QC: Unique Mapping Rate and Motif Recovery as Confidence Tiers

For multi-genome DAP-seq, two metrics take on elevated importance beyond their standard role in single-genome QC:

Unique mapping rate is the fraction of total mapped reads that align unambiguously to a single genomic locus in the subgenome-resolved reference. In diploid species, this rate typically exceeds 85–90% for high-quality libraries. In hexaploid wheat, the same metric will be lower — how much lower depends on the sequence divergence between subgenomes at the profiled TF's target loci. An unusually low unique-mapping rate (below ~60–65%) may indicate that the reference assembly has regions of insufficient subgenome resolution, or that the peak set is dominated by highly conserved binding sites that cannot be subgenome-assigned. This metric should be reported and interpreted, not silently filtered.

Canonical motif recovery rate — whether the TF's known or predicted binding motif is de novo identified or significantly enriched in the peak set — is the primary confidence-tier classifier used by Zhang et al. 2022. A dataset where the canonical motif is de novo identified (HC) is more reliable than one where it is only enriched (MC), and substantially more reliable than one where it is not recovered (LC). For cross-species comparison projects, the motif should be independently assessed in both species' peak sets — motif conservation between species is itself an informative result. For DAP-seq vs ChIP-seq for plant TF mapping considerations including confidence tier expectations, see our method comparison resource. For a broader protein-DNA interaction method comparison covering ChIP-seq, CUT&Tag, CUT&RUN, and DAP-seq, see our multi-method guide.

Planning a multi-genome DAP-seq project in wheat, rice, cotton, or another complex plant genome? Our team can advise on homeolog cloning strategy, library preparation format, and subgenome-resolved analysis pipelines. Contact us to discuss your project →

Frequently Asked Questions

1) Can DAP-seq be used with polyploid plant genomes such as hexaploid wheat?

Yes. The most comprehensive published example is the Zhang et al. (Nature Communications, 2022) study profiling 189 TFs from 30 families against the hexaploid wheat genome, producing a regulatory network of over 3.7 million regulatory elements. The key technical requirements for polyploid DAP-seq are a subgenome-resolved reference genome, strict unique-mapping filtering, and homeolog-specific ORF cloning to avoid chimeric TF constructs. DAP-seq has also been applied to tetraploid cotton and is technically applicable to other allopolyploid crop species with high-quality chromosome-level assemblies.

2) How do I decide which homeologous TF to clone for DAP-seq in allopolyploid plants?

Start with the evidence on binding conservation. Zhang et al. 2022 demonstrated that homoeologous TF binding specificities are generally conserved across wheat subgenomes — the divergence in binding patterns between subgenomes is predominantly driven by differential TE content at the binding sites, not by divergence in the TF proteins. For most TF families, profiling one well-expressed, well-characterized homeolog provides a valid first approximation. The main exception is when the research question specifically addresses subgenome-specific binding divergence, or when the TF family is known to require cofactors for DNA binding (B3 domain, some MADS-box), in which case all homeologs should be profiled independently.

3) What reference genome should I use for DAP-seq peak calling in hexaploid wheat?

IWGSC RefSeq v1.0 (Chinese Spring) with IWGSC v2.1 gene annotation is the current standard. This assembly assigns each chromosome to a specific subgenome (A, B, or D), enabling subgenome-specific peak calling after unique-mapping filtering. For projects focused on diploid progenitor biology, mapping to Aegilops tauschii (D genome) or Triticum urartu (A genome) assemblies separately is a valid alternative. Avoid using collapsed or haplotype-merged assemblies for subgenome-specific analysis — they defeat the purpose of subgenome-resolved mapping.

4) What is the difference between amplified and non-amplified gDNA libraries in DAP-seq?

Non-amplified gDNA libraries preserve the endogenous DNA methylation pattern of the original tissue. This allows the DAP-seq experiment to detect methylation-sensitive TF binding — whether specific TFs preferentially bind methylated or unmethylated CpG sites. Amplified libraries undergo PCR before use, which removes DNA methylation information but increases molecular yield — an important practical advantage for large polyploid genomes that require higher DNA input. The choice between amplified and non-amplified must be made before library construction and should be based on whether methylation-sensitive binding is part of the research question.

5) How many TFs can be profiled in a single multi-genome DAP-seq project?

Standard DAP-seq scales to tens of TFs per project with individual ORF cloning. Multi-DAP-seq (Baumgart et al., Nature Methods, 2021), a clone-free approach using biotinylated lysine incorporation and multiplexed barcoding, can profile TFs at 40× higher throughput and 10× lower cost than standard DAP-seq, enabling simultaneous profiling of hundreds of TFs and multiple species in a single experiment. The practical upper limit depends on the availability of ORF sequences, the genome size of the target species, and the sequencing depth available for multiplexed libraries. For projects requiring large-scale TF profiling across a polyploid genome, multi-DAP-seq is the current method of choice.

6) What QC steps are critical for multi-genome DAP-seq to produce high-confidence data?

Four QC steps are essential: (1) TF protein expression verification by Western blot against the affinity tag before library incubation; (2) canonical motif recovery assessment after a pilot peak-calling run — if the known binding motif is not recovered, the dataset is likely LC and should not proceed to full analysis; (3) unique-mapping rate assessment in the subgenome-resolved alignment — unusually low rates signal mapping strategy problems; and (4) fragment size distribution check of the input gDNA library to confirm the ~200 bp peak before TF incubation. For polyploid projects, homeolog clone verification by full-length cDNA sequencing adds a fifth critical checkpoint.

7) How should DAP-seq data from multi-genome plants be integrated with RNA-seq or ATAC-seq?

DAP-seq is an in vitro method: it maps where a TF can bind, not where it does bind in a specific tissue or developmental stage. Integration with in vivo data is essential for interpreting biological relevance. The recommended integration approach is to overlap DAP-seq peak sets with ATAC-seq open chromatin regions from the same species and tissue — peaks that fall within accessible chromatin are more likely to represent active in vivo binding sites. RNA-seq from the same tissue provides expression context: DAP-seq target genes (genes with TF binding sites in their promoters) that are also differentially expressed in conditions where the TF is active are high-priority candidates for follow-up. For plant TF epigenomics in stress research including integration workflows combining DAP-seq, ATAC-seq, and RNA-seq, see our plant epigenomics resource.

References

Zhang, Y. et al. Transposable elements orchestrate subgenome-convergent and -divergent transcription in common wheat. Nature Communications, 13, 6940 (2022). https://doi.org/10.1038/s41467-022-34290-w
Bartlett, A. et al. Mapping genome-wide transcription-factor binding sites using DAP-seq. Nature Protocols, 12, 1659–1672 (2017). https://doi.org/10.1038/nprot.2017.055
Baumgart, L.A. et al. Persistence and plasticity in bacterial gene regulation. Nature Methods, 18, 1499–1505 (2021). https://doi.org/10.1038/s41592-021-01312-2
Li, M. et al. Double DAP-seq uncovered synergistic DNA binding of interacting bZIP transcription factors. Nature Communications, 14, 2600 (2023). https://doi.org/10.1038/s41467-023-38096-2
Hutin, S. et al. Identification of plant transcription factor DNA-binding sites using seq-DAP-seq. Methods in Molecular Biology, 2698, 119–145 (2023). https://doi.org/10.1007/978-1-0716-3354-0_9
Galli, M. et al. The DNA binding landscape of the maize AUXIN RESPONSE FACTOR family. Nature Communications, 9, 4526 (2018). https://doi.org/10.1038/s41467-018-06977-6

! For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.