Data Analysis Workflow for Cut&Tag Sequencing: From Raw Reads to Biological Insights
Cut & Tag sequencing has become a powerful technique in molecular biology, enabling researchers to study protein-DNA interactions with high specificity and sensitivity. This method can identify binding sites for transcription factors, histones, and other proteins at the genomic level. However, to fully realize the potential of Cut & Tag data, a comprehensive data analysis workflow is essential. This article outlines the key steps in Cut & Tag sequencing data analysis, from raw sequencing data to biological insights.
Data processing flow and different types of enriched epigenetic signals (Cheng S et al., 2024)
I. Data Quality Control and Preprocessing
1.1 Raw Data Quality Assessment
FastQC was used to perform multi-dimensional quality checks on the raw sequencing data in FASTQ format, focusing on the following core indicators:
- Base Quality Distribution: Sequencing accuracy at each position was assessed using a per-base quality score plot, requiring all bases to have a Q value ≥ 20 (corresponding to an error rate ≤ 1%).
- Adapter Contamination Analysis: Per-sequence GC content distribution was detected to identify abnormal peaks (such as GC content fluctuations specific to Illumina adapters).
- Sequence Length Distribution: Short fragments caused by sequencing truncation were excluded (normal range: 50-150 bp).
MultiQC was used to integrate various quality control results and generate a visualization report (including a quality score distribution histogram, a repetitive sequence proportion heatmap, etc.).
1.2 Data Cleaning Strategies
- Adapter Trimming: For paired-end sequencing data, the first 19 bp adapter sequence was removed using Cutadapt (parameter -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC).
- Low-Quality Filtering: Dynamic trimming was performed using Trimmomatic, with a sliding window (4 bp window, average quality ≥15), retaining valid reads ≥36 bp in length.
- Repetitive Sequence Removal: Picard MarkDuplicates was used to identify duplicate reads generated by PCR amplification (retaining uniquely aligned reads).
II. Sequence Alignment and Peak Calling
2.1 Reference Genome Alignment
- Key Parameters: Use Bowtie2's `--very-sensitive-local` mode (adapting to Tn5 cleavage characteristics) and `-N 1` (allowing single-base mismatches) to ensure effective alignment of low-quality fragments.
- Genome Version: Humans are recommended to use hg38, mice to use mm10, and the version must be consistent with the annotation file.
2.2 Peak Calling
- Histone Modifications: Enable broad peak mode (`--broad`) and a relaxed threshold (`--broad-cutoff 0.1`) to capture continuous open chromatin regions.
- Transcription Factors: Use narrow peak mode (`--narrow`), combined with `--shift 100` and `--extsize 200` to compensate for Tn5 cleavage bias.
GoPeaks and MACS2 perform better than SEACR at identifying a range of H3K4me3 peak sizes (Yashar WM et al., 2022)
2.3 Quality Control Standards
- FRiP Value: Histone modifications ≥5%, transcription factors ≥15%, reflecting signal specificity.
- Peak length: Histone modification peak width > 1 kb, transcription factor peak width < 500 bp, distinguishing differences in biological characteristics.
2.4 Data Validation
- Repetitive sequence filtering: PCR amplification bias was removed using Picard MarkDuplicates.
- Control settings: Input or IgG controls must be included to improve the signal-to-noise ratio.
III. Functional Annotation and Biological Interpretation
3.1 Peak Annotation Analysis
- Tools and Parameters: Use ChIPseeker (R package), set the promoter region range (TSS±3kb), and associate it with genome annotations (e.g., TxDb for hg38).
- Key Analysis:
- Gene Region Distribution: Differentiate promoter regions (TSS±1kb), enhancer regions (H3K27ac enriched), etc.
- Functional Annotation: GO (molecular function/biological process), KEGG pathway enrichment (p<0.01, FDR<5%).
- Quality Control: Promoter region percentage ≥30%, FRiP value ≥5% (histone) or ≥15% (transcription factor).
3.2 Dynamic Regulatory Network Construction
- Tools: GREAT, input peak file and genome annotations, define regulatory regions (e.g., 500bp upstream to 1kb downstream).
- Output: List of direct target genes and regulatory network diagram.
- Case Study: H3K4me3 was significantly enriched in the promoter regions of WNT pathway genes (p=2.3e-8), suggesting pathway activation.
3.3 Biological Significance
- Mechanism Analysis: Linking epigenetic modifications (e.g., H3K27ac) with gene expression reveals enhancer/promoter activity.
- Application: Validating regulatory networks using RNA-seq data to guide disease target screening (e.g., abnormally modified pathways in cancer).
IV. Data Visualization and Reporting
4.1 Signal Distribution Visualization
- IGV Trajectory Plot: Compares signal intensity between treatment and control groups, locating differentially expressed regions (e.g., high expression of H3K27ac in enhancer regions).
- Heatmap: Displays signal enrichment patterns in specific regions (promoters, enhancers) with high resolution (e.g., --binSize 10).
- Chromosomal Distribution Trajectory: Provides a panoramic view of genome-wide signal distribution, identifying chromosome-specific enrichments (e.g., enhanced signal on chromosomes containing oncogenes).
4.2 Differential Analysis Strategy
- Tools: DiffBind (R package), integrating BAM and Peak files to analyze inter-group differences.
- Key Parameters:
- FDR ≤ 0.05: Controls false positive rate.
- Minimum overlap ≥ 50%: Ensures reproducibility.
- Fold change ≥ 2-fold: Demonstrates biological significance.
- Output Results:
- Differential Peak List: Includes location, length, fold change, and associated genes.
- Functional enrichment: GO (biological process/molecular function) and KEGG pathway analysis revealed regulatory mechanisms.
V. Workflow Optimization and Precautions
5.1 Key Quality Control Checkpoints
| Step | Detection Indicator | Qualification Standard | Abnormal Handling Plan |
| Quality Control | FastQC Q Value | All bases ≥ 20 | Re-sequence or trim |
| Alignment | Overall Alignment Rate | ≥ 80% | Check genome version/contamination |
| Peak Calling | FRiP Value | Histone ≥ 5%, TF ≥ 15% | Adjust peak calling parameters |
| Annotation | Promoter Region Percentage | ≥ 30% | Verify antibody specificity |
5.2 Common Issue Solutions
- Low FRiP Value: Check antibody specificity (using IgG control), optimize chromatin fragmentation conditions.
- Non-specific Binding: Increase washing steps (e.g., using high-salt buffers), shorten transposition reaction time.
- Data Visualization Anomalies: Verify genome version consistency (e.g., chromosome naming differences between hg38 and hg19).
VI. Key Quality Control Checkpoints
| Step | Detection Indicator | Qualification Standard | Tool/Parameters | Objective | Abnormal Handling Plan |
| Quality Control | FastQC Q Value | All bases ≥ 20 | FastQC/Q-score ≥ 20 | Exclude low-quality data | Re-sequence or trim |
| Alignment | Overall Alignment Rate | ≥ 80% | Bowtie2 -N 1 | Precisely locate DNA fragment origins | Check genome version/contamination |
| Peak Calling | FRiP Value | Histone ≥ 5%, TF ≥ 15% | MACS2 --broad | Distinguish broad histone modification peaks | Adjust peak calling parameters |
| Annotation | Promoter Region Percentage | ≥ 30% | ChIPseeker TSS ± 3kb | Associate functional gene regions | Verify antibody specificity |
References and Toolchain
- Core Tool Versions:
- MACS2 2.2.6 (Supports multi-threaded acceleration)
- ChIPseeker 1.28.0 (Integrates the latest genome annotation)
- deepTools 3.5.1 (Supports GPU-accelerated computation)
- Data Storage Specifications:
- Raw Data: FASTQ.gz (Retains original index information)
- Intermediate Files: BAM (sorting + index), BED (zero compression)
- Final Output: BigWig (normalized signal), PDF (vector image)
Performance comparison of bioinformatics tools in the peak calling analysis of narrow-type CUT&Tag data (Cheng S et al., 2024)
VII. In-depth Analysis of Application Scenarios
7.1 Epigenetic Heterogeneity Research
- Li C et al., through CUT & Tag data analysis and sequencing data processing (Bowtie2 alignment, MACS3 peak recall), identified 2067 binding sites of NICD1 in the genome (44.84% located near TSS, including known target genes HES1/HES4). HOMER motif analysis was used to identify regulatory elements, and IGV visualization was used to annotate peak positions. Integrating RNA-seq (NOTCH1 knockout expression), 31 target genes (such as USP5, whose peak signal is second only to HES1) were screened. Immunofluorescence was used to verify the positive correlation between NICD1 and USP5 expression, ultimately revealing the mechanism by which Notch signaling directly regulates target gene transcription and promotes angiogenesis through NICD1. The core of this study is the analysis of NICD1's genomic binding characteristics and downstream regulatory network.
- Tao X et al. constructed two biological replicas of H3K4me3 CUT&Tag (with IgG as a control) and conducted parallel ChIP experiments. After verifying fragment quality (~350 bp) using qubits and mapping NGS reads to the reference genome, they found that the CUT&Tag experimental group had extremely low correlation with the IgG control (r=0.01, low background), and the signal intensity, after normalization, was significantly higher than that of ChIP-seq (ChIP correlation with the simulated control r=0.89, poor signal-to-noise ratio). Peak distribution showed that 60-70% of the H3K4me3 signal was enriched in the 1-kb promoter and first exon/intron (consistent with ChIP), and the peak correlation verification results near the gene were reliable (CUT&Tag two replicas r=0.94, compared with ChIP r=0.71). This demonstrates that CUT&Tag requires less starting material and can generate high-resolution signals with low background noise, making it suitable for a wide range of plant epigenetic studies.
7.2 Developmental Regulatory Network Analysis
Akdogan-Ozdilek B et al. used zebrafish embryos at the full barrier stage as material in CUT&Tag. Through a modified mammalian protocol (combined with CUT&RUN), they generated high-resolution enrichment maps of H3K4me3, H3K27me3, H3K9me3, RNA polymerase II, and H2A.Z. Sequencing data underwent adapter removal via CutAdapt, alignment of the zebrafish genome with Bowtie2 (GRCz.11), filtering of unmapped reads with samtools, and removal of PCR repeats with picard. The data was then peaked using macs2 and processed by deepTools to generate genome trajectories and heatmaps/outlines, providing a core analysis of the zebrafish embryo chromatin landscape. Identifying subsets of genes that may be bivalently regulated during gastric development in zebrafish and mice provides evidence for H2A.Z evolution. A robust signal of H2A.Z was detected in embryos at the full barrier stage, with enrichment at gene promoters (consistent with previous studies). Approximately 74% of H2A.Z marker genes were expressed during the shielding stage (TPM>0.5).
CUT&Tag detects H2A.Z in shield stage zebrafish embryos (Akdogan-Ozdilek B et al., 2021)
Summary
Through this workflow, researchers can systematically analyze CUT & Tag data, forming a complete chain of evidence from technical validation to mechanism interpretation. It is recommended to regularly update genome annotation files (e.g., using Ensembl release 109) and establish a laboratory-specific quality control threshold database.
References:
- Cheng S, Miao B, Li T, Zhao G, Zhang B. Review and Evaluate the Bioinformatics Analysis Strategies of ATAC-seq and CUT&Tag Data. Genomics Proteomics Bioinformatics. 2024 Sep 13;22(3):qzae054.
- Li C, Wu P, Xie X, Chen X, Chen L, Zhu L, Xuan Z, Liu T, Tan W, Zhang S, Lin D, Wu C. Aberrant Notch-signaling promotes tumor angiogenesis in esophageal squamous-cell carcinoma. Signal Transduct Target Ther. 2025 Jul 22;10(1):233. doi: 10.1038/s41392-025-02309-5. Erratum in: Signal Transduct Target Ther. 2025 Aug 31;10(1):288.
- Yashar WM, Kong G, VanCampen J, Curtiss BM, Coleman DJ, Carbone L, Yardimci GG, Maxson JE, Braun TP. GoPeaks: histone modification peak calling for CUT&Tag. Genome Biol. 2022 Jul 4;23(1):144.
- Tao X, Feng S, Zhao T, Guan X. Efficient chromatin profiling of H3K4me3 modification in cotton using CUT&Tag. Plant Methods. 2020 Aug 31;16:120.
- Akdogan-Ozdilek B, Duval KL, Meng FW, Murphy PJ, Goll MG. Identification of chromatin states during zebrafish gastrulation using CUT&RUN and CUT&Tag. Dev Dyn. 2022 Apr;251(4):729-742.