Pipeline and Tools for ChIP-Seq Analysis

The challenges of Chip-seq

ChIP-seq is a powerful method to identify genome-wide DNA binding sites for a protein of interest. Mapping the chromosomal locations of transcription factors (TFs), nucleosomes, histone modifications, chromatin remodeling enzymes, chaperones, and polymerases is one of the key tasks of modern biology. To this end, ChIP-seq is the standard methodology (Bailey et al., 2013). Multiple challenges presented in ChIP-seq are not only in sample preparation and sequencing but also in computational analysis.

Unlike other types of massively parallel sequencing data, the ChIP-seq data have several characteristics:

  • Histone modifications cover broader regions of DNA than TFs.
  • Reads are trimmed to within a smaller number of bases.
  • Fragments are quite large relative to binding sites of TFs.
  • Measurements of histone modification often undulate following well-positioned nucleosomes.

To extract meaningful data from the raw sequence reads, the ChIP-seq data analysis should:

  • Identify genomic regions - ‘peaks’ - where TF binds or histones are modified.
  • Quantify and compare levels of binding or histone modification between samples.
  • Characterize the relationships among chromatin state and gene expression or splicing.

Bioinformatics analysis workflow for ChIP-seq data

Bioinformatics analysis workflow for ChIP-seq data and the considerations for each step is illustrated in Figure 1 (Nakato and Shirahige, 2017). The procedure of sample preparation, sequencing and mapping (Figure 1A) is common in both experiments with single or a few samples (Figure 1B) and experiments with many samples (Figure 1C). Initially, sequencing reads of ChIP-seq are analyzed to assess the quality of the reads. After quality metrics, reads are mapped to the reference genome. Compared with input reads, genomic regions that are significantly enriched for ChIP reads are detected as peaks. Other genomic regions are regarded as non-specific background. Read densities can be visualized along the genome. Adjusting peak-calling strategy and parameters to each sample’s property is possible in sample-scale analysis (Figure 1B). But one-by-one adjusting is difficult for large-scale analysis (Figure 1C), in which objective quality metrics for multilateral quantitative assessment is necessary to filter poor-quality data automatically. The called peaks represent candidates of histone modification and targeted protein or DNA-binding sites, which can be used to identify associated functional annotations, such as binding motifs.

Pipeline and Tools Comparison for ChIP-Seq Analysis

Figure 1. ChIP-seq analysis workflow. Adapted from (Nakato and Shirahige, 2017)

A comprehensive comparison of tools for differential ChIP-seq data analysis

There has been a large effort to improve analytical tools that are used in analysis of ChIP-seq data, and each step has led to the development of specialized software tools. A subset of software tools available for mapping and peak calling are briefly listed in Table 1 (Furey, 2012).

Table 1. A subset of software tools available for mapping and peak calling in the analysis of ChIP-seq data.

Tool Notes Web address
Short-read aligners
BWA (Burrows-Wheeler Aligner) Fast and efficient; based on the Burrows-Wheeler transform http://bio-bwa.sourceforge.net
Bowtie Similar to BWA, part of suite of tools that includes TopHat and CuffLinks for RNA-seq processing http://bowtie-bio.sourceforge.net
GSNAP (Genomic Short-read Nucleotide Alignment Program) Considers a set of variant allele inputs to better align to heterozygous sites http://research-pub.gene.com/gmap
Wikipedia list of aligners A comprehensive list of available short-read aligners, with descriptions and links to download the software http://en.wikipedia.org/wiki/List_of_ sequence_alignment_software#Short- Read_Sequence_Alignment
Peak callers
MACS (Model-based Analysis for ChIP-seq) Fits data to a dynamic Poisson distribution; works with and without control data http://liulab.dfci.harvard.edu/MACS
PeakSeq Takes into account differences in mappability of genomic regions; enrichment based on FDR (false-discovery rate) calculation http://info.gersteinlab.org/PeakSeq
ZINBA (Zero-Inflated Negative Binomial Algorithm) Can incorporate multiple genomic factors, such as mappability and GC content; can work with point-source and broad-source peak data http://code.google.com/p/zinba

Besides detection of enriched or bound regions in ChIP-seq data analysis, an important question is to determine differences between conditions. Owing to the complexity of ChIP-seq data in terms of noisiness and variability, the question is particularly challenging for ChIP-seq. Many different computational tools have been developed and published in recent years for differential ChIP-seq analysis. These tools show important differences in their algorithmic setups, in the number and size of detected differential regions (DR), and in the range of applicability. Description of 14 different tools for differential ChIP-seq data analysis is listed in Table 2 (Steinhauser et al., 2016).

Table 2. Description of different tools for differential ChIP-seq data analysis.

Tool Language Peak Calling Web address
SICER Bash/Python Window based approach, merging of eligible clusters in proximity closer than the defined gap size https://home.gwu.edu/~wpeng/ Software.htm
MACS2 Python Not required https://github.com/taoliu/MACS/
ODIN Python Not required http://costalab.org/wp/ odin
RSEG C++ Not required http://smithlabresearch.org/software /rseg/
MAnorm R Requires peak calling e.g. with MACS http://bcb.dfci.harvard.edu/~gcyuan /MAnorm/MAnorm.htm
HOMER Perl & C++ Window based approach Peak calling done by HOMER http://homer.salk.edu/homer /index.html
QChIPat R, Perl & C++ Peak calling possible with BELT, MACS, SISSRs or FindPeaks http://motif.bmi.ohio-state.edu/ QChIPat/
diffReps Perl Sliding window approach https://github.com/shenlab -sinai/diffreps
DBChip R Requires peak calling e.g. with MACS http://pages.cs.wisc.edu/ ~kliang/DBChIP/
ChIPComp R Requires peak calling e.g. with MACS http://web1.sph.emory.edu/users /hwu30/software/ChIPComp.html
MultiGPS Java Expectation maximization learning http://mahonylab.org/software /multigps/
MMDiff R Requires peak calling e.g. with MACS https://bioconductor.riken.jp/ packages/3.1/bioc/html/MMDiff.html
DiffBind R Requires peak calling e.g. with MACS http://bioconductor.org/packages /release/bioc/html/DiffBind.html
PePr Python Window based approach https://github.com/shawnzhangyx /PePr

Decision tree indicating the proper choice of tool is illustrated in Figure 2. The choice of tool depends on several factors: shape of the signal (sharp peaks or broad ChIP enrichments), presence of replicates and presence of an external set of regions of interest. The tools indicated in black give good results using default settings, and the tools in gray would require more extensive fine-tuning of parameters to achieve optimal results.

Pipeline and Tools Comparison for ChIP-Seq Analysis

Figure 2. Decision tree indicating the proper choice of tool. Adapted from (Steinhauser et al., 2016).

Technical guidelines for the comprehensive analysis of ChIP-seq data

Recent advances in sequencing technologies and analyses enable us to handle hundreds of ChIP samples simultaneously. But there are still some issues in analysis of ChIP-seq data, such as the false positive peaks, the multiple mapped reads and the poor overlap between peak-finding algorithm results. To obtain high-quality results from the computational analysis of ChIP-seq data, some technical aspects should be considered, which have been listed below (Bailey et al., 2013):

1) Sequencing Depth

  • Effective analysis of ChIP-seq data requires enough coverage by sequence reads (sequencing depth). The required sequencing depth mainly depends on the size of the genome and the number and size of the binding sites of the protein.
  • 20 million reads may be adequate for mammalian TFs and chromatin modifications which are typically localized at specific, narrow sites, such as enhancer-associated histone marks (Landt et al., 2012).
  • Proteins with broader factors, including most histone marks, or more binding sites, such as RNA Pol II, will require up to 60 million reads for mammalian ChIP-seq (Chen et al., 2012).
  • Control samples should be sequenced significantly deeper than the ChIP ones.

strong>2) Read Mapping and Quality Metrics

  • Before mapping to the reference genome, the reads should be filtered by applying a quality cutoff.
  • It is important to consider the percentage of uniquely mapped reads reported by the mapping tools.

3) Peak Calling

  • The analysis for ChIP-seq data is to predict the regions of the genome where the ChIPed protein is bound by finding regions with peaks.
  • A fine balance between sensitivity and specificity depends on choosing an appropriate peak-calling algorithm and normalization method based on the type of protein ChIPed.

4) Assessment of Reproducibility

  • To ensure the reproducibility of the experimental results, at least two biological replicates of each ChIP-seq experiment are recommended to be performed.
  • The reproducibility of both reads and identified peaks should be examined.

5) Differential Binding Analysis

  • Comparative ChIP-seq analysis of an increasing number of protein-bound regions across conditions or tissues is expected with the steady raise of NGS (next-generation sequencing) projects.
  • The direct calculation of differentially bound regions between treatment samples without controls is not recommended.

6) Peak Annotation

The aim of the annotation is to associate the ChIP-seq peaks with functionally relevant genomic regions, such as gene promoters, transcription start sites, intergenic regions, etc.

7) Motif Analysis

  • Motif analysis is useful for much more than just identifying the causal DNA-binding motif in TF ChIP-seq peaks.
  • When the motif of the ChIPed protein is already known, motif analysis provides validation of the success of the experiment.

Additional reading:

The Advantages and Workflow of ChIP-Seq

References:

  1. Bailey, T., Krajewski, P., Ladunga, I., Lefebvre, C., Li, Q., Liu, T., Madrigal, P., Taslim, C., and Zhang, J. (2013). Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS computational biology 9, e1003326.
  2. Chen, Y., Negre, N., Li, Q., Mieczkowska, J.O., Slattery, M., Liu, T., Zhang, Y., Kim, T.K., He, H.H., Zieba, J., et al. (2012). Systematic evaluation of factors influencing ChIP-seq fidelity. Nature methods 9, 609-614.
  3. Furey, T.S. (2012). ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nature reviews Genetics 13, 840-852.
  4. Landt, S.G., Marinov, G.K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B.E., Bickel, P., Brown, J.B., Cayting, P., et al. (2012). ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome research 22, 1813-1831.
  5. Machanick, P., and Bailey, T.L. (2011). MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27, 1696-1697.
  6. McLean, C.Y., Bristor, D., Hiller, M., Clarke, S.L., Schaar, B.T., Lowe, C.B., Wenger, A.M., and Bejerano, G. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nature biotechnology 28, 495-501.
  7. Nakato, R., and Shirahige, K. (2017). Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation. Briefings in bioinformatics 18, 279-290.
  8. Steinhauser, S., Kurzawa, N., Eils, R., and Herrmann, C. (2016). A comprehensive comparison of tools for differential ChIP-seq analysis. Briefings in bioinformatics 17, 953-966.
  9. Thomas-Chollier, M., Herrmann, C., Defrance, M., Sand, O., Thieffry, D., and van Helden, J. (2012). RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic acids research 40, e31.
For Research Use Only. Not for use in diagnostic procedures.
Related Services
Quote Request
! For research purposes only, not intended for personal diagnosis, clinical testing, or health assessment.
Contact CD Genomics
Terms & Conditions | Privacy Policy | Feedback   Copyright © CD Genomics. All rights reserved.
Top