The challenges of Chip-seq
ChIP-seq is a powerful method to identify genome-wide DNA binding sites for a protein of interest. Mapping the chromosomal locations of transcription factors (TFs), nucleosomes, histone modifications, chromatin remodeling enzymes, chaperones, and polymerases is one of the key tasks of modern biology. To this end, ChIP-seq is the standard methodology (Bailey et al., 2013). Multiple challenges presented in ChIP-seq are not only in sample preparation and sequencing but also in computational analysis.
Unlike other types of massively parallel sequencing data, the ChIP-seq data have several characteristics:
To extract meaningful data from the raw sequence reads, the ChIP-seq data analysis should:
Bioinformatics analysis workflow for ChIP-seq data
Bioinformatics analysis workflow for ChIP-seq data and the considerations for each step is illustrated in Figure 1 (Nakato and Shirahige, 2017). The procedure of sample preparation, sequencing and mapping (Figure 1A) is common in both experiments with single or a few samples (Figure 1B) and experiments with many samples (Figure 1C). Initially, sequencing reads of ChIP-seq are analyzed to assess the quality of the reads. After quality metrics, reads are mapped to the reference genome. Compared with input reads, genomic regions that are significantly enriched for ChIP reads are detected as peaks. Other genomic regions are regarded as non-specific background. Read densities can be visualized along the genome. Adjusting peak-calling strategy and parameters to each sample’s property is possible in sample-scale analysis (Figure 1B). But one-by-one adjusting is difficult for large-scale analysis (Figure 1C), in which objective quality metrics for multilateral quantitative assessment is necessary to filter poor-quality data automatically. The called peaks represent candidates of histone modification and targeted protein or DNA-binding sites, which can be used to identify associated functional annotations, such as binding motifs.
Figure 1. ChIP-seq analysis workflow. Adapted from (Nakato and Shirahige, 2017)
A comprehensive comparison of tools for differential ChIP-seq data analysis
There has been a large effort to improve analytical tools that are used in analysis of ChIP-seq data, and each step has led to the development of specialized software tools. A subset of software tools available for mapping and peak calling are briefly listed in Table 1 (Furey, 2012).
Table 1. A subset of software tools available for mapping and peak calling in the analysis of ChIP-seq data.
|BWA (Burrows-Wheeler Aligner)||Fast and efficient; based on the Burrows-Wheeler transform||http://bio-bwa.sourceforge.net|
|Bowtie||Similar to BWA, part of suite of tools that includes TopHat and CuffLinks for RNA-seq processing||http://bowtie-bio.sourceforge.net|
|GSNAP (Genomic Short-read Nucleotide Alignment Program)||Considers a set of variant allele inputs to better align to heterozygous sites||http://research-pub.gene.com/gmap|
|Wikipedia list of aligners||A comprehensive list of available short-read aligners, with descriptions and links to download the software||http://en.wikipedia.org/wiki/List_of_ sequence_alignment_software#Short- Read_Sequence_Alignment|
|MACS (Model-based Analysis for ChIP-seq)||Fits data to a dynamic Poisson distribution; works with and without control data||http://liulab.dfci.harvard.edu/MACS|
|PeakSeq||Takes into account differences in mappability of genomic regions; enrichment based on FDR (false-discovery rate) calculation||http://info.gersteinlab.org/PeakSeq|
|ZINBA (Zero-Inflated Negative Binomial Algorithm)||Can incorporate multiple genomic factors, such as mappability and GC content; can work with point-source and broad-source peak data||http://code.google.com/p/zinba|
Besides detection of enriched or bound regions in ChIP-seq data analysis, an important question is to determine differences between conditions. Owing to the complexity of ChIP-seq data in terms of noisiness and variability, the question is particularly challenging for ChIP-seq. Many different computational tools have been developed and published in recent years for differential ChIP-seq analysis. These tools show important differences in their algorithmic setups, in the number and size of detected differential regions (DR), and in the range of applicability. Description of 14 different tools for differential ChIP-seq data analysis is listed in Table 2 (Steinhauser et al., 2016).
Table 2. Description of different tools for differential ChIP-seq data analysis.
|Tool||Language||Peak Calling||Web address|
|SICER||Bash/Python||Window based approach, merging of eligible clusters in proximity closer than the defined gap size||https://home.gwu.edu/~wpeng/ Software.htm|
|ODIN||Python||Not required||http://costalab.org/wp/ odin|
|RSEG||C++||Not required||http://smithlabresearch.org/software /rseg/|
|MAnorm||R||Requires peak calling e.g. with MACS||http://bcb.dfci.harvard.edu/~gcyuan /MAnorm/MAnorm.htm|
|HOMER||Perl & C++||Window based approach Peak calling done by HOMER||http://homer.salk.edu/homer /index.html|
|QChIPat||R, Perl & C++||Peak calling possible with BELT, MACS, SISSRs or FindPeaks||http://motif.bmi.ohio-state.edu/ QChIPat/|
|diffReps||Perl||Sliding window approach||https://github.com/shenlab -sinai/diffreps|
|DBChip||R||Requires peak calling e.g. with MACS||http://pages.cs.wisc.edu/ ~kliang/DBChIP/|
|ChIPComp||R||Requires peak calling e.g. with MACS||http://web1.sph.emory.edu/users /hwu30/software/ChIPComp.html|
|MultiGPS||Java||Expectation maximization learning||http://mahonylab.org/software /multigps/|
|MMDiff||R||Requires peak calling e.g. with MACS||https://bioconductor.riken.jp/ packages/3.1/bioc/html/MMDiff.html|
|DiffBind||R||Requires peak calling e.g. with MACS||http://bioconductor.org/packages /release/bioc/html/DiffBind.html|
|PePr||Python||Window based approach||https://github.com/shawnzhangyx /PePr|
Decision tree indicating the proper choice of tool is illustrated in Figure 2. The choice of tool depends on several factors: shape of the signal (sharp peaks or broad ChIP enrichments), presence of replicates and presence of an external set of regions of interest. The tools indicated in black give good results using default settings, and the tools in gray would require more extensive fine-tuning of parameters to achieve optimal results.
Figure 2. Decision tree indicating the proper choice of tool. Adapted from (Steinhauser et al., 2016).
Technical guidelines for the comprehensive analysis of ChIP-seq data
Recent advances in sequencing technologies and analyses enable us to handle hundreds of ChIP samples simultaneously. But there are still some issues in analysis of ChIP-seq data, such as the false positive peaks, the multiple mapped reads and the poor overlap between peak-finding algorithm results. To obtain high-quality results from the computational analysis of ChIP-seq data, some technical aspects should be considered, which have been listed below (Bailey et al., 2013):
1) Sequencing Depth
strong>2) Read Mapping and Quality Metrics
3) Peak Calling
4) Assessment of Reproducibility
5) Differential Binding Analysis
6) Peak Annotation
The aim of the annotation is to associate the ChIP-seq peaks with functionally relevant genomic regions, such as gene promoters, transcription start sites, intergenic regions, etc.
7) Motif Analysis