CD Genomics-the genomics service company
Support Documents The CD Genomics Way of Thinking Explore the scientific documents we’ve developed, including sample submission guidelines, principles, applications, and bioinformatics of genetic technologies.
Home / Resource / Support Documents / Genome Research / Overview of RRBS Data Analysis: Pipeline, Alignment Tools, Databases, and Challenges

Overview of RRBS Data Analysis: Pipeline, Alignment Tools, Databases, and Challenges

The rapid development of high-throughput sequencing technologies now provides opportunities to interrogate DNA methylation at single base resolution with high coverage on a massive scale. Bisulfite sequencing is the gold-standard for measuring methylation over the genomes of interest (Wreczycka et al., 2017). The reduced representation bisulfite sequencing (RRBS) has been widely used for studying genome-wide DNA methylation due to its significantly reduced sequencing cost and high-sequencing coverage and sensitivity (Gu et al., 2010; Meissner et al., 2005).

Bioinformatics Analysis Pipeline for RRBS Data

Analysis of DNA methylation patterns on a genome-wide scale is essential to understanding the underlying mechanisms of DNA methylation. The computational pipeline for analysis of RRBS data is shown in Figure 1.

Overview of RRBS Data Analysis: Pipeline, Alignment Tools, Databases, and ChallengesFigure 1. Pipeline for analysis of RRBS data. CpG: CG sequences, C is cytosine and G is guanine. CHG and CHH: H is A (adenine), C or T (thymine).

Alignment tools used for RRBS data analysis

Because of the complexity of bisulfite sequencing alignments (the aligned sequences do not exactly match the reference genome, and the complexity of the libraries is reduced), standard sequence alignment software cannot be used. Due to the unique properties of RRBS, special tools are needed for alignment and analysis. Five commonly used mapping algorithms for benchmarking analysis in RRBS data, include Bismark, BS-Seeker2, BSMAP, GSNAP, and bwa-meth, which are listed in Table 1 (Sun et al., 2018).

Table 1. Brief description of different alignment tools for RRBS data analysis.

  Bismark BS-Seeker2 bwa-meth BSMAP GSNAP
Mapping strategy Three-letter Three-letter Three-letter Wildcard Wildcard
Aligner Bowtie, bowtie2 Bowtie, bowtie2, SOAP BWA SOAP Gsnap
WGBS/RRBS WGBS/RRBS WGBS/RRBS WGBS/RRBS WGBS/RRBS WGBS/RRBS
Adapter trimming No Yes No Yes Yes
Multi-cores Yes Yes Yes Yes Yes
Directional /undirectional Yes/Yes Yes/Yes Yes/No Yes/Yes Yes/Yes
Single-end/pair-end Yes/Yes Yes/Yes No/Yes Yes/Yes Yes/Yes
Programming language Perl Python Python C++ C and Perl

DNA Methylation databases

Large amount of data has been generated by the NGS-based DNA methylation detection technologies in the past years. Several methylation databases have been developed to store these data and are available for researchers (Table 2) (Su et al., 2012). With the development of study about DNA methylation, more databases will be established and then more information about methylation will be known.

Table 2. DNA Methylation databases.

Tools Purpose Web page
MethDB Database for DNA methylation data http://www.methdb.de
MethyCancer Database Database of cancer DNA methylation data http://methycancer.psych. ac.cn/
PubMeth Database of DNA methylation literature http://www.pubmeth.org/
NGSmethDB Database for DNA methylation data at single-base resolution http://bioinfo2.ugr.es/ NGSmethDB/gbrowse/
DBCAT Database of CpG islands and analytical tools for identifying comprehensive methylation profiles in cancer cells http://dbcat.cgm.ntu. edu.tw/
MethylomeDB Database of DNA methylation profiles of the brain http://epigenomics. columbia.edu/ methylomedb/index.html
DiseaseMeth Human disease methylation database http://bioinfo.hrbmu.edu. cn/diseasemeth
CpG IE Identification of CpG islands http://bioinfo.hku.hk/ cpgieintro.html
CpG IS Identification of CpG islands http://cpgislands.usc.edu/
CG clusters Identification of CpG islands http://greallylab.aecom. yu.edu/cgClusters/
CpGcluster Identification of CpG islands http://bioinfo2.ugr.es/ CpGcluster
CpGIF Identification of CpG islands http://www.usd.edu/~sye/ cpgisland/CpGIF.htm
CpG_MI Identification of CpG islands http://bioinfo.hrbmu.edu. cn/cpgmi
CpGProD Identification of CpG islands http://pbil.univ-lyon1.fr/ software/cpgprod.html
EpiGRAPH Genome scale statistical analysis http://epigraph.mpi-inf. mpg.de/WebGRAPH
Galaxy General purpose analysis http://main.g2.bx.psu.edu/
QDMR Identification of differentially methylated regions http://bioinfo.hrbmu.edu. cn/qdmr.
Batman MeDIP DNA methylation analysis tool http://td-blade.gurdon. cam.ac.uk/software/ batman
CisGenome Browser A flexible tool for genomic data visualization http://biogibbs.stanford. edu/~jiangh/browser/
MethVisual Visualization and exploratory statistical analysis of DNA methylation profiles from bisulfite sequencing http://methvisual.molgen. mpg.de/
MethTools A toolbox to visualize and analyze DNA methylation data http://genome.imb-jena. de/methtools/

Challenges of methylation calling in RRBS Data Analysis

There are two key factors affecting the accuracy of methylation calls when determining the methylation state of bisulfite sequencing reads. First, the sequencing reads must be correct and derive entirely from bisulfite-converted sequences. Second, the reads must be correctly mapped to the reference genome. Failure of these two factors will result in the generation of incorrect methylation calls. And in extreme cases, the noise from these miscalls can adversely affect the experiments conclusions (Krueger et al., 2012). The process of restriction enzymes (commonly using restriction endonuclease MspI) digestion, bisulfite conversion and sequencing involved in RRBS would affect these two factors.

  • MspI digestion

The MspI digestion would result in a wide range of DNA fragments in different sizes (Figure 2), and usually fragments between 40 and 220bp will be size-selected for the RRBS library. Quite a few MspI digested fragments, even shorter than 40bp, will be generated during the process. If the size selection process is not as good as it is in theory, often a sizeable number of fragments below 40bp can end up in the RRBS library.

Overview of RRBS Data Analysis: Pipeline, Alignment Tools, Databases, and ChallengesFigure 2. The relative frequencies of MspI digestion product sizes in the human reference genome. (Suzuki et al., 2010)

The shorter fragments are more likely to be sequenced than larger (≥300bp) fragments. But short reads in the bisulfite sequencing data could result in low mapping efficiency in data analysis (Figure 3).

Overview of RRBS Data Analysis: Pipeline, Alignment Tools, Databases, and ChallengesFigure 3. Performance of methylation-aware mapping (biased) and unbiased mapping for methylation sequencing data. (Krueger et al., 2012)

  • Bisulfite conversion

The bisulfite treatment of DNA mediates the deamination of unmethylated cytosine into uracil, and these converted residues will be read as thymine, which is determined by PCR-amplification and subsequent sequencing analysis (Figure 4).

Overview of RRBS Data Analysis: Pipeline, Alignment Tools, Databases, and ChallengesFigure 4. The principle of bisulfite sequencing.

Bisulfite sequencing relies on the conversion of every single unmethylated cytosine residue to uracil. Incomplete conversion will cause false positive results due to incorrect interpretation of the unconverted unmethylated cytosines as methylated cytosines (Figure 5).

Overview of RRBS Data Analysis: Pipeline, Alignment Tools, Databases, and ChallengesFigure 5. Incomplete bisulfite conversion.

  • Sequencing

Due to the short size-selected fragment size in the RRBS library, several factors in the sequencing process would affect the RRBS data analysis (Krueger et al., 2012):

  • Base-calling qualities: The quality of base calls tends to fall as the length of the reads increases. The poor base qualities would lead to incorrect methylation calls and/or mis-mapping.
  • Base call errors: The sequencing errors in reads can result in low mapping efficiency (reads not being aligned at all), incorrect methylation calls, or mis-alignments, which will also most likely lead to incorrect methylation calls.
  • Adapter contamination: In many libraries, a proportion of reads will run through the insert and begin to sequence the adaptor on the 3′ end. Such ‘adapter contamination’ may lead to low mapping efficiencies if the read fails to align, or, if mapped, may lead to false alignments which can result in incorrect methylation calls.
  • End repair: Positions filled in during end repair will infer the methylation state of the cytosine used for the fill-in reaction but not of the true genomic cytosine.
  • Pair-end sequencing: Pair-end RRBS sequencing (especially with long read length) yield redundant methylation information if the read pairs overlap.

At CD Genomics, we are dedicated to providing reliable epigenomics sequencing services, including EpiTYPER DNA methylation analysistargeted bisulfite sequencingreduced representation bisulfite sequencing (RRBS)whole genome bisulfite sequencingMeDIP sequencingChIP-seq, and MethylRAD-seq.

References:

  1. Gu, H., et al., Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution. Nature methods, 2010,(7):133-136.
  2. Krueger, F., et al., DNA methylome analysis using short bisulfite sequencing data. Nature methods 2012,9,145-151.
  3. Meissner, A., et al., Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic acids research, 2005,(33):5868-5877.
  4. Su J., et al., Advances in Bioinformatics Tools for High-Throughput Sequencing Data of DNA Methylation. Hereditary Genet, 2012(1):107.
  5. Sun, X., et al. A comprehensive evaluation of alignment software for reduced representation bisulfite sequencing data. Bioinformatics, 2018(34):2715-2723.
  6. Suzuki, M., et al. Optimized design and data analysis of tag-based cytosine methylation assays. Genome biology, 2010(11):R36.
  7. Wreczycka, K., et al., Strategies for analyzing bisulfite sequencing data. Journal of biotechnology, 2017(261):105-115.
SPEAK TO OUR SCIENTISTS

What would you like to discuss?

With whom will we be speaking?

Please input "genomics" as verification code.

* is a required item.

Get cutting-edge science information from CD Genomics sent straight to your inbox every month.

SUBSCRIBE TO OUR NEWSLETTER
CONTACT CD GENOMICS

45-1 Ramsey Road, Shirley, NY 11967, USA
Tel: 1-631-275-3058
Fax: 1-631-614-7828
Email: info@cd-genomics.com