Bioinformatics Workflow for Whole Genome Sequencing

Whole genome sequencing (WGS) has the capacity to greatly enhance genomic knowledge and understand mysteries of life by utilizing the most advanced genetic sequencing technologies. WGS can be used for variant calling, genome annotation, phylogenetic analysis, reference genome construction, and more. WGS tries to cover the whole genome, but actually covers 95% of the genome with technical difficulties in sequencing regions such as centromeres and telomeres. Another challenge for WGS is data management. As larger datasets become more accessible and affordable, computational analysis will be the rate-limiting factor rather than sequencing technology. Here we will discuss the bioinformatics workflow for detection of genetic variations in WGS to help you get through it.

The bioinformatics workflow for WGS is similar to that for whole exome sequencing. You can view our article Bioinformatics Workflow for Whole Exome Sequencing. The bioinformatics workflow for WGS falls into the following steps: (1) raw read quality control; (2) data preprocessing; (3) alignment; (4) variant calling; (5) genome assembly; (6) genome annotation; (7) other advanced analyses based on your research interest such as phylogenetic analysis.

Bioinformatics Workflow for Whole Genome SequencingFigure 1. Bioinformatics workflow of whole genome sequencing.

Raw read QC and preprocessing

The raw files (fastq) need to be eliminated from poor-quality reads/sequences and technical sequences such as adapter sequences. This process is important for accurate and reliable variation detection. FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastq) is a powerful tool for raw read QC that generates statistics data results, including basic statistics, sequence quality, quality scores, sequence content, GC content, sequence length distribution, overrepresented sequences, sequence duplication level plots, adapter content, and k-mer content. Tools like Fastx_trimmer and cutadapt can be used for read trimming.

Alignment

A reference genome needs to be determined. Mash enables us to compare the sequencing reads generated against the reference set from NCBI RefSeq genomes (https://www.ncbi.nlm.nih.gov/refseq) to determine genetic distance and relatedness. The next step is to map the quality-controlled reads to the reference genome. Burrows-Wheeler Aligner (BWA) and Bowtie2 are two popular short read alignment algorithms. The output of BWA and Bowtie2 is the standard sequence alignment/map format known as SAM, which facilitates the following steps. Alternatively, BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi) is widely used for local alignment.

Table 1. The common computational programs for read alignment.

Program Source type Website
Bowtie2 Open source http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
SEAL Open source http://compbio.case.edu/seal/
SOAP3 Open source http://www.cs.hku.hk/2bwt-tools/soap3/; http://soap.genomics.org.cn/soap3.html
BWA, BWA-SW Open source http://bio-bwa.sourceforge.net/
Novoalign Commercially available http://www.novocra.com/
SHRiMP/SHRiMP2 Open source http://compbio.cs.toronto.edu/shrimp/
MAQ Open source http://maq.sourceforget.net/
Stampy Open source http://www.well.ox.ac.uk/project-stampy/
ELAND Commercially available http://www.illumina.com/
SARUMAN Open source http://www.cebitec.uni-bielefeld.de/brf/saruman/saruman.html

Variant calling

Once reads are aligned to the reference genome, variants can be identified by comparing the sample genome to the reference genome. Detected variants may be associated with disease, or simply be non-functional genomic noise. Variant call format (VCF) is the standard format for storing sequence variations, including SNPs (single nucleotide polymorphisms), indels, structural variants, and annotations. Variant calling can be complicated due to high rate of false positive and false negative identifications of SNVs and indels. The software packages in Table 2 are useful for improving variant calling.

Table 2. The software packages for variant calling.

Software packages Descriptions Website
GATK
  • Multiple-sequence realignment
  • Quality score recalibration
  • SNP genotyping
  • Indel discovery and genotyping
http://software.broadinstitute.org/gatk/
SOAPsnp
  • Consensus calling and SNP detection
  • Calculation of the likelihood of each genotype
http://soap.genomics.org.cn/
VarScan/VarScan2
  • Detects variants at 1% frequency
  • Normalizes sequence depth at each position
http://genome.wustl.edu/tools/cancer-genomics
ALTAS 2
  • Variant calling of aligned data from diverse NGS platforms
http://www.genboree.org/

Genome assembly

De novo assembly is the process to align overlapping reads to form longer contigs (larger contiguous sequences) and order the contigs into scaffolds (a framework of the sequenced genome). If there is a reference genome from a related species, the common method is to first generate contigs de novo and then align them to the reference genome for scaffold assembly. An alternative approach is the “Align-Layout-Consensus” algorithm. This method first aligns reads to a closely related reference genome, and then builds contigs and scaffolds de novo.

Table 3. The common assemblers for diverse sequencing platforms.

Sequencing platform Tools for genome assembly
Illumina Velvet (https://www.ebi.ac.uk/~zerbino/velvet/) SPAdes (http://bioinf.Spbau.Ru/spades)
Ion Torrent MIRA (http://www.Chevreux.Org/projects_mira.html)
Roche 454 Newbler (http://454.com/contact-us/software-request.asp)
PacBio SMRT SPAdes, HGAP, and the Celera-MHAP assembler

Users can assess the quality of draft genome assemblies or compare assemblies generated by different methods. There are a variety of metrics that reflect the quality of the assembly. Only contiguous near-complete (approximately 90%) assembly interrupted by small gaps will yield successful genome annotation.

  • Genome size. Both C-value and k-mer frequency-based approaches can infer genome size.
  • Assembly contiguity. N50 statistic can be used to evaluate assembly contiguity, which describes a kind of median of assembled sequence lengths.
  • Accuracy. Transcriptome data present an important resource for validating sequence accuracy and correcting scaffolds. Comparative genomic approaches can also provide guidance in detecting mis-assemblies and chimeric contigs.

Genome annotation

To fully understand the genome sequence, it needs to be annotated with biologically relevant information such as gene ontology (GO) terms, KEGG pathways, and epigenetic modifications. The annotation involves two phases:

(1) Computational phase. A computational phase includes repeat masking, prediction of coding sequence (CDS), and prediction of gene models.

  • Repeat masking. Since repeats are poorly conserved across species, it is recommended to create a species-specific repeat library by utilizing tools such as RepeatModeler, RepeatExplorer.
  • Prediction of CDS. Predict CDS using ab initio algorithms.
  • Prediction of gene models. Protein alignment, syntenic protein lift-overs from other species, EST, and RNA-seq data can provide a valuable resource for predicting gene models.

(2) Annotation phase. All the evidence mentioned above (ab initio prediction, as well as protein-, EST-, and RNA-alignments) is then synthesized into a gene annotation. Additionally, automated annotation tools such as MAKER and PASA are available to integrate and weigh the evidence. WebApollo can be used to edit the annotation through the visual interface if anything is wrong with the gene annotations.

Once the genome annotation is assessed by visual inspection, you can publish the draft genome sequences and annotation. In order to allow others to improve the genome assembly and annotation, all raw data should be uploaded. The available databases for uploading genome include ENSEMBL and NCBI.

If you are interested in our genomics services, please visit our website: www.cd-genomics.com for more information. We can provide a full package of genomics sequencing, including whole genome sequencing, whole exome sequencing, targeted region sequencing, mitochondrial DNA (mtDNA) sequencing, and complete plasmid DNA sequencing.

References:

  1. Dolled-Filhart M P, Lee M, Ou-yang C, et al. Computational and bioinformatics frameworks for next-generation whole exome and genome sequencing. The Scientific World Journal, 2013, 2013.
  2. Ekblom R, Wolf J B W. A field guide to whole‐genome sequencing, assembly and annotation. Evolutionary applications, 2014, 7(9): 1026-1042.
  3. Kwong J C, McCallum N, Sintchenko V, et al. Whole genome sequencing in clinical and public health microbiology. Pathology, 2015, 47(3): 199-210.
  4. Meena N, Mathur P, Medicherla K M, et al. A Bioinformatics Pipeline for Whole Exome Sequencing: Overview of the Processing and Steps from Raw Data to Downstream Analysis. bioRxiv, 2017: 201145.
  5. Oakeson K F, Wagner J M, Mendenhall M, et al. Bioinformatic analyses of whole-genome sequence data in a public health laboratory. Emerging infectious diseases, 2017, 23(9): 1441.
For Research Use Only. Not for use in diagnostic procedures.
Related Services
Speak to Our Scientists
What would you like to discuss?
With whom will we be speaking?

* is a required item.

Contact CD Genomics
Terms & Conditions | Privacy Policy | Feedback   Copyright © CD Genomics. All rights reserved.
Top