Whole genome sequencing (WGS) has the capacity to greatly enhance genomic knowledge and understand mysteries of life by utilizing the most advanced genetic sequencing technologies. WGS can be used for variant calling, genome annotation, phylogenetic analysis, reference genome construction, and more. WGS tries to cover the whole genome, but actually covers 95% of the genome with technical difficulties in sequencing regions such as centromeres and telomeres. Another challenge for WGS is data management. As larger datasets become more accessible and affordable, computational analysis will be the rate-limiting factor rather than sequencing technology. Here we will discuss the bioinformatics workflow for detection of genetic variations in WGS to help you get through it.
The bioinformatics workflow for WGS is similar to that for whole exome sequencing. You can view our article Bioinformatics Workflow for Whole Exome Sequencing. The bioinformatics workflow for WGS falls into the following steps: (1) raw read quality control; (2) data preprocessing; (3) alignment; (4) variant calling; (5) genome assembly; (6) genome annotation; (7) other advanced analyses based on your research interest such as phylogenetic analysis.
Figure 1. Bioinformatics workflow of whole genome sequencing.
Raw read QC and preprocessing
The raw files (fastq) need to be eliminated from poor-quality reads/sequences and technical sequences such as adapter sequences. This process is important for accurate and reliable variation detection. FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastq) is a powerful tool for raw read QC that generates statistics data results, including basic statistics, sequence quality, quality scores, sequence content, GC content, sequence length distribution, overrepresented sequences, sequence duplication level plots, adapter content, and k-mer content. Tools like Fastx_trimmer and cutadapt can be used for read trimming.
A reference genome needs to be determined. Mash enables us to compare the sequencing reads generated against the reference set from NCBI RefSeq genomes (https://www.ncbi.nlm.nih.gov/refseq) to determine genetic distance and relatedness. The next step is to map the quality-controlled reads to the reference genome. Burrows-Wheeler Aligner (BWA) and Bowtie2 are two popular short read alignment algorithms. The output of BWA and Bowtie2 is the standard sequence alignment/map format known as SAM, which facilitates the following steps. Alternatively, BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi) is widely used for local alignment.
Table 1. The common computational programs for read alignment.
|SOAP3||Open source||http://www.cs.hku.hk/2bwt-tools/soap3/; http://soap.genomics.org.cn/soap3.html|
|BWA, BWA-SW||Open source||http://bio-bwa.sourceforge.net/|
Once reads are aligned to the reference genome, variants can be identified by comparing the sample genome to the reference genome. Detected variants may be associated with disease, or simply be non-functional genomic noise. Variant call format (VCF) is the standard format for storing sequence variations, including SNPs (single nucleotide polymorphisms), indels, structural variants, and annotations. Variant calling can be complicated due to high rate of false positive and false negative identifications of SNVs and indels. The software packages in Table 2 are useful for improving variant calling.
Table 2. The software packages for variant calling.
De novo assembly is the process to align overlapping reads to form longer contigs (larger contiguous sequences) and order the contigs into scaffolds (a framework of the sequenced genome). If there is a reference genome from a related species, the common method is to first generate contigs de novo and then align them to the reference genome for scaffold assembly. An alternative approach is the “Align-Layout-Consensus” algorithm. This method first aligns reads to a closely related reference genome, and then builds contigs and scaffolds de novo.
Table 3. The common assemblers for diverse sequencing platforms.
|Sequencing platform||Tools for genome assembly|
|Illumina||Velvet (https://www.ebi.ac.uk/~zerbino/velvet/) SPAdes (http://bioinf.Spbau.Ru/spades)|
|Ion Torrent||MIRA (http://www.Chevreux.Org/projects_mira.html)|
|Roche 454||Newbler (http://454.com/contact-us/software-request.asp)|
|PacBio SMRT||SPAdes, HGAP, and the Celera-MHAP assembler|
Users can assess the quality of draft genome assemblies or compare assemblies generated by different methods. There are a variety of metrics that reflect the quality of the assembly. Only contiguous near-complete (approximately 90%) assembly interrupted by small gaps will yield successful genome annotation.
To fully understand the genome sequence, it needs to be annotated with biologically relevant information such as gene ontology (GO) terms, KEGG pathways, and epigenetic modifications. The annotation involves two phases:
(1) Computational phase. A computational phase includes repeat masking, prediction of coding sequence (CDS), and prediction of gene models.
(2) Annotation phase. All the evidence mentioned above (ab initio prediction, as well as protein-, EST-, and RNA-alignments) is then synthesized into a gene annotation. Additionally, automated annotation tools such as MAKER and PASA are available to integrate and weigh the evidence. WebApollo can be used to edit the annotation through the visual interface if anything is wrong with the gene annotations.
Once the genome annotation is assessed by visual inspection, you can publish the draft genome sequences and annotation. In order to allow others to improve the genome assembly and annotation, all raw data should be uploaded. The available databases for uploading genome include ENSEMBL and NCBI.
If you are interested in our genomics services, please visit our website: www.cd-genomics.com for more information. We can provide a full package of genomics sequencing, including whole genome sequencing, whole exome sequencing, targeted region sequencing, mitochondrial DNA (mtDNA) sequencing, and complete plasmid DNA sequencing.