CD Genomics-the genomics service company
Support Documents The CD Genomics Way of Thinking Explore the scientific documents we’ve developed, including sample submission guidelines, principles, applications, and bioinformatics of genetic technologies.
Home / Resource / Support Documents / Genome Research / Introduction to Shotgun Metagenomics, from Sampling to Data Analysis

Introduction to Shotgun Metagenomics, from Sampling to Data Analysis

Metagenomics is the science that applies high-throughput sequencing technologies and bioinformatics tools to directly obtain the genetic content of a microbial community without the need to isolate and culture the individual microbial species. Metagenomics enables researches not only to study the functional gene composition of microbial communities but also to conduct evolutionary research. Metagenomics has been used to identify novel biocatalysts or enzymes and generate novel hypotheses of microbial function, which is a powerful and practical tool. Compared to 16S/18S/ITS amplicon sequencing, metagenomics can provide more information about functional potential of microbial communities and whole-genome sequences. The rapid development and substantial cost decrease in high-throughput sequencing have dramatically promoted the development of shotgun mategenomic sequencing.

This article gives an overview of metagnomics, from sampling to data analysis. A typical metagenomics project involves sample preparation, sequencing, and data analysis (including assembly, binning, annotation, statistical analysis, and data submission).

Introduction to Shotgun Metagenomics, from Sampling to Data AnalysisFigure 1. Flow diagram of a typical metagenome project.

Sample preparation

Sample preparation generally involves two step, sample collection and DNA extraction, both of which can affect quality ad accuracy of metagenomic experiments. Commercial kits are available for sample collection and DNA isolation. Its key objectives are to collect enough microbial biomass for sequencing and to minimize contamination. When working with low biomass samples, ultraclean reagents and “blank” sequencing controls should be used to minimize less “real” signals.

Library preparation and sequencing

Common high-throughput sequencing platforms include Illumina systems, Roche 454, Ion Torrent instruments, and PacBio SMRT systems.

  • Next generation sequencing

Frey et al. (2014) assessed the ability of three next generation sequencing (NGS) platforms (Illumina MiSeq, Roche 454 Titanium, and Ion Torrent PGM) to identify a low-titer pathogen (viral or bacterial) in a clinically relevant blood sample. They found that Ion Torrent PGM and Illumina platforms perform better in identification of scarce microbial species, and for bacterial samples, only the MiSeq platform could provide reads that were unambiguously classified as originating from Bacillus anthracis.

The Illumina platform has become dominant for shotgun metagenomics sequencing due to its very high outputs (up to 1.5Tb per run), high accuracy (error rate of between 0.1-1%), and wide availability. Ion Torrent instruments and PacBio SMRT instruments are becoming tough competitors in the field. The Illumina platforms mainly differ in total output and maximum read length. The Illumina HiSeq 2500 (2×250 nt, 180 Gb output or 2×125 nt, 1Tb output) is a classical choice for metagenomics. Newer HiSeq 3000 and 4000 systems increase the throughput of a run but are limited to read length (150 nt). The MiSeq instruments only generate up to 15Gb in 2×300 mode but are still useful for single marker gene microbiome studies, or a limited number of samples.

Pacific Biosciences (PacBio) instruments, based on single-molecule, real-time (SMRT) detection in zero-mode waveguide wells, provide much greater read lengths (average read lengths up to 30 kb) than NGS instruments. Short-read sequencing (i.e. NGS) has limited ability to assemble complex or low-coverage regions, while long-read metagenomic sequencing by PacBio SMRT sequencing is able to reconstruct a high-quality and closed genome of a previously uncharacterized microbial species from metagenomic samples.

Data analysis

  • Assembly

If the research aims at obtaining full-length CDS or recovering microbial genomes, then assembly need to be performed to generate longer genomic contigs. Assembly can be divided into two strategies: reference-based assembly and de novo assembly. Reference-based assembly is fast and accurate, if the metagenomic dataset includes sequences where closely related reference genomes are available. Reference-based assembly can be performed with software packages such as Newbler, AMOS, MIRA. De novo assembly requires larger computational resources. De Bruijin graph approach is the most popular metagenome de novo assembly method.

If the research aims at taxonomic profiling, there is no need for assembly and binning. Assembly-free metagenomic profiling can mitigate assembly problems, and make it possible to identify low-abundance species that cannot be assembled de novo. The approach is limited because previously uncharacterized microorganisms are difficult to profile, but the number of reference genomes is increasing rapidly.

  • Binning

Metagenome assemblies are only fragmented contigs. We do not know contig derives from which genome. We do not even know how many species there are. Binning is the process to group contigs into species. There are two strategies for binning, including compositional-based and similarity-based methods. The examples of compositional-based binning algorithms include S-GSOM, Phylopythia, PCAHIER, and TACAO. Similarity-based algorithms include IMG/M, MG-RAST, MEGAN, CARMA, SOrt-ITEMS, MetaWatt, SCIMM, and MetaPhyler. There are also algorithms that consider both composition and similarity, such as PhymmBL and MetaCluster.

  • Annotation

The annotation has two steps, gene identification and functional annotation. Databases that contain combinations of manually annotated and computationally predicted proteins families, can be used for genes and metabolic pathway from metagenomes. Common databases and tools are concluded in the following table.

Table 1. Common databases and tools for annotation of metagenomic data.

Databases/Tools Details
KEGG KEGG is a database resource used to understand functions and utilities of the biological system.
UniProt UniProt provides a repository of metagenomic sequence data and allows you to view taxonomic and functional analyses.
TIGRFAM TIGRFAMs is a database of protein family definitions.
eggNOG eggNOG is used for identification of orthologous gene groups and function annotation. Other databases of orthologous gene groups include KEGG, COG, M5NR, and Metacyc.
SILVA SILVA is an on-line resource for quality checked and aligned ribosomal RNA sequence data.
Greengenes Greengenes is a combination of a chimera-checked 16S rRNA gene database and tools.
RDP Ribosomal Database Project (RDP) includes aligned and annotated rRNA gene sequence data and tools.
HUMAnN pipeline HUMAnN is a pipeline for accurately determining the presence/absence and abundance of microbial pathways from metagenomic data.
CAZy CAZY (Carbohydrate-Active enZYmes) database can be used for prediction of genes coding for carbohydrate-active enzyme and correlation analysis.
CARD The comprehensive Antibiotic Resistance Database (CARD) can be used for prediction of resistance genes and correlation analysis.
MG-RAST MG-RAST is an open source web application server for phylogenetic and functional analysis of metagenomes.


Since whole DNA sequencing of environmental sample was first performed by teams led by Banfield and Venter in 2004, shotgun metagenomics has become an indispensable tool for the study of microbial communities. The decreased cost of sequencing and the development of computational methods have promoted the widespread adaption of metagenomics.


  1. Faust K, Lahti L, Gonze D, et al. Metagenomics meets time series analysis: unraveling microbial community dynamics. Current opinion in microbiology, 2015, 25: 56-66.
  2. Frey K G, Herrera-Galeano J E, Redden C L, et al. Comparison of three next-generation sequencing platforms for metagenomic sequencing and identification of pathogens in blood. BMC genomics, 2014, 15(1): 96.
  3. Quince C, Walker A W, Simpson J T, et al. Shotgun metagenomics, from sampling to analysis. Nature biotechnology, 2017, 35(9): 833.
  4. Thomas T, Gilbert J, Meyer F. Metagenomics-a guide from sampling to data analysis. Microbial informatics and experimentation, 2012, 2(1): 3.

What would you like to discuss?

With whom will we be speaking?

Please input "genomics" as verification code.

* is a required item.

Get cutting-edge science information from CD Genomics sent straight to your inbox every month.


45-1 Ramsey Road, Shirley, NY 11967, USA
Tel: 1-631-275-3058
Fax: 1-631-614-7828