Introduction to Shotgun Metagenomics, From Sampling to Data Analysis

Metagenomics is the science that applies high-throughput sequencing technologies and bioinformatics tools to directly obtain the genetic content of a microbial community without the need to isolate and culture the individual microbial species. Metagenomics enables researchers not only to study the functional gene composition of microbial communities but also to conduct evolutionary research. Metagenomics has been used to identify novel biocatalysts or enzymes and generate novel hypotheses of microbial function, which is a powerful and practical tool. Compared to 16S/18S/ITS amplicon sequencing, metagenomics can provide more information about functional potential of microbial communities and whole-genome sequences. The rapid development and substantial cost decrease in high-throughput sequencing have dramatically promoted the development of shotgun metagenomic sequencing.

This article gives an overview of metagenomics, from sampling to data analysis. A typical metagenomics project involves sample preparation, sequencing, and data analysis (including assembly, binning, annotation, statistical analysis, and data submission).

Sample preparation

Sample preparation generally involves two steps, sample collection and DNA extraction, both of which can affect the quality and accuracy of metagenomic experiments. Commercial kits are available for sample collection and DNA isolation. Its key objectives are to collect enough microbial biomass for sequencing and to minimize contamination. When working with low biomass samples, ultraclean reagents and “blank” sequencing controls should be used to minimize less “real” signals.

Library preparation and sequencing

Common high-throughput sequencing platforms include Illumina systems, Roche 454, Ion Torrent instruments, and PacBio SMRT systems.

Next generation sequencing

Frey et al. (2014) assessed the ability of three next generation sequencing (NGS) platforms (Illumina MiSeq, Roche 454 Titanium, and Ion Torrent PGM) to identify a low-titer pathogen (viral or bacterial) in a clinically relevant blood sample. They found that Ion Torrent PGM and Illumina platforms perform better in identification of scarce microbial species, and for bacterial samples, only the MiSeq platform could provide reads that were unambiguously classified as originating from Bacillus anthracis.

The Illumina platform has become dominant for shotgun metagenomics sequencing due to its very high outputs (up to 1.5Tb per run), high accuracy (error rate of between 0.1-1%), and wide availability. Ion Torrent instruments and PacBio SMRT instruments are becoming tough competitors in the field. The Illumina platforms mainly differ in total output and maximum read length. The Illumina HiSeq 2500 (2×250 nt, 180 Gb output or 2×125 nt, 1Tb output) is a classical choice for metagenomics. Newer HiSeq 3000 and 4000 systems increase the throughput of a run but are limited to read length (150 nt). The MiSeq instruments only generate up to 15Gb in 2×300 mode but are still useful for single marker gene microbiome studies, or a limited number of samples.

PacBio SMRT sequencing

Pacific Biosciences (PacBio) instruments, based on single-molecule, real-time (SMRT) detection in zero-mode waveguide wells, provide much greater read lengths (average read lengths up to 30 kb) than NGS instruments. Short-read sequencing (i.e. NGS) has limited ability to assemble complex or low-coverage regions, while long-read metagenomic sequencing by PacBio SMRT sequencing is able to reconstruct a high-quality and closed genome of a previously uncharacterized microbial species from metagenomic samples.

Data analysis

Assembly

If the research aims at obtaining full-length CDS or recovering microbial genomes, then assembly needs to be performed to generate longer genomic contigs. Assembly can be divided into two strategies: reference-based assembly and de novo assembly. Reference-based assembly is fast and accurate, if the metagenomic dataset includes sequences where closely related reference genomes are available. Reference-based assembly can be performed with software packages such as Newbler, AMOS, MIRA. De novo assembly requires larger computational resources. De Bruijin graph approach is the most popular metagenome de novo assembly method.

If the research aims at taxonomic profiling, there is no need for assembly and binning. Assembly-free metagenomic profiling can mitigate assembly problems, and make it possible to identify low-abundance species that cannot be assembled de novo. The approach is limited because previously uncharacterized microorganisms are difficult to profile, but the number of reference genomes is increasing rapidly.

Binning

Metagenome assemblies are only fragmented contigs. We do not know which contig derives from which genome. We do not even know how many species there are. Binning is the process of grouping contigs into species. There are two strategies for binning, including compositional-based and similarity-based methods. Examples of compositional-based binning algorithms include S-GSOM, Phylopythia, PCAHIER, and TACAO. Similarity-based algorithms include IMG/M, MG-RAST, MEGAN, CARMA, SOrt-ITEMS, MetaWatt, SCIMM, and MetaPhyler. Some algorithms consider both composition and similarity, such as PhymmBL and MetaCluster.

Annotation

The annotation has two steps, gene identification and functional annotation. Databases that contain combinations of manually annotated and computationally predicted protein families, can be used for genes and metabolic pathways from metagenomes.

Blog Post

Introduction to Shotgun Metagenomics, From Sampling to Data Analysis

Kiko