Combined with bioinformatics, next-generation sequencing (NGS) technologies have become an effective instrument for the discovery, recognition, and study of human pathogens. NGS technologies can be roughly divided into shotgun sequencing (non-targeted) and enrichment-based sequencing (targeted). Extra hybridization and cleaning measures are used in the enrichment method, requiring higher sample input, raising the likelihood of hitting the target, and increasing the cost and hands-on time. Investment in software and costly hardware for networking, storage, and data analysis would involve routine and effective processing and storage of gigabases of sequence data that can be generated by even a benchtop sequencer.
To enhance metagenomic analysis and, specifically, pathogen identification, various software packages and workflows were created. These packages vary from easy-to-use web-based or commercially accessible applications, but the accuracy of the outcome depends on default parameters selected by the software creator, to control line tools that enable customization and potentially more tailored outcomes, provided there is bioinformatic knowledge available. While attempts have been made to optimize workflows and offer guidelines on best practices for research, there is no consensus on the creation and application of any particular workflows for bioinformatics. These are thus also produced in-house and tailored depending on the needs of laboratories, making it more difficult to further standardize.
In general, there are standard steps that should be taken for a good analysis, regardless of the choice of software or pipeline for the detection and identification of pathogens in a sample. To delete low-quality and redundant readings, the original data from a sequencing platform is normally washed, cut, and screened. To reduce background noise and improve the accuracy of pathogen readings, elimination of the host genome/transcriptome reads is carried out. This move would also minimize the time for downstream research. In order to ensure exclusion of any contaminating reads, such as those associated with the reagents or sampling storage media, further background noise isolation is accomplished by mapping sample reads to read from the negative control. To create long stretches of sequences, the remaining reads are normally assembled de novo. This move also ensures reliability of outcomes and improved precision of downstream pathogen detection, especially for sequencing platforms that yield short reads.
Sequencing read consistency and read depth of coverage is one of the major elements that contribute in the accuracy and completeness of a genome assembly. These components vary between short-read and long-read platforms and sequencing methods. The consensus sequence should not be regarded if read consistency and depth criteria are not met. Sequencing read consistency and read depth of coverage is one of the major elements that contribute in the accuracy and completeness of a genome assembly. Two primary genome assembly methods, reference-based (mapping-based) assembly and de novo assembly, are usually used.
Reference-based assembly is a very useful and accurate tool for the assembly of known genomes and can be especially beneficial for laboratories with limited computational capacity or those with high sequencing throughput and/or when time is of the essence. In a reference-based genome assembly, sample reads are mapped to a reference genome, and the reads are placed based on the best match and alignment to the reference. The accuracy of reference mapping relies on the reference and mapping parameters being used. The reference genome must be closely related to the sequenced pathogen, for most of the reads to map accurately.
Without looking at the illustration on the packaging, de novo genome assembly may be compared to placing together a jigsaw puzzle; it depends on linking the sample reads to each other using sequence match overlaps to produce longer sequences known as contigs. Thus this approach can be less precise than reference mapping and is typically slower and more computationally intensive as well. De novo assembly, however, is helpful where the pathogen is poorly known or no good guide exists, or when insertions, deletions, or repetitions are believed to be present. In addition, de novo assembly is widely used to identify and assemble novel pathogens, horizontal gene transfer pathogens, or non-chromosomal factor assemblies, such as bacterial plasmids.