Skip to main content
Fig. 5 | Journal of Translational Medicine

Fig. 5

From: Unraveling metagenomics through long-read sequencing: a comprehensive review

Fig. 5

Overview of  Taxonomic Annotation Algorithm. A The Krona chart represents hierarchical data, which can be visualized as a multi-layered pie chart and is useful for displaying various levels of taxonomy and their corresponding abundances simultaneously. The pie chart presents all NCBI taxonomy levels, from superkingdom to family, using a blend of radial and spatial display, along with parametric colors and zoom options. B Heatmap with hierarchical clustering is one of the more common visualizations of the difference in species abundance. Hierarchical clustering on selected parameters is applied to both rows and columns. Blocks with similar clustering are positioned together, and a color scheme is then applied corresponding to the parameters. C Nucleotide or translation alignment uses nucleotides or amino acid codons to search the database. The resulting similarity or dissimilarity can be used to draw conclusions about the relationship between species. Similarities can be indicative of a common ancestor, while mismatches may signify mutations in the form of indels or point mutations. For taxonomic annotation, the LCA algorithm and its variations are commonly employed to determine the taxonomic identity of query sequences based on their similarity to known sequences in the database. The figure depicts eight species, D to K, divided into two genera, B and C, which belong to family A. The read is aligned to the protein sequence from the database, represented in species D to K. The alignment percentage ranges from 90 to 20%. Nodes A and B have read coverage of 100%, while node C has read coverage of 90%. The read is placed on the lowest taxonomic node with ≥ 80% read coverage, which is D. If node D or any other lower taxonomic node has read coverage of 80% or higher, then node B will be chosen. D CDKAM utilizes discriminative k-mer and approximates matching algorithm to perform taxonomic annotation. The left image depicts a simplistic view of the k-mer (5-mer) search. The right image depicts approximate matching where the key sequence does not have to be identical but allows mutation or variation. Despite having 3 nucleotide mismatches, the algorithm identifies it as a match. A threshold for approximate matching can be adjusted. E MetaMaps employs minimizer-based approximate mapping and the EM algorithm for taxonomic annotation. First, minimizer-based approximate mapping is used to swiftly generate potential mapping location for each long read. Next, all mapping locations are given a score using a probability model, and EM algorithm estimates the overall sample composition. EM algorithm is comprised of two steps: the E-step or estimation step and the M-step or maximization step. The E-step computes missing or latent variables, and the M-step optimizes the parameter to best fit the data. The graph starts with the initial parameter θ(t). E-step constructs the function gt to define the lower bound of the function log P (x;θ). The maximum of function gt is θ(t+1) and is computed during M-step. The next E-step defines the new lower bound as function gt+1, and new M-step computes new maximization at θ(t+2). EM steps terminate when parameter estimation converges or reaches maximum iteration

Back to article page