Glossary#

Profiling & Data Structures#

mOTUs profiler

The software tool used to taxonomically profile shotgun metagenomic or metatranscriptomic data. It maps reads to the mOTUs marker gene database and uses a consensus of the 10 marker genes to identify and quantify microbial taxa.

mOTUs db (Genome Database)

A comprehensive collection of approximately 4 million genomes (including isolates, SAGs, and MAGs). This is a “genome-resolved” resource that provides the full genomic context (genomes, genes, annotations) for the taxa represented in mOTUs. It is used to link species level clusters back to genomes for deeper functional and taxonomic analysis.

mOTUs marker gene db

The underlying database containing marker gene sequences extracted from millions of genomes. As of recent versions (v3/v4), it integrates sequences from isolate genomes, MAGs, and SAGs to provide a comprehensive reference for both cultivated and uncultivated life.

GTDB (Genome Taxonomy Database)

A standardized, phylogenetically consistent taxonomy for bacteria and archaea based on protein sequences from whole genomes. Each genome in mOTUs db and each mOTU is mapped to a clade in GTDB.

Genome Sources#

Isolate Genome

A genome derived from a pure culture of a single microbial strain.

MAG (Metagenome-Assembled Genome)

A genome reconstructed by binning scaffolds from a metagenomic assembly.

SAG (Single-cell Amplified Genome)

A genome obtained by isolating and sequencing a single microbial cell. Like MAGs, they provide data on uncultivated taxa but avoid the potential chimera issues of metagenomic binning.

Taxonomic Units & Markers#

MG (Marker Gene)

A universal, protein-coding, single-copy phylogenetic marker gene. In the mOTUs framework, 10 specific genes (e.g., COG0012, COG0016) are used as they are present in almost all bacteria and archaea, and typically occur only once per genome. This allows for accurate quantification by e.g. avoiding genome size bias.

MGC (Marker Gene Cluster)

A cluster of marker gene sequences representing one of the 10 universal gene families for a specific mOTU. The mOTUs profiler calculates the abundance of these clusters to ultimately estimate the abundance of the mOTU itself.

mOTU (Marker gene-based Operational Taxonomic Unit)

A species-level cluster defined by sequence similarity of marker genes. Unlike traditional 16S rRNA OTUs, mOTUs provide higher resolution (species-level) and include both “known” species (represented in isolate genomes) and “unknown” species (extracted directly from MAGs).

unassigned

A specialized taxonomic bin that represents the collective abundance of all microbial species present in a sample that are not represented in the mOTUs database.

  • The Mechanism: It is calculated by identifying reads that map to the 10 universal marker genes (MGs) but do not share enough sequence similarity to match any other mOTU.

  • The Purpose (Normalization): This is the “dark matter” of your specific sample. By quantifying the unassigned fraction, the mOTUs profiler can calculate relative abundances that are comparable across different samples. Without it, you would only be measuring the “relative abundance among knowns”.

Bioinformatic Mapping Terms#

Insert

In paired-end sequencing, the “insert” is the actual fragment of DNA being sequenced, located between the two adapters. The insert size (the length of the DNA fragment) is a critical parameter for accurate mapping and downstream quantification. The mOTUs profiler is quantifying inserts rather than reads, e.g. to reduce the number of multiple-mappers and to have an accurate unit of quantification.

Unique mapper

A sequencing read (or insert) that aligns to only one specific MGC in the database. These are the most informative for high-precision taxonomic assignment.

Multi-mapper

A read (or insert) that aligns with high similarity to multiple MGCs in the database (e.g., due to highly conserved gene regions). The mOTUs profiler typically uses specific algorithms to distribute these inserts to avoid over- or under-estimating the abundance of closely related species.

Functional Annotation Databases#

Complete genes called on mOTUs genomes are annotated functionally using the KEGG, eggNOG and Pfam databases. This information is then accessible using the motus genomes command.

KEGG (Kyoto Encyclopedia of Genes and Genomes)

A database resource for understanding high-level functions and utilities of the biological system. In metagenomics, it is primarily used via KO (KEGG Orthology) identifiers to map genes to biochemical pathways.

eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups)

A database of orthologous groups (OGs) that provides high-level functional annotations (via COG categories) and fine-grained orthology. It is frequently used in mOTUs and other pipelines to provide functional context to marker genes and MAGs across diverse taxonomic ranks.

Pfam (Protein families)

A comprehensive collection of protein domains and families represented by Hidden Markov Models (HMMs). Unlike KEGG, which focuses on metabolic pathways, Pfam focuses on the structural and evolutionary conserved domains within a protein, allowing for the functional characterization of even highly divergent sequences.

Gene Catalogs & Clustering Levels#

The Gene Catalogs are a comprehensive set of gene sequences (nucleotide or protein) predicted from mOTUs genomes and clustered at different levels of similarity.

Nucleotide Space (DNA Level)#

Redundant Catalog

The raw collection of all predicted genes from all mOTUs genomes. This contains many identical or near-identical sequences from the same species across different genomes. The associated mOTUs gene catalog is named mOTUsv4.1_NT_G_R and contains 9.5b gene sequences.

Non-redundant (NR) Catalog

A collapsed version where identical sequences are merged. The associated mOTUs gene catalog is named mOTUsv4.1_NT_G_NR100 and contains 2.3b gene sequences.

95% Clustered (Species-level)

Genes clustered at 95% nucleotide identity (over 90% coverage). The associated mOTUs gene catalog is named mOTUsv4.1_NT_G_NR95 and contains 480m gene sequences.

Protein Space (Amino Acid Level)#

Redundant/Non-redundant

Similar to nucleotide space, but based on amino acid sequences. Protein NR catalogs are more sensitive for detecting distant evolutionary relationships because the protein code is more conserved than the underlying DNA. The associated redundant mOTUs gene catalog is named mOTUsv4.1_AA_G_R and contains 9.5b gene sequences. The associated non-redundant mOTUs gene catalog is named mOTUsv4.1_AA_G_NR100 and contains 1.6b gene sequences.

50% Clustered

Sequences clustered at 50% amino acid identity. This level typically groups proteins that share the same general fold and biochemical function, often used to define broad functional protein families. The associated mOTUs gene catalog is named mOTUsv4.1_AA_G_NR50 and contains 110m gene sequences.

30% Clustered

Sequences clustered at 30% amino acid identity. This is often considered the “twilight zone” of protein bioinformatics; sequences at this level may have very different primary structures but still share similar three-dimensional architectures and ancient evolutionary origins. Clustering at this depth is used to collapse massive datasets into a representative set of “protein universes.” The associated mOTUs gene catalog is named mOTUsv4.1_AA_G_NR30 and contains 70m gene sequences.



ico1 mOTUs is part of SIB's portfolio of open tools and databases.

ico2 mOTUs is part of the ELIXIR-CH Service Delivery Plan.