Naming conventions#
A key advantage of mOTUs is the integration of diverse data types through unique, human-readable identifiers. Where available, direct links to public NCBI records are maintained.
For example, the mOTU identifier mOTUv4.0_000001 represents a species-level cluster comprising ~10k bacterial genomes. A member genome might be named ANDE20-1_SAMEA4688927_MAG_00000078. This identifier breaks down as follows:
MAGnumber78From metagenomic sample
ANDE20-1_SAMEA4688927_METAGBelonging to study
ANDE20-1Linked to the public BioSample
SAMEA4688927
This naming convention extends to all genomic features. Scaffolds include the genome name (e.g., ANDE20-1_...-scaffold_1), as do genes (e.g., ANDE20-1_...-scaffold_1_1). These genes then map to protein clusters; for instance, the gene above is part of the mOTUsv4.1_AA_G_NR30_000024213789 (a 30% protein identity cluster).
This consistent structure allows users to seamlessly navigate from a taxonomic profile to a specific mOTU, then to an individual genome, and finally to its constituent gene clusters and functional annotations.
Study#
A Study consists of all samples associated with a single publication. The identifier uses a 4-letter string, the year of publication, and an integer.
Example: ANDE20-1 is associated with Andersen et al. and contains samples from BioProject PRJEB26961.
Sample#
Sample names begin with the Study ID, followed by a BioSample ID (if available). They end with a suffix indicating the data type: _METAG for metagenomes or _GENO for isolates/SAGs.
Metagenomic sample:
ANDE20-1_SAMEA4688927_METAGIsolate/SAG sample:
RSGB23-1_GCF-024329905-V1_GENO(derived from RefSeq GCF_024329905.1)
Genome#
Every genome carries its parent study/sample prefix. For example, ANDE20-1_SAMEA4688927_MAG_00000078 is explicitly linked to the sample ANDE20-1_SAMEA4688927_METAG.
Scaffolds, Genes#
Scaffolds and genes inherit the full genome name as a prefix. The gene ...-scaffold_1_1 is the first gene on the first scaffold (...-scaffold_1) of the genome ANDE20-1_SAMEA4688927_MAG_00000078.
mOTUs#
mOTUs are species-level clusters. Every genome in the database belongs to exactly one mOTU. mOTUs are numbered sequentially starting from mOTUv4.0_000000. Genomes that cannot be placed (e.g., due to insufficient marker genes) are assigned to the group no_mOTU.
Note
While the current database version is 4.1, the primary mOTU clustering was established in version 4.0. New genomes added in 4.1 were assigned to existing clusters rather than re-clustering the entire database.
Gene Clusters#
Genes are clustered at various identity cutoffs in both protein (AA) and nucleotide (NT) space. The cluster identifiers reflect these parameters.
Example: mOTUsv4.1_AA_G_NR30_000024213789 represents a 30% identity protein cluster (AA) built from all mOTUs 4.1 genomes (4.1, G).
Each element is linked via its mCGE identifier to its source sample.
mOTUs is part of SIB's portfolio of open tools and databases.
mOTUs is part of the ELIXIR-CH Service Delivery Plan.