Naming conventions#

The mOTUs naming system integrates diverse data types through unique, human-readable identifiers. Entries associated with a public NCBI record preserve the corresponding record identifier.

For example, the mOTU identifier mOTUv4.0_000001 represents a species-level cluster comprising ~10k bacterial genomes. The representative genome for this cluster is RSGB23-1_GCF-024329905-V1_GENO_10000001.

Another genome within the cluster is ANDE20-1_SAMEA4688927_MAG_00000078, which is a metagenome-assembled genome. These two identifiers are broken down in the following manner:

This naming convention extends to all genomic features. The genome name is indicated in all its scaffolds (e.g., ANDE20-1_SAMEA4688927_MAG_00000078-scaffold_1) and genes (e.g., ANDE20-1_SAMEA4688927_MAG_00000078-scaffold_1_1).

`RSGB23-1_GCF-024329905-V1_GENO_10000001`#
`RSGB23-1`	A reference genome downloaded from the RefSeq/GenBank database.
`GCF-024329905-V1`	Associated with NCBI assembly GCF_024329905.1
`GENO_10000001`	Standard suffix for all reference genomes within mOTUs-db.

`ANDE20-1_SAMEA4688927_MAG_00000078`#
`ANDE20-1`	Belonging to study `ANDE20-1` (see study naming)
`SAMEA4688927`	Reconstructed from NCBI BioSample SAMEA4688927
`MAG_00000078`	`MAG` number `78`

Study#

A study consists of all samples associated with a single publication or project. The identifier uses a four-letter string, the year of publication, and an integer. For example, ANDE20-1 is associated with the study from Andersen, VD, et al. (2020) and contains samples from BioProject PRJEB26961.

The study identifiers RSGB23-1, RSGB23-2, and JGIG23-1 cover reference isolate genomes or SAGs downloaded from NCBI RefSeq and GenBank databases, or the JGI Genome Portal.

Sample#

Sample names begin with the Study ID, followed by a BioSample ID (if available). They end with a suffix indicating the data type: _GENO for isolates/SAGs or _METAG for metagenomes.

Isolate/SAG sample: RSGB23-1_GCF-024329905-V1_GENO (derived from GCF_024329905.1)
Metagenomic sample: ANDE20-1_SAMEA4688927_METAG

Genome#

The parent study and sample prefixes are included in every genome identifier. MAGs recovered from the same metagenomic sample are numbered.

Isolate/SAG: RSGB23-1_GCF-024329905-V1_GENO_10000001 is linked to sample entry RSGB23-1_GCF-024329905-V1_GENO
MAG: ANDE20-1_SAMEA4688927_MAG_00000078 is the 78th genome that has been reconstructed from the sample ANDE20-1_SAMEA4688927_METAG

Scaffold or gene#

Scaffolds and genes inherit the genome identifier as the prefix, followed by the numbered scaffold (e.g. scaffold_1) and the number of the gene on the scaffold.

ANDE20-1_SAMEA4688927_MAG_00000078-scaffold_1_1 - the first gene on the first scaffold of the genome ANDE20-1_SAMEA4688927_MAG_00000078
ANDE20-1_SAMEA4688927_MAG_00000078-scaffold_5_2 - the second gene on the fifth scaffold of the genome ANDE20-1_SAMEA4688927_MAG_00000078

mOTU#

mOTUs are species-level clusters defined based on marker genes (see Concept page). Every genome in the database belongs to exactly one mOTU. mOTUs are numbered sequentially starting from mOTUv4.0_000000. Genomes that cannot be placed (e.g., due to an insufficient number of marker genes) are assigned to the group no_mOTU.

Note

While the current database version is 4.1, the primary mOTU clustering was established in version 4.0. New genomes added in 4.1 were assigned to existing clusters rather than re-clustering the entire database.

Gene cluster#

Genes are clustered in nucleotide (NT) space at identity thresholds of 95% and 100%. The cluster identifiers reflect these parameters.

mOTUsv4.1_NT_G_NR100_000060812858 - a 100% nucleotide identity cluster (NR100, NT) built from mOTUs 4.1 genomes (4.1, G), which includes 658 genes, including ANDE20-1_SAMEA4688927_MAG_00000078-scaffold_1_1.
mOTUsv4.1_NT_G_NR95_000118583542 - a 95% nucleotide identity cluster (NR95, NT) built from mOTUs 4.1 genomes (4.1, G), which includes 11571 genes, including ANDE20-1_SAMEA4688927_MAG_00000078-scaffold_1_1.

mOTUs is part of SIB's portfolio of open tools and databases.

mOTUs is part of the ELIXIR-CH Service Delivery Plan.