motus classify#

The motus classify command allows users to take unknown genome sequences (in FASTA format) and map them to known taxonomic units (mOTUs) using the mOTUs marker gene database. This is particularly useful for connecting user genomes to their abundances in public or private taxonomic profiles.

Note: This command requires the mOTUs markergene database to be downloaded. Use motus downloadMGDB to download the database.

1. Preparing Your Input#

The tool does not take FASTA files directly on the command line. Instead, it requires a manifest file: a simple text file containing the paths to your genome files, with one path per line.

Requirements:

  • Files must be in .fa, .fasta, or gzipped .gz format.

  • Crucial: Every filename must be unique, even if they are in different folders.

Download tutorial genomes (4 genome files and 1 manifest file, ~2MB)

$ wget https://zenodo.org/records/19071244/files/genomes.tar.gz
$ tar -xzvf genomes.tar.gz
$ ls genomes/

ACIN21-1_SAMN05421555_MAG_00000004.fa
ARTA20-1_SAMN17006390_MAG_00000056.fa
BAEC15-1_SAMEA2580230_MAG_00000003.fa
ROHW24-1_SAMN18246391_MAG_00000126.fa
genomes.txt

$ cat genomes/genomes.txt

genomes/ACIN21-1_SAMN05421555_MAG_00000004.fa
genomes/ARTA20-1_SAMN17006390_MAG_00000056.fa
genomes/BAEC15-1_SAMEA2580230_MAG_00000003.fa
genomes/ROHW24-1_SAMN18246391_MAG_00000126.fa

2. Running motus classify#

To run the classification, use the -i flag for your list and -o for your desired output name.

$ motus classify -i genomes/genomes.txt -o classify_example.tsv -t 32

INFO: mOTU tool starting - mOTUs4:4.0.4
INFO: Starting mOTUs classify:
INFO: Loading database ...
INFO: Loading database finished. Version 4.0 (version date: 2024-08-09) contains 124300 mOTUs, 2075157 markergeneclusters and 3436253 markergenes.
INFO:         Input = 4 genomes.
INFO:         Output will be written to classify_example.tsv
INFO:         Temporary files will be written to classify_example.tsv_classify_tmp
INFO:         Running fetchMGs on genomes.
INFO: Starting gene calling from 4 genome files.            100%|██████████ 4/4 [00:03<00:00,  1.25genomes/s]
INFO: Finished gene calling.
INFO: Starting marker gene extraction from 4 protein files. 100%|██████████ 4/4 [00:01<00:00,  3.63protein files/s]
INFO: Finished marker gene extraction.
INFO:         Finished running fetchMGs on genomes.
INFO:         Collecting fetchMGs results
INFO:         Finished collecting fetchMGs results. Genomes = 4, Genomes with enough MGs = 3
INFO:         Matching genomes against the mOTUs database
INFO:         Aligning genome marker genes against the mOTUs marker gene database using vsearch
INFO:         Finished alignment
INFO:         Combining individual marker gene distances into genome to genome distances
INFO:         Finished combining

Understanding the Options:

  • -i: Path to your text file listing the genomes.

  • -o: The name of the resulting classification file.

  • -t: Number of threads.

Note: Some parts of the classify routine could be parallelized. Others couldn’t. In general it’s recommended to scale up to 32 cores.

3. Understanding the Output#

The tabular output file contains one line per submitted genome, indicating the assigned mOTU (ARTA20-1_SAMN17006390_MAG_00000056.fa), <6MGs-no_mOTU (ACIN21-1_SAMN05421555_MAG_00000004.fa) if the genome lacked at least 5 out of 10 marker genes, or Novel-no_mOTU (ROHW24-1_SAMN18246391_MAG_00000126.fa) if the genome had >6MGs marker genes but couldn’t be assigned to any mOTU.

$ cat classify_example.tsv

GENOME                                            MOTU  SIMILARITY    NUM_MGS
ACIN21-1_SAMN05421555_MAG_00000004.fa    <6MGs-no_mOTU        -1.0          1
ARTA20-1_SAMN17006390_MAG_00000056.fa  mOTUv4.0_000062        99.6          8
BAEC15-1_SAMEA2580230_MAG_00000003.fa  mOTUv4.0_007086        99.7          6
ROHW24-1_SAMN18246391_MAG_00000126.fa    Novel-no_mOTU        -1.0          8


ico1 mOTUs is part of SIB's portfolio of open tools and databases.

ico2 mOTUs is part of the ELIXIR-CH Service Delivery Plan.