motus genomes + motus download#

Note: This tutorial has been designed to be run on Unix-based systems (macOS or Linux) and requires the mOTUs profiler to be correctly installed as described on the quickstart page.

Note: First-time execution of motus genomes downloads the mOTUs annotation database, which requires 17.7G of storage.

The motus genomes command queries the mOTUs genome database to find genomes matching indicated mOTUs identifiers, genome identifiers, taxonomic clades, or functional annotation groups.

By default, motus genomes requires two parameters (see option manual):

  • -i | --input-queries: List of search queries separated by a space. Alternatively, a text file listing search queries, with one query per line.

  • -o | --output-file: Path to the output table containing an overview of genomes, together with annotations requested by the user.

The genome database is available for interactive browsing of 3.7M genomes on mOTUs-db, which can also be used to retrieve a list of genome identifiers that is compatible with both motus genomes and motus download.

Querying by mOTUs identifier#

Let’s consider that we would like to study all genomes corresponding to Gilliamella apis, or mOTUv4.0_001734. To do this, run the following command:

motus genomes -i mOTUv4.0_001734 -o genomes_gilliamella_apis.tsv

The genomes_gilliamella_apis.tsv file should contain a list of 43 genomes. The beginning of the file should look like this:

GENOME                                QUERY
ELLE19-1_SAMN09288280_MAG_00000004    mOTUv4.0_001734
ELLE19-1_SAMN09288282_MAG_00000006    mOTUv4.0_001734
ELLE19-1_SAMN09288284_MAG_00000011    mOTUv4.0_001734
ELLE19-1_SAMN09288286_MAG_00000003    mOTUv4.0_001734
ELLE19-1_SAMN09288291_MAG_00000012    mOTUv4.0_001734
ELLE19-1_SAMN09288293_MAG_00000013    mOTUv4.0_001734
ELLE19-1_SAMN09288295_MAG_00000020    mOTUv4.0_001734
ELLE19-1_SAMN09288297_MAG_00000010    mOTUv4.0_001734
ELLE19-1_SAMN09288314_MAG_00000008    mOTUv4.0_001734

By default, the output file (genomes_gilliamella_apis.tsv) contains two columns: GENOME, listing mOTUs-db genome identifiers, and QUERY, indicating the database query supplied by the user. To retrieve taxonomic or functional annotations for the genomes, you can specify TAXONOMY, KEGG, EGGNOG, or PFAM when running motus genomes:

motus genomes -i mOTUv4.0_001734 -o genomes_gilliamella_apis.tsv -d KEGG

Where KEGG will add a column containing a semicolon-separated list of all KEGG orthologous groups detected within the corresponding genome to the output file.

Querying by taxonomy#

We can also search for genomes whose taxonomic annotation includes the scientific name Gilliamella apis. Note that “Gilliamella apis” has to be written in quotes for exact search, otherwise Gilliamella and apis will be interpreted by two separate search terms separated by a space.

motus genomes -i "Gilliamella apis" -o genomes_gilliamella_apis.tsv

The beginning of the genomes_gilliamella_apis.tsv file should look like this:

GENOME                                QUERY
ELLE19-1_SAMN09288280_MAG_00000004    Gilliamella apis
ELLE19-1_SAMN09288282_MAG_00000006    Gilliamella apis
ELLE19-1_SAMN09288283_MAG_00000001    Gilliamella apis
ELLE19-1_SAMN09288284_MAG_00000011    Gilliamella apis
ELLE19-1_SAMN09288286_MAG_00000003    Gilliamella apis
ELLE19-1_SAMN09288291_MAG_00000012    Gilliamella apis
ELLE19-1_SAMN09288293_MAG_00000013    Gilliamella apis
ELLE19-1_SAMN09288295_MAG_00000020    Gilliamella apis
ELLE19-1_SAMN09288297_MAG_00000010    Gilliamella apis
ELLE19-1_SAMN09288314_MAG_00000008    Gilliamella apis

In contrast to the search performed with mOTUv4.0_001734, this output file should contain 45 genomes instead of 43. Two of the genomes did not have enough marker genes detected to be clustered into a mOTU. Functional annotations can be added using the -d option:

motus genomes -i "Gilliamella apis" -o genomes_gilliamella_apis.tsv -d KEGG PFAM

Taxonomic searches are not restricted to species level and can be performed for any clade defined in the Genome Taxonomy Database (GTDB). Genomes were annotated using the R220 version of the database, so it may be necessary to consult the Taxon history page on the GTDB website when mapping names to clades.

To obtain an overview of all genomes from the Gilliamella genus, run:

motus genomes -i Gilliamella -o genomes_gilliamella.tsv

Querying by functional annotations#

All genomes within mOTUs-db have been annotated using functional annotation terms from eggNOG (v5), KEGG (v2022), and Pfam (v37.1). You can use the corresponding database websites to retrieve identifiers for your functional group of interest.

For this example, let’s consider the caffeine degradation pathway in KEGG (see M00915).

There are three enzymes within the pathway: K21722, K21723, and K21724. You can retrieve a list of genomes containing any one of these enzymes using the following command:

motus genomes -i K21722 K21723 K21724 -o genomes_caffeine_degradation.tsv

Alternatively, to query the database for a list of terms or identifiers, you can create a file containing one query per row. In this example, the file caffeine_enzymes.tsv contains the following lines:

K21722
K21723
K21724

The command to retrieve a list of genomes using caffeine_enzymes.tsv as an input file is:

motus genomes -i caffeine_enzymes.tsv -o genomes_caffeine_degradation.tsv

The resulting genomes_caffeine_degradation.tsv file should contain a list of 947 genomes. The beginning of the file should look like this:

GENOME                                QUERY
BUSI22-1_SAMN19433686_MAG_00000242    K21722
BUSI22-1_SAMN19433687_MAG_00000201    K21722
BUSI22-1_SAMN19433687_MAG_00000587    K21722
BUSI22-1_SAMN19433691_MAG_00000008    K21722
BUSI22-1_SAMN19433691_MAG_00000409    K21722
BUSI22-1_SAMN19433691_MAG_00000432    K21722
BUSI22-1_SAMN19433692_MAG_00000096    K21722
BUSI22-1_SAMN19433693_MAG_00000324    K21722
BUSI22-1_SAMN19433693_MAG_00000475    K21722

To retrieve the taxonomic annotations of these genomes, add the TAXONOMY flag to the -d parameter.

motus genomes -i caffeine_enzymes.tsv -o genomes_caffeine_degradation.tsv -d TAXONOMY

The new genomes_caffeine_degradation.tsv should contain two additional columns: TAXONOMY, which contains the taxonomic assignment of the genome according to GTDBR220, and mOTU, containing the identifier of the species-level cluster that this genome belongs to.

Querying by selected studies or samples#

At the moment, motus genomes does not allow to list genomes based on study or sample identifiers. However, using the mOTUs-db website interface, a list of genome identifiers can be retrieved for download.

For this example, let’s consider genomes generated from the study on microbial populations inhabiting hydrothermal vents. The study identifier is ANDE17-1. Navigate to the genome table on mOTUs-db and type ANDE17-1 into the Study column. After the table is subset to 278 genomes, click Select All then Export Metadata Selection CSV.

You can then extract the genome column from the downloaded file. The beginning of the resulting file should look like this:

ANDE17-1_SAMEA4470835_MAG_00000001
ANDE17-1_SAMEA4470835_MAG_00000004
ANDE17-1_SAMEA4470835_MAG_00000005
ANDE17-1_SAMEA4470835_MAG_00000006
ANDE17-1_SAMEA4470835_MAG_00000007
ANDE17-1_SAMEA4470835_MAG_00000009
ANDE17-1_SAMEA4470835_MAG_00000011
ANDE17-1_SAMEA4470835_MAG_00000012
ANDE17-1_SAMEA4470835_MAG_00000013
ANDE17-1_SAMEA4470835_MAG_00000014

You can now retrieve the annotations for these genomes by providing this file as input:

motus genomes -i exported_genomes.tsv -o genomes_ANDE17-1.tsv -d TAXONOMY KEGG

Downloading genome sequences#

The output of motus genomes is directly compatible with the motus download function, which requires two parameters (see option manual):

  • -i | --input-genomes: a text file listing genome identifiers, with one identifier per line OR a list of genome identifiers separated by a space.

  • -o | --output-folder: a path to the output folder in which all downloaded genome sequences will be stored as FASTA files.

For the first example, let’s use the list of genomes from ANDE17-1 that we generated in Querying by selected studies or samples.

motus download -i exported_genomes.tsv -o genome_sequences/

It is also possible to download representative genomes only, i.e. the highest quality genome per mOTU. To download only one representative genome for each species within the Gilliamella genus, run:

motus download -i genomes_gilliamella.tsv -o genome_sequences/ -r


ico1 mOTUs is part of SIB's portfolio of open tools and databases.

ico2 mOTUs is part of the ELIXIR-CH Service Delivery Plan.