`motus genomes` + `motus download`#

Note: This tutorial has been designed to be run on Unix-based systems (macOS or Linux) and requires the mOTUs profiler to be correctly installed as described on the quickstart page.

Note: First-time execution of motus genomes downloads the mOTUs annotation database, which requires 18GB of storage.

The motus genomes command queries the mOTUs genome database to find genomes matching indicated mOTUs identifiers, genome identifiers, taxonomic clades, or functional annotation groups.

By default, motus genomes requires two parameters (see option manual):

-i | --input-queries: List of search queries separated by a space. Alternatively, a text file listing search queries, with one query per line. Required unless -l is used.
-o | --output-file: Path to the output table containing an overview of genomes, together with annotations requested by the user.

Since v4.1, -i can be replaced by -l to list all searchable entries of a category instead of searching for specific queries:

-l | --list: List all searchable entries of a category and write them to -o. Choose from GENOME, TAXONOMY, PFAM, KEGG, or EGGNOG. When used, -i is not required (see List entries).

The genome database is available for interactive browsing of 3.9M genomes on mOTUs-db, which can also be used to retrieve a list of genome identifiers that is compatible with both motus genomes and motus download.

Querying by mOTUs identifier#

Let’s consider that we would like to study all genomes corresponding to Gilliamella apis, or mOTUv4.0_001734. To do this, run the following command:

motus genomes -i mOTUv4.0_001734 -o genomes_gilliamella_apis.tsv

The genomes_gilliamella_apis.tsv file should contain a list of 43 genomes. The beginning of the file should look like this:

GENOME                                QUERY
ELLE19-1_SAMN09288280_MAG_00000004    mOTUv4.0_001734
ELLE19-1_SAMN09288282_MAG_00000006    mOTUv4.0_001734
ELLE19-1_SAMN09288284_MAG_00000011    mOTUv4.0_001734
ELLE19-1_SAMN09288286_MAG_00000003    mOTUv4.0_001734
ELLE19-1_SAMN09288291_MAG_00000012    mOTUv4.0_001734
ELLE19-1_SAMN09288293_MAG_00000013    mOTUv4.0_001734
ELLE19-1_SAMN09288295_MAG_00000020    mOTUv4.0_001734
ELLE19-1_SAMN09288297_MAG_00000010    mOTUv4.0_001734
ELLE19-1_SAMN09288314_MAG_00000008    mOTUv4.0_001734

By default, the output file (genomes_gilliamella_apis.tsv) contains two columns: GENOME, listing mOTUs-db genome identifiers, and QUERY, indicating the database query supplied by the user. To retrieve taxonomic or functional annotations for the genomes, you can specify TAXONOMY, KEGG, EGGNOG, or PFAM when running motus genomes:

motus genomes -i mOTUv4.0_001734 -o genomes_gilliamella_apis.tsv -d KEGG

Where KEGG will add a column containing a semicolon-separated list of all KEGG orthologous groups detected within the corresponding genome to the output file.

Querying by taxonomy#

We can also search for genomes whose taxonomic annotation includes the scientific name Gilliamella apis. Note that "Gilliamella apis" has to be written in quotes for exact search, otherwise Gilliamella and apis will be interpreted as two separate search terms separated by a space.

motus genomes -i "Gilliamella apis" -o genomes_gilliamella_apis.tsv

The beginning of the genomes_gilliamella_apis.tsv file should look like this:

GENOME                                QUERY
ELLE19-1_SAMN09288280_MAG_00000004    Gilliamella apis
ELLE19-1_SAMN09288282_MAG_00000006    Gilliamella apis
ELLE19-1_SAMN09288283_MAG_00000001    Gilliamella apis
ELLE19-1_SAMN09288284_MAG_00000011    Gilliamella apis
ELLE19-1_SAMN09288286_MAG_00000003    Gilliamella apis
ELLE19-1_SAMN09288291_MAG_00000012    Gilliamella apis
ELLE19-1_SAMN09288293_MAG_00000013    Gilliamella apis
ELLE19-1_SAMN09288295_MAG_00000020    Gilliamella apis
ELLE19-1_SAMN09288297_MAG_00000010    Gilliamella apis
ELLE19-1_SAMN09288314_MAG_00000008    Gilliamella apis

In contrast to the search performed with mOTUv4.0_001734, this output file should contain 45 genomes instead of 43. Two of the genomes did not have enough marker genes detected to be clustered into a mOTU. Functional annotations can be added using the -d option:

motus genomes -i "Gilliamella apis" -o genomes_gilliamella_apis.tsv -d KEGG PFAM

Taxonomic searches are not restricted to species level and can be performed for any clade defined in the Genome Taxonomy Database (GTDB). Genomes were annotated using the R226 version of the database, so it may be necessary to consult the Taxon history page on the GTDB website when mapping names to clades.

To obtain an overview of all genomes from the Gilliamella genus, run:

motus genomes -i Gilliamella -o genomes_gilliamella.tsv

Querying by functional annotations#

All genomes within mOTUs-db have been annotated using functional annotation terms from eggNOG (v5), KEGG (v2022), and Pfam (v37.1). You can use the corresponding database websites to retrieve identifiers for your functional group of interest.

For this example, let’s consider the caffeine degradation pathway in KEGG (see M00915).

There are three enzymes within the pathway: K21722, K21723, and K21724. You can retrieve a list of genomes containing any one of these enzymes using the following command:

motus genomes -i K21722 K21723 K21724 -o genomes_caffeine_degradation.tsv

Alternatively, to query the database for a list of terms or identifiers, you can create a file containing one query per row. In this example, the file caffeine_enzymes.tsv contains the following lines:

K21722
K21723
K21724

The command to retrieve a list of genomes using caffeine_enzymes.tsv as an input file is:

motus genomes -i caffeine_enzymes.tsv -o genomes_caffeine_degradation.tsv

The resulting genomes_caffeine_degradation.tsv file should contain a list of 962 genomes. The beginning of the file should look like this:

GENOME                                QUERY
BUSI22-1_SAMN19433686_MAG_00000242    K21722
BUSI22-1_SAMN19433687_MAG_00000201    K21722
BUSI22-1_SAMN19433687_MAG_00000587    K21722
BUSI22-1_SAMN19433691_MAG_00000008    K21722
BUSI22-1_SAMN19433691_MAG_00000409    K21722
BUSI22-1_SAMN19433691_MAG_00000432    K21722
BUSI22-1_SAMN19433692_MAG_00000096    K21722
BUSI22-1_SAMN19433693_MAG_00000324    K21722
BUSI22-1_SAMN19433693_MAG_00000475    K21722

To retrieve the taxonomic annotations of these genomes, add the TAXONOMY flag to the -d parameter.

motus genomes -i caffeine_enzymes.tsv -o genomes_caffeine_degradation.tsv -d TAXONOMY

The new genomes_caffeine_degradation.tsv should contain two additional columns: TAXONOMY, which contains the taxonomic assignment of the genome according to GTDB R226, and mOTU, containing the identifier of the species-level cluster that this genome belongs to.

Listing all searchable entries#

If you are unsure which identifiers or names can be queried, motus genomes can write every searchable entry of a category to a file using the -l | --list option. When -l is used, -i is not required. Accepted categories are GENOME, TAXONOMY, PFAM, KEGG, and EGGNOG.

For example, to list all KEGG orthologous groups that can be queried in the mOTUs-db, run:

motus genomes -l KEGG -o all_kegg.txt

This writes all 11,665 KEGG entries to all_kegg.txt, one entry per line and without a header:

Any of these entries can be used directly as a query for -i, either on the command line or through an input file.

Querying by selected studies or samples#

At the moment, motus genomes does not allow listing genomes based on study or sample identifiers. However, using the mOTUs-db website interface, a list of genome identifiers can be retrieved for download.

For this example, let’s consider genomes generated from the study on microbial populations inhabiting hydrothermal vents. The study identifier is ANDE17-1. Navigate to the genome table on mOTUs-db and type ANDE17-1 into the Study column. After the table is subset to 278 genomes, click Select All then Export Metadata → Selection → CSV.

You can then extract the genome column from the downloaded file. The beginning of the resulting file should look like this:

ANDE17-1_SAMEA4470835_MAG_00000001
ANDE17-1_SAMEA4470835_MAG_00000004
ANDE17-1_SAMEA4470835_MAG_00000005
ANDE17-1_SAMEA4470835_MAG_00000006
ANDE17-1_SAMEA4470835_MAG_00000007
ANDE17-1_SAMEA4470835_MAG_00000009
ANDE17-1_SAMEA4470835_MAG_00000011
ANDE17-1_SAMEA4470835_MAG_00000012
ANDE17-1_SAMEA4470835_MAG_00000013
ANDE17-1_SAMEA4470835_MAG_00000014

You can now retrieve the annotations for these genomes by providing this file as input:

motus genomes -i exported_genomes.tsv -o genomes_ANDE17-1.tsv -d TAXONOMY KEGG

Downloading genomes and annotations#

The output of motus genomes is directly compatible with the motus download function, which requires two parameters (see option manual):

-i | --input-genomes: a text file listing genome identifiers, with one identifier per line OR a list of genome identifiers separated by a space.
-o | --output-folder: a path to the output folder in which the downloaded files will be stored, one file per genome.

For the first example, let’s use the list of genomes from ANDE17-1 that we generated in Querying by selected studies or samples.

motus download -i exported_genomes.tsv -o genome_sequences/

It is also possible to download the sequences for representative genomes only, i.e. the highest quality genome per mOTU. To download only one representative genome sequence for each species within the Gilliamella genus, run:

motus download -i genomes_gilliamella.tsv -o genome_sequences/ -r

Downloading other file types#

By default, motus download retrieves the genome sequence of each genome. Since v4.1, the -t | --file-type option selects which file to download instead (see File type):

-t | --file-type: one of genome (default), gene_fna, gene_faa, gene_gff, kegg, pfam, eggnog, trna, rrna, or antismash.

This makes it possible to retrieve predicted genes and proteins, gene coordinates, or functional and non-coding RNA annotations for the same set of genomes. For example, to download the KEGG annotations instead of the genome sequences, run:

motus download -i exported_genomes.tsv -o genome_annotations/ -t kegg

These files can also be retrieved programmatically through the mOTUs API.

mOTUs is part of SIB's portfolio of open tools and databases.

mOTUs is part of the ELIXIR-CH Service Delivery Plan.

motus genomes + motus download#