motus genomes + motus download#
Note: This tutorial has been designed to be run on Unix-based systems (macOS or Linux) and requires the mOTUs profiler to be correctly installed as described on the quickstart page.
Note: First-time execution of motus genomes downloads the mOTUs annotation database, which
requires 17.7G of storage.
The motus genomes command queries the mOTUs genome database to find genomes matching indicated mOTUs identifiers,
genome identifiers, taxonomic clades, or functional annotation groups.
By default, motus genomes requires two parameters (see option manual):
-i|--input-queries: List of search queries separated by a space. Alternatively, a text file listing search queries, with one query per line.-o|--output-file: Path to the output table containing an overview of genomes, together with annotations requested by the user.
The genome database is available for interactive browsing of 3.7M genomes on mOTUs-db, which can
also be used to retrieve a list of genome identifiers that is compatible with both motus genomes and motus download.
Querying by mOTUs identifier#
Let’s consider that we would like to study all genomes corresponding to Gilliamella apis, or mOTUv4.0_001734.
To do this, run the following command:
motus genomes -i mOTUv4.0_001734 -o genomes_gilliamella_apis.tsv
The genomes_gilliamella_apis.tsv file should contain a list of 43 genomes. The beginning of the file should look like this:
GENOME QUERY
ELLE19-1_SAMN09288280_MAG_00000004 mOTUv4.0_001734
ELLE19-1_SAMN09288282_MAG_00000006 mOTUv4.0_001734
ELLE19-1_SAMN09288284_MAG_00000011 mOTUv4.0_001734
ELLE19-1_SAMN09288286_MAG_00000003 mOTUv4.0_001734
ELLE19-1_SAMN09288291_MAG_00000012 mOTUv4.0_001734
ELLE19-1_SAMN09288293_MAG_00000013 mOTUv4.0_001734
ELLE19-1_SAMN09288295_MAG_00000020 mOTUv4.0_001734
ELLE19-1_SAMN09288297_MAG_00000010 mOTUv4.0_001734
ELLE19-1_SAMN09288314_MAG_00000008 mOTUv4.0_001734
By default, the output file (genomes_gilliamella_apis.tsv) contains two columns: GENOME, listing mOTUs-db genome identifiers,
and QUERY, indicating the database query supplied by the user. To retrieve taxonomic or functional annotations for the genomes,
you can specify TAXONOMY, KEGG, EGGNOG, or PFAM when running motus genomes:
motus genomes -i mOTUv4.0_001734 -o genomes_gilliamella_apis.tsv -d KEGG
Where KEGG will add a column containing a semicolon-separated list of all KEGG orthologous groups detected within the
corresponding genome to the output file.
Querying by taxonomy#
We can also search for genomes whose taxonomic annotation includes the scientific name Gilliamella apis.
Note that “Gilliamella apis” has to be written in quotes for exact search, otherwise Gilliamella and
apis will be interpreted by two separate search terms separated by a space.
motus genomes -i "Gilliamella apis" -o genomes_gilliamella_apis.tsv
The beginning of the genomes_gilliamella_apis.tsv file should look like this:
GENOME QUERY
ELLE19-1_SAMN09288280_MAG_00000004 Gilliamella apis
ELLE19-1_SAMN09288282_MAG_00000006 Gilliamella apis
ELLE19-1_SAMN09288283_MAG_00000001 Gilliamella apis
ELLE19-1_SAMN09288284_MAG_00000011 Gilliamella apis
ELLE19-1_SAMN09288286_MAG_00000003 Gilliamella apis
ELLE19-1_SAMN09288291_MAG_00000012 Gilliamella apis
ELLE19-1_SAMN09288293_MAG_00000013 Gilliamella apis
ELLE19-1_SAMN09288295_MAG_00000020 Gilliamella apis
ELLE19-1_SAMN09288297_MAG_00000010 Gilliamella apis
ELLE19-1_SAMN09288314_MAG_00000008 Gilliamella apis
In contrast to the search performed with mOTUv4.0_001734, this output file should contain 45 genomes instead of 43.
Two of the genomes did not have enough marker genes detected to be clustered into a mOTU. Functional annotations can
be added using the -d option:
motus genomes -i "Gilliamella apis" -o genomes_gilliamella_apis.tsv -d KEGG PFAM
Taxonomic searches are not restricted to species level and can be performed for any clade defined in the
Genome Taxonomy Database (GTDB). Genomes were annotated using the R220 version of the database, so it may be necessary to consult
the Taxon history page on the GTDB website when mapping names to clades.
To obtain an overview of all genomes from the Gilliamella genus, run:
motus genomes -i Gilliamella -o genomes_gilliamella.tsv
Querying by functional annotations#
All genomes within mOTUs-db have been annotated using functional annotation terms from eggNOG (v5), KEGG (v2022), and Pfam (v37.1). You can use the corresponding database websites to retrieve identifiers for your functional group of interest.
For this example, let’s consider the caffeine degradation pathway in KEGG (see M00915).
There are three enzymes within the pathway: K21722, K21723, and K21724. You can retrieve a list of genomes
containing any one of these enzymes using the following command:
motus genomes -i K21722 K21723 K21724 -o genomes_caffeine_degradation.tsv
Alternatively, to query the database for a list of terms or identifiers, you can create a file containing one query per row.
In this example, the file caffeine_enzymes.tsv contains the following lines:
K21722
K21723
K21724
The command to retrieve a list of genomes using caffeine_enzymes.tsv as an input file is:
motus genomes -i caffeine_enzymes.tsv -o genomes_caffeine_degradation.tsv
The resulting genomes_caffeine_degradation.tsv file should contain a list of 947 genomes.
The beginning of the file should look like this:
GENOME QUERY
BUSI22-1_SAMN19433686_MAG_00000242 K21722
BUSI22-1_SAMN19433687_MAG_00000201 K21722
BUSI22-1_SAMN19433687_MAG_00000587 K21722
BUSI22-1_SAMN19433691_MAG_00000008 K21722
BUSI22-1_SAMN19433691_MAG_00000409 K21722
BUSI22-1_SAMN19433691_MAG_00000432 K21722
BUSI22-1_SAMN19433692_MAG_00000096 K21722
BUSI22-1_SAMN19433693_MAG_00000324 K21722
BUSI22-1_SAMN19433693_MAG_00000475 K21722
To retrieve the taxonomic annotations of these genomes, add the TAXONOMY flag to the -d parameter.
motus genomes -i caffeine_enzymes.tsv -o genomes_caffeine_degradation.tsv -d TAXONOMY
The new genomes_caffeine_degradation.tsv should contain two additional columns: TAXONOMY, which contains the taxonomic assignment
of the genome according to GTDBR220, and mOTU, containing the identifier of the species-level cluster that this genome belongs to.
Querying by selected studies or samples#
At the moment, motus genomes does not allow to list genomes based on study or sample identifiers. However, using the mOTUs-db
website interface, a list of genome identifiers can be retrieved for download.
For this example, let’s consider genomes generated from the study on microbial populations inhabiting hydrothermal vents. The study identifier
is ANDE17-1. Navigate to the genome table on mOTUs-db and type ANDE17-1 into the Study column. After the table is subset to 278
genomes, click Select All then Export Metadata → Selection → CSV.
You can then extract the genome column from the downloaded file. The beginning of the resulting file should look like this:
ANDE17-1_SAMEA4470835_MAG_00000001
ANDE17-1_SAMEA4470835_MAG_00000004
ANDE17-1_SAMEA4470835_MAG_00000005
ANDE17-1_SAMEA4470835_MAG_00000006
ANDE17-1_SAMEA4470835_MAG_00000007
ANDE17-1_SAMEA4470835_MAG_00000009
ANDE17-1_SAMEA4470835_MAG_00000011
ANDE17-1_SAMEA4470835_MAG_00000012
ANDE17-1_SAMEA4470835_MAG_00000013
ANDE17-1_SAMEA4470835_MAG_00000014
You can now retrieve the annotations for these genomes by providing this file as input:
motus genomes -i exported_genomes.tsv -o genomes_ANDE17-1.tsv -d TAXONOMY KEGG
Downloading genome sequences#
The output of motus genomes is directly compatible with the motus download function, which requires two parameters (see option manual):
-i|--input-genomes: a text file listing genome identifiers, with one identifier per line OR a list of genome identifiers separated by a space.-o|--output-folder: a path to the output folder in which all downloaded genome sequences will be stored as FASTA files.
For the first example, let’s use the list of genomes from ANDE17-1 that we generated in Querying by selected studies or samples.
motus download -i exported_genomes.tsv -o genome_sequences/
It is also possible to download representative genomes only, i.e. the highest quality genome per mOTU. To download only one representative genome for each species within the Gilliamella genus, run:
motus download -i genomes_gilliamella.tsv -o genome_sequences/ -r
mOTUs is part of SIB's portfolio of open tools and databases.
mOTUs is part of the ELIXIR-CH Service Delivery Plan.