Option manual#

Here, we provide a full overview of mOTUs commands and their options, as well as recommendations on parameter selection. In the command line, you can always type motus <command> to obtain a short description of its options and useage.

To execute the tool you need to call motus <command> [options], where <command> can be:

profile, perform taxonomic profiling on a sample. The command has the following subroutines:
- map_tax, map reads to the marker gene database, output a SAM/BAM file,
- calc_mgc, aggregate reads from the same marker gene cluster (MGC) and output the MGC abundance table. It uses the SAM/BAM file produced by map_tax,
- calc_motu, from an MGC abundance table (created by calc_mgc), produce the mOTUs abundance table.
downloadMGDB, download the mOTUs marker gene database,
merge, merge multiple sample profiles into a single table,
classify, annotate for your genomes which mOTUs they belong to,
prep_long, convert long read data into an input format suitable for motus profile.
genomes, search the mOTUs genome database for genomes matching specified taxonomic and functional annotations,
download, download sequence data for any genome within the mOTUs database.

`motus profile`#

Produces a taxonomic profile from short read metagenomic sequencing data by executing motus map_tax, motus calc_mgc, and motus calc_motu in succession. Can be used to profile long read metagenomic sequencing data after running motus prep_long first.

Required arguments#

Input -f, -r, -s: one or multiple FastQ/A files, which can be gzipped. Ensure the order of input files is the same for -f and -r if using paired-end data.

Output -o: path to the output file. This also serves as a prefix for intermediate files.

Option overview#

Option	Description
`-f`, `--forward`	Input path - Paired Forward: One or more gzipped FastQ/A files containing forward reads. The input files must have the same order for both forward and reverse.
`-r`, `--reverse`	Input path - Paired Reverse: One or more gzipped FastQ/A files containing reverse reads. The input files must have the same order for both forward and reverse.
`-s`, `--single`	Input path - Single: One or more gzipped FastQ/A files. The order of the input files does not matter for single-end files.
`-o`, `--output-file`	Output prefix: Path to the output files. This prefix is also used for intermediate files.
`-n`, `--sample-name`	Sample name: Name of the sample. Required when merging samples. The default value is `unnamed sample`.
`-g`, `--marker-genes`	Sensitivity: The number of marker genes with abundance required to call a mOTU present. Default value is `3`, with a minimum of `1` and a maximum of `10`.
`-l`, `--alignment-length`	Length: Filter alignments if their length is below this value. Default value is `75`. Note: Choose a value less than or equal to the length of the reads.
`-t`, `--threads`	Threads: Number of threads to use for the alignment step. Default is `1`.
`-y`, `--counting-mode`	Counting method: mOTUs can count in different modes, the default mode is `INSERT_SCALED`. Other options include `INSERT_RAW`, `INSERT_NORM`, `INSERT_SCALED`, `BASE_RAW`, and `BASE_NORM`. For more details, see algorithm.

Option description#

Input files#

Option	Description
`-f`, `--forward`	Input path - Paired Forward: One or more gzipped FastQ/A files containing forward reads. The input files must have the same order for both forward and reverse.
`-r`, `--reverse`	Input path - Paired Reverse: One or more gzipped FastQ/A files containing reverse reads. The input files must have the same order for both forward and reverse.
`-s`, `--single`	Input path - Single: One or more gzipped FastQ/A files. The order of the input files does not matter for single-end files.

The mOTUs profile and map_tax routines accept short read sequence files in FastA or FastQ format. Gzipped input is supported as well.

It’s possible to submit single and paired-end files belonging to multiple runs from the same sample if the file order between forward and reverse reads is maintained. Files submitted together should be separated by a space.

Examples

We use the sequencing files from the biosample SAMN06172452 as input. This sample has been sequenced multiple times, including runs ERR1913344, ERR1913349, and so on, for a total of 25 runs.

Correct usage

-f ERR1913344.1.fq.gz -r ERR1913344.2.fq.gz Profile one paired-end run.
-f ERR1913344.1.fq.gz ERR1913349.1.fq.gz -r ERR1913344.2.fq.gz ERR1913349.2.fq.gz Profile two paired-end runs as one sample.

Incorrect usage and mOTUs will abort with an error

-f ERR1913349.1.fq.gz -r ERR1913349.1.fq.gz The same FastQ file is submitted twice. If the run was single-end, use -s instead of -f.
-f ERR1913349.1.fq.gz -r ERR1913344.2.fq.gz ERR1913349.2.fq.gz Unequal number of forward and reverse FastQ input files.
-f ERR1913349.1.fq.gz ERR1913344.1.fq.gz -r ERR1913344.2.fq.gz ERR1913349.2.fq.gz Runs submitted in the wrong order. mOTUs checks read names and will break.

Incorrect usage but mOTUs will not abort with an error

-s ERR1913344.1.fq.gz mOTUs will profile a paired-end run as single-end data if only the forward file is submitted.
-f ERR1913344.1.fq.gz ERR1913368.1.fq.gz -r ERR1913344.2.fq.gz ERR1913368.2.fq.gz Run ERR1913368 is from (another) biosample SAMN06172417, but mOTUs will profile these two runs as one sample.

In general, when a sample has been sequenced multiple times, we recommend profiling all available runs together, rather than processing each run separately. The main benefits are:

Increased sensitivity: a higher chance of detecting taxa present in low abundances through increasing the chance to detect enough marker genes and pass the threshold defined by -g.
More accurate handling of multi-mapper reads, as their weights are influenced by uniquely mapped reads from the same sample (see algorithm for more details).

Sample name#

Option	Description
`-n`, `--sample-name`	Sample name: Name of the sample. Required when merging samples. The default value is `unnamed sample`.

The -n option can be used within the profile and calc_motu routines to add the name of the sample to the mOTUs profile file.

This is particularly important when merging multiple mOTUs profiles using the motus merge command, which requires distinct sample names. By default, mOTUs assigns each sample the name unnamed sample.

Recommendation: Although this parameter is optional, we recommend using it and adding meaningful sample names that match the user’s metadata. For public data, it’s best to use stable identifiers. For example, when profiling all runs from the biosample SAMN06172452, use -n SAMN06172452. If profiling only one run from the same biosample (e.g., ERR1913349), use -n ERR1913349.

Number of marker genes#

Option	Description
`-g`, `--marker-genes`	Sensitivity: The number of marker genes with abundance required to call a mOTU present. Default value is `3`, with a minimum of `1` and a maximum of `10`.

The profile and calc_motu routines estimate taxonomic (per-mOTU) abundance based on the median abundance of marker genes for mOTUs in which at least [-g] marker genes have been detected. By default, at least three marker genes need to display non-zero abundance for calculating the abundance of the corresponding mOTU. Varying the -g parameter allows users to increase or decrease the required number of non-zero abundance marker genes, prioritising precision or recall respectively.

Examples

-g 3 default value, balanced trade-off between precision and recall
-g 1 lower precision, higher recall
-g 6 higher precision, lower recall
-g 10 highest precision, lowest recall.

Note: The criteria for including a genome into marker-gene based clustering is the presence of at least six marker genes (see conceptual implementation). As a result, setting the -g parameter to be higher than 6 will automatically omit such mOTUs from the taxonomic profile.

Alignment length#

Option	Description
`-l`, `--alignment-length`	Length: Filter alignments if their length is below this value. Default value is `75`. Note: Choose a value less than or equal to the length of the reads.

Used within the profile, map_tax, and calc_mgc routines to filter alignments from the BAM file.

mOTUs uses the bwa aligner to map reads against the mOTUs marker gene database. By default, alignments of less than 97% identity or 75 bases in length (adjusted with -l) are filtered out.

The 75-base cutoff has been proven to be a robust default value that avoids spurious alignments that can occur with decreasing alignment length. It is also low enough to be able to profile sequencing runs that yield 100 bp reads.

Note

If taxonomic profiles from different samples are merged, these profiles should be generated using the same alignment length parameter.

Number of threads#

Option	Description
`-t`, `--threads`	Threads: Number of threads to use for the alignment step. Default is `1`.

As part of the profile and map_tax routines, mOTUs uses the bwa aligner to map reads against its marker gene database.

By default, one thread is used, but you can increase the number of threads with -t to improve runtime. In our tests, runtime scaled almost linearly up to 16 threads (-t 16).

Counting method#

Option	Description
`-y`, `--counting-mode`	Counting method: mOTUs can count in different modes, the default mode is `INSERT_SCALED`. Other options include `INSERT_RAW`, `INSERT_NORM`, `INSERT_SCALED`, `BASE_RAW`, and `BASE_NORM`. For more details, see algorithm.

The -y option in the profile and calc_motu routines is used to set the counting method. Five counting methods are available, each with its own use cases. By default, INSERT_SCALED is applied because it’s most suitable for standard taxonomic profiling followed by alpha- and/or beta-diversity analysis. Detailed formulas and a visual overview of all counting methods are provided on the algorithm page.

Short description with use cases:

INSERT_SCALED: Number of inserts mapped to each mOTU after normalising coverage of each MG by gene length, scaled so that abundance values are above 1 and suitable for use in methods requiring count data. This mode is the recommended default and can be used for most downstream analyses, including alpha-diversity, beta-diversity, and differential abundance analysis.

INSERT_RAW: Number of inserts mapped to each mOTU before any length normalisation. In general, this mode is not recommended for downstream analyses: the MGs vary in length and taking the median coverage across them without accounting for length variation is not meaningful for species quantification. However, the INSERT_RAW column in the MGC output file can be useful for diagnosing technical issues, such as identifying MGs with unusually high or low coverage, or comparing consistency across runs from the same biological sample.

INSERT_NORM: Number of inserts mapped to each mOTU after normalising coverage of each MG by gene length. This number represents the median coverage of universal single-copy marker genes in the sample and can therefore be used to normalise the coverage of gene catalogues. Dividing the coverage of other genes by this value provides the average gene copy number per cell, enabling cross-sample comparison of functional abundances.

BASE_RAW: Base coverage of each mOTU before any length normalisation. As with INSERT_RAW, this mode is not recommended for species quantification and downstream analyses, but can be inspected in the MGC output file to assess technical issues.

BASE_NORM: Base coverage of each mOTU after normalising coverage of each MG by gene length. This mode can be used to normalise gene catalogues, similar to INSERT_NORM. The difference between these two modes depends on how the gene catalogue itself was quantified: in number of inserts or in average base coverage.

`motus map_tax`#

Maps short reads against the mOTUs marker gene database.

Required arguments#

Input -f, -r, -s: one or multiple FastQ/A files, which can be gzipped. Ensure the order of input files is the same for -f and -r if using paired-end data.

Output -o: path to the output file.

Option overview#

Option	Description
`-f`, `--forward`	Input path - Paired Forward: One or more gzipped FastQ/A files containing forward reads. The input files must have the same order for both forward and reverse.
`-r`, `--reverse`	Input path - Paired Reverse: One or more gzipped FastQ/A files containing reverse reads. The input files must have the same order for both forward and reverse.
`-s`, `--single`	Input path - Single: One or more gzipped FastQ/A files. The order of the input files does not matter for single-end files.
`-o`, `--output-file`	Output prefix: Path to the output files. This prefix is also used for intermediate files.
`-l`, `--alignment-length`	Length: Filter alignments if their length is below this value. Default value is `75`. Note: Choose a value less than or equal to the length of the reads.
`-t`, `--threads`	Threads: Number of threads to use for the alignment step. Default is `1`.

`motus calc_mgc`#

Calculates number of inserts mapping to each marker gene cluster within the mOTUs marker gene database.

Required arguments#

Input -i: a SAM or BAM file generated after running motus map_tax.

Output -o: path to the output file.

Option overview#

Option	Description
`-i`, `--input-file`	Input path: Path to BAM file generated after running `motus map_tax`.
`-o`, `--output-file`	Output path: Path to the output file.
`-l`, `--alignment-length`	Length: Filter alignments if their length is below this value. Default value is `75`. Note: Choose a value less than or equal to the length of the reads.

`motus calc_motu`#

Calculates the taxonomic profile based on number of inserts mapped to the corresponding marker gene clusters.

Required arguments#

Input -i: the MGC file generated after running motus calc_mgc.

Output -o: path to the taxonomic profiles containing abundances as counts and as relative abundances.

Option overview#

Option	Description
`-i`, `--input-file`	Input path: Path to MGC abundance table generated after running `motus calc_mgc`.
`-o`, `--output-file`	Output path: Path to the output files.
`-n`, `--sample-name`	Sample name: Name of the sample. Required when merging samples. The default value is `unnamed sample`.
`-g`, `--marker-genes`	Sensitivity: The number of marker genes with abundance required to call a mOTU present. Default value is `3`, with a minimum of `1` and a maximum of `10`.
`-y`, `--counting-mode`	Counting method: mOTUs can count in different modes, the default mode is `INSERT_SCALED`. Other options include `INSERT_RAW`, `INSERT_NORM`, `INSERT_SCALED`, `BASE_RAW`, and `BASE_NORM`. For more details, see algorithm.

`motus downloadMGDB`#

Downlads the marker gene reference database required for profiling.

Option overview#

Option	Description
`-f`, `--force`	Force: Download the database even if it’s already present.

`motus merge`#

Merges taxonomic profiles from multiple samples into one (tab-separated) table.

Note: Requires that each profile is named (using -n in motus profile or motus calc_motu).

Required arguments#

Input -i: mOTUs profile files to merge. The input can be provided as a text file with one line per profile or as a space-separated list containing multiple mOTUs profiles. At least two profiles are required. Note: These files must be generated using the same mOTUs version and the same parameters.

Output -o: path to the merged profile file.

Option overview#

Option	Description
`-i`, `--input-files`	Input files: A space-separated list of profiles produced after running `motus profile` or `motus calc_motu`. Alternatively, a text file containing paths to generated profiles, one profile per line.
`-o`, `--output-file`	Output path: Path to the output file containing merged profiles.

Option description#

Input files#

Option	Description
`-i`, `--input-files`	Input files: A space-separated list of profiles produced after running `motus profile` or `motus calc_motu`. Alternatively, a text file containing paths to generated profiles, one profile per line.

We will use samples from project PRJEB52368 as an example. After running motus profile on selected samples, we obtain the following output files: SAMEA5998847.mOTUs4, SAMEA6009611.mOTUs4, SAMEA6009792.mOTUs4, SAMEA6009843.mOTUs4, and SAMEA6009888.mOTUs4.

IMPORTANT: When running motus profile, sample names should be indicated using the -n parameter (see here). Unnamed samples are unsuitable for merging.

Correct usage

motus merge -i SAMEA5998847.mOTUs4 SAMEA6009611.mOTUs4 SAMEA6009792.mOTUs4 SAMEA6009843.mOTUs4 SAMEA6009888.mOTUs4 -o PRJEB52368.tsv listing profiles directly in the command line.

motus merge -i profiles_to_merge.txt -o PRJEB52368.tsv indicating file with a list of profiles, where the content of profiles_to_merge.txt is:

/path/to/SAMEA5998847.mOTUs4
/path/to/SAMEA6009611.mOTUs4
/path/to/SAMEA6009792.mOTUs4
/path/to/SAMEA6009843.mOTUs4
/path/to/SAMEA6009888.mOTUs4

Incorrect usage and mOTUs will abort with an error

motus merge -i SAMEA5998847.mOTUs4 -o PRJEB52368.tsv merge requires at least two profiles to be provided.

motus merge -i SAMEA5998847.mOTUs4 SAMEA6009611.mOTUs4 SAMEA6009611.mOTUs4 -o PRJEB52368.tsv the same sample is indicated twice.

`motus classify`#

Assigns provided genomes to a mOTU if the corresponding taxon is present within database.

Required arguments#

Input -i: a text file containing a list of paths to genome files in FastA(.gz) format.

Output -o: path to output file, which contains one line per genome with its associated mOTU.

Option overview#

Option	Description
`-i`, `--input-file`	Input file - Genome list: A text file listing genome files that will be associated with existing mOTUs.
`-o`, `--output-file`	Output path - Classification: The output file, containing one line per genome with its associated mOTU.
`-t`, `--threads`	Threads: Number of threads to use for aligning against the mOTUs MG database. Default is `1`.

Option description#

Input file#

Option	Description
`-i`, `--input-file`	Input file - Genome list: A text file listing genome files that will be associated with existing mOTUs.

The input of motus classify is a text file listing genome sequence files in FastA format, one file per line. The genome sequence files can be gzipped, and the filenames of all genomes must be unique.

Correct usage

Wherein the content of genomes txt is

$ cat genomes.txt
/a/b/c.fa
/a/c/d.fa.gz

Incorrect usage and mOTUs will abort with an error

$ cat genomes.txt
/a/b/c.fa /a/c/d.fa

$ cat genomes.txt
/a/b/c.fa
/a/c/c.fa

Output file#

Option	Description
`-o`, `--output-file`	Output path - Classification: The output file, containing one line per genome with its associated mOTU.

The tabular output file contains one line per submitted genome, indicating the assigned mOTU, <6MGs-no_mOTU if the genome lacked at least 5 out of 10 marker genes, or Novel-no_mOTU if the genome had >6MGs marker genes but couldn’t be assigned to any mOTU.

Threads#

Option	Description
`-t`, `--threads`	Threads: Number of threads to use for aligning against the mOTUs MG database. Default is `1`.

motus classify is partly multi-threaded and using less than 32 threads usually gives a considerable speed-up when classifying larger genome sets.

`motus prep_long`#

Prepares long reads for profiling by mOTUs. Has to be run before the motus profile command.

Required arguments#

Input -i: input file containing long reads in FastQ/A(.gz) format.

Output -o: output FastA file containing converted reads. Appropriate to be used as input for motus profile.

Option overview#

Option	Description
`-i`, `--input-file`	Input path: The input file containing long reads in FastQ/A(.gz) format.
`-o`, `--output-file`	Output path: The output file containing converted short reads in FastA format.
`-sl`, `--splitting-length`	Splitting length: Target length of the converted reads. Default value is `300`.
`-ml`, `--minimum-length`	Minimum length: Converted reads shorter than indicated length will not be written to the output. The default value is `50`.

Option description#

Splitting length and minimum length#

The motus prep_long function splits every long read in the dataset into non-overlapping fragments of 300 base pairs (or the value of -sl) in length:

|--SL--|                                                |--- >ML?
|======|======|======|======|======|======|======|======|===

The final fragment is only written to the output file if its length is at least 50 base pairs (or the value of -ml). Fragments cannot overlap as this will affect base coverage quantification.

`motus genomes`#

Queries the mOTUs genome database to find genomes matching indicated mOTUs identifiers, taxonomic clades, or functional annotations.

Note

First-time execution of this command downloads the mOTUs annotation database, which requires 17.7G of storage.

Required arguments#

Input -i: a list of search queries separated by space. Alternatively, a text file listing search queries, with one query per line.

Output -o: output table containing genome identifiers matching search queries and annotations requested by the user. This file is appropriate to be used as input for motus download.

Option overview#

Option	Description
`-i`, `--input-queries`	Input - List of queries: A list of terms to query within the mOTUs annotation database.
`-o`, `--output-file`	Output file: The output table contains an overview of genomes matching indicated queries, accompanied by annotations specified with the `-d` parameter.
`-d`, `--details`	Details: Annotations to report. Options include `KEGG`, `PFAM`, `EGGNOG`, and `TAXONOMY`.

Option description#

List of queries#

Option	Description
`-i`, `--input-queries`	Input - List of queries: A list of terms to query within the mOTUs annotation database.

A list of search queries separated by space. Alternatively, a text file listing search queries, with one query per line.

Output file#

Option	Description
`-o`, `--output-file`	Output file: The output table contains an overview of genomes matching indicated queries, accompanied by annotations specified with the `-d` parameter.

By default only the names of the genomes and the search query is reported. Using the -d option will add columns with functional and taxonomic annotation.

Details#

Option	Description
`-d`, `--details`	Details: Annotations to report. Options include `KEGG`, `PFAM`, `EGGNOG`, and `TAXONOMY`.

The -d option allows users to specify which annotation to report when using the motus genomes command. Reporting multiple annotation types is possible, e.g. by -d KEGG PFAM.

`motus download`#

Downloads sequences for indicated genomes from the mOTUs genome database.

Required arguments#

Input -i: a list of genome identifiers separated by space. Alternatively, a text file listing genome identifiers, with one genome per line. The file generated by motus genomes can be used as input.

Output -o: path to folder that will store downloaded sequences.

Option overview#

Option	Description
`-i`, `--input-genomes`	Input - List of genomes: A list of genome identifiers specifying which sequences to download.
`-o`, `--output-folder`	Output path: Output directory in which the downloaded sequences will be saved.
`-r`, `--representatives`	Representatives only: Only download sequences from representative genomes.

Option description#

List of genome identifiers#

Option	Description
`-i`, `--input-genomes`	Input - List of genomes: A list of genome identifiers specifying which sequences to download.

The download command supports two different types of input files. You can provide a simple list of genome names or a detailed table from a previous analysis.

Option 1: Simple Genome List

This is a basic text file where you provide one genome name per line. Use this format if you already have a specific list of genomes you want to download.

ELLE19-1_SAMN09288280_MAG_00000004
ELLE19-1_SAMN09288282_MAG_00000006
ELLE19-1_SAMN09288284_MAG_00000011

Option 2: mOTUs Genome Table

Alternatively, you can provide the output file generated by the motus genomes command. This format is a tab-separated table that includes a header line followed by the genome names and their associated query data.

GENOME                                QUERY
ELLE19-1_SAMN09288280_MAG_00000004    mOTUv4.0_001734
ELLE19-1_SAMN09288282_MAG_00000006    mOTUv4.0_001734
ELLE19-1_SAMN09288284_MAG_00000011    mOTUv4.0_001734

mOTUs is part of SIB's portfolio of open tools and databases.

mOTUs is part of the ELIXIR-CH Service Delivery Plan.

Option manual#

motus profile#

Required arguments#

Option overview#

Option description#

Input files#

Sample name#

Number of marker genes#

Alignment length#

Number of threads#

Counting method#

motus map_tax#

Required arguments#

Option overview#

motus calc_mgc#

Required arguments#

Option overview#

motus calc_motu#

Required arguments#

Option overview#

motus downloadMGDB#

Option overview#

motus merge#

Required arguments#

Option overview#

Option description#

Input files#

motus classify#

Required arguments#

Option overview#

Option description#

Input file#

Output file#

Threads#

motus prep_long#

Required arguments#

Option overview#

Option description#

Splitting length and minimum length#

motus genomes#

Required arguments#

Option overview#

Option description#

List of queries#

Output file#

Details#

motus download#

Required arguments#

Option overview#

Option description#

List of genome identifiers#

`motus profile`#

`motus map_tax`#

`motus calc_mgc`#

`motus calc_motu`#

`motus downloadMGDB`#

`motus merge`#

`motus classify`#

`motus prep_long`#

`motus genomes`#

`motus download`#