Algorithm#

motus profile#

The motus profile routine is comprised of three subroutines: map_tax, calc_mgc and calc_motu. Refer below for a detailed explanation on the algorithm behind each routine.

motus map_tax#

Summary#

Aligns metagenomic Illumina short reads against the mOTUs marker gene database using bwa (visit website), filters alignments by alignment length and percent identity.

Input#

  • One or more fastQ/A files, compressed format (.gz) accepted as well.

    • Reads can be both paired (-f, -r) or singletons (-s).

    • When submitting multiple runs from the same sample, files must be separated by a space, for example, -s a.fq.gz b.fq.gz.

    • When submitting multiple paired-end runs, file order must be consistent between -f and -r, for example, -f a.1.fq.gz b.1.fq.gz -r a.2.fq.gz b.2.fq.gz.

  • Minimal alignment length (-l). Default value = 75

Output#

  • A sorted BAM file containing filtered alignments, paired if possible.

Algorithm#

  1. Run bwa mem -a to align the input files against the mOTUs marker gene database.

  2. Filter the input, keeping only alignments that are longer than 75 (or the value of -l) nucleotides and display ≥97% sequence identity. Pair information is retained by adding /R1 and /R2 to each aligned read for paired alignments and /S for unpaired alignments.

  3. Sort the BAM file by read name using pysam sort -n.

motus calc_mgc#

Summary#

Reads the BAM file, assigns inserts to marker genes (MGs) and aggregates counts at marker gene cluster (MGC) level. The most complex algorithm of the mOTUs profiler tool.

Input#

  • The BAM file created using the motus map_tax routine.

Output#

  • Insertfate: A TSV file containing the assigned MG/MGC for each counted insert.

  • MGC: A TSV file containing the abundance of each MGC in different spaces: RAW/NORM/SCALED, INSERTS/BASES.

Algorithm#


Per Insert


  1. Stream the BAM file, always reading alignments one insert at a time.

  2. Find the best alignment per insert:

    • For each insert, choose the MG with the highest alignment score. For paired-end data, sum up the alignment score from the forward and the reverse read alignment.

      • What happens if the forward read and the reverse read align to different MGs? The MG with the highest alignment score is chosen. If alignment scores are identical, the insert is treated as a multi-mapper.

    • Report multiple MGs if multiple MGs share the highest alignment score.

    • If multiple MGs with the highest alignment score come from the same MGC, the insert is randomly assigned to one of these MGs. These cases are not considered to be multi-mappers.

  3. Split alignments into unique and multi-mappers:

    • a unique mapper is an insert which has one single MG as the best alignment.

    • a multi-mapper is an insert which has multiple MGs with the best alignments.

  4. Count unique mappers:

    • Each unique mapper is assigned to exactly one MG with a weight of 1.

    • Perform edge correction/inverse padding (see details).

  5. Count multi-mappers. Multi-mappers are inserts assigned to at least two different MGs from different MGCs. Each insert has a total weight of 1, which is distributed fractionally among the corresponding MGs/MGCs, with fractions proportional to the abundances estimated from unique mappers. For example:

    • Assume insert_z maps to both \(MG_1\) and \(MG_{10}\). \(MG_1\) belongs to \(MGC_X\), \(MG_{10}\) belongs to \(MGC_Y\).

    • If \(Unique_{MGC_X} = 30\) and \(Unique_{MGC_Y} = 5\):

      • \(Weight_{MG_1} = \frac{Unique_{MGC_X}}{Unique_{MGC_X} + Unique_{MGC_Y}} = \frac{30}{30 + 5} = 0.86\)

      • \(Weight_{MG_{10}} = \frac{Unique_{MGC_Y}}{Unique_{MGC_X} + Unique_{MGC_Y}} = \frac{5}{30 + 5} = 0.14\)

    • The abundance of \(MG_1\) increases by 0.86. The abundance of \(MG_{10}\) increases by 0.14.

    If a multi-mapper maps to MGCs with zero unique mappers, this insert is ignored during counting.

  6. Write each insert, its assigned MGs and weights to the Insertfate output file.


Per MG


Insert counts are summed up in RAW, NORM, and SCALED spaces:

  • RAW: Sum of edge-corrected abundances per MG (\(Insert\_Raw\) or \(Base\_Raw\)).

  • NORM: Normalise \(Insert\_Raw\) or \(Base\_Raw\) MG counts by dividing by MG length, for example:

    \(Insert\_Norm_{MG_1} = \frac{Insert\_Raw_{MG_1}}{Length_{MG_1}}\)
    \(Insert\_Norm_{MG_2} = \frac{Insert\_Raw_{MG_2}}{Length_{MG_2}}\)
    \(Insert\_Norm_{MG_N} = \frac{Insert\_Raw_{MG_M}}{Length_{MG_N}}\)

    The same calculation applies to \(Base\_Norm\).

  • SCALED: Rescale normalised MG counts by total inserts mapping to MGs. Note: This results in values above 1, which are easier to work with and don’t break standard methods that require counts instead of relative abundances.

    \(Insert\_Total = sum(Insert\_Raw_{MG_1},\ Insert\_Raw_{MG_2},\ ...\ Insert\_Raw_{MG_N})\)
    \(Denominator = sum(Insert\_Norm_{MG_1},\ Insert\_Norm_{MG_2},\ ...\ Insert\_Norm_{MG_N})\)
    \(Insert\_Scaled_{MG_1} = Insert\_Total \times \frac{Insert\_Norm_{MG_1}}{Denominator}\)
    \(Insert\_Scaled_{MG_2} = Insert\_Total \times \frac{Insert\_Norm_{MG_2}}{Denominator}\)
    \(Insert\_Scaled_{MG_N} = Insert\_Total \times \frac{Insert\_Norm_{MG_M}}{Denominator}\)

Per MGC


  • Sum up counts calculated above for each MGC. Each MG belongs to exactly one MGC.

  • For each MGC, report abundances according to INSERT_RAW, INSERT_NORM, INSERT_SCALED, BASE_RAW, BASE_NORM in the MGC output file.


Edge Correction


Reads are less likely to align with the same likelihood, length, and quality to gene edges when compared to the middle of the gene, a bias that edge correction compensates for. During edge correction, reads are first aligned to the full gene sequence. Then, the alignment is trimmed by 30 bases (default for bwa alignments) at the left and right edges, and the gene’s abundance is extrapolated from the middle part of the alignment:

            ABUNDANCE
FALSE        CORRECT        FALSE
=====|=====================|=====

      =====================
     Use TRUNC abundance and
    extrapolate to FULL length

Formula for edge correction: \(Abundance_{Full} = Length_{Full} \times \frac{Abundance_{Trunc}} {Length_{Trunc}}\)

motus calc_motu#

Summary#

Summarizes the MGC output file created using the calc_mgc routine on mOTU level.

Input#

Output#

  • Two taxonomic profile files ref-output_taxprofile are produced:

    • Taxa abundances as counts according to the selected counting method.

    • Taxa relative abundances.

Algorithm#

  1. For each mOTU, determine whether the number of MGCs with non-zero abundance is greater than or equal to the indicated minimum number of detected MGCs.

  2. For each mOTU with sufficient MGCs detected, calculate the median abundance across detected MGCs.

  3. Write abundances according to selected counting method to the output file.

  4. Divide each count by total counts to obtain relative abundances and write them to the output file.

motus classify#

Summary#

Associates provided genomes with existing mOTUs based on combined MG marker gene distances.

Input#

  • File listing paths to genome sequence files in fastQ/A(.gz) format.

Output#

  • Classify output file containing associations between each genome and existing mOTUs, or the reason why a genome could not be associated with an existing mOTU.

Algorithm#

Associating genomes to mOTUs consists of three independent tasks:

1. Extracting marker genes.

The fetchMGs tool (v2.1.2, visit GitHub) is used to extract MGs from all genomes. The tool first runs pyrodigal to identify protein-coding genes, and then annotates these genes using pyhmmer with calibrated cutoffs against the 10 MGs. Genes from genomes that contain at least 6 out of the 10 MGs are aggregated and used in the subsequent step. All other genomes are annotated as <6MGs-no_mOTU.

2. Aligning against mOTUs marker gene database.

The genes of the remaining genomes are aligned against the mOTUs marker gene database. The results are filtered to remove alignments against MGs belonging to mOTUv4.0_unassigned and alignments with \(<80%\) coverage.

3. Combining distances and assignment.

Distances between provided genomes and existing mOTUs are calculated by combining individual marker gene distances using length-weighted percent identities. A genome is assigned to the mOTU with the highest similarity if this similarity is \(\geq 96.5%\). Otherwise, it is annotated as Novel-no_mOTU.



ico1 mOTUs is part of SIB's portfolio of open tools and databases.

ico2 mOTUs is part of the ELIXIR-CH Service Delivery Plan.