MetaSanity v1.1 - Adding additional analyses

9 minute read

Published:

MetaSanity v1.1 releases 10/18/2019 with some new features! See the changelog for a complete list of changes in version 1.1.

We also incorporated a more standardized analysis for determining genome quality (Bowers et al). This blog will provide a walk-through for how we implemented this protocol.

Why do we want to explain this? BioMetaDB is designed to be updated, and the underlying SQL engine handles these updates easily.

Once a BioMetaDB project is created from a MetaSanity run, we can access this project from our own code.

MetaSanity v1.1 - Adding genome quality labels

Goals

Build an analysis that incorporates a BioMetaDB project.

The script that we will generate in this blog is in your MetaSanity installation in the Accessories directory should you wish to complete the analysis without this mini-lesson in Python scripting.

In this blog, we will be:

  • Adding a column to our BioMetaDB database table that describes a genome’s quality as high, medium, low, or incomplete.
  • Update existing data for incomplete genomes.

Preparing

Prior to running this script, ensure that BioMetaDB is on your python path. You can add it to your path by appending the following line to your .bashrc file and restarting your bash session:

export PYTHONPATH=/path/to/MetaSanity/BioMetaDB:$PYTHONPATH

Backing up your project

Since this particular update is released by our lab, a backup is not needed, assuming that you have not radically changed your project’s architecture since running MetaSanity.

However, this blog provides you with the tools to begin drastically changing your BioMetaDB project on your own. Users are STRONGLY advised to create a copy of their BioMetaDB project prior to running any update that is not directly released from this lab.

Part 1 - RecordList and UpdateData

Without getting too deep into OOP concepts, we can think of a RecordList as:

  • A list (e.g. [record-1, record-2] etc.) AND a dictionary (e.g. {record1: metadata} etc.) of the metadata in our project
  • A way to query our BioMetaDB project for specific gene calls, annotations, evaluation scores, etc.
  • A way to get FASTA records or metadata
  • A method to update the underlying database structure

We can think of UpdateData as an empty data frame - we put into it specific genome ids that we wish to update, and we add data categories and values as needed. This class generates a .tsv file, which BioMetaDB uses to update its database table schema.

Moreover, UpdateData is used to add new columns to our BioMetaDB project. If we are seeking to add a data column to our table (as we are here, the column “quality”) that does not already exist, then we must use UpdateData. Otherwise, a RecordList returned by the function get_table will suffice for making updates to columns that already exist or for genome records that already exist.

Let’s begin writing a small script. I will name this script bowers_et_al_2017.py.

Begin by adding these lines:

#!/usr/bin/env python3
import sys
from BioMetaDB import get_table, UpdateData

assert len(sys.argv) == 2, "usage: python3 bowers_et_al_2017.py biometadb-project"

evaluation_data = get_table(sys.argv[1], "evaluation")
evaluation_data.query()

dt = UpdateData()

The first two lines are pretty standard - we have our bash shebang for if we make this script executable, and we import the sys package for accessing command-line arguments.

The following line imports the function get_table, which generates a RecordList for viewing our data. The data structure UpdateData is also imported. The next line is a check for users who run this script. Note that we have not checked that the path actually exists (we’ll leave that to you).

Next, we use the function get_table to look at the BioMetaDB project the user passed for the table named “evaluation”, and query this table for all records. Note: this requires that PhyloSanity ran successfully.

Finally, we create an UpdateData object named dt to store our data prior to updating the BioMetaDB project.

Part 2 - Gathering data

Now that all of our variables are initialized, we can search our project for valid data.

We will assign high, medium, low, and incomplete quality labels to our genomes based on the provisions established in the Bowers paper. In that paper, these qualities are defined as:

  • High - Genome codes for 23S and 16S rRNA and at least 18 tRNAs. Genome is >90% complete and <5% contaminated based on CheckM output (presence of single copy genes).
  • Medium - Genome completion ≥50% complete and <10% contaminated.
  • Low - Genome completion <50% complete and <10% contaminated.

We will define incomplete to be anything not meeting these criteria.

With that in mind, let’s look at the next section of the script:

for genome in evaluation_data.keys():
    genome_id = genome.rstrip(".fna")
    genome_rl = get_table(sys.argv[1], table_name=genome_id)
    genome_rl.query("prokka LIKE 'tRNA%'")
    num_tRNAs = len(genome_rl)
    genome_rl.query("prokka == '23S ribosomal RNA'")
    has_23 = len(genome_rl) > 0
    genome_rl.query("prokka == '16S ribosomal RNA'")https://raw.githubusercontent.com/cjneely10/MetaSanity/v0.0.5/install.py
    has_16 = len(genome_rl) > 0
    has_23_16_rRNA = has_23 and has_16
    completion = evaluation_data[genome].completion
    contamination = evaluation_data[genome].contamination
    # Bowers et al determinations for MAG/SAG assembly quality
    if (num_tRNAs >= 18 and has_23_16_rRNA) and completion > 90 and contamination < 5:
        dt[genome].setattr("quality", "high")
    elif completion >= 50 and contamination < 10:
        dt[genome].setattr("quality", "medium")
    elif completion < 50 and contamination < 10:
        dt[genome].setattr("quality", "low")
    else:
        dt[genome].setattr("quality", "incomplete")
        evaluation_data[genome].is_complete = False

We are using a for loop to check each genome for our search criteria. By default, every key returned by keys is the filename of a FASTA file. FuncSanity generated a database table for every genome it analyzed, which stores out annotations. By calling get_table with this table name (less the extension), we get the associated annotations.

From here. we can query the RecordList stored in genome_rl just as we did the one stored in evaluation_data. We first check for tRNA annotations and store the number of records that match out query using the len function. We then query for 23S rRNA and check if any were returned. We do the same for 16S rRNA. The next line combines these two values for a single check - a bit superfluous, but it is better to be clear!

Next, we check our genome’s completion value. Since evaluation_data is a list/dict, we can access it and manipulate the underlying stored values. In this case, we are getting the stored completion value. We do the same for contamination in the next line.

Finally, we assign qualities to our data. Each portion of the if statement refers to one of the Bowers quality assignments.

As I mentioned at the beginning of this blog, our goal is to update the underlying database with an additional column (the quality labels), and we wish to adjust any incomplete genomes to change their PhyloSanity-determined is_complete value to reflect their newly determined status.

The RecordList class handles changes to the existing architecture; but, if we want to add additional column data, we must incorporate an UpdateData object. In each portion of the if statement, we directly assign a genome’s corresponding quality score by using UpdateData. In the else portion of the if block, we access the RecordList object and change its existing is_complete value.

Part 3 - Saving data and wrapping up

The final portion of our code consists of two lines:

evaluation_data.save()
evaluation_data.update(data=dt)

The first line updates all values that were changed in the existing database architecture. The second line stores the new database info to the BioMetaDB project. Optionally, a folder of FASTA records could be provided to add to a database table by passing directory_name="/path/to/genome-folder".

The update function will destroy the existing project architecture; so, if we want to use our new project structure, we must then call get_table again to get the new project data as a RecordList.

And that is it! We have just incorporated a genome quality analysis for all of our genomes in less than 40 lines of code.

Here is the complete script:

#!/usr/bin/env python3
import sys
from BioMetaDB import get_table, UpdateData

assert len(sys.argv) == 2, "usage: python3 bowers_et_al_2017.py biometadb-project"

evaluation_data = get_table(sys.argv[1], "evaluation")
evaluation_data.query()

dt = UpdateData()
for genome in evaluation_data.keys():
    genome_id = genome.rstrip(".fna")
    genome_rl = get_table(sys.argv[1], table_name=genome_id)
    genome_rl.query("prokka LIKE 'tRNA%'")
    num_tRNAs = len(genome_rl)
    genome_rl.query("prokka == '23S ribosomal RNA'")
    has_23 = len(genome_rl) > 0
    genome_rl.query("prokka == '16S ribosomal RNA'")
    has_16 = len(genome_rl) > 0
    has_23_16_rRNA = has_23 and has_16
    completion = evaluation_data[genome].completion
    contamination = evaluation_data[genome].contamination
    # Bowers et al determinations for MAG/SAG assembly quality
    if (num_tRNAs >= 18 and has_23_16_rRNA) and completion > 90 and contamination < 5:
        dt[genome].setattr("quality", "high")
    elif completion >= 50 and contamination < 10:
        dt[genome].setattr("quality", "medium")
    elif completion < 50 and contamination < 10:
        dt[genome].setattr("quality", "low")
    else:
        dt[genome].setattr("quality", "incomplete")
        evaluation_data[genome].is_complete = False

evaluation_data.save()
evaluation_data.update(data=dt)

There is much more that can be incorporated through the use of UpdateData and RecordList! See the BioMetaDB page for more information!

Citations

Bowers, Robert M et al. “Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea.” Nature biotechnology vol. 35,8 (2017): 725-731. doi:10.1038/nbt.3893. https://www.nature.com/articles/nbt.3893