MAG Quality Control
Overview
This module is designed to function as both a standalone MAG MAG QC pipeline as well as a component of the larger CAMP/CAP2 metagenome analysis pipeline. As such, it is both self-contained (ex. instructions included for the setup of a versioned environment, etc.), and seamlessly compatible with other CAMP modules (ex. ingests and spawns standardized input/output config files, etc.).
The CAMP MAG quality-checking pipeline wraps several tools to assess the overall quality of the MAG binning process. The tools incorporated (CheckM, cmseq, GTDB-Tk, dnadiff) were inspired by Saheb Kashaf et al.’s MAG Snakemake workflow. More quality metrics will be added as they are validated on simulated and real gold-standard datasets. For example, adding QUAST’s ability to identify contiguity and misassemblies would be useful.
Installation
<<<<<<< HEAD 1. Clone repo from Github. ======= 1. Clone repo from Github. >>>>>>> main
Set up the conda environment (contains, Snakemake) using
configs/conda/mag_qc.yaml.
3. Make sure the installed pipeline works correctly. pytest only generates temporary outputs so no files should be created.
cd camp_mag_qc
conda env create -f configs/conda/mag_qc.yaml
conda activate mag_qc
pytest .tests/unit/
Using the Module
- Input:
/path/to/samples.csvprovided by the user. Note: The workflow expects MAG FastAs named with the format:
{bin_num}.fa, where the prefixbin_numis a unique identifier and the extension isfa(i.e.: standard MetaWRAP output). If this is not how your FastAs are formatted, either update theSnakefileor your MAG files.
Output: 1) An output config file summarizing 2) the module’s outputs.
/path/to/work/dir/mag_qc/final_reports/{sample}.summary.csvthat provides summary statistics of each MAG in the{sample}dataset:Completeness contamination, as measured by CheckM’s lineage-specific marker gene heuristic
Strain heterogeneity, as measured by
cmseq’s equation \(dN * 100 / dS\)Taxonomic classification, as estimated by GTDB-Tk’s gene-based tree placement method
N50, MAG size (number of base-pairs), and GC content as adapted from a MetaWRAP script
MAG coverage by the closest reference genome, reference genome coverage by that MAG, average nucleotide identity between MAG and reference genome, as calculated by
dnadiff
/path/to/work/dir/mag_qc/final_reports/samples.csvsummarizing the locations of each of the summary reports
Structure:
└── workflow
├── Snakefile
├── mag_qc.py
├── utils.py
└── __init__.py
workflow/mag_qc.py: Click-based CLI that wraps thesnakemakeand unit test generation commands for clean management of parameters, resources, and environment variables.workflow/Snakefile: Thesnakemakepipeline.workflow/utils.py: Sample ingestion and work directory setup functions, and other utility functions used in the pipeline and the CLI.
- Make your own
samples.csvbased on the template inconfigs/samples.csv. Sample test data can be found intest_data/. ingest_samplesinworkflow/utils.pyexpects Illumina reads in FastQ (may be gzipped) form and de novo assembled contigs in FastA formsamples.csvrequires either absolute paths or paths relative to the directory that the module is being run in
- Make your own
Update the relevant parameters in
configs/parameters.yaml.Update the computational resources available to the pipeline in
resources.yaml.- Download the database dependencies for GTDB-Tk and CheckM.
Note:
GTDBTK_DATA_PATHhas to beexport-ed every time you open a new instance of the terminal
export GTDBTK_DATA_PATH='/path/to/databases/metagenomics/GTDBTk_R202'
wget https://data.gtdb.ecogenomic.org/releases/release202/202.0/auxillary_files/gtdbtk_r202_data.tar.gz -P ${GTDBTK_DATA_PATH}
tar xvzf ${GTDBTK_DATA_PATH}/gtdbtk_r202_data.tar.gz -C ${GTDBTK_DATA_PATH} --strip 1
export CHECK_DATA_PATH='/path/to/databases/metagenomics/CheckM_2015_01_16'
wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz -P ${CHECK_DATA_PATH}
tar xvzf ${CHECK_DATA_PATH}/checkm_data_2015_01_16.tar.gz -C ${CHECK_DATA_PATH} --strip 1
# Tell CheckM where to find this data before running anything:
checkm data setRoot ${CHECK_DATA_PATH}
- To run CAMP on the command line, use the following, where
/path/to/work/diris replaced with the absolute path of your chosen working directory, and/path/to/samples.csvis replaced with your copy ofsamples.csv. The default number of cores available to Snakemake is 1 which is enough for test data, but should probably be adjusted to 10+ for a real dataset.
Relative or absolute paths to the Snakefile and/or the working directory (if you’re running elsewhere) are accepted!
- To run CAMP on the command line, use the following, where
python /path/to/camp_mag_qc/workflow/mag_qc.py \
(-c max_number_of_local_cpu_cores) \
-d /path/to/work/dir \
-s /path/to/samples.csv
Note: This setup allows the main Snakefile to live outside of the work directory.
- To run CAMP on a job submission cluster (for now, only Slurm is supported), use the following.
--slurmis an optional flag that submits all rules in the Snakemake pipeline assbatchjobs.In Slurm mode, the
-cflag refers to the maximum number ofsbatchjobs submitted in parallel, not the pool of cores available to run the jobs. Each job will request the number of cores specified by threads inconfigs/resources/slurm.yaml.
sbatch -J jobname -o jobname.log << "EOF"
#!/bin/bash
python /path/to/camp_mag_qc/workflow/mag_qc.py --slurm \
(-c max_number_of_parallel_jobs_submitted) \
-d /path/to/work/dir \
-s /path/to/samples.csv
EOF
6. After checking over final_reports/ and making sure you have everything you need, you can delete all intermediate files to save space.
python /path/to/camp_mag_qc/workflow/mag_qc.py cleanup \
-d /path/to/work/dir \
-s /path/to/samples.csv
7. If for some reason the module keeps failing, CAMP can print a script containing all of the remaining commands that can be run manually.
python /path/to/camp_mag_qc/workflow/mag_qc.py --dry_run \
-d /path/to/work/dir \
-s /path/to/samples.csv > cmds.txt
python /path/to/camp_mag_qc/workflow/mag_qc.py commands cmds.txt
Extending the Module Extending the Module ——————–
We love to see it! This module was partially envisioned as a dependable, prepackaged sandbox for developers to test their shiny new tools in.
These instructions are meant for developers who have made a tool and want to integrate or demo its functionality as part of the standard {{ cookiecutter.module_name }} workflow, or developers who want to integrate an existing tool.
- Write a module rule that wraps your tool and integrates its input and output into the pipeline.
This is a great Snakemake tutorial for writing basic Snakemake rules.
If you’re adding new tools from an existing YAML, use
conda env update --file configs/conda/existing.yaml --prune.If you’re using external scripts and resource files that i) cannot easily be integrated into either utils.py or parameters.yaml, and ii) are not as large as databases that would justify an externally stored download, add them to
workflow/ext/orworkflow/ext/scripts/and userule external_ruleas a template to wrap them.
- Update the
make_configinworkflow/Snakefilerule to check for your tool’s output files. Updatesamples.csvto document its output if downstream modules/tools are meant to ingest it. If you plan to integrate multiple tools into the module that serve the same purpose but with different input or output requirements (ex. for alignment, Minimap2 for Nanopore reads vs. Bowtie2 for Illumina reads), you can toggle between these different ‘streams’ by setting the final files expected by
make_configusing the example functionworkflow_mode.Update the description of the
samples.csvinput fields in the CLI scriptworkflow/{{ cookiecutter.module_slug }}.py.
- Update the
- If applicable, update the default conda config using
conda env export > config/conda/{{ cookiecutter.module_slug }}.yamlwith your tool and its dependencies. If there are dependency conflicts, make a new conda YAML under
configs/condaand specify its usage in specific rules using thecondaoption (seefirst_rulefor an example).
- If applicable, update the default conda config using
Add your tool’s installation and running instructions to the module documentation and (if applicable) add the repo to your Read the Docs account + turn on the Read the Docs service hook.
- Run the pipeline once through to make sure everything works using the test data in
test_data/if appropriate, or your own appropriately-sized test data. Then, generate unit tests to ensure that others can sanity-check their installations. Note: Python functions imported from
utils.pyintoSnakefileshould be debugged on the command-line first before being added to a rule because Snakemake doesn’t port standard output/error well when usingrun:.
- Run the pipeline once through to make sure everything works using the test data in
python /path/to/camp_mag_qc/workflow/mag_qc.py --unit_test \
-d /path/to/work/dir \
-s /path/to/samples.csv
6. Increment the version number of the modular pipeline.
bump2version --allow-dirty --commit --tag major workflow/__init__.py \
--current-version A.C.E --new-version B.D.F
- If you want your tool integrated into the main CAP2/CAMP pipeline, send a pull request and we’ll have a look at it ASAP!
Please make it clear what your tool intends to do by including a summary in the commit/pull request (ex. “Release X.Y.Z: Integration of tool A, which does B to C and outputs D”).
Credits
This package was created with Cookiecutter as a simplified version of the project template.
This module is heavily inspired by four Snakefiles from MAG Snakemake workflow (Saheb Kashaf et al. 2021).
The MAG N50, size, and GC calculation rule was adapted from a script in MetaWRAP.
Free software: MIT