Workflow Deployment Options
Recommended Deployment Strategy
Layered Environments with Singularity/Apptainer + Conda
The most robust and reproducible way to run this workflow is by combining containers and package management in a layered approach, as described in the official Snakemake documentation. This is the method we've implemented and recommend.
At first, it might sound redundant: "Why use a package manager inside a container?" The answer lies in achieving the ultimate balance of reproducibility, flexibility, and efficiency.
Let's continue with our kitchen and shopping list analogy:
- 
Singularity/Apptaineris the Portable Kitchen: It provides the entire standardized kitchen environment: the operating system, the plumbing, the wiring, and the corecondaappliance. When you run the workflow with--use-singularityor--use-apptainer, you are ensuring every step of the analysis happens inside this identical, portable kitchen, no matter if you're on your laptop or an HPC cluster.
- 
--use-condais the Automated Shopping Trip: For every single step in the main recipe (a Snakemake rule),--use-condaperforms a specialized, just-in-time shopping trip. It reads a very specific shopping list (the rule'senvironment.yamlfile) that details the exact "brand" and "version" of every ingredient (software likefastqc=0.11.9) needed for only that step. It then uses thecondatool inside the kitchen to instantly procure these items and place them in a clean, isolated pantry for that specific task.
This two-layer system is the gold standard for several reasons:
- Absolute Reproducibility: You control the outer environment (the kitchen's OS via Singularity/Apptainer) and the inner environment (the specific ingredients for each step via Conda). This eliminates almost every variable that could cause results to differ between machines.
- Unmatched Flexibility: Need one brand of flour (samtools v1.9) for the cake and a different brand (samtools v1.15) for the bread? No problem. Each recipe step gets its own shopping trip and its own isolated pantry, so ingredients never get mixed up.
- Efficiency: The base kitchen (Apptainer container) stays small and
simple; it only needs the basic condaappliance. You don't need to stock the kitchen with every possible ingredient from the start. The workflow performs these small, fast shopping trips on-the-fly, only when an ingredient is needed.
How It Works in Practice
When you execute the command snakemake all --use-singularity --use-conda, to
run a configured workflow, a precise sequence of events unfolds:
- 
DAG Calculation: First, Snakemake reads the entire Snakefile(the recipe). It compares the outputs you've asked for with the files that already exist, and it builds a full dependency graph of all jobs that need to run. This is the Directed Acyclic Graph (DAG), which serves as the master plan for the entire analysis.
- 
Environment Creation: Next, before a single computational job is run, Snakemake identifies all the unique software environments ( environment.yamlfiles) required for the jobs in the DAG. It then creates all of these Conda environments. This is like doing all the grocery shopping for every course of a multi-course meal before you even start preheating the oven.
- 
Job Execution: With the plan (DAG) made and all the pantries stocked (Conda environments created), Snakemake begins executing jobs. - It starts with the jobs at the beginning of the DAG, which have no unfinished dependencies.
- For each job, it uses Singularity/Apptainer to enter the standardized "kitchen."
- Inside, it activates the specific, pre-built Conda environment ("pantry") for that job.
- It runs the job's script in this doubly-isolated environment.
- As each job finishes, Snakemake checks the DAG to see which downstream jobs are now unlocked and ready to run, continuing until the final output is generated.
 
Helpful Resources (Docs / Guides)
Installing Apptainer/Singularity requires administrator (root) privileges, which can make it difficult to set up on a personal machine.
However, because it is a standard tool in scientific computing, most HPC clusters already have it installed. We recommend checking your HPC's documentation or contacting its support team to confirm before you start.
Choosing an Execution Platform
In addition to the software deployment strategy (using Conda, Singularity, or both), you also need to decide on the hardware or platform where the workflow will run. Snakemake's plugin-based architecture allows it to adapt to a wide variety of execution and storage backends.
Execution Backends
High-Performance Computing (HPC) Clusters
A High-Performance Compute (HPC) cluster is a network of powerful servers designed for complex data analysis. Many researchers use HPC clusters to run bioinformatics workflows because they can process large datasets much faster than a personal computer. The workflow can be adapted to different HPC schedulers by creating custom profiles and using the appropriate executor plugin:
- Slurm: snakemake-executor-plugin-slurm(used in this workflow's pre-configured profiles)
- LSF: snakemake-executor-plugin-lsf
- PBS/Torque: snakemake-executor-plugin-pbs
- SGE: snakemake-executor-plugin-sge
- And many others, which you can explore in the Snakemake Executor Plugin Catalog.
Local Execution
For small datasets or testing, you can run the workflow on a local machine (e.g., a personal laptop or workstation). In this case, Snakemake does not require a special executor plugin and will run jobs directly on your computer.
Local execution is suitable for small genomes and limited datasets. For production analyses, an HPC cluster or cloud environment is recommended.
Cloud and Container Orchestration
The workflow also supports various cloud and container platforms through dedicated executor plugins:
- Kubernetes: snakemake-executor-plugin-kubernetes
- Google Cloud Batch: snakemake-executor-plugin-googlebatch
- AWS Batch: snakemake-executor-plugin-aws-batch
Continuous Integration (CI) with GitHub Actions
This repository comes with a pre-configured GitHub Actions workflow (.github/workflows/main.yml)
that automatically tests the workflow's integrity on every push and pull
request. This ensures that the code is always in a working state.
The included GitHub Actions workflow runs on free, public runners, which have limited computational resources and disk space. It is designed for testing with small datasets only and cannot be used to run a real-world analysis.
For large-scale production runs, you can repurpose the
.github/workflows/main.yml file by configuring it to use self-hosted
runners with sufficient resources. This typically involves changing the
runs-on key in the workflow file to target your self-hosted runner group.
Storage Backends
For workflows that run in the cloud or across different physical locations,
Snakemake can use storage plugins to seamlessly access data from various
remote storage backends. This means your Snakefile can reference remote files
(e.g., in an S3 bucket) as if they were on your local filesystem.
Some of the available storage plugins include:
- Amazon S3: snakemake-storage-plugin-s3
- Google Cloud Storage: snakemake-storage-plugin-gcs
- Microsoft Azure Blob Storage: snakemake-storage-plugin-azure
You can explore all available storage backends in the Snakemake Storage Plugin Catalog.
Working on a High-Performance Compute (HPC) Cluster
You typically connect to an HPC cluster using SSH (Secure Shell), which provides a secure command-line interface to the remote servers.
HPC Learning Materials
If you are not familiar with high-performance computing, these resources are a great place to start.
- HPC Carpentry Lessons
- Tufts HPC Home Page
- Tufts HPC User Guide
Recommended Tool: Visual Studio Code (VSCode)
We highly recommend using an integrated development environment (IDE) like VSCode to interact with this workflow, especially when working on an HPC cluster.
VSCode is a free, powerful code editor that can connect to remote servers via SSH. This allows you to edit files, run commands, and manage your workflow on the HPC cluster directly from a user-friendly interface on your local machine. It offers several key advantages for scientific workflows:
- Remote Development: VSCode can seamlessly connect to remote servers, allowing you to edit files, run code, and manage projects directly on the HPC without leaving your local environment.
- Integrated Terminal: Access the HPC terminal within VSCode, enabling efficient command-line operations alongside your coding activities.
- Extensions and Customization: Enhance your development experience with extensions tailored for specific programming languages, debugging tools, and workflow optimizations.
Learning Resources for VSCode
Download and Install VSCode
Visit the official Visual Studio Code website to download the installer for your operating system.
Installation Instructions:
Windows
- Run the downloaded .exeinstaller.
- Follow the installation prompts, accepting the license agreement and selecting desired installation options (e.g., adding VSCode to your PATH).
macOS
- Open the downloaded .dmgfile.
- Drag and drop the VSCode application into the Applications folder.