Workflow Deployment Options

Recommended Deployment Strategy

Layered Environments with Singularity/Apptainer + Conda

The most robust and reproducible way to run this workflow is by combining containers and package management in a layered approach, as described in the official Snakemake documentation. This is the method we've implemented and recommend.

At first, it might sound redundant: "Why use a package manager inside a container?" The answer lies in achieving the ultimate balance of reproducibility, flexibility, and efficiency.

Let's continue with our kitchen and shopping list analogy:

Singularity/Apptainer is the Portable Kitchen: It provides the entire standardized kitchen environment: the operating system, the plumbing, the wiring, and the core conda appliance. When you run the workflow with --use-singularity or --use-apptainer, you are ensuring every step of the analysis happens inside this identical, portable kitchen, no matter if you're on your laptop or an HPC cluster.
--use-conda is the Automated Shopping Trip: For every single step in the main recipe (a Snakemake rule), --use-conda performs a specialized, just-in-time shopping trip. It reads a very specific shopping list (the rule's environment.yaml file) that details the exact "brand" and "version" of every ingredient (software like fastqc=0.11.9) needed for only that step. It then uses the conda tool inside the kitchen to instantly procure these items and place them in a clean, isolated pantry for that specific task.

This two-layer system is the gold standard for several reasons:

Absolute Reproducibility: You control the outer environment (the kitchen's OS via Singularity/Apptainer) and the inner environment (the specific ingredients for each step via Conda). This eliminates almost every variable that could cause results to differ between machines.
Unmatched Flexibility: Need one brand of flour (samtools v1.9) for the cake and a different brand (samtools v1.15) for the bread? No problem. Each recipe step gets its own shopping trip and its own isolated pantry, so ingredients never get mixed up.
Efficiency: The base kitchen (Apptainer container) stays small and simple; it only needs the basic conda appliance. You don't need to stock the kitchen with every possible ingredient from the start. The workflow performs these small, fast shopping trips on-the-fly, only when an ingredient is needed.

How It Works in Practice

When you execute the command snakemake all --use-singularity --use-conda, to run a configured workflow, a precise sequence of events unfolds:

DAG Calculation: First, Snakemake reads the entire Snakefile (the recipe). It compares the outputs you've asked for with the files that already exist, and it builds a full dependency graph of all jobs that need to run. This is the Directed Acyclic Graph (DAG), which serves as the master plan for the entire analysis.
Environment Creation: Next, before a single computational job is run, Snakemake identifies all the unique software environments (environment.yaml files) required for the jobs in the DAG. It then creates all of these Conda environments. This is like doing all the grocery shopping for every course of a multi-course meal before you even start preheating the oven.
Job Execution: With the plan (DAG) made and all the pantries stocked (Conda environments created), Snakemake begins executing jobs.
- It starts with the jobs at the beginning of the DAG, which have no unfinished dependencies.
- For each job, it uses Singularity/Apptainer to enter the standardized "kitchen."
- Inside, it activates the specific, pre-built Conda environment ("pantry") for that job.
- It runs the job's script in this doubly-isolated environment.
- As each job finishes, Snakemake checks the DAG to see which downstream jobs are now unlocked and ready to run, continuing until the final output is generated.

Helpful Resources (Docs / Guides)

The container runtime that allows you to run the workflow in a self-contained environment. We use it to run a Docker container image.

The package manager that operates inside the container to manage the specific software packages required by the Snakemake rules.

A Note on Installation

Installing Apptainer/Singularity requires administrator (root) privileges, which can make it difficult to set up on a personal machine.

However, because it is a standard tool in scientific computing, most HPC clusters already have it installed. We recommend checking your HPC's documentation or contacting its support team to confirm before you start.

Choosing an Execution Platform

In addition to the software deployment strategy (using Conda, Singularity, or both), you also need to decide on the hardware or platform where the workflow will run. Snakemake's plugin-based architecture allows it to adapt to a wide variety of execution and storage backends.

Execution Backends

High-Performance Computing (HPC) Clusters

A High-Performance Compute (HPC) cluster is a network of powerful servers designed for complex data analysis. Many researchers use HPC clusters to run bioinformatics workflows because they can process large datasets much faster than a personal computer. The workflow can be adapted to different HPC schedulers by creating custom profiles and using the appropriate executor plugin:

Slurm: snakemake-executor-plugin-slurm (used in this workflow's pre-configured profiles)
LSF: snakemake-executor-plugin-lsf
PBS/Torque: snakemake-executor-plugin-pbs
SGE: snakemake-executor-plugin-sge
And many others, which you can explore in the Snakemake Executor Plugin Catalog.

Local Execution

For small datasets or testing, you can run the workflow on a local machine (e.g., a personal laptop or workstation). In this case, Snakemake does not require a special executor plugin and will run jobs directly on your computer.

Local Execution Limitations

Local execution is suitable for small genomes and limited datasets. For production analyses, an HPC cluster or cloud environment is recommended.

Cloud and Container Orchestration

The workflow also supports various cloud and container platforms through dedicated executor plugins:

Kubernetes: snakemake-executor-plugin-kubernetes
Google Cloud Batch: snakemake-executor-plugin-googlebatch
AWS Batch: snakemake-executor-plugin-aws-batch

Continuous Integration (CI) with GitHub Actions

This repository comes with a pre-configured GitHub Actions workflow (.github/workflows/main.yml) that automatically tests the workflow's integrity on every push and pull request. This ensures that the code is always in a working state.

For Testing Only

The included GitHub Actions workflow runs on free, public runners, which have limited computational resources and disk space. It is designed for testing with small datasets only and cannot be used to run a real-world analysis.

For large-scale production runs, you can repurpose the .github/workflows/main.yml file by configuring it to use self-hosted runners with sufficient resources. This typically involves changing the runs-on key in the workflow file to target your self-hosted runner group.

Storage Backends

For workflows that run in the cloud or across different physical locations, Snakemake can use storage plugins to seamlessly access data from various remote storage backends. This means your Snakefile can reference remote files (e.g., in an S3 bucket) as if they were on your local filesystem.

Some of the available storage plugins include:

Amazon S3: snakemake-storage-plugin-s3
Google Cloud Storage: snakemake-storage-plugin-gcs
Microsoft Azure Blob Storage: snakemake-storage-plugin-azure

You can explore all available storage backends in the Snakemake Storage Plugin Catalog.

Working on a High-Performance Compute (HPC) Cluster

You typically connect to an HPC cluster using SSH (Secure Shell), which provides a secure command-line interface to the remote servers.

HPC Learning Materials

New to HPC?

If you are not familiar with high-performance computing, these resources are a great place to start.

HPC Carpentry Lessons
Tufts HPC Home Page
Tufts HPC User Guide

HPC Carpentry teaches foundational skills for high-performance computing.

Tufts HPC Materials

If you are Tufts staff/student/faculty, please refer to our Tufts-specific guides for deployment via the Tufts HPC cluster:

Recommended Tool: Visual Studio Code (VSCode)

We highly recommend using an integrated development environment (IDE) like VSCode to interact with this workflow, especially when working on an HPC cluster.

VSCode is a free, powerful code editor that can connect to remote servers via SSH. This allows you to edit files, run commands, and manage your workflow on the HPC cluster directly from a user-friendly interface on your local machine. It offers several key advantages for scientific workflows:

Remote Development: VSCode can seamlessly connect to remote servers, allowing you to edit files, run code, and manage projects directly on the HPC without leaving your local environment.
Integrated Terminal: Access the HPC terminal within VSCode, enabling efficient command-line operations alongside your coding activities.
Extensions and Customization: Enhance your development experience with extensions tailored for specific programming languages, debugging tools, and workflow optimizations.

tip

If you are Tufts staff/student/faculty, please refer our documentation for setting up access to the Tufts HPC cluster via VSCode:

Learning Resources for VSCode

The official docs are the best place to start, covering everything from basic setup to advanced features.

Visit the official Visual Studio Code website to download the installer for your operating system.

Installation Instructions:

Windows

Run the downloaded .exe installer.
Follow the installation prompts, accepting the license agreement and selecting desired installation options (e.g., adding VSCode to your PATH).

macOS

Open the downloaded .dmg file.
Drag and drop the VSCode application into the Applications folder.

Linux

Follow the instructions for your distribution on the VSCode website.

Workflow Deployment Options

Recommended Deployment Strategy

Layered Environments with Singularity/Apptainer + Conda

How It Works in Practice

Helpful Resources (Docs / Guides)

Apptainer / Singularity

Conda

Choosing an Execution Platform

Execution Backends

High-Performance Computing (HPC) Clusters

Local Execution

Cloud and Container Orchestration

Continuous Integration (CI) with GitHub Actions

Storage Backends

Working on a High-Performance Compute (HPC) Cluster

HPC Learning Materials

Recommended Tool: Visual Studio Code (VSCode)

Learning Resources for VSCode

Official Documentation

VS Code in 100 Seconds

25 VS Code Productivity Tips

Download and Install VSCode

Installation Instructions:

Recommended Deployment Strategy​

Layered Environments with Singularity/Apptainer + Conda​

How It Works in Practice​

Helpful Resources (Docs / Guides)​

Apptainer / Singularity​

Conda​

Choosing an Execution Platform​

Execution Backends​

High-Performance Computing (HPC) Clusters​

Local Execution​

Cloud and Container Orchestration​

Continuous Integration (CI) with GitHub Actions​

Storage Backends​

Working on a High-Performance Compute (HPC) Cluster​

HPC Learning Materials​

Recommended Tool: Visual Studio Code (VSCode)​

Learning Resources for VSCode​

Official Documentation

VS Code in 100 Seconds

25 VS Code Productivity Tips

Download and Install VSCode​

Installation Instructions:

Recommended Deployment Strategy

Layered Environments with Singularity/Apptainer + Conda

How It Works in Practice

Helpful Resources (Docs / Guides)

Apptainer / Singularity

Conda

Choosing an Execution Platform

Execution Backends

High-Performance Computing (HPC) Clusters

Local Execution

Cloud and Container Orchestration

Continuous Integration (CI) with GitHub Actions

Storage Backends

Working on a High-Performance Compute (HPC) Cluster

HPC Learning Materials

Recommended Tool: Visual Studio Code (VSCode)

Learning Resources for VSCode

Download and Install VSCode