Running the tucca-rna-seq Workflow
This guide covers how to execute the tucca-rna-seq workflow using different
execution environments and profiles.
Quick Start
Section titled “Quick Start”Once you’ve configured your analysis, follow these steps to run the workflow:
# 1. Validate your configurationsnakemake --lint --workflow-profile profiles/slurm
# 2. Create all required Conda environmentssnakemake all --conda-create-envs-only --workflow-profile profiles/slurm
# 3. Test the workflow (dry-run)snakemake all -np --workflow-profile profiles/slurm
# 4. Execute the workflowsnakemake all --workflow-profile profiles/slurmUsing Execution Profiles
Section titled “Using Execution Profiles”Snakemake profiles are a powerful feature that allows you to pre-configure the command-line options for a specific computing environment, such as a local workstation or an HPC cluster. This saves you from typing long, complex commands for every analysis.
To use a profile, you simply activate it with the --workflow-profile flag:
# Use the Slurm profilesnakemake all --workflow-profile profiles/slurm
# Use the development profilesnakemake all --workflow-profile profiles/slurm-devFor a comprehensive guide on what profiles are, which ones are available in this workflow, and how to customize them for your specific needs, please see the Configuration Guide.
Workflow Execution Steps
Section titled “Workflow Execution Steps”1. Configuration Validation
Section titled “1. Configuration Validation”Before running, validate your configuration:
# Check for configuration errorssnakemake --lint --workflow-profile profiles/slurmThis step:
- Validates your
config.yamlagainst the JSON schema - Checks consistency between
samples.tsvandunits.tsv - Identifies potential issues before execution
2. Create Software Environments
Section titled “2. Create Software Environments”Before performing a dry-run, it’s essential to create the necessary software environments. This step prevents errors during the dry-run, which needs to inspect tools inside containers that may not have been downloaded yet.
# Create all Conda environments and pull container imagessnakemake all --conda-create-envs-only --workflow-profile profiles/slurmThis command will:
- Download the main Singularity/Apptainer container image.
- Create all the isolated Conda environments required by the workflow’s rules.
- Not run any computational jobs.
3. Dry Run
Section titled “3. Dry Run”Test the workflow without executing jobs:
# Generate execution plansnakemake all -np --workflow-profile profiles/slurmThis step:
- Shows which jobs will be executed
- Displays the dependency graph
- Estimates resource requirements
- Identifies any missing inputs or configuration issues
4. Execution
Section titled “4. Execution”Run the complete workflow:
# Execute all jobssnakemake all --workflow-profile profiles/slurmThe workflow will:
- Use the pre-built software environments for each tool
- Download reference genomes and annotations
- Process your sequencing data through the pipeline
- Generate comprehensive analysis results
Monitoring and Debugging
Section titled “Monitoring and Debugging”Job Monitoring
Section titled “Job Monitoring”When running Snakemake interactively, the main log output will stream directly to your terminal.
For workflows running on an HPC cluster, you can use the scheduler’s native commands in a separate terminal to check the status of submitted jobs. The command below is an example for a SLURM-managed cluster.
# Check job status (example for a SLURM cluster)squeue -u $USERUnderstanding Log Files
Section titled “Understanding Log Files”When troubleshooting, it’s important to know where to look for information. The workflow generates logs in three primary locations, each serving a different purpose:
-
Main Snakemake Log (
.snakemake/log/): This directory contains the main log file from the Snakemake process itself. It’s useful for debugging high-level workflow errors related to DAG construction, configuration, or job submission. -
Cluster Executor Logs (e.g.,
.snakemake/slurm_logs/): When running on a cluster, the executor plugin (like the one for SLURM) generates its own logs for each submitted job. These files capture the rawstdoutandstderrfrom the cluster’s perspective and are invaluable for debugging job submission issues or resource-related failures. -
Rule-Specific Logs (
logs/): The workflow is designed to capture thestdoutandstderrfrom each specific rule into this directory. These are the most important logs for debugging tool-specific errors, such as a problem with a bioinformatics tool’s parameters or input files.
Failed Jobs
Section titled “Failed Jobs”When jobs fail, Snakemake provides commands to help you investigate and recover. While these can be run manually, it is often more convenient to set them as defaults in your execution profile.
Common Issues
Section titled “Common Issues”| Issue | Solution | |-------|----------| | Missing dependencies | Check conda environment creation | | Resource limits | Adjust memory/CPU allocation in profile | | File permissions | Ensure write access to output directories | | Network issues | Check internet connectivity for downloads |
Resource Management
Section titled “Resource Management”Memory and CPU Allocation
Section titled “Memory and CPU Allocation”Configure resource allocation in your profile:
default-resources: mem_mb: 32000 # 32GB RAM per job cpus_per_task: 12 # 12 CPU cores per job runtime: 4320 # 4 hours runtime
# Override for specific rulesset-resources: star_index: mem_mb: 64000 # 64GB RAM for genome indexingSoftware Environment Caching
Section titled “Software Environment Caching”Creating the workflow’s software environments with Conda and downloading the
Singularity/Apptainer container image can be time-consuming, but this initial
setup only needs to be done once. By default, Snakemake caches these
environments in a hidden .snakemake directory within each project folder.
To avoid rebuilding these environments for every new analysis, you can create a centralized cache that all your projects can share. This has two major benefits:
- Saves Time: Subsequent workflow runs will start much faster by reusing the pre-existing environments.
- Saves Space: It prevents duplicating many gigabytes of software, which is especially important on HPC clusters with home/lab directory storage quotas.
To set up a central cache, use the --conda-prefix and --singularity-prefix
flags to point to a shared location, such as a project or scratch directory.
# Example of redirecting caches to a shared locationsnakemake all \\ --workflow-profile profiles/slurm \\ --conda-prefix /path/to/shared/conda_envs \\ --singularity-prefix /path/to/shared/singularity_imagesStorage Requirements
Section titled “Storage Requirements”Runtime Estimation
Section titled “Runtime Estimation”Advanced Execution Options
Section titled “Advanced Execution Options”Parallel Execution
Section titled “Parallel Execution”Control job parallelism:
# Limit concurrent jobssnakemake all --workflow-profile profiles/slurm --jobs 50
# Use all available cores locallysnakemake all --use-conda --cores allSelective Execution
Section titled “Selective Execution”Run specific parts of the workflow:
# Run only quality controlsnakemake fastqc --workflow-profile profiles/slurm
# Run only differential expressionsnakemake deseq2_wald_per_analysis --workflow-profile profiles/slurm
# Run only enrichment analysissnakemake all_enrichment_analyses --workflow-profile profiles/slurmResume and Restart
Section titled “Resume and Restart”Handle interrupted workflows:
# Resume from where you left offsnakemake all --workflow-profile profiles/slurm
# Force rerun of specific outputssnakemake all --workflow-profile profiles/slurm --forceall
# Remove all protected and temporary output filessnakemake all --delete-all-output --workflow-profile profiles/slurmOutput and Results
Section titled “Output and Results”Main Outputs
Section titled “Main Outputs”The workflow generates comprehensive results:
- Quality Control: FastQC reports, Qualimap analysis
- Alignment: STAR BAM files, alignment statistics
- Quantification: Salmon transcript counts
- Differential Expression: DESeq2 results, normalized counts
- Enrichment Analysis: GO, KEGG, MSigDB, SPIA results
- Reports: MultiQC summary, FastQC, Qualimap HTML reports
Output Organization
Section titled “Output Organization”Results are organized in a logical structure:
results/├── fastqc/ # Quality control reports├── qualimap/ # RNA-seq quality metrics├── salmon/ # Transcript quantification└── multiqc/ # Summary reports
resources/├── star/ # STAR alignment results (BAM files)├── deseq2/ # Differential expression results├── enrichment/ # Functional enrichment analysis└── tximeta/ # Transcript quantification metadataPost-Analysis Interactive Tools
Section titled “Post-Analysis Interactive Tools”After workflow completion, you can use the provided RMarkdown notebooks for additional analysis:
analysis/pcaExplorer_playground.Rmd: Principal component analysisanalysis/GeneTonic_playground.Rmd: Enrichment analysis visualization
Note: These are separate analysis tools, not part of the Snakemake workflow execution.
Troubleshooting
Section titled “Troubleshooting”Profile Issues
Section titled “Profile Issues”If profiles don’t work as expected:
- Check Snakemake version: Ensure you’re using v8.27.1 or later
- Verify profile syntax: Check YAML formatting and indentation
- Install executor plugins: Ensure required plugins are installed
- Check permissions: Verify file and directory permissions
Execution Problems
Section titled “Execution Problems”Common execution issues and solutions:
| Problem | Cause | Solution |
|---------|-------|----------|
| Jobs stuck in queue | Resource limits too high | Reduce memory/CPU requirements |
| “Lost” SLURM jobs | sacct accounting issues. You may see messages like:status_of_jobs after sacct is: {}active_jobs_ids_with_current_sacct_status are: {}active_jobs_seen_by_sacct are: {}missing_sacct_status are: set() | Stop and restart the workflow; Snakemake will re-evaluate job status and resume. |
| Environment creation fails | Conda/Mamba issues | Check conda installation and channels |
| Container errors | Singularity/Apptainer issues | Verify container runtime installation |
| File system errors | Storage or permission issues | Check disk space and file permissions |
Getting Help
Section titled “Getting Help”If you encounter issues:
- Check the logs: Review Snakemake and job-specific logs
- Validate configuration: Run
snakemake --lint - Review documentation: Check this guide and workflow documentation
- Open an issue: Report problems on the GitHub repository
Next Steps
Section titled “Next Steps”After successfully running your workflow:
- Review Results: Examine quality control reports and analysis outputs
- Post-Analysis Tools: Use the provided RMarkdown notebooks
- Custom Analysis: Extend the workflow with your own scripts
- Share Results: Export data files and create custom visualizations
For more advanced usage and customization, see the Advanced Configuration guide.
Linked external resources are independent of TUCCA and Tufts University and remain under their own licenses.