Skip to content
TUCCA Our TeamHelpCAAIL ↗

Configuring tucca-rna-seq for your Analysis


The tucca-rna-seq workflow uses several configuration files to define your analysis:

  • config/config.yaml - Main configuration file with all analysis parameters
  • config/samples.tsv - Sample metadata and experimental design
  • config/units.tsv - Sequencing unit information (lanes, technical replicates)
  • Execution Profiles (profiles/*/): YAML files that preset execution options for different compute environments.

These files define your experimental design and sequencing data organization.

config/units.tsv - Sequencing Unit Information

Section titled “config/units.tsv - Sequencing Unit Information”

This file tracks technical replicates (e.g., sequencer lanes) and their associated data files.

Required Columns:

  • sample_name: Links to samples.tsv
  • unit_name: Technical replicate identifier (e.g., “lane1”, “lane2”)
  • sra: SRA (Sequence Read Archive) accession number (if using SRA data)
  • fq1: Path to forward read FASTQ file
  • fq2: Path to reverse read FASTQ file
config/units.tsv
sample_name unit_name sra fq1 fq2
etoh60_1 lane1 .test/data/snakemake_data/yeast/reads/etoh60_1_1.fq .test/data/snakemake_data/yeast/reads/etoh60_1_2.fq
etoh60_2 lane1 .test/data/snakemake_data/yeast/reads/etoh60_2_1.fq .test/data/snakemake_data/yeast/reads/etoh60_2_2.fq
etoh60_3 lane1 .test/data/snakemake_data/yeast/reads/etoh60_3_1.fq .test/data/snakemake_data/yeast/reads/etoh60_3_2.fq
ref1 lane1 .test/data/snakemake_data/yeast/reads/ref1_1.fq .test/data/snakemake_data/yeast/reads/ref1_2.fq
ref2 lane1 .test/data/snakemake_data/yeast/reads/ref2_1.fq .test/data/snakemake_data/yeast/reads/ref2_2.fq
ref3 lane1 .test/data/snakemake_data/yeast/reads/ref3_1.fq .test/data/snakemake_data/yeast/reads/ref3_2.fq
temp33_1 lane1 .test/data/snakemake_data/yeast/reads/temp33_1_1.fq .test/data/snakemake_data/yeast/reads/temp33_1_2.fq
temp33_2 lane1 .test/data/snakemake_data/yeast/reads/temp33_2_1.fq .test/data/snakemake_data/yeast/reads/temp33_2_2.fq
temp33_3 lane1 .test/data/snakemake_data/yeast/reads/temp33_3_1.fq .test/data/snakemake_data/yeast/reads/temp33_3_2.fq

OR for SRA files:

config/units.tsv
sample_name unit_name sra fq1 fq2
M1 lane1 SRR21631081
M2 lane1 SRR21631080
M3 lane1 SRR21631089
M4 lane1 SRR21631088
F1 lane1 SRR21631085
F2 lane1 SRR21631084
F3 lane1 SRR21631083
F4 lane1 SRR21631082
C1 lane1 SRR21631091
C2 lane1 SRR21631090
C3 lane1 SRR21631087
C4 lane1 SRR21631086

This file contains biological sample information and experimental design.

Required Columns:

  • sample_name: Unique identifier (must match units.tsv)
  • Additional columns define your experimental factors
config/samples.tsv
sample_name treatment time replicate_num sequencing_batch
M1 microstructured 10 1 1
M2 microstructured 10 2 1
M3 microstructured 10 3 1
M4 microstructured 10 4 1
F1 flat 10 1 1
F2 flat 10 2 1
F3 flat 10 3 1
F4 flat 10 4 1
C1 control 10 1 1
C2 control 10 2 1
C3 control 10 3 1
C4 control 10 4 1

The main configuration file is organized into logical sections that control different aspects of your analysis. Each section is validated against a schema to ensure proper configuration.

ref_assembly - Reference Genome Configuration

Section titled “ref_assembly - Reference Genome Configuration”

This section defines the reference genome and annotation files for your analysis.

| Parameter | Type | Description | Required | |-----------|------|-------------|----------| | source | string | One of “RefSeq”, “Ensembl”, or “GENCODE” | Yes | | accession | string | Genome assembly accession number | Yes | | name | string | Assembly name/build identifier | Yes | | release | string | Release version (Ensembl/GENCODE only) | Conditional | | species | string | Scientific name in “Genus_species” format | Yes |

  • RefSeq assemblies: ^GCF_[0-9]+.[0-9]+$ (e.g., GCF_000001405.39)
  • Ensembl/GENCODE assemblies: ^GCA_[0-9]+.[0-9]+$ (e.g., GCA_000001405.28)
ref_assembly:
source: "RefSeq"
accession: "GCF_016699485.2"
name: "bGalGal1.mat.broiler.GRCg7b"
release: "" # Not required for RefSeq
species: "Gallus_gallus"

Configure API keys for external database access to avoid rate limiting.

| Parameter | Type | Description | Required | |-----------|------|-------------|----------| | ncbi | string | NCBI API key for genome and SRA downloads | No (recommended) |

diffexp - Differential Expression Analysis

Section titled “diffexp - Differential Expression Analysis”

This section configures the differential gene expression analysis using DESeq2.

tximeta - Transcript Quantification Import

Section titled “tximeta - Transcript Quantification Import”

Configure how transcript quantification data is imported and processed.

| Parameter | Type | Description | Required | |-----------|------|-------------|----------| | factors | array | Experimental factors for analysis | Yes | | extra | string | Additional parameters for tximeta | No |

Each factor should have:

  • name: Factor name (e.g., “treatment”, “time”)
  • reference_level: Baseline level for comparisons

Configure multiple DESeq2 analyses with different experimental designs.

| Parameter | Type | Description | Required | |-----------|------|-------------|----------| | analyses | array | List of analysis configurations | Yes | | transform | object | Data transformation settings | Yes |

Each analysis includes:

  • name: Unique identifier for the analysis
  • deseqdataset: DESeqDataSet creation parameters
  • wald: Wald test configuration
  • contrasts: Statistical comparisons to perform
diffexp:
tximeta:
factors:
- name: "treatment"
reference_level: "control"
- name: "time"
reference_level: "0h"
extra: ""
deseq2:
analyses:
- name: "treatment_analysis"
deseqdataset:
formula: "~ treatment + time"
min_counts: 10
extra: ""
threads: 4
wald:
deseq_extra: ""
shrink_extra: "type = 'apeglm'"
results_extra: "alpha = 0.05"
threads: 4
contrasts:
- name: "treatment_vs_control"
elements: ["treatment", "treated", "control"]
- name: "time_6h_vs_0h"
elements: ["time", "6h", "0h"]
transform:
method: "rlog"
extra: ""

enrichment - Functional Enrichment Analysis

Section titled “enrichment - Functional Enrichment Analysis”

Configure comprehensive functional enrichment analysis including GO, KEGG, MSigDB, SPIA, and Harmonizome databases.

| Parameter | Type | Description | Required | |-----------|------|-------------|----------| | padj_cutoff | number | Adjusted p-value cutoff for ORA | Yes | | targets | array | Target genes for pathway analysis | No |

clusterprofiler - Standard Enrichment Analysis

Section titled “clusterprofiler - Standard Enrichment Analysis”

Configure GO and KEGG enrichment analysis.

| Parameter | Type | Description | Required | |-----------|------|-------------|----------| | gsea | object | Gene Set Enrichment Analysis settings | Yes | | ora | object | Over-Representation Analysis settings | Yes | | wikipathways | object | WikiPathways analysis settings | Yes | | kegg_module | object | KEGG module analysis settings | Yes |

Configure MSigDB gene set analysis with customizable collections.

| Parameter | Type | Description | Required | |-----------|------|-------------|----------| | enabled | boolean | Enable MSigDB analysis | Yes | | collections | array | MSigDB collections to analyze | Yes | | custom_gmt_files | array | Custom GMT files for analysis | No |

Available MSigDB Collections:

  • H: Hallmark gene sets (50 gene sets)
  • C1: Positional gene sets (299 gene sets)
  • C2: Curated gene sets (5,922 gene sets)
  • C3: Regulatory target gene sets (3,738 gene sets)
  • C4: Computational gene sets (858 gene sets)
  • C5: Ontology gene sets (10,922 gene sets)
  • C6: Oncogenic signatures (189 gene sets)
  • C7: Immunologic signatures (4,872 gene sets)
  • C8: Cell type signature gene sets (746 gene sets)

Configure topology-based pathway analysis using KEGG pathway information.

| Parameter | Type | Description | Required | |-----------|------|-------------|----------| | enabled | boolean | Enable SPIA analysis | Yes | | extra | string | Additional SPIA parameters | No |

Configure tissue-specific gene set analysis using the Harmonizome database.

| Parameter | Type | Description | Required | |-----------|------|-------------|----------| | enabled | boolean | Enable Harmonizome analysis | Yes | | datasets | array | Harmonizome datasets to analyze | Yes |

Available Datasets:

  • GTEx Tissue Gene Expression Profiles
  • BioGPS Human Cell Type and Tissue Gene Expression Profiles
  • Human Protein Atlas
  • And many more at Harmonizome
enrichment:
padj_cutoff: 0.05
targets: ["ACTB", "GAPDH", "PAX3", "PAX7", "MYOD1"]
clusterprofiler:
gsea:
gseGO:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
gseKEGG:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
ora:
enrichGO:
extra: "pvalueCutoff = 0.05"
enrichKEGG:
extra: "pvalueCutoff = 0.05"
wikipathways:
enabled: true
enrichWP:
extra: "pvalueCutoff = 0.05"
gseWP:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
kegg_module:
enabled: true
enrichMKEGG:
extra: "pvalueCutoff = 0.05"
gseMKEGG:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
msigdb:
enabled: true
collections: ["H", "C2"] # Hallmark and Curated gene sets
custom_gmt_files: [] # Add custom GMT files here
ora:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
gsea:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
spia:
enabled: true
extra: "beta = NULL, verbose = TRUE, plots = FALSE"
harmonizome:
enabled: false
datasets:
- name: "GTEx Tissue Gene Expression Profiles"
gene_sets: ["Muscle", "Adipose"]
- name: "BioGPS Human Cell Type and Tissue Gene Expression Profiles"
gene_sets: ["Adipocyte", "Myocyte"]
ora:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
gsea:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
annotationforge:
version: "0.1.0"
author: "tucca-rna-seq <benjamin.bromberg@tufts.edu>"
extra: "useSynonyms = TRUE"

Configure parameters for individual analysis tools in the workflow.

| Parameter | Type | Description | Default | |-----------|------|-------------|---------| | memory | integer | Memory allocation in MB | 1024 | | extra | string | Additional FastQC parameters | "" |

| Parameter | Type | Description | Default | |-----------|------|-------------|---------| | sjdbOverhang | integer | Splice junction database overhang | 149 | | extra | string | Additional STAR index parameters | "" |

| Parameter | Type | Description | Default | |-----------|------|-------------|---------| | outSAMtype | string | Output SAM format | “BAM SortedByCoordinate” | | extra | string | Additional STAR parameters | "" |

| Parameter | Type | Description | Default | |-----------|------|-------------|---------| | enabled | boolean | Enable Qualimap analysis | true | | counting_alg | string | Counting algorithm | “proportional” | | sequencing_protocol | string | Sequencing protocol type | “non-strand-specific” | | extra | string | Additional Qualimap parameters | "" |

| Parameter | Type | Description | Default | |-----------|------|-------------|---------| | extra | string | Additional Salmon index parameters | “-k 31” |

| Parameter | Type | Description | Default | |-----------|------|-------------|---------| | libtype | string | Library type detection | “A” (auto-detect) | | extra | string | Additional Salmon parameters | "" |

| Parameter | Type | Description | Default | |-----------|------|-------------|---------| | extra | string | Additional MultiQC parameters | “—verbose —force” |

| Parameter | Type | Description | Default | |-----------|------|-------------|---------| | vdb_config_ra_path | string | SRA tools configuration path | “/repository/user/main/remote_access=true” | | subsample | object | SRA subsampling configuration | See below |

SRA Subsampling Configuration: | Parameter | Type | Description | Default | |-----------|------|-------------|---------| | enabled | boolean | Enable SRA subsampling | false | | min_spot_id | integer | Minimum spot ID for subsampling | 1 | | max_spot_id | integer | Maximum spot ID for subsampling | 100000 |

params:
fastqc:
memory: 1024
extra: ""
star_index:
sjdbOverhang: 149
extra: "--genomeSAindexNbases 10" # For small genomes
star:
extra: >-
--outSAMtype BAM SortedByCoordinate --outSAMunmapped Within
--outSAMattributes Standard --outFilterMultimapNmax 1
--outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0
--alignIntronMin 1 --alignIntronMax 2500
qualimap_rnaseq:
enabled: true
counting_alg: "proportional"
sequencing_protocol: "non-strand-specific"
extra: "--paired --java-mem-size=8G"
salmon_index:
extra: "-k 31"
salmon_quant:
libtype: "A"
extra: "--seqBias --posBias --writeUnmappedNames"
multiqc:
extra: "--verbose --force"
sra_tools:
vdb_config_ra_path: "/repository/user/main/remote_access=true"
subsample:
enabled: false
min_spot_id: 1
max_spot_id: 100000

The primary role of profiles is to prevent the need to type long, complex commands for every analysis by presetting custom values for flags from the Snakemake CLI and for plugins found on the Snakemake Plugin Catalog.

A profile is a directory containing a YAML configuration file, typically named config.v8+.yaml. Inside this file, you can set default values for any of Snakemake’s command-line arguments. The syntax is straightforward: a command-line option like --some-option becomes a key in the YAML file as some-option:.

This allows you to configure everything from the execution backend (e.g., a SLURM cluster) to software deployment (e.g., Conda and Singularity) and resource allocation for specific rules.

Since profiles work by setting default values for command-line arguments, it is helpful to know all the available options. The official documentation for the snakemake executable, which is the primary way to run, debug, and visualize workflows, provides a comprehensive list of all possible flags and options that can be set in a profile.

View All Snakemake CLI Options Official Snakemake Docs for Profiles

The catalog is a centralized resource that collects information and documentation for all official Snakemake plugins. These plugins are essential for extending Snakemake’s core functionality, allowing it to interface with different execution backends (like HPC schedulers), storage systems (like cloud buckets), and more. This plugin-based architecture, allows for modular and independent development of these components.

Explore the Snakemake Plugin Catalog

This workflow includes several pre-configured profiles that serve as excellent examples of how to set up different execution environments.

This profile is for running the workflow on a SLURM-managed HPC cluster, like the Tufts HPC. It configures the snakemake-executor-plugin-slurm, enables Singularity and Conda for reproducibility, and sets default resources for jobs.

profiles/slurm/config.v8+.yaml
# profiles/slurm/config.v8+.yaml
# Configured for exclusive use with snakemake-executor-plugin-slurm
# Profile Settings
# To learn more see:
# https://snakemake.readthedocs.io/en/stable/executing/cli.html
__use_yte__: true
executor: slurm
use-singularity: true
use-conda: true
conda-cleanup-pkgs: tarballs
verbose: true
show-failed-logs: true
retries: 3
rerun-incomplete: true
jobs: 100 # Slurm jobscript size limit
latency-wait: 120
default-resources:
slurm_partition: "batch"
slurm_account: "default"
runtime: 4320
mem_mb: 32000
cpus_per_task: 12
# Change the following email address to your own email address if you would
# like this workflow to notify you via email of started/failed/completed
# slurm jobs
slurm_extra: '"--mail-type=ALL --mail-user=firstname.lastname@tufts.edu"'
set-resources:
star_index:
mem_mb: 64000

Snakemake profiles support the YTE templating engine (__use_yte__: true), which allows you to dynamically set profile values based on environment variables. This is useful for creating flexible profiles that can adapt to different users or systems.


The workflow automatically validates your configuration using JSON schemas:

  1. Schema Validation: Configuration files are checked against schemas in workflow/schemas/
  2. Cross-Reference Validation: Ensures consistency between files
  3. Runtime Validation: Catches configuration errors before analysis begins

Now that you have prepared your configuration files, you are ready to execute the workflow. For detailed instructions on how to perform a dry-run, validate your setup, and launch the analysis on your specific computing environment, please proceed to the next guide.

Linked external resources are independent of TUCCA and Tufts University and remain under their own licenses.