Configuring tucca-rna-seq for your Analysis
Overview of Configuration Files
Section titled “Overview of Configuration Files”The tucca-rna-seq workflow uses several configuration files to define your
analysis:
config/config.yaml- Main configuration file with all analysis parametersconfig/samples.tsv- Sample metadata and experimental designconfig/units.tsv- Sequencing unit information (lanes, technical replicates)- Execution Profiles (
profiles/*/): YAML files that preset execution options for different compute environments.
Understanding samples.tsv and units.tsv
Section titled “Understanding samples.tsv and units.tsv”These files define your experimental design and sequencing data organization.
config/units.tsv - Sequencing Unit Information
Section titled “config/units.tsv - Sequencing Unit Information”This file tracks technical replicates (e.g., sequencer lanes) and their associated data files.
Required Columns:
sample_name: Links to samples.tsvunit_name: Technical replicate identifier (e.g., “lane1”, “lane2”)sra: SRA (Sequence Read Archive) accession number (if using SRA data)fq1: Path to forward read FASTQ filefq2: Path to reverse read FASTQ file
Example units.tsv:
Section titled “Example units.tsv:”sample_name unit_name sra fq1 fq2etoh60_1 lane1 .test/data/snakemake_data/yeast/reads/etoh60_1_1.fq .test/data/snakemake_data/yeast/reads/etoh60_1_2.fqetoh60_2 lane1 .test/data/snakemake_data/yeast/reads/etoh60_2_1.fq .test/data/snakemake_data/yeast/reads/etoh60_2_2.fqetoh60_3 lane1 .test/data/snakemake_data/yeast/reads/etoh60_3_1.fq .test/data/snakemake_data/yeast/reads/etoh60_3_2.fqref1 lane1 .test/data/snakemake_data/yeast/reads/ref1_1.fq .test/data/snakemake_data/yeast/reads/ref1_2.fqref2 lane1 .test/data/snakemake_data/yeast/reads/ref2_1.fq .test/data/snakemake_data/yeast/reads/ref2_2.fqref3 lane1 .test/data/snakemake_data/yeast/reads/ref3_1.fq .test/data/snakemake_data/yeast/reads/ref3_2.fqtemp33_1 lane1 .test/data/snakemake_data/yeast/reads/temp33_1_1.fq .test/data/snakemake_data/yeast/reads/temp33_1_2.fqtemp33_2 lane1 .test/data/snakemake_data/yeast/reads/temp33_2_1.fq .test/data/snakemake_data/yeast/reads/temp33_2_2.fqtemp33_3 lane1 .test/data/snakemake_data/yeast/reads/temp33_3_1.fq .test/data/snakemake_data/yeast/reads/temp33_3_2.fqOR for SRA files:
sample_name unit_name sra fq1 fq2M1 lane1 SRR21631081M2 lane1 SRR21631080M3 lane1 SRR21631089M4 lane1 SRR21631088F1 lane1 SRR21631085F2 lane1 SRR21631084F3 lane1 SRR21631083F4 lane1 SRR21631082C1 lane1 SRR21631091C2 lane1 SRR21631090C3 lane1 SRR21631087C4 lane1 SRR21631086Configuring config/samples.tsv
Section titled “Configuring config/samples.tsv”This file contains biological sample information and experimental design.
Required Columns:
sample_name: Unique identifier (must match units.tsv)- Additional columns define your experimental factors
Example samples.tsv:
Section titled “Example samples.tsv:”sample_name treatment time replicate_num sequencing_batchM1 microstructured 10 1 1M2 microstructured 10 2 1M3 microstructured 10 3 1M4 microstructured 10 4 1F1 flat 10 1 1F2 flat 10 2 1F3 flat 10 3 1F4 flat 10 4 1C1 control 10 1 1C2 control 10 2 1C3 control 10 3 1C4 control 10 4 1Configuring config/config.yaml
Section titled “Configuring config/config.yaml”The main configuration file is organized into logical sections that control different aspects of your analysis. Each section is validated against a schema to ensure proper configuration.
ref_assembly - Reference Genome Configuration
Section titled “ref_assembly - Reference Genome Configuration”This section defines the reference genome and annotation files for your analysis.
Parameters:
Section titled “Parameters:”| Parameter | Type | Description | Required |
|-----------|------|-------------|----------|
| source | string | One of “RefSeq”, “Ensembl”, or “GENCODE” | Yes |
| accession | string | Genome assembly accession number | Yes |
| name | string | Assembly name/build identifier | Yes |
| release | string | Release version (Ensembl/GENCODE only) | Conditional |
| species | string | Scientific name in “Genus_species” format | Yes |
Accession Number Patterns:
Section titled “Accession Number Patterns:”- RefSeq assemblies:
^GCF_[0-9]+.[0-9]+$(e.g.,GCF_000001405.39) - Ensembl/GENCODE assemblies:
^GCA_[0-9]+.[0-9]+$(e.g.,GCA_000001405.28)
Example Configuration:
Section titled “Example Configuration:”ref_assembly: source: "RefSeq" accession: "GCF_016699485.2" name: "bGalGal1.mat.broiler.GRCg7b" release: "" # Not required for RefSeq species: "Gallus_gallus"api_keys - External API Access
Section titled “api_keys - External API Access”Configure API keys for external database access to avoid rate limiting.
| Parameter | Type | Description | Required |
|-----------|------|-------------|----------|
| ncbi | string | NCBI API key for genome and SRA downloads | No (recommended) |
diffexp - Differential Expression Analysis
Section titled “diffexp - Differential Expression Analysis”This section configures the differential gene expression analysis using DESeq2.
tximeta - Transcript Quantification Import
Section titled “tximeta - Transcript Quantification Import”Configure how transcript quantification data is imported and processed.
| Parameter | Type | Description | Required |
|-----------|------|-------------|----------|
| factors | array | Experimental factors for analysis | Yes |
| extra | string | Additional parameters for tximeta | No |
Each factor should have:
name: Factor name (e.g., “treatment”, “time”)reference_level: Baseline level for comparisons
deseq2 - DESeq2 Analysis Configuration
Section titled “deseq2 - DESeq2 Analysis Configuration”Configure multiple DESeq2 analyses with different experimental designs.
| Parameter | Type | Description | Required |
|-----------|------|-------------|----------|
| analyses | array | List of analysis configurations | Yes |
| transform | object | Data transformation settings | Yes |
Each analysis includes:
name: Unique identifier for the analysisdeseqdataset: DESeqDataSet creation parameterswald: Wald test configurationcontrasts: Statistical comparisons to perform
Example DESeq2 Configuration:
Section titled “Example DESeq2 Configuration:”diffexp: tximeta: factors: - name: "treatment" reference_level: "control" - name: "time" reference_level: "0h" extra: "" deseq2: analyses: - name: "treatment_analysis" deseqdataset: formula: "~ treatment + time" min_counts: 10 extra: "" threads: 4 wald: deseq_extra: "" shrink_extra: "type = 'apeglm'" results_extra: "alpha = 0.05" threads: 4 contrasts: - name: "treatment_vs_control" elements: ["treatment", "treated", "control"] - name: "time_6h_vs_0h" elements: ["time", "6h", "0h"] transform: method: "rlog" extra: ""enrichment - Functional Enrichment Analysis
Section titled “enrichment - Functional Enrichment Analysis”Configure comprehensive functional enrichment analysis including GO, KEGG, MSigDB, SPIA, and Harmonizome databases.
Core Parameters:
Section titled “Core Parameters:”| Parameter | Type | Description | Required |
|-----------|------|-------------|----------|
| padj_cutoff | number | Adjusted p-value cutoff for ORA | Yes |
| targets | array | Target genes for pathway analysis | No |
clusterprofiler - Standard Enrichment Analysis
Section titled “clusterprofiler - Standard Enrichment Analysis”Configure GO and KEGG enrichment analysis.
| Parameter | Type | Description | Required |
|-----------|------|-------------|----------|
| gsea | object | Gene Set Enrichment Analysis settings | Yes |
| ora | object | Over-Representation Analysis settings | Yes |
| wikipathways | object | WikiPathways analysis settings | Yes |
| kegg_module | object | KEGG module analysis settings | Yes |
msigdb - Molecular Signatures Database
Section titled “msigdb - Molecular Signatures Database”Configure MSigDB gene set analysis with customizable collections.
| Parameter | Type | Description | Required |
|-----------|------|-------------|----------|
| enabled | boolean | Enable MSigDB analysis | Yes |
| collections | array | MSigDB collections to analyze | Yes |
| custom_gmt_files | array | Custom GMT files for analysis | No |
Available MSigDB Collections:
- H: Hallmark gene sets (50 gene sets)
- C1: Positional gene sets (299 gene sets)
- C2: Curated gene sets (5,922 gene sets)
- C3: Regulatory target gene sets (3,738 gene sets)
- C4: Computational gene sets (858 gene sets)
- C5: Ontology gene sets (10,922 gene sets)
- C6: Oncogenic signatures (189 gene sets)
- C7: Immunologic signatures (4,872 gene sets)
- C8: Cell type signature gene sets (746 gene sets)
spia - Signaling Pathway Impact Analysis
Section titled “spia - Signaling Pathway Impact Analysis”Configure topology-based pathway analysis using KEGG pathway information.
| Parameter | Type | Description | Required |
|-----------|------|-------------|----------|
| enabled | boolean | Enable SPIA analysis | Yes |
| extra | string | Additional SPIA parameters | No |
harmonizome - Tissue-Specific Analysis
Section titled “harmonizome - Tissue-Specific Analysis”Configure tissue-specific gene set analysis using the Harmonizome database.
| Parameter | Type | Description | Required |
|-----------|------|-------------|----------|
| enabled | boolean | Enable Harmonizome analysis | Yes |
| datasets | array | Harmonizome datasets to analyze | Yes |
Available Datasets:
- GTEx Tissue Gene Expression Profiles
- BioGPS Human Cell Type and Tissue Gene Expression Profiles
- Human Protein Atlas
- And many more at Harmonizome
Example Enrichment Configuration:
Section titled “Example Enrichment Configuration:”enrichment: padj_cutoff: 0.05 targets: ["ACTB", "GAPDH", "PAX3", "PAX7", "MYOD1"] clusterprofiler: gsea: gseGO: extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500" gseKEGG: extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500" ora: enrichGO: extra: "pvalueCutoff = 0.05" enrichKEGG: extra: "pvalueCutoff = 0.05" wikipathways: enabled: true enrichWP: extra: "pvalueCutoff = 0.05" gseWP: extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500" kegg_module: enabled: true enrichMKEGG: extra: "pvalueCutoff = 0.05" gseMKEGG: extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500" msigdb: enabled: true collections: ["H", "C2"] # Hallmark and Curated gene sets custom_gmt_files: [] # Add custom GMT files here ora: extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500" gsea: extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500" spia: enabled: true extra: "beta = NULL, verbose = TRUE, plots = FALSE" harmonizome: enabled: false datasets: - name: "GTEx Tissue Gene Expression Profiles" gene_sets: ["Muscle", "Adipose"] - name: "BioGPS Human Cell Type and Tissue Gene Expression Profiles" gene_sets: ["Adipocyte", "Myocyte"] ora: extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500" gsea: extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500" annotationforge: version: "0.1.0" author: "tucca-rna-seq <benjamin.bromberg@tufts.edu>" extra: "useSynonyms = TRUE"params - Tool-Specific Parameters
Section titled “params - Tool-Specific Parameters”Configure parameters for individual analysis tools in the workflow.
fastqc - Quality Control
Section titled “fastqc - Quality Control”| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| memory | integer | Memory allocation in MB | 1024 |
| extra | string | Additional FastQC parameters | "" |
star_index - Genome Indexing
Section titled “star_index - Genome Indexing”| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| sjdbOverhang | integer | Splice junction database overhang | 149 |
| extra | string | Additional STAR index parameters | "" |
star - Read Alignment
Section titled “star - Read Alignment”| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| outSAMtype | string | Output SAM format | “BAM SortedByCoordinate” |
| extra | string | Additional STAR parameters | "" |
qualimap_rnaseq - RNA-Seq Quality Control
Section titled “qualimap_rnaseq - RNA-Seq Quality Control”| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| enabled | boolean | Enable Qualimap analysis | true |
| counting_alg | string | Counting algorithm | “proportional” |
| sequencing_protocol | string | Sequencing protocol type | “non-strand-specific” |
| extra | string | Additional Qualimap parameters | "" |
salmon_index - Transcriptome Indexing
Section titled “salmon_index - Transcriptome Indexing”| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| extra | string | Additional Salmon index parameters | “-k 31” |
salmon_quant - Transcript Quantification
Section titled “salmon_quant - Transcript Quantification”| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| libtype | string | Library type detection | “A” (auto-detect) |
| extra | string | Additional Salmon parameters | "" |
multiqc - Quality Report Aggregation
Section titled “multiqc - Quality Report Aggregation”| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| extra | string | Additional MultiQC parameters | “—verbose —force” |
sra_tools - SRA Data Download
Section titled “sra_tools - SRA Data Download”| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| vdb_config_ra_path | string | SRA tools configuration path | “/repository/user/main/remote_access=true” |
| subsample | object | SRA subsampling configuration | See below |
SRA Subsampling Configuration:
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| enabled | boolean | Enable SRA subsampling | false |
| min_spot_id | integer | Minimum spot ID for subsampling | 1 |
| max_spot_id | integer | Maximum spot ID for subsampling | 100000 |
Example Parameters Configuration:
Section titled “Example Parameters Configuration:”params: fastqc: memory: 1024 extra: "" star_index: sjdbOverhang: 149 extra: "--genomeSAindexNbases 10" # For small genomes star: extra: >- --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard --outFilterMultimapNmax 1 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --alignIntronMin 1 --alignIntronMax 2500 qualimap_rnaseq: enabled: true counting_alg: "proportional" sequencing_protocol: "non-strand-specific" extra: "--paired --java-mem-size=8G" salmon_index: extra: "-k 31" salmon_quant: libtype: "A" extra: "--seqBias --posBias --writeUnmappedNames" multiqc: extra: "--verbose --force" sra_tools: vdb_config_ra_path: "/repository/user/main/remote_access=true" subsample: enabled: false min_spot_id: 1 max_spot_id: 100000Profile Configuration
Section titled “Profile Configuration”The primary role of profiles is to prevent the need to type long, complex commands for every analysis by presetting custom values for flags from the Snakemake CLI and for plugins found on the Snakemake Plugin Catalog.
How Profiles Work
Section titled “How Profiles Work”A profile is a directory containing a YAML configuration file, typically named
config.v8+.yaml. Inside this file, you can set default values for any of
Snakemake’s command-line arguments. The syntax is straightforward: a
command-line option like --some-option becomes a key in the YAML file as
some-option:.
This allows you to configure everything from the execution backend (e.g., a SLURM cluster) to software deployment (e.g., Conda and Singularity) and resource allocation for specific rules.
Since profiles work by setting default values for command-line arguments, it is
helpful to know all the available options. The official documentation for the
snakemake executable, which is the primary way to run, debug, and visualize
workflows, provides a comprehensive list of all possible flags and options that
can be set in a profile.
Snakemake Plugin Catalog
Section titled “Snakemake Plugin Catalog”The catalog is a centralized resource that collects information and documentation for all official Snakemake plugins. These plugins are essential for extending Snakemake’s core functionality, allowing it to interface with different execution backends (like HPC schedulers), storage systems (like cloud buckets), and more. This plugin-based architecture, allows for modular and independent development of these components.
Explore the Snakemake Plugin CatalogProfile Examples from tucca-rna-seq
Section titled “Profile Examples from tucca-rna-seq”This workflow includes several pre-configured profiles that serve as excellent examples of how to set up different execution environments.
This profile is for running the workflow on a SLURM-managed HPC cluster,
like the Tufts HPC. It configures the
snakemake-executor-plugin-slurm, enables Singularity and Conda
for reproducibility, and sets default resources for jobs.
# profiles/slurm/config.v8+.yaml# Configured for exclusive use with snakemake-executor-plugin-slurm
# Profile Settings# To learn more see:# https://snakemake.readthedocs.io/en/stable/executing/cli.html__use_yte__: trueexecutor: slurmuse-singularity: trueuse-conda: trueconda-cleanup-pkgs: tarballsverbose: trueshow-failed-logs: trueretries: 3rerun-incomplete: truejobs: 100 # Slurm jobscript size limitlatency-wait: 120default-resources: slurm_partition: "batch" slurm_account: "default" runtime: 4320 mem_mb: 32000 cpus_per_task: 12 # Change the following email address to your own email address if you would # like this workflow to notify you via email of started/failed/completed # slurm jobs slurm_extra: '"--mail-type=ALL --mail-user=firstname.lastname@tufts.edu"'
set-resources: star_index: mem_mb: 64000This profile is for Continuous Integration (CI) testing on GitHub Actions.
It runs jobs locally (cores: all) using only Conda for software
management. It does not specify an executor, so Snakemake defaults to local
execution.
# profiles/github-actions/config.v8+.yaml# Configured for exclusive use with GitHub Actions testing workflow# Defaulting to Conda-only execution.# Singularity runs will add --use-singularity explicitly.# See .github/workflows/main.yml
# Profile Settings# To learn more see:# https://snakemake.readthedocs.io/en/stable/executing/cli.htmluse-conda: trueconda-cleanup-pkgs: tarballsverbose: true# all-temp: true # Commented out to allow artifact uploads for Rmd testing# Note: If disk space becomes an issue, consider creating a separate profile# for Singularity runs that includes all-temp: trueshow-failed-logs: trueretries: 3rerun-incomplete: truecores: alllatency-wait: 120default-resources: mem_mb: 16000
set-resources: star_index: mem_mb: 16000Coming Soon…
Advanced Profile Templating with YTE
Section titled “Advanced Profile Templating with YTE”Snakemake profiles support the YTE templating engine (__use_yte__: true),
which allows you to dynamically set profile values based on environment variables.
This is useful for creating flexible profiles that can adapt to different users
or systems.
Configuration Validation
Section titled “Configuration Validation”The workflow automatically validates your configuration using JSON schemas:
- Schema Validation: Configuration files are checked against schemas
in
workflow/schemas/ - Cross-Reference Validation: Ensures consistency between files
- Runtime Validation: Catches configuration errors before analysis begins
Next Steps
Section titled “Next Steps”Now that you have prepared your configuration files, you are ready to execute the workflow. For detailed instructions on how to perform a dry-run, validate your setup, and launch the analysis on your specific computing environment, please proceed to the next guide.
Linked external resources are independent of TUCCA and Tufts University and remain under their own licenses.