Skip to content
TUCCA Our TeamHelpCAAIL ↗

Functional Enrichment Analysis

The tucca-rna-seq workflow provides comprehensive functional enrichment analysis to help interpret the biological meaning of differentially expressed genes. This guide covers all enrichment analysis capabilities and how to configure and interpret results.


Functional enrichment analysis identifies biological processes, pathways, and gene sets that are over-represented in your differentially expressed genes compared to a background set. This helps you understand:

  • Biological significance of your gene expression changes
  • Pathway involvement in your experimental conditions
  • Gene set associations with specific biological functions
  • Tissue-specific patterns in gene expression

1. Standard Enrichment Analysis (clusterProfiler)

Section titled “1. Standard Enrichment Analysis (clusterProfiler)”

The workflow uses clusterProfiler for comprehensive enrichment analysis:

  • Biological Process (BP): Cellular processes and biological functions
  • Molecular Function (MF): Biochemical activities and molecular interactions
  • Cellular Component (CC): Subcellular locations and cellular structures
  • Metabolic pathways: Energy metabolism, biosynthesis
  • Signaling pathways: Cell communication, signal transduction
  • Disease pathways: Disease-associated gene sets
  • Community-curated pathways: Expert-validated pathway databases
  • Species-specific pathways: Tailored to your organism of interest
  • Dynamic pathway updates: Regular community updates
  • Functional modules: Subsets of KEGG pathways
  • Metabolic networks: Interconnected metabolic reactions
  • Regulatory modules: Gene regulatory networks

Access to 8 major gene set collections with over 25,000 gene sets:

H: Hallmark Gene Sets

50 well-defined gene sets representing specific biological states and processes.

  • Apoptosis
  • DNA Repair
  • Inflammatory Response
  • Oxidative Phosphorylation

C2: Curated Gene Sets

5,922 gene sets from various sources including pathway databases, literature, and expert knowledge.

  • KEGG pathways
  • Reactome pathways
  • Biocarta pathways
  • Literature-derived sets

C5: Ontology Gene Sets

10,922 gene sets derived from Gene Ontology terms.

  • Biological processes
  • Molecular functions
  • Cellular components
  • Hierarchical organization

Additional Collections:

  • C1: Positional gene sets (299 sets)
  • C3: Regulatory target gene sets (3,738 sets)
  • C4: Computational gene sets (858 sets)
  • C6: Oncogenic signatures (189 sets)
  • C7: Immunologic signatures (4,872 sets)
  • C8: Cell type signature gene sets (746 sets)

Add your own gene sets in GMT format:

SET_NAME DESCRIPTION GENE1 GENE2 GENE3
MyPathway Custom pathway ACTB GAPDH PAX3

3. SPIA (Signaling Pathway Impact Analysis)

Section titled “3. SPIA (Signaling Pathway Impact Analysis)”

Topology-based pathway analysis that considers:

  • Gene interactions within pathways
  • Pathway structure and connectivity
  • Impact scores based on topology
  • Statistical significance of pathway involvement

Tissue-specific gene sets from various expression datasets:

  • GTEx Tissue Gene Expression Profiles: Human tissue-specific expression
  • BioGPS Human Cell Type Profiles: Cell type-specific gene sets
  • Human Protein Atlas: Protein expression across tissues
  • Expression Atlas: Multi-species expression data

Configure enrichment analysis in your config.yaml:

enrichment:
padj_cutoff: 0.05 # Adjusted p-value cutoff for significant genes
clusterprofiler:
gsea:
gseGO:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
gseKEGG:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
ora:
enrichGO:
extra: "pvalueCutoff = 0.05"
enrichKEGG:
extra: "pvalueCutoff = 0.05"
wikipathways:
enabled: true
enrichWP:
extra: "pvalueCutoff = 0.05"
gseWP:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
kegg_module:
enabled: true
enrichMKEGG:
extra: "pvalueCutoff = 0.05"
gseMKEGG:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
msigdb:
enabled: true
collections: ["H", "C2", "C5"] # Hallmark, Curated, Ontology
custom_gmt_files: [] # Add custom GMT files here
ora:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
gsea:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
spia:
enabled: true
extra: "beta = NULL, verbose = TRUE, plots = FALSE"
# When enabled, the workflow automatically:
# 1. Downloads KEGG pathway data for your organism
# 2. Generates SPIA data files using makeSPIAdata()
# 3. Performs SPIA analysis using the generated data
harmonizome:
enabled: false
datasets:
- name: "GTEx Tissue Gene Expression Profiles"
gene_sets: ["Muscle", "Adipose"]
- name: "BioGPS Human Cell Type and Tissue Gene Expression Profiles"
gene_sets: ["Adipocyte", "Myocyte"]
ora:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
gsea:
extra: "pvalueCutoff = 0.05, minGSSize = 10, maxGSSize = 500"
annotationforge:
version: "0.1.0"
author: "tucca-rna-seq <benjamin.bromberg@tufts.edu>"
extra: "useSynonyms = TRUE"
# This section configures local OrgDb package building for organisms
# not available in Bioconductor. The workflow automatically detects
# if your species needs a local build and creates the package using
# AnnotationForge::makeOrgPackage().

The workflow automatically handles organism-specific annotation packages:

  • Bioconductor packages: Automatically installed for supported organisms (e.g., org.Hs.eg.db for human)
  • Local builds: For unsupported organisms, automatically builds OrgDb packages using AnnotationForge
  • Species detection: Automatically determines the correct package based on your config.yaml species setting
gsea:
gseGO:
extra: >-
pvalueCutoff = 0.05,
minGSSize = 10,
maxGSSize = 500,
eps = 1e-10,
seed = 42
ora:
enrichGO:
extra: >-
pvalueCutoff = 0.05,
minGSSize = 10,
maxGSSize = 500,
qvalueCutoff = 0.2
spia:
extra: >-
beta = NULL,
verbose = TRUE,
plots = FALSE,
nB = 2000,
nPerm = 1000

The workflow generates RDS files only - no plots, visualizations, or interactive tools are automatically created:

resources/enrichment/
├── {analysis_name}/
│ └── {contrast_name}/
│ ├── gsea_results.RDS # Gene Set Enrichment Analysis results
│ ├── ora_results.RDS # Over-Representation Analysis results
│ ├── spia_results.RDS # SPIA results (raw)
│ └── spia_results_readable.RDS # SPIA results (with gene symbols)
├── spia_data/ # SPIA data directory (if enabled)
├── local_orgdb_build/ # Local OrgDb packages (if built)
└── enrichment_params.RDS # Enrichment parameters
  • RDS files: R objects containing complete analysis results
  • No automatic plots: The workflow generates data objects, not visualizations
  • No PDF/PNG files: These must be created manually using the R objects
  • No HTML reports: Results are stored as R data structures

The workflow generates RDS files containing:

  • GSEA results: gseaResult objects from clusterProfiler
  • ORA results: enrichResult objects from clusterProfiler
  • SPIA results: Data frames with pathway impact analysis
  • MSigDB results: Gene set enrichment for specified collections
  • Harmonizome results: Tissue-specific gene set analysis

The workflow includes RMarkdown playgrounds that integrate directly with the outputs:

  • GeneTonic Playground: Interactive enrichment analysis and visualization
  • PCA Explorer Playground: Interactive principal component analysis
  • Custom Analysis Playground: Framework for extending analysis

These tools automatically load workflow outputs and provide interactive interfaces through Shiny applications, enabling you to:

  • Create interactive plots that can be customized and downloaded
  • Generate publication-ready figures directly from the apps
  • Export visualizations in multiple formats (PNG, PDF, SVG)
  • Explore data dynamically before creating final figures

The workflow does not automatically create:

  • ❌ Static plots or bar charts
  • ❌ PDF figures or publication-ready images
  • ❌ Batch-generated visualizations

The complete analysis process:

  1. Workflow execution: Generates data objects and HTML reports
  2. Interactive analysis: Use integrated RMarkdown playgrounds
  3. Custom visualization: Create publication-ready figures manually
  4. Export results: Share data files and custom visualizations

The workflow automatically handles organism-specific annotation packages:

  • Detection: Identifies whether your species has a Bioconductor package
  • Installation: Installs Bioconductor packages automatically
  • Local building: Creates custom OrgDb packages for unsupported organisms
  • Integration: Seamlessly provides the correct annotation data for enrichment analysis

Key Metrics:

  • Enrichment Score (ES): Measure of gene set enrichment
  • Normalized Enrichment Score (NES): Normalized ES for comparison
  • P-value: Statistical significance
  • FDR q-value: False discovery rate adjusted p-value

Interpretation:

  • Positive ES: Genes upregulated in the gene set
  • Negative ES: Genes downregulated in the gene set
  • |NES| > 1.5: Strong enrichment
  • FDR < 0.25: Statistically significant

Key Metrics:

  • Gene Ratio: Proportion of genes in the gene set
  • P-value: Statistical significance
  • Adjusted P-value: Multiple testing correction
  • Count: Number of genes in the intersection

Interpretation:

  • Gene Ratio > 0.1: High proportion of genes
  • Adjusted P-value < 0.05: Statistically significant
  • Count > 5: Sufficient genes for reliable results

Key Metrics:

  • PERT: Perturbation statistic
  • PGFdr: False discovery rate adjusted p-value
  • Status: Pathway activation/inhibition
  • tA: Total perturbation accumulation

Interpretation:

  • Status = “Activated”: Pathway is upregulated
  • Status = “Inhibited”: Pathway is downregulated
  • PGFdr < 0.05: Statistically significant pathway

# Load enrichment results
library(tidyverse)
# GSEA results
gsea_results <- readRDS("resources/enrichment/analysis_name/contrast_name/gsea_results.RDS")
# ORA results
ora_results <- readRDS("resources/enrichment/analysis_name/contrast_name/ora_results.RDS")
# SPIA results
spia_results <- readRDS("resources/enrichment/analysis_name/contrast_name/spia_results_readable.RDS")

Interactive Analysis with Integrated Tools

Section titled “Interactive Analysis with Integrated Tools”

The workflow includes RMarkdown playgrounds that integrate directly with the generated outputs:

The analysis/GeneTonic_playground.Rmd provides interactive enrichment analysis:

# Load workflow outputs
dds_wald <- readRDS("resources/deseq2/analysis_name/dds.RDS")
res_de <- readRDS("resources/deseq2/analysis_name/contrast_name/wald.RDS")
res_enrich <- readRDS("resources/enrichment/analysis_name/contrast_name/ora_results.RDS")
# Launch interactive GeneTonic app
gtl <- GeneTonic::GeneTonicList(
dds = dds_wald,
res_de = res_de,
res_enrich = res_enrich_shaken,
annotation_obj = annotation_obj
)
GeneTonic::GeneTonic(gtl = gtl, project_id = "My Project")

The analysis/pcaExplorer_playground.Rmd provides interactive PCA analysis:

# Load workflow outputs
dds <- readRDS("resources/deseq2/analysis_name/dds.RDS")
dst <- readRDS("resources/deseq2/analysis_name/dst.RDS")
# Launch interactive pcaExplorer app
pcaExplorer::pcaExplorer(
dds = dds,
dst = dst,
countmatrix = countmatrix,
coldata = coldata,
annotation = annotation_obj
)

The workflow outputs are also compatible with the ideal package for interactive differential expression analysis:

# Load workflow outputs
dds <- readRDS("resources/deseq2/analysis_name/dds.RDS")
res <- readRDS("resources/deseq2/analysis_name/contrast_name/wald.RDS")
# Launch interactive ideal app
ideal::ideal(dds = dds, res = res)

You can also create custom visualizations manually using the data objects:

# Load required libraries
library(clusterProfiler)
library(enrichplot)
# Create dot plot for GO results
if (!is.null(ora_results$GO)) {
dotplot(ora_results$GO, showCategory = 20)
}
# Create enrichment map for GSEA results
if (!is.null(gsea_results$GO)) {
emapplot(pairwise_termsim(gsea_results$GO))
}
# Create network plot
if (!is.null(ora_results$KEGG)) {
cnetplot(ora_results$KEGG, showCategory = 10)
}

  • P-value cutoff: 0.05 for discovery, 0.01 for validation
  • Gene set size: 10-500 genes for reliable results
  • Multiple testing: Use FDR correction for multiple comparisons
  • Background genes: Use all expressed genes as background
  • Gene set overlap: Avoid highly overlapping gene sets
  • Statistical power: Ensure sufficient sample size
  • Biological relevance: Focus on biologically meaningful results
  • Pathway consistency: Look for consistent patterns across analyses
  • Validation: Confirm key findings with independent methods
  • Interactive exploration: Start with GeneTonic and pcaExplorer for data exploration
  • Download from apps: Export publication-ready figures directly from the Shiny interfaces
  • Manual creation: Use clusterProfiler functions for custom plots when needed

| Problem | Cause | Solution | |---------|-------|----------| | No significant results | P-value too strict | Increase p-value cutoff | | Too many results | P-value too lenient | Decrease p-value cutoff | | Empty gene sets | Background too small | Use all expressed genes | | SPIA errors | KEGG API issues | Check internet connection |

If you encounter issues:

  1. Check configuration: Verify YAML syntax and parameters
  2. Review logs: Check Snakemake and R logs
  3. Validate input: Ensure differential expression results exist
  4. Open issue: Report problems on GitHub

After completing enrichment analysis:

  1. Load Results: Use readRDS() to load the generated data objects
  2. Interactive Exploration: Use GeneTonic and pcaExplorer for data exploration
  3. Download Visualizations: Export publication-ready figures from the Shiny apps
  4. Custom Analysis: Perform additional statistical analyses or create custom plots
  5. Biological Interpretation: Relate findings to your research question

For detailed workflow execution, see the Running Guide.


Comprehensive data objects for all enrichment analyses
Statistical results ready for interpretation
Gene ID mappings and annotations
Flexible configuration for different analysis types
Integrated RMarkdown playgrounds for interactive analysis
HTML reports (FastQC, Qualimap, MultiQC)

Automatic static plots or visualizations
Batch-generated figures
Publication-ready images

The workflow provides both data generation AND integrated interactive analysis tools.


Functional enrichment analysis provides biological context for your gene expression changes. The workflow generates comprehensive data objects and integrates with interactive analysis tools like GeneTonic, pcaExplorer, and ideal, allowing you to explore your results interactively and create custom visualizations for publication.

Linked external resources are independent of TUCCA and Tufts University and remain under their own licenses.