Skip to main content

Reproducibility in Scientific Computing

Reproducibility is the cornerstone of scientific research. This guide covers essential practices for ensuring your computational analyses can be reproduced by others, including yourself in the future.


Why Reproducibility Matters

  • Scientific integrity: Results can be verified and validated
  • Collaboration: Others can build on your work
  • Career advancement: Reproducible research is highly valued
  • Self-preservation: You can reproduce your own results months later

Good Enough Practices in Scientific Computing

We highly recommend reading Good enough practices in scientific computing by Wilson et al. The first author, Greg Wilson, was a co-founder of Software Carpentry, an educational resource for researchers that "develops and teaches workshops on the fundamental programming skills needed to conduct research. [Software Carpentry's] mission is to provide researchers high-quality, domain-specific training covering all aspects of research software engineering." 1

Alternatively, there is a self-led lesson inspired by and based on the paper available on The Carpentries Lab (a collection of peer-reviewed lessons from the carpentries community). The lesson does not need any specific software installed beyond a web browser. It is mostly discussion-based, with examples of data organization given and used for the discussion.


Core Principles

1. Document Everything

📝 Code Documentation

Document your code with clear comments and README files.

  • Function descriptions
  • Parameter explanations
  • Usage examples
  • Dependencies list

🔬 Experimental Design

Document your experimental setup and methodology.

  • Sample information
  • Data collection methods
  • Analysis parameters
  • Quality control steps

2. Version Control Everything

📚 Code Versioning

Use Git to track changes in your code and analysis scripts.

  • Track all changes
  • Create meaningful commits
  • Use branches for experiments
  • Tag important versions

📊 Data Versioning

Version your data and results alongside your code.

  • Data processing pipelines
  • Intermediate results
  • Final outputs
  • Configuration files

3. Environment Management

🐍 Python Environments

Use conda or virtual environments to isolate dependencies.

  • Create environment.yml
  • Pin package versions
  • Export environment
  • Use containers

📦 R Environments

Use renv to manage R package versions.

  • Initialize renv
  • Snapshot dependencies
  • Restore environments
  • Lock file versions

Introduction to Git/GitHub

Git is a distributed version control system that tracks changes to files and directories through a series of commits, letting developers branch and merge code safely while maintaining a full history. GitHub is a cloud-based hosting service for Git repositories that adds a user-friendly web interface, issue tracking, pull requests for code review, and integrations with CI/CD and project management tools. Together, they let teams collaborate on the same codebase concurrently, manage versions, share work publicly or privately, and streamline development and deployment workflows.

What is Git? Explained in 2 Minutes!

note

The thumbnail for this video is wrong. The video is only 3 minutes long.

Software Carpentry's Version Control with Git Tutorial


Practical Implementation

1. Project Structure

project_name/
├── README.md # Project description and setup
├── requirements.txt # Python dependencies
├── renv.lock # R package versions
├── data/ # Raw and processed data
│ ├── raw/ # Original data files
│ ├── processed/ # Cleaned/processed data
│ └── README.md # Data documentation
├── scripts/ # Analysis scripts
│ ├── 01_data_cleaning.R
│ ├── 02_analysis.R
│ └── 03_visualization.R
├── results/ # Output files
│ ├── figures/ # Plots and graphs
│ ├── tables/ # Data tables
│ └── README.md # Results documentation
├── docs/ # Documentation
│ ├── methods.md # Methodology
│ └── results.md # Results summary
└── .gitignore # Files to exclude from version control

2. Git Workflow

# Initialize repository
git init
git add .
git commit -m "Initial commit: project setup"

# Make changes and commit
git add scripts/analysis.R
git commit -m "Add differential expression analysis"

# Create branch for experiment
git checkout -b experiment/new_method
# ... make changes ...
git add .
git commit -m "Implement new analysis method"
git checkout main
git merge experiment/new_method

# Tag important versions
git tag -a v1.0.0 -m "First stable release"

3. Environment Management

Python (conda)

# Create environment
conda env create -f environment.yml

# Activate environment
conda activate myproject

# Export environment
conda env export > environment.yml

R (renv)

# Initialize renv
renv::init()

# Install packages
renv::install(c("tidyverse", "DESeq2"))

# Snapshot current state
renv::snapshot()

# Restore environment
renv::restore()

Best Practices Checklist

Before Starting Analysis

  • Project setup: Create organized directory structure
  • Version control: Initialize Git repository
  • Environment: Set up isolated computing environment
  • Documentation: Create README and project description

During Analysis

  • Code comments: Add clear comments explaining logic
  • Regular commits: Commit changes frequently with descriptive messages
  • Data tracking: Document data sources and processing steps
  • Parameter logging: Record all analysis parameters

After Analysis

  • Results documentation: Document all outputs and interpretations
  • Environment export: Export environment specifications
  • Code review: Review and clean up code
  • Archive data: Store raw data and results securely

Tools and Resources

Apptainer / Singularity

The container runtime that allows you to run the workflow in a self-contained environment. We use it to run a Docker container image.

Conda

The package manager that operates inside the container to manage the specific software packages required by the Snakemake rules.

Snakemake

The workflow management system that orchestrates the analysis, using Conda to create reproducible environments for each step.

This is the most important page to visit when learning how to execute, debug, and configure your workflow.

While the official tutorial is a great start, we recommend the one from The Carpentries for its detailed, bioinformatics-focused approach.

For more advanced needs, or when information isn't in the main docs, the API reference is useful.

Version Control

Environment Management

Documentation

Learning Resources


Common Pitfalls

1. Hard-coded Paths

Bad:

data <- read.csv("C:/Users/me/Documents/project/data.csv")

Good:

data <- read.csv(file.path("data", "raw", "data.csv"))

2. Missing Dependencies

Bad:

# Install packages manually without recording versions
install.packages("tidyverse")

Good:

# Use renv to manage package versions
renv::install("tidyverse")
renv::snapshot()

3. Incomplete Documentation

Bad:

# No explanation of what this does
results <- deseq(dat)

Good:

# Run DESeq2 differential expression analysis
# Parameters: design = ~condition, alpha = 0.05
results <- deseq(dat, design = ~condition, alpha = 0.05)

Next Steps

After implementing these practices:

  1. Start small: Begin with one project and expand gradually
  2. Use templates: Create templates for common project types
  3. Automate: Use scripts to automate repetitive tasks
  4. Share: Share your reproducible workflows with others

For workflow-specific reproducibility, see the Configuration Guide.


References

  1. About us. Software Carpentry https://software-carpentry.org/about-us/.
  2. Wilson, G., et al. (2017). Good enough practices in scientific computing. PLoS Comput Biol, 13(6), e1005510.
  3. Sandve, G. K., et al. (2013). Ten simple rules for reproducible computational research. PLoS Comput Biol, 9(10), e1003505.