Version: 0.9.1

Reproducibility in Scientific Computing

Reproducibility is the cornerstone of scientific research. This guide covers essential practices for ensuring your computational analyses can be reproduced by others, including yourself in the future.

Why Reproducibility Matters

Scientific integrity: Results can be verified and validated
Collaboration: Others can build on your work
Career advancement: Reproducible research is highly valued
Self-preservation: You can reproduce your own results months later

Good Enough Practices in Scientific Computing

We highly recommend reading Good enough practices in scientific computing by Wilson et al. The first author, Greg Wilson, was a co-founder of Software Carpentry, an educational resource for researchers that "develops and teaches workshops on the fundamental programming skills needed to conduct research. [Software Carpentry's] mission is to provide researchers high-quality, domain-specific training covering all aspects of research software engineering." 1

Alternatively, there is a self-led lesson inspired by and based on the paper available on The Carpentries Lab (a collection of peer-reviewed lessons from the carpentries community). The lesson does not need any specific software installed beyond a web browser. It is mostly discussion-based, with examples of data organization given and used for the discussion.

Core Principles

1. Document Everything

Document your code with clear comments and README files.

Function descriptions
Parameter explanations
Usage examples
Dependencies list

Document your experimental setup and methodology.

Sample information
Data collection methods
Analysis parameters
Quality control steps

2. Version Control Everything

Use Git to track changes in your code and analysis scripts.

Track all changes
Create meaningful commits
Use branches for experiments
Tag important versions

Version your data and results alongside your code.

Data processing pipelines
Intermediate results
Final outputs
Configuration files

3. Environment Management

Use conda or virtual environments to isolate dependencies.

Create environment.yml
Pin package versions
Export environment
Use containers

Use renv to manage R package versions.

Initialize renv
Snapshot dependencies
Restore environments
Lock file versions

Introduction to Git/GitHub

Git is a distributed version control system that tracks changes to files and directories through a series of commits, letting developers branch and merge code safely while maintaining a full history. GitHub is a cloud-based hosting service for Git repositories that adds a user-friendly web interface, issue tracking, pull requests for code review, and integrations with CI/CD and project management tools. Together, they let teams collaborate on the same codebase concurrently, manage versions, share work publicly or privately, and streamline development and deployment workflows.

What is Git? Explained in 2 Minutes!

note

The thumbnail for this video is wrong. The video is only 3 minutes long.

Software Carpentry's Version Control with Git Tutorial

Practical Implementation

1. Project Structure

project_name/
├── README.md              # Project description and setup
├── requirements.txt       # Python dependencies
├── renv.lock             # R package versions
├── data/                 # Raw and processed data
│   ├── raw/             # Original data files
│   ├── processed/       # Cleaned/processed data
│   └── README.md        # Data documentation
├── scripts/              # Analysis scripts
│   ├── 01_data_cleaning.R
│   ├── 02_analysis.R
│   └── 03_visualization.R
├── results/              # Output files
│   ├── figures/         # Plots and graphs
│   ├── tables/          # Data tables
│   └── README.md        # Results documentation
├── docs/                 # Documentation
│   ├── methods.md       # Methodology
│   └── results.md       # Results summary
└── .gitignore           # Files to exclude from version control

2. Git Workflow

# Initialize repository
git init
git add .
git commit -m "Initial commit: project setup"

# Make changes and commit
git add scripts/analysis.R
git commit -m "Add differential expression analysis"

# Create branch for experiment
git checkout -b experiment/new_method
# ... make changes ...
git add .
git commit -m "Implement new analysis method"
git checkout main
git merge experiment/new_method

# Tag important versions
git tag -a v1.0.0 -m "First stable release"

3. Environment Management

Python (conda)

# Create environment
conda env create -f environment.yml

# Activate environment
conda activate myproject

# Export environment
conda env export > environment.yml

R (renv)

# Initialize renv
renv::init()

# Install packages
renv::install(c("tidyverse", "DESeq2"))

# Snapshot current state
renv::snapshot()

# Restore environment
renv::restore()

Best Practices Checklist

Before Starting Analysis

Project setup: Create organized directory structure
Version control: Initialize Git repository
Environment: Set up isolated computing environment
Documentation: Create README and project description

During Analysis

Code comments: Add clear comments explaining logic
Regular commits: Commit changes frequently with descriptive messages
Data tracking: Document data sources and processing steps
Parameter logging: Record all analysis parameters

After Analysis

Results documentation: Document all outputs and interpretations
Environment export: Export environment specifications
Code review: Review and clean up code
Archive data: Store raw data and results securely

Tools and Resources

The container runtime that allows you to run the workflow in a self-contained environment. We use it to run a Docker container image.

The package manager that operates inside the container to manage the specific software packages required by the Snakemake rules.

The workflow management system that orchestrates the analysis, using Conda to create reproducible environments for each step.

This is the most important page to visit when learning how to execute, debug, and configure your workflow.

While the official tutorial is a great start, we recommend the one from The Carpentries for its detailed, bioinformatics-focused approach.

For more advanced needs, or when information isn't in the main docs, the API reference is useful.

Version Control

Git: git-scm.com
GitHub: github.com
GitLab: gitlab.com

Environment Management

Conda: docs.conda.io
Docker: docker.com
Singularity: sylabs.io

Documentation

R Markdown: rmarkdown.rstudio.com
Jupyter Notebooks: jupyter.org
Quarto: quarto.org

Learning Resources

Software Carpentry: software-carpentry.org
Data Carpentry: datacarpentry.org
The Carpentries: carpentries.org

Common Pitfalls

1. Hard-coded Paths

❌ Bad:

data <- read.csv("C:/Users/me/Documents/project/data.csv")

✅ Good:

data <- read.csv(file.path("data", "raw", "data.csv"))

2. Missing Dependencies

❌ Bad:

# Install packages manually without recording versions
install.packages("tidyverse")

✅ Good:

# Use renv to manage package versions
renv::install("tidyverse")
renv::snapshot()

3. Incomplete Documentation

❌ Bad:

# No explanation of what this does
results <- deseq(dat)

✅ Good:

# Run DESeq2 differential expression analysis
# Parameters: design = ~condition, alpha = 0.05
results <- deseq(dat, design = ~condition, alpha = 0.05)

Next Steps

After implementing these practices:

Start small: Begin with one project and expand gradually
Use templates: Create templates for common project types
Automate: Use scripts to automate repetitive tasks
Share: Share your reproducible workflows with others

For workflow-specific reproducibility, see the Configuration Guide.

References

About us. Software Carpentry https://software-carpentry.org/about-us/.
Wilson, G., et al. (2017). Good enough practices in scientific computing. PLoS Comput Biol, 13(6), e1005510.
Sandve, G. K., et al. (2013). Ten simple rules for reproducible computational research. PLoS Comput Biol, 9(10), e1003505.

Reproducibility in Scientific Computing

Why Reproducibility Matters

Good Enough Practices in Scientific Computing

Core Principles

1. Document Everything

📝 Code Documentation

🔬 Experimental Design

2. Version Control Everything

📚 Code Versioning

📊 Data Versioning

3. Environment Management

🐍 Python Environments

📦 R Environments

Introduction to Git/GitHub

What is Git? Explained in 2 Minutes!

Software Carpentry's Version Control with Git Tutorial

Practical Implementation

1. Project Structure

2. Git Workflow

3. Environment Management

Python (conda)

R (renv)

Best Practices Checklist

Before Starting Analysis

During Analysis

After Analysis

Tools and Resources

Apptainer / Singularity

Conda

Snakemake

Version Control

Environment Management

Documentation

Learning Resources

Common Pitfalls

1. Hard-coded Paths

2. Missing Dependencies

3. Incomplete Documentation

Next Steps

References

Why Reproducibility Matters​

Good Enough Practices in Scientific Computing​

Core Principles​

1. Document Everything​

📝 Code Documentation

🔬 Experimental Design

2. Version Control Everything​

📚 Code Versioning

📊 Data Versioning

3. Environment Management​

🐍 Python Environments

📦 R Environments

Introduction to Git/GitHub​

What is Git? Explained in 2 Minutes!​

Software Carpentry's Version Control with Git Tutorial​

Practical Implementation​

1. Project Structure​

2. Git Workflow​

3. Environment Management​

Python (conda)​

R (renv)​

Best Practices Checklist​

Before Starting Analysis​

During Analysis​

After Analysis​

Tools and Resources​

Apptainer / Singularity

Conda

Snakemake

Version Control​

Environment Management​

Documentation​

Learning Resources​

Common Pitfalls​

1. Hard-coded Paths​

2. Missing Dependencies​

3. Incomplete Documentation​

Next Steps​

References​

Why Reproducibility Matters

Good Enough Practices in Scientific Computing

Core Principles

1. Document Everything

2. Version Control Everything

3. Environment Management

Introduction to Git/GitHub

What is Git? Explained in 2 Minutes!

Software Carpentry's Version Control with Git Tutorial

Practical Implementation

1. Project Structure

2. Git Workflow

3. Environment Management

Python (conda)

R (renv)

Best Practices Checklist

Before Starting Analysis

During Analysis

After Analysis

Tools and Resources

Version Control

Environment Management

Documentation

Learning Resources

Common Pitfalls

1. Hard-coded Paths

2. Missing Dependencies

3. Incomplete Documentation

Next Steps

References