Reproducibility in Scientific Computing
Reproducibility is the cornerstone of scientific research. This guide covers essential practices for ensuring your computational analyses can be reproduced by others, including yourself in the future.
Why Reproducibility Matters
- Scientific integrity: Results can be verified and validated
- Collaboration: Others can build on your work
- Career advancement: Reproducible research is highly valued
- Self-preservation: You can reproduce your own results months later
Good Enough Practices in Scientific Computing
We highly recommend reading Good enough practices in scientific computing by Wilson et al. The first author, Greg Wilson, was a co-founder of Software Carpentry, an educational resource for researchers that "develops and teaches workshops on the fundamental programming skills needed to conduct research. [Software Carpentry's] mission is to provide researchers high-quality, domain-specific training covering all aspects of research software engineering." 1
Alternatively, there is a self-led lesson inspired by and based on the paper available on The Carpentries Lab (a collection of peer-reviewed lessons from the carpentries community). The lesson does not need any specific software installed beyond a web browser. It is mostly discussion-based, with examples of data organization given and used for the discussion.
Core Principles
1. Document Everything
📝 Code Documentation
Document your code with clear comments and README files.
- Function descriptions
- Parameter explanations
- Usage examples
- Dependencies list
🔬 Experimental Design
Document your experimental setup and methodology.
- Sample information
- Data collection methods
- Analysis parameters
- Quality control steps
2. Version Control Everything
📚 Code Versioning
Use Git to track changes in your code and analysis scripts.
- Track all changes
- Create meaningful commits
- Use branches for experiments
- Tag important versions
📊 Data Versioning
Version your data and results alongside your code.
- Data processing pipelines
- Intermediate results
- Final outputs
- Configuration files
3. Environment Management
🐍 Python Environments
Use conda or virtual environments to isolate dependencies.
- Create environment.yml
- Pin package versions
- Export environment
- Use containers
📦 R Environments
Use renv to manage R package versions.
- Initialize renv
- Snapshot dependencies
- Restore environments
- Lock file versions
Introduction to Git/GitHub
Git is a distributed version control system that tracks changes to files and directories through a series of commits, letting developers branch and merge code safely while maintaining a full history. GitHub is a cloud-based hosting service for Git repositories that adds a user-friendly web interface, issue tracking, pull requests for code review, and integrations with CI/CD and project management tools. Together, they let teams collaborate on the same codebase concurrently, manage versions, share work publicly or privately, and streamline development and deployment workflows.
What is Git? Explained in 2 Minutes!
The thumbnail for this video is wrong. The video is only 3 minutes long.
Software Carpentry's Version Control with Git Tutorial
Practical Implementation
1. Project Structure
project_name/
├── README.md              # Project description and setup
├── requirements.txt       # Python dependencies
├── renv.lock             # R package versions
├── data/                 # Raw and processed data
│   ├── raw/             # Original data files
│   ├── processed/       # Cleaned/processed data
│   └── README.md        # Data documentation
├── scripts/              # Analysis scripts
│   ├── 01_data_cleaning.R
│   ├── 02_analysis.R
│   └── 03_visualization.R
├── results/              # Output files
│   ├── figures/         # Plots and graphs
│   ├── tables/          # Data tables
│   └── README.md        # Results documentation
├── docs/                 # Documentation
│   ├── methods.md       # Methodology
│   └── results.md       # Results summary
└── .gitignore           # Files to exclude from version control
2. Git Workflow
# Initialize repository
git init
git add .
git commit -m "Initial commit: project setup"
# Make changes and commit
git add scripts/analysis.R
git commit -m "Add differential expression analysis"
# Create branch for experiment
git checkout -b experiment/new_method
# ... make changes ...
git add .
git commit -m "Implement new analysis method"
git checkout main
git merge experiment/new_method
# Tag important versions
git tag -a v1.0.0 -m "First stable release"
3. Environment Management
Python (conda)
# Create environment
conda env create -f environment.yml
# Activate environment
conda activate myproject
# Export environment
conda env export > environment.yml
R (renv)
# Initialize renv
renv::init()
# Install packages
renv::install(c("tidyverse", "DESeq2"))
# Snapshot current state
renv::snapshot()
# Restore environment
renv::restore()
Best Practices Checklist
Before Starting Analysis
- Project setup: Create organized directory structure
- Version control: Initialize Git repository
- Environment: Set up isolated computing environment
- Documentation: Create README and project description
During Analysis
- Code comments: Add clear comments explaining logic
- Regular commits: Commit changes frequently with descriptive messages
- Data tracking: Document data sources and processing steps
- Parameter logging: Record all analysis parameters
After Analysis
- Results documentation: Document all outputs and interpretations
- Environment export: Export environment specifications
- Code review: Review and clean up code
- Archive data: Store raw data and results securely
Tools and Resources
Snakemake
The workflow management system that orchestrates the analysis, using Conda to create reproducible environments for each step.
This is the most important page to visit when learning how to execute, debug, and configure your workflow.
While the official tutorial is a great start, we recommend the one from The Carpentries for its detailed, bioinformatics-focused approach.
For more advanced needs, or when information isn't in the main docs, the API reference is useful.
Version Control
- Git: git-scm.com
- GitHub: github.com
- GitLab: gitlab.com
Environment Management
- Conda: docs.conda.io
- Docker: docker.com
- Singularity: sylabs.io
Documentation
- R Markdown: rmarkdown.rstudio.com
- Jupyter Notebooks: jupyter.org
- Quarto: quarto.org
Learning Resources
- Software Carpentry: software-carpentry.org
- Data Carpentry: datacarpentry.org
- The Carpentries: carpentries.org
Common Pitfalls
1. Hard-coded Paths
❌ Bad:
data <- read.csv("C:/Users/me/Documents/project/data.csv")
✅ Good:
data <- read.csv(file.path("data", "raw", "data.csv"))
2. Missing Dependencies
❌ Bad:
# Install packages manually without recording versions
install.packages("tidyverse")
✅ Good:
# Use renv to manage package versions
renv::install("tidyverse")
renv::snapshot()
3. Incomplete Documentation
❌ Bad:
# No explanation of what this does
results <- deseq(dat)
✅ Good:
# Run DESeq2 differential expression analysis
# Parameters: design = ~condition, alpha = 0.05
results <- deseq(dat, design = ~condition, alpha = 0.05)
Next Steps
After implementing these practices:
- Start small: Begin with one project and expand gradually
- Use templates: Create templates for common project types
- Automate: Use scripts to automate repetitive tasks
- Share: Share your reproducible workflows with others
For workflow-specific reproducibility, see the Configuration Guide.
References
- About us. Software Carpentry https://software-carpentry.org/about-us/.
- Wilson, G., et al. (2017). Good enough practices in scientific computing. PLoS Comput Biol, 13(6), e1005510.
- Sandve, G. K., et al. (2013). Ten simple rules for reproducible computational research. PLoS Comput Biol, 9(10), e1003505.