AI Evaluation & Benchmarking

This page describes the AI Evaluation & Benchmarking column of the Papers.md matrix — peer-reviewed and preprint research whose primary contribution is an evaluation dataset, benchmark suite, leaderboard, or evaluation methodology for AI systems applied to biology and the broader natural sciences. The column is a sibling to AI Tooling / Methodology, separating two questions that had been getting tangled: what AI tools and agents have been built (AI Tooling) versus how well do they actually work, and how do we measure that (this column).

CAAIL’s curatorial stance is that evaluation is now the limiting reagent for AI-for-biology research. Foundation models, agent frameworks, and tool-augmented LLMs are released faster than the cell-ag community can assess them; without standardized benchmarks it is impossible to tell which model genuinely advances the state of a task and which one fits a vendor demo. The papers in this column are the substrate any cell-ag team should consult before deciding to deploy a new model in media optimization, cellular engineering, bioprocess control, scaffolding, sensory prediction, or any of the other applied columns.

The boundary between this column and AI Tooling is pragmatic: a paper that introduces a benchmark and uses it to score a new model goes here (the benchmark is the durable contribution); a paper that introduces a model and uses an existing benchmark to evaluate it goes in AI Tooling or the appropriate applied column. Evaluation methodology papers — work on verifier reliability, hypothesis-testing protocols, and meta-evaluation — also live here.

Benchmarks & Evaluation Frameworks

Comparison

Benchmark	Year	Domain	Task style	Scope	Companion
#146 LAB-Bench (Laurent et al. 2024, FutureHouse)	2024	Biology research	Practical lab tasks (literature, figure reading, protocols, DNA/protein sequence ops, cloning, database access)	2,400+ multiple-choice questions across 8 task types	Datasets/Benchmarks.md
#108 BixBench (Mitchener et al. 2025, FutureHouse)	2025	Computational biology	Multi-step bioinformatics workflows derived from peer-reviewed papers and their code/data	50+ real-world scenarios	Datasets/Benchmarks.md
#147 BLADE (Gu et al. 2025)	2025	Data-driven science	Open-ended data-analysis tasks requiring decision-making across analytical paths	14 datasets, expert decision sequences	Datasets/Benchmarks.md
#149 Duan et al. 2025	2025	Systems biology	LLM-driven dry-lab experimentation in mechanistic ODE models — design, prediction, parameter inference	Multiple systems-bio model suites	—
#109 MassSpecGym (Bushuiev et al. 2024)	2024	Mass spectrometry	MS/MS-spectra → molecular structure (de novo + database retrieval + classification)	231 k records, NeurIPS 2024 D&B Spotlight	Datasets/Benchmarks.md
#110 PubChem LLM Retrieval Eval (Sze & Hassoun 2024)	2024	Chemistry retrieval	Search-enabled LLMs retrieving structured records from PubChem	Multi-prompt eval over PubChem	—
#148 ProteinGym (Notin et al. 2023, Marks lab)	2023	Protein design	Variant-effect prediction (DMS substitutions + indels) and clinical-variant classification	200+ DMS assays, ~2.5 M mutated sequences	Datasets/Benchmarks.md · Databases.md leaderboard
#127 CausalBench (Chevalley et al. 2025)	2025	Single-cell perturbation	Network inference from single-cell perturbation data (framework; data sourced externally)	Large-scale CRISPR-perturbation benchmark	Software.md
#89 AssayBench (Brouwer et al. 2026)	2026	Virtual cell	Assay-level virtual-cell predictions evaluated against held-out assays	Cross-assay benchmark for LLMs and agents	—
#113 / #114 single-cell FM critique (Boiarsky et al. 2024 + Yang et al. 2024 reply)	2024	Single-cell foundation models	Deeper evaluation of scFM predictions vs. simple baselines	Critical eval + reply	—
#55 E-valuator (Sadhuka et al. 2025)	2025	Verifier reliability	Sequential hypothesis testing for reliable agent verifiers (eval methodology, not a benchmark)	Methodology paper	—
BioMysteryBench (Anthropic)	2026	Bioinformatics research	Vendor-released eval dataset for frontier-LLM bioinformatics capability	HuggingFace dataset, no companion paper	Datasets/Benchmarks.md
#150 CompBioBench (Nair et al. 2026, Genentech)	2026	Computational biology	Well-scoped, verifiable agentic tasks across genomics, transcriptomics, epigenomics, single-cell, human genetics, and ML workflows	100 expert-curated questions with synthetic/scrambled ground truth	Datasets/Benchmarks.md · Databases.md leaderboard
#225 BioML-bench (Miller et al. 2025)	2025	Biomedical ML	End-to-end ML solutions auto-graded against human leaderboards (Docker-runnable task capsules, built on MLE-bench)	4 domains — protein engineering, single-cell omics, biomedical imaging, drug discovery	Datasets/Benchmarks.md

Multi-task agent benchmarks

Benchmarks that evaluate agents on broad, multi-step scientific tasks rather than a single predictive sub-task. These are the closest existing eval suites for measuring whether an LLM agent can plan and execute the kind of workflow a cell-ag bioinformatician would delegate to it.

#146 LAB-Bench (Laurent et al. 2024, FutureHouse) — 2,400+ MCQs across literature reading, figure interpretation, protocol design, DNA/protein sequence manipulation, cloning, and database access. The single broadest practical-biology eval at time of curation, and the FutureHouse counterpart to #108 BixBench. Data + bundled scoring code at Datasets/Benchmarks.md / LAB-Bench.
#108 BixBench (Mitchener et al. 2025, FutureHouse) — 205 questions derived from 60 real-world published Jupyter notebooks; complements LAB-Bench’s atomized question format with full multi-step task execution. Data + bundled scoring code at Datasets/Benchmarks.md / BixBench.
#147 BLADE (Gu et al. 2025, EMNLP Findings) — evaluates agents on open-ended data analysis where the “correct” answer is a defensible analytical path, not a single value; built around 14 datasets and expert-annotated ground-truth decision sequences. Data + bundled scoring code at Datasets/Benchmarks.md / BLADE.
#149 Duan et al. 2025 (Maddison group) — uses mechanistic ODE-based systems-biology models as a dry-lab substrate: the LLM proposes experiments, observes simulated outcomes, and updates its understanding, putting iterative scientific reasoning rather than single-shot QA under test.
BioMysteryBench (Anthropic) — vendor-released bioinformatics eval dataset distributed via Datasets/Benchmarks.md / BioMysteryBench, released alongside Anthropic’s Evaluating Claude’s bioinformatics research capabilities study. No companion peer-reviewed paper at time of curation, but useful as a reference point when comparing frontier LLMs on the multi-step bioinformatics tasks cell-ag agents increasingly handle.
#150 CompBioBench (Nair et al. 2026, Genentech) — 100 expert-curated questions across genomics, transcriptomics, epigenomics, single-cell analysis, human genetics, and ML workflows, with ground truth pinned via synthetic/augmented data or metadata-scrambling of real datasets to give each task a single verifiable answer. Agents are evaluated end-to-end from a bare environment, fetching data and tools as needed. Data + bundled bioinformatics artifacts at Datasets/Benchmarks.md / CompBioBench v1; live results at Databases.md / CompBioBench v1 Leaderboard.
#225 BioML-bench (Miller et al. 2025) — the first end-to-end agent benchmark for biomedical machine learning itself: built on MLE-bench, it packages tasks across protein engineering, single-cell omics, biomedical imaging, and drug discovery as Docker-runnable capsules auto-graded against expert human leaderboards. Evaluating four open-source agents (the specialists STELLA and Biomni, the generalists AIDE and MLAgentBench), it found that all underperform human baselines on average, that biomedical specialization gives no consistent advantage, and that agents using more diverse ML strategies (feature engineering, model stacking) score highest — evidence that agent scaffolding matters more than domain tuning. Data + harness at Datasets/Benchmarks.md / BioML-bench.
#53 ARIEL (Liu et al. 2026) — an open evaluation framework for AI research assistants, pairing expert-annotated biomedical datasets (a 2,571-manuscript summarization set and a 163-question figure-understanding set) with a dual protocol: seven computational metrics (BLEU, ROUGE, BERTScore, F1-RadGraph, MEDCON) plus blinded PhD-level human scoring across five dimensions. It benchmarks open and closed models (GPT-4o, Gemini-1.5, Claude-3.5, Qwen-VL, LLaVA, and others) rather than shipping a single model — an evaluation-first contribution measuring exactly the literature- and figure-comprehension skills a cell-ag research assistant needs.

General-purpose frontier benchmarks (capability context)

None of the benchmarks here is biology-specific, but they are the reference points the field uses to gauge a model’s raw capability — and a cell-ag team evaluating a general-purpose model will meet all of them. #155 SWE-bench (Jimenez et al. 2024) scores whether an agent can resolve real GitHub issues by editing a codebase until the project’s own tests pass — 2,294 tasks across 12 Python repositories — the standard measure of whether a coding agent can maintain the bioinformatics pipelines a cell-ag lab delegates to it (data and harness at Datasets/Benchmarks.md / SWE-bench; live SWE-bench Leaderboard). #156 GPQA (Rein et al. 2023) is 448 “Google-proof” graduate-level questions in biology, physics, and chemistry where PhD-level experts reach ~65% and skilled non-experts only ~34% even with web access (Datasets/Benchmarks.md / GPQA). #157 MMLU-Pro (Wang et al. 2024) is a harder, ten-option successor to MMLU across 14 disciplines (Datasets/Benchmarks.md / MMLU-Pro; live MMLU-Pro Leaderboard). #158 Humanity’s Last Exam (Phan et al. 2026, Nature) is a 2,500-question multi-modal frontier-knowledge benchmark on which even the best models scored in the single-to-low-double digits at release, with calibration error above 70% — the load-bearing point for cell-ag is that a model can be confidently wrong, so an autonomous loop must not treat its confidence as a reliability signal (Datasets/Benchmarks.md / Humanity’s Last Exam; live Leaderboard). #159 FrontierScience (Wang et al. 2026, OpenAI) pairs an olympiad track with a PhD-level open-ended Research track scored by a rubric that evaluates the solving process rather than a single answer — a methodology directly relevant to grading the open-ended research tasks a cell-ag agent would face (Datasets/Benchmarks.md / FrontierScience).

Domain-specific predictive benchmarks

Benchmarks targeting a single predictive task within a domain — what a model must output is well-defined, so model performance is directly comparable across releases.

#148 ProteinGym (Notin et al. 2023, Marks lab) — the dominant variant-effect benchmark; 200+ deep mutational scanning assays plus clinical variant tasks across substitutions and indels. Directly relevant to any cell-ag protein-engineering work (growth factors, scaffolds, recombinant ECM proteins) that uses a protein language model. Data + bundled scoring code at Datasets/Benchmarks.md / ProteinGym; live substitution + indel leaderboards at Databases.md / ProteinGym Leaderboard.
#109 MassSpecGym (Bushuiev et al. 2024, Pluskal lab; NeurIPS 2024 Datasets & Benchmarks Spotlight) — MS/MS → structure benchmark with de novo, database-retrieval, and chemical-class prediction sub-tasks. The cleanest existing substrate for ML-based flavor-compound identification in cell-ag sensomics workflows. Data + bundled scoring code at Datasets/Benchmarks.md / MassSpecGym.
#110 PubChem LLM Retrieval Eval (Sze & Hassoun 2024, Tufts) — evaluates how reliably search-enabled LLMs pull structured chemical metadata from PubChem; relevant whenever an agent must verify a compound identifier before passing it downstream into a media-formulation, scaffold-chemistry, or flavor-prediction workflow.

Cell-state and virtual-cell benchmarks

Benchmarks specific to the foundation-model and virtual-cell research programs that increasingly dominate AI-for-biology. Their cell-ag relevance is direct: livestock cell biology is starting to ride the same single-cell foundation-model wave, and the benchmark debates here will set the bar for what counts as a useful livestock-cell model.

#113 Boiarsky et al. 2024 — “Deeper evaluation of a single-cell foundation model” (Nature Machine Intelligence) — shows that several headline scFM benchmark wins do not survive comparison against simpler baselines.
#114 Yang et al. 2024 — reply from the original scFM authors. The Boiarsky/Yang pair is required reading for any team deciding whether to invest in scFMs vs. classical methods for livestock-cell tasks. The underlying scFM (scBERT) is catalogued at Software.md / scBERT.
#127 CausalBench (Chevalley et al. 2025) — large-scale benchmark framework for network inference from single-cell perturbation data; the closest cell-ag-relevant eval suite for any model that promises to infer regulatory networks driving lineage commitment or response to media perturbations. Framework code at Software.md / CausalBench; the benchmark sources data externally (Replogle et al. genome-wide Perturb-seq) rather than bundling its own dataset, so no Datasets/ entry.
#89 AssayBench (Brouwer et al. 2026) — assay-level virtual-cell benchmark for LLMs and agents; complements the cell-state benchmarks above by testing whether models can predict assay outcomes rather than gene-expression vectors.
#164 SciHorizon-Gene (Huang et al. 2026) — a gene-centric benchmark of more than 540,000 questions spanning over 190,000 human genes, testing an LLM’s gene-to-function reasoning (cell-type annotation, functional interpretation, mechanism) along four axes — research-attention sensitivity, hallucination tendency, knowledge completeness, and literature influence. It is human-gene-focused, but probes exactly the gene-knowledge inference a livestock cell-state model would need once cross-species fine-tuning of biomedical LLMs matures; data + scoring at Datasets/Benchmarks.md / SciHorizon-GENE.
#57 State / Cell-Eval (Adduri et al. 2025, Arc Institute) — primarily a perturbation-prediction model (described under Cellular Engineering), cross-listed here because it ships Cell-Eval, a separately released suite of evaluation metrics for predicted single-cell perturbation responses (a perturbation-discrimination score, log-fold-change correlation, and DE-gene precision/recall), applied to large Perturb-seq datasets. It is a reusable evaluation framework for the perturbation-prediction models that virtual-cell and cell-engineering work increasingly depends on — a metric suite and protocol rather than an expert-annotated held-out benchmark.

Evaluation methodology & reliability

The cluster’s “meta” papers — work on how evaluation itself is done, rather than a benchmark for a specific task.

#55 E-valuator (Sadhuka et al. 2025) — sequential hypothesis testing for reliable agent verifiers; the closest existing answer to “how confident should I be that this verifier’s pass/fail signal is real?” for any cell-ag team building an autonomous loop on top of an LLM agent.

Linked external resources are independent of TUCCA and Tufts University and remain under their own licenses.