Benchmark & Evaluation Datasets

Curated eval datasets — released to benchmark AI models rather than to train them. Collected here — separate from the per-species data pages — because their primary use is downstream model evaluation, and because they pair with the AI Evaluation & Benchmarking deep-dive page and column in Papers.md. Each entry is the canonical home of the data — the questions, scenarios, spectra, or sequences models are evaluated against — with the bundled scoring code (where present) noted inline. Live leaderboards / results trackers are catalogued separately in Databases.md.

For fixed train-on data artifacts organized by species, see the per-species pages indexed in README.md.

BioML-bench

The official BioML-bench release at science-machine/biomlbench on GitHub — a benchmark suite that evaluates LLM agents on end-to-end biomedical machine-learning tasks spanning protein engineering, drug discovery, single-cell omics, medical imaging, and clinical biomarkers. Built on top of MLE-bench, it bundles the task datasets, the preparation pipeline, and the scoring/grading harness in one repository; agents read each task description, analyse the biomedical data, and implement a complete ML solution scored against built-in human baselines. Companion to Papers.md ref #225 (Miller et al. 2025). The closest existing eval substrate for whether an autonomous agent can carry the kind of end-to-end biomedical-ML work — single-cell omics, protein engineering — that cell-ag pipelines increasingly delegate.

BioMysteryBench

Anthropic/BioMysteryBench-full is Anthropic’s open-source benchmark dataset for evaluating LLM capabilities on bioinformatics research tasks — released as the substrate for the Evaluating Claude’s bioinformatics research capabilities study. Provides a vendor-released, cell-ag-adjacent eval dataset for any team comparing frontier LLMs on the kind of multi-step bioinformatics tasks (data exploration, hypothesis generation, code-based analysis) increasingly delegated to agents in cell-ag bioinformatics workflows. No separate companion paper at time of curation; see the AI Evaluation & Benchmarking deep-dive for context.

BixBench

The Hugging Face Datasets release of FutureHouse’s 205-question benchmark derived from 60 real-world, published Jupyter notebooks — the bundled-data side of BixBench. The companion scoring framework lives at Future-House/BixBench on GitHub and authenticates with Hugging Face to pull this dataset at evaluation time; this Hugging Face record is the canonical source. Companion to Papers.md ref #108 (Mitchener et al. 2025). The closest existing eval substrate for measuring whether agents can plan and execute multi-step bioinformatics workflows of the kind cell-ag teams increasingly delegate to LLM agents.

BLADE

The official BLADE benchmark release at behavioral-data/BLADE on GitHub — 14 datasets (the canonical “BLADE datasets” set referenced as 12 in the paper plus held-out additions), expert-annotated ground-truth decision sequences in blade_bench/datasets/, and evaluation harness (run_get_eval.py, load_ground_truth()) all bundled into one repository. Tasks evaluate language-model agents on open-ended data analysis where the “correct” answer is a defensible analytical path, not a single value (Gu et al. 2025, EMNLP Findings). Companion to Papers.md ref #147; project showcase at blade-bench.github.io. Pairs with the same authors’ CHI 2024 studies (Papers.md refs #151, #152) on how human analysts actually use AI assistants for analysis.

CompBioBench v1

Genentech’s 100-task benchmark for evaluating agentic systems on verifiable computational-biology problems — questions span genomics, transcriptomics, epigenomics, single-cell analysis, human genetics, and machine-learning workflows. Each question’s ground truth is pinned by synthetic/augmented data or by metadata scrambling/scrubbing of real datasets, sidestepping the noisy-and-open-to-interpretation problem that has historically blocked single-answer biology benchmarks. The canonical Hugging Face Dataset (Genentech/compbiobench-data-v1, CC-BY-4.0) is the data distribution; rows carry question_id, domain, question_style, skills_tested, question, internet_required, gpu_preferred, and file_paths, with the associated bioinformatics artifacts (BAM, FASTQ, H5AD, MTX, TSV/CSV, tar.gz) downloaded alongside. Mirrored on Zenodo; the live results tracker is the CompBioBench v1 Leaderboard catalogued in Databases.md. Companion to Papers.md ref #150 (Nair et al. 2026). One of the cleanest existing eval substrates for any cell-ag team auditing whether an LLM agent can actually deliver on the multi-step bioinformatics work — single-cell, transcriptomics, epigenomics — that increasingly underpins media-response analysis and livestock-cell lineage characterization.

FLIP (Fitness Landscape Inference for Proteins)

The official FLIP benchmark release at J-SNACKKB/FLIP on GitHub — a suite of protein function-prediction tasks with curated train/test splits over three experimental landscapes: adeno-associated virus (AAV) capsid stability, protein G domain B1 (GB1) stability and immunoglobulin binding, and thermostability across multiple protein families. The splits are designed to probe how well protein representation and language models generalize in the low-resource and extrapolative regimes that matter for protein engineering (Dallago et al. 2021, bioRxiv, 10.1101/2021.11.09.467890). The repository bundles the split datasets with the baselines and scoring code. For cellular agriculture it is a directly relevant eval substrate for protein-engineering work — growth factors, scaffold and ECM proteins, enzymes — where a model must predict the fitness effect of sequence variants from limited assay data.

FrontierScience

OpenAI’s benchmark for expert-level scientific reasoning, released as the Hugging Face dataset openai/frontierscience. It has two tracks: an Olympiad track of IPhO/IChO/IBO-level problems, and a Research track of PhD-level, open-ended problems representative of sub-tasks in scientific research — built to resist the saturation that has flattened earlier multiple-choice science benchmarks. Companion to Papers.md ref #159 (Wang et al. 2026). For cell-ag teams, a frontier check on whether a general-purpose model has the scientific-reasoning depth to be trusted on novel experimental design and analysis rather than recall.

GeneBench-Pro

A benchmark from OpenAI researchers (Li & Ho) for AI agents performing realistic multi-stage scientific analyses in genomics, quantitative biology, and translational biomedicine — 129 evaluations spanning 10 primary domains and 21 terminal subdomains with a genomics-centered core, each problem framed as an end-to-end analysis (QC and exploratory data analysis → modeling and estimation → diagnostics → a downstream go/no-go scientific decision) rather than a single-answer question. An expanded, harder successor to GeneBench. The public release — the Hugging Face dataset ajh-oai/genebench-pro-public-package — contains ten open-source problems bundling solver-facing prompts, staged data files, grading/configuration files, and per-problem report PDFs documenting each problem’s construction, validation evidence, and grading contract; a further 50 problems are held out with Artificial Analysis for independent third-party benchmarking and the remainder are retained as an internal holdout (so the eval as a whole is not fully open). Companion to Papers.md ref #241 (Li & Ho 2026). For cell-ag teams, a frontier probe of whether an LLM agent can be trusted with the multi-stage quantitative-biology reasoning — QC, modeling, diagnostics, decision — that increasingly underpins omics-driven media-response and lineage analysis.

GPQA

The “Google-proof” graduate-level Q&A benchmark — 448 hard multiple-choice questions written and validated by domain experts in biology, physics, and chemistry, where PhD-level experts reach ~65% accuracy and skilled non-experts only ~34% even with unrestricted web access. Distributed as a gated Hugging Face dataset (Idavidrein/gpqa); the bundled scoring code and the diamond/main/extended splits live at idavidrein/gpqa. Companion to Papers.md ref #156 (Rein et al. 2023). The biology subset is a fast expert-level competence probe for any LLM a cell-ag team intends to lean on for literature reasoning.

Humanity’s Last Exam

A multi-modal benchmark of 2,500 questions at the frontier of human knowledge across dozens of subjects, assembled by the Center for AI Safety and Scale AI as a closed-ended academic benchmark hard enough to outlast MMLU-style saturation. The data lives at the Hugging Face dataset cais/hle, the scoring code at centerforaisafety/hle, and the live results tracker is the Humanity’s Last Exam Leaderboard catalogued in Databases.md. Companion to Papers.md ref #158 (Phan et al. 2026). A ceiling-level check on frontier-model reasoning before a cell-ag team trusts one with open-ended scientific work.

iUmami-SCM

The data and scoring-card model behind iUmami-SCM, a sequence-based umami-peptide predictor built on a scoring card method with dipeptide propensity scores (Charoenkwan et al. 2020, Journal of Chemical Information and Modeling, 10.1021/acs.jcim.0c00707). The Shoombuatong/Dataset-Code/iUmami folder bundles the benchmark sequence sets, the UMP-TR training set and the UMP-IND independent test set of umami versus non-umami peptides, alongside the scoring-card implementation (SCM.zip) and comparison ML-classifier code, so the eval data and its bundled scoring code ship together. A compact, labeled eval substrate for taste-peptide prediction, directly relevant to sensory-prediction and flavor-optimization models for cultivated meat, where umami peptides are a primary driver of savory taste. Companion to Papers.md ref #269 (Charoenkwan et al. 2020).

LAB-Bench

The Hugging Face Datasets release of the LAB-Bench eval suite from FutureHouse — multiple-choice questions across eight task categories spanning literature reading, figure interpretation, protocol design, DNA/protein sequence manipulation, cloning, and database access. The bundled scoring code and prompts live at Future-House/lab-bench; this Hugging Face dataset is the canonical data distribution. Companion to Papers.md ref #146 (Laurent et al. 2024). The single broadest practical-biology eval dataset at time of curation, useful both directly for benchmarking and as a template for cell-ag-specific eval datasets the field will need as livestock-cell-focused agents emerge.

MassSpecGym

The canonical Hugging Face distribution of MassSpecGym from the Pluskal lab (Bushuiev et al. 2024, NeurIPS Datasets & Benchmarks Spotlight) — 231 k MS/MS records with full schema (precursor m/z, intensities, SMILES, InChIKey, formula, adduct, instrument type, collision energy, fold splits, simulation-challenge flag). Distributed alongside the canonical scoring code at pluskal-lab/MassSpecGym, with v1.5 adding RDKit-canonical SMILES, MGF export, retrieval candidate pools (mass- and formula-filtered), and pretraining-molecule companion datasets (2.5 M and 50 M SMILES filtered to < 0.7 Tanimoto similarity vs. the test set). Companion to Papers.md ref #109. The cleanest existing substrate for ML-based flavor-compound identification in cell-ag sensomics workflows; pairs with GNPS and the spectral-library entries in Databases.md.

MMLU-Pro

A reasoning-focused successor to MMLU — ten-option (rather than four-option) questions across 14 disciplines, curated to drop noisy items and reward multi-step reasoning over knowledge recall so that frontier models no longer plateau near the ceiling. The data is the Hugging Face dataset TIGER-Lab/MMLU-Pro, with bundled evaluation code at TIGER-AI-Lab/MMLU-Pro and a live MMLU-Pro Leaderboard catalogued in Databases.md. Companion to Papers.md ref #157 (Wang et al. 2024). A general reasoning-capability baseline for sizing up models before cell-ag-specific evaluation.

ProteinGym

The ProteinGym benchmark suite (Notin et al. 2023, NeurIPS Datasets & Benchmarks Track) — 200+ deep mutational scanning (DMS) assays plus clinical-variant classification tasks covering substitutions and indels, ~2.5 M mutated sequences total. The canonical project home at proteingym.org distributes versioned data archives (v1.3 current at time of curation) and the live substitution and indel leaderboards (catalogued in Databases.md). The bundled scoring code, baselines, and model-specific implementations live at OATML-Markslab/ProteinGym; mirror Hugging Face dataset at OATML-Markslab/ProteinGym_v1; archival releases on Zenodo. Companion to Papers.md ref #148. The dominant variant-effect benchmark in the field — directly relevant to any cell-ag protein-engineering work (growth factors, scaffolds, recombinant ECM proteins) that uses a protein language model.

SciHorizon-GENE

The Hugging Face Datasets release of SciHorizon-GENE from Huang et al. 2026 (CNIC-DSL group) — an LLM benchmark of 540K+ gene-centric questions constructed from authoritative biological databases, covering 190K+ human genes, designed to test “understanding-to-reasoning” inference of biological function from gene-level knowledge. MIT-licensed, English. The companion evaluation framework with bundled scoring code lives at CNIC-DSL/SciHorizonGene — the GitHub URL listed on the HF card itself (XiaohanHwang/SciHorizonGene) currently 404s, so use the CNIC-DSL repo instead. The project page is at horizon.scidb.cn. Companion to Papers.md ref #164 (Huang et al. 2026, arXiv; accepted SIGKDD 2026). For cell-ag teams the dataset is human-gene-focused, but functions as a proximate substrate for cross-species transfer — the same gene-knowledge-to-function reasoning that becomes the livestock task once cross-species fine-tuning of biomedical LLMs matures.

SWE-bench

An evaluation dataset of 2,294 software-engineering tasks drawn from real resolved GitHub issues and their pull requests across 12 popular Python repositories — each task asks a model to produce a patch that makes the project’s own tests pass. The data is the Hugging Face dataset princeton-nlp/SWE-bench, the harness is at SWE-bench/SWE-bench, and the live SWE-bench Leaderboard is catalogued in Databases.md. Companion to Papers.md ref #155 (Jimenez et al. 2024). The standard measure of whether a coding agent can be trusted with the bioinformatics-pipeline and analysis-code maintenance that cell-ag teams increasingly delegate.

Linked external resources are independent of TUCCA and Tufts University and remain under their own licenses.