Skip to content
CAAIL

Benchmark & Evaluation Datasets

Curated eval datasets — released to benchmark AI models rather than to train them. Collected here — separate from the per-species data pages — because their primary use is downstream model evaluation, and because they pair with the AI Evaluation & Benchmarking deep-dive page and column in Papers.md. Each entry is the canonical home of the data — the questions, scenarios, spectra, or sequences models are evaluated against — with the bundled scoring code (where present) noted inline. Live leaderboards / results trackers are catalogued separately in Databases.md.

For fixed train-on data artifacts organized by species, see the per-species pages indexed in README.md.

BioMysteryBench

Anthropic/BioMysteryBench-full is Anthropic’s open-source benchmark dataset for evaluating LLM capabilities on bioinformatics research tasks — released as the substrate for the Evaluating Claude’s bioinformatics research capabilities study. Provides a vendor-released, cell-ag-adjacent eval dataset for any team comparing frontier LLMs on the kind of multi-step bioinformatics tasks (data exploration, hypothesis generation, code-based analysis) increasingly delegated to agents in cell-ag bioinformatics workflows. No separate companion paper at time of curation; see the AI Evaluation & Benchmarking deep-dive for context.

BixBench

The Hugging Face Datasets release of FutureHouse’s 205-question benchmark derived from 60 real-world, published Jupyter notebooks — the bundled-data side of BixBench. The companion scoring framework lives at Future-House/BixBench on GitHub and authenticates with Hugging Face to pull this dataset at evaluation time; this Hugging Face record is the canonical source. Companion to Papers.md ref #108 (Mitchener et al. 2025). The closest existing eval substrate for measuring whether agents can plan and execute multi-step bioinformatics workflows of the kind cell-ag teams increasingly delegate to LLM agents.

BLADE

The official BLADE benchmark release at behavioral-data/BLADE on GitHub — 14 datasets (the canonical “BLADE datasets” set referenced as 12 in the paper plus held-out additions), expert-annotated ground-truth decision sequences in blade_bench/datasets/, and evaluation harness (run_get_eval.py, load_ground_truth()) all bundled into one repository. Tasks evaluate language-model agents on open-ended data analysis where the “correct” answer is a defensible analytical path, not a single value (Gu et al. 2025, EMNLP Findings). Companion to Papers.md ref #147; project showcase at blade-bench.github.io. Pairs with the same authors’ CHI 2024 studies (Papers.md refs #151, #152) on how human analysts actually use AI assistants for analysis.

CompBioBench v1

Genentech’s 100-task benchmark for evaluating agentic systems on verifiable computational-biology problems — questions span genomics, transcriptomics, epigenomics, single-cell analysis, human genetics, and machine-learning workflows. Each question’s ground truth is pinned by synthetic/augmented data or by metadata scrambling/scrubbing of real datasets, sidestepping the noisy-and-open-to-interpretation problem that has historically blocked single-answer biology benchmarks. The canonical Hugging Face Dataset (Genentech/compbiobench-data-v1, CC-BY-4.0) is the data distribution; rows carry question_id, domain, question_style, skills_tested, question, internet_required, gpu_preferred, and file_paths, with the associated bioinformatics artifacts (BAM, FASTQ, H5AD, MTX, TSV/CSV, tar.gz) downloaded alongside. Mirrored on Zenodo; the live results tracker is the CompBioBench v1 Leaderboard catalogued in Databases.md. Companion to Papers.md ref #150 (Nair et al. 2026). One of the cleanest existing eval substrates for any cell-ag team auditing whether an LLM agent can actually deliver on the multi-step bioinformatics work — single-cell, transcriptomics, epigenomics — that increasingly underpins media-response analysis and livestock-cell lineage characterization.

FrontierScience

OpenAI’s benchmark for expert-level scientific reasoning, released as the Hugging Face dataset openai/frontierscience. It has two tracks: an Olympiad track of IPhO/IChO/IBO-level problems, and a Research track of PhD-level, open-ended problems representative of sub-tasks in scientific research — built to resist the saturation that has flattened earlier multiple-choice science benchmarks. Companion to Papers.md ref #159 (Wang et al. 2026). For cell-ag teams, a frontier check on whether a general-purpose model has the scientific-reasoning depth to be trusted on novel experimental design and analysis rather than recall.

GPQA

The “Google-proof” graduate-level Q&A benchmark — 448 hard multiple-choice questions written and validated by domain experts in biology, physics, and chemistry, where PhD-level experts reach ~65% accuracy and skilled non-experts only ~34% even with unrestricted web access. Distributed as a gated Hugging Face dataset (Idavidrein/gpqa); the bundled scoring code and the diamond/main/extended splits live at idavidrein/gpqa. Companion to Papers.md ref #156 (Rein et al. 2023). The biology subset is a fast expert-level competence probe for any LLM a cell-ag team intends to lean on for literature reasoning.

Humanity’s Last Exam

A multi-modal benchmark of 2,500 questions at the frontier of human knowledge across dozens of subjects, assembled by the Center for AI Safety and Scale AI as a closed-ended academic benchmark hard enough to outlast MMLU-style saturation. The data lives at the Hugging Face dataset cais/hle, the scoring code at centerforaisafety/hle, and the live results tracker is the Humanity’s Last Exam Leaderboard catalogued in Databases.md. Companion to Papers.md ref #158 (Phan et al. 2026). A ceiling-level check on frontier-model reasoning before a cell-ag team trusts one with open-ended scientific work.

LAB-Bench

The Hugging Face Datasets release of the LAB-Bench eval suite from FutureHouse — multiple-choice questions across eight task categories spanning literature reading, figure interpretation, protocol design, DNA/protein sequence manipulation, cloning, and database access. The bundled scoring code and prompts live at Future-House/lab-bench; this Hugging Face dataset is the canonical data distribution. Companion to Papers.md ref #146 (Laurent et al. 2024). The single broadest practical-biology eval dataset at time of curation, useful both directly for benchmarking and as a template for cell-ag-specific eval datasets the field will need as livestock-cell-focused agents emerge.

MassSpecGym

The canonical Hugging Face distribution of MassSpecGym from the Pluskal lab (Bushuiev et al. 2024, NeurIPS Datasets & Benchmarks Spotlight) — 231 k MS/MS records with full schema (precursor m/z, intensities, SMILES, InChIKey, formula, adduct, instrument type, collision energy, fold splits, simulation-challenge flag). Distributed alongside the canonical scoring code at pluskal-lab/MassSpecGym, with v1.5 adding RDKit-canonical SMILES, MGF export, retrieval candidate pools (mass- and formula-filtered), and pretraining-molecule companion datasets (2.5 M and 50 M SMILES filtered to < 0.7 Tanimoto similarity vs. the test set). Companion to Papers.md ref #109. The cleanest existing substrate for ML-based flavor-compound identification in cell-ag sensomics workflows; pairs with GNPS and the spectral-library entries in Databases.md.

MeatScan

A curated image dataset for deep-learning-based binary classification of cow meat as fresh or spoiled (Gyening et al. 2025, Data in Brief) — 11,000 high-resolution RGB images (5,627 fresh, 5,373 spoiled) captured in real-world Ghanaian settings (open-air markets, butcher shops, cold-storage facilities), archived at Zenodo 10.5281/zenodo.16764338. Companion to Papers.md ref #196. A computer-vision benchmark for meat freshness/spoilage classification — substrate for image-based quality and sensory assessment relevant to cultivated and conventional meat.

MMLU-Pro

A reasoning-focused successor to MMLU — ten-option (rather than four-option) questions across 14 disciplines, curated to drop noisy items and reward multi-step reasoning over knowledge recall so that frontier models no longer plateau near the ceiling. The data is the Hugging Face dataset TIGER-Lab/MMLU-Pro, with bundled evaluation code at TIGER-AI-Lab/MMLU-Pro and a live MMLU-Pro Leaderboard catalogued in Databases.md. Companion to Papers.md ref #157 (Wang et al. 2024). A general reasoning-capability baseline for sizing up models before cell-ag-specific evaluation.

ProteinGym

The ProteinGym benchmark suite (Notin et al. 2023, NeurIPS Datasets & Benchmarks Track) — 200+ deep mutational scanning (DMS) assays plus clinical-variant classification tasks covering substitutions and indels, ~2.5 M mutated sequences total. The canonical project home at proteingym.org distributes versioned data archives (v1.3 current at time of curation) and the live substitution and indel leaderboards (catalogued in Databases.md). The bundled scoring code, baselines, and model-specific implementations live at OATML-Markslab/ProteinGym; mirror Hugging Face dataset at OATML-Markslab/ProteinGym_v1; archival releases on Zenodo. Companion to Papers.md ref #148. The dominant variant-effect benchmark in the field — directly relevant to any cell-ag protein-engineering work (growth factors, scaffolds, recombinant ECM proteins) that uses a protein language model.

SciHorizon-GENE

The Hugging Face Datasets release of SciHorizon-GENE from Huang et al. 2026 (CNIC-DSL group) — an LLM benchmark of 540K+ gene-centric questions constructed from authoritative biological databases, covering 190K+ human genes, designed to test “understanding-to-reasoning” inference of biological function from gene-level knowledge. MIT-licensed, English. The companion evaluation framework with bundled scoring code lives at CNIC-DSL/SciHorizonGene — the GitHub URL listed on the HF card itself (XiaohanHwang/SciHorizonGene) currently 404s, so use the CNIC-DSL repo instead. The project page is at horizon.scidb.cn. Companion to Papers.md ref #164 (Huang et al. 2026, arXiv; accepted SIGKDD 2026). For cell-ag teams the dataset is human-gene-focused, but functions as a proximate substrate for cross-species transfer — the same gene-knowledge-to-function reasoning that becomes the livestock task once cross-species fine-tuning of biomedical LLMs matures.

SWE-bench

An evaluation dataset of 2,294 software-engineering tasks drawn from real resolved GitHub issues and their pull requests across 12 popular Python repositories — each task asks a model to produce a patch that makes the project’s own tests pass. The data is the Hugging Face dataset princeton-nlp/SWE-bench, the harness is at SWE-bench/SWE-bench, and the live SWE-bench Leaderboard is catalogued in Databases.md. Companion to Papers.md ref #155 (Jimenez et al. 2024). The standard measure of whether a coding agent can be trusted with the bioinformatics-pipeline and analysis-code maintenance that cell-ag teams increasingly delegate.