Human Reference (Homo sapiens)

Cellular agriculture engineers livestock cells, but the largest and best-annotated single-cell corpora — and the most mature genome-scale metabolic reconstructions — are human. This page collects those human artifacts, because they are the practical substrate cell-ag modeling builds on: human single-cell foundation-model corpora are the pretraining base that cross-species transfer methods adapt to livestock, and the human GEMs are the homology templates and biomass-function donors from which species-specific livestock reconstructions inherit their reaction networks. Human is not a cultivated-meat target species — it is the reference substrate.

Single-cell & perturbation corpora

Genecorpus-30M

Genecorpus-30M is the pretraining corpus for Geneformer — a Hugging Face Datasets collection of ~30 million human single-cell transcriptomes assembled from publicly available scRNA-seq studies covering a broad range of human tissues and cell states. Each cell is encoded as a rank-ordered gene-expression “sentence” suitable for masked-language-model pretraining; the corpus is distributed under standard HF Datasets tooling with versioned snapshots. For cellular agriculture, Genecorpus-30M is the de-facto starting substrate for any human-cell foundation-model training run — and the template that cross-species fine-tuning approaches like SATURN (Papers.md ref #118) and UCE (ref #119) build on to transfer to livestock species where annotated single-cell data is sparse. Companion to Papers.md ref #111 (Theodoris et al. 2023, Nature).

cited by1.1K

Genecorpus-104M

Genecorpus-104M is the ~104-million-cell Hugging Face Datasets corpus the Theodoris lab assembled for pretraining Geneformer-V2 — the larger counterpart to Genecorpus-30M above, spanning a broad range of human tissues, drawn from publicly available data, and tokenized as the same rank-value gene-expression encodings (Apache-2.0). For cellular agriculture it is the current-generation human-cell pretraining substrate that cross-species transfer methods like SATURN (Papers.md ref #118) and UCE (ref #119) adapt to livestock species where annotated single-cell data is sparse.

Mouse-Genecorpus-20M

Mouse-Genecorpus-20M is the murine counterpart to the Genecorpus family — a Hugging Face Datasets corpus of ~21 million mouse single-cell transcriptomes from a broad range of tissues, released as the pretraining base for Mouse-Geneformer (Ito et al. 2024, bioRxiv; Apache-2.0). For cellular agriculture it is a non-human mammalian pretraining substrate and a cross-species-transfer template: Mouse-Geneformer demonstrates the murine→other-species utility that cell-ag modeling needs to reach livestock cells where labelled single-cell data is scarce.

cited by6

Perturb-Sapiens

Perturb-Sapiens is Arc Institute’s Hugging Face Datasets release of large-scale perturbational single-cell measurements aggregated from public Perturb-seq, CROP-seq, and ECCITE-seq experiments, plus internal Arc data, covering 70+ human cell lines. The dataset is the training substrate for Arc’s State virtual cell model (Papers.md ref #57) and is a companion data product within the Arc Virtual Cell Atlas alongside Tahoe-100M and scBaseCount. For cellular agriculture, Perturb-Sapiens is the most comprehensive public source of cellular-perturbation-response data available — the methodology template for any future livestock-cell-perturbation atlas, and a useful benchmark for evaluating cross-species transfer of perturbation-response models.

scPerturb

scPerturb is a harmonized resource of single-cell perturbation data (Peidli et al. 2024, Nature Methods, 10.1038/s41592-023-02144-y) — 44 public single-cell perturbation datasets reprocessed to a uniform format, spanning CRISPR (Cas9 knockout, CRISPRa, CRISPRi, and Cas13) and chemical perturbations, with transcriptomic readouts and, in some datasets, proteomic and epigenomic measurements, plus harmonized E-statistic perturbation-strength metrics. The data home is scperturb.org with deposits on Zenodo; processing code at sanderlab/scPerturb. For cellular agriculture it is a benchmark and methodology template for designing a livestock-cell perturbation atlas and for evaluating cross-species transfer of perturbation-response models.

cited by157

Arc Virtual Cell Atlas

The Arc Virtual Cell Atlas is Arc Institute’s open-data initiative providing the substrate datasets for virtual-cell-modeling research, hosted on GitHub with documentation and uniformly processed releases, plus mirroring on Google Cloud’s BigQuery Public Data marketplace for cloud-native analytics. The Atlas aggregates several large-scale single-cell datasets under a single curation umbrella:

Tahoe-100M

A 100-million-cell drug-perturbation dataset profiling cancer-cell-line responses to >1,100 small molecules across 50+ cancer cell lines via large-scale Perturb-seq. The largest publicly available drug-perturbation single-cell dataset at time of release; methodology directly transferable to any future cell-ag work involving small-molecule modulators of differentiation, proliferation, or media response.

scBaseCount

An AI-agent-curated, uniformly processed, and autonomously updated single-cell data repository aggregating thousands of public scRNA-seq studies into a single harmonized reference — the data-engineering substrate that an autonomously updating virtual-cell atlas requires. Companion to Papers.md ref #126 (Youngblut et al. 2025).

Parse Biosciences 10M PBMC Atlas

A publicly released ~10-million peripheral blood mononuclear cell (PBMC) single-cell RNA-seq experiment from Parse Biosciences, demonstrating the throughput of their Evercode WT Mega platform. Includes harmonized cell-type annotations and represents the largest single-experiment PBMC atlas available at time of release. For cellular agriculture, useful both as a benchmark dataset for evaluating single-cell-FM batch-effect handling and scaling at extreme throughput, and as immune-cell reference data for cultivated-meat applications involving immune-cell co-cultures or contamination screening.

Human Skeletal Muscle Single-Cell Atlas (De Micheli et al. 2020)

A reference single-cell transcriptomic atlas of human skeletal muscle tissue (De Micheli et al. 2020, Skeletal Muscle, 10.1186/s13395-020-00236-3), profiling donor muscle by scRNA-seq and resolving its constituent cell populations, including bifurcated muscle stem (satellite) cell states. Raw data are deposited at GEO GSE143704. For cellular agriculture it is a human reference for the satellite-cell and myogenic-lineage compartments that cultivated-meat muscle work targets: a well-annotated cross-species template and transfer-learning reference for livestock muscle single-cell data, where comparably deep human-annotated atlases remain sparse.

cited by228

Human and Mouse White Adipose Tissue Single-Cell Atlas (Emont et al. 2022)

A single-cell and single-nucleus atlas of human and mouse white adipose tissue (Emont et al. 2022, Nature, 10.1038/s41586-022-04518-2), spanning multiple depots and resolving adipocyte subtypes together with the stromal-vascular compartment, including adipose progenitor / preadipocyte populations. Interactive data are hosted on the Broad Single Cell Portal, study SCP1376. For cellular agriculture it is the reference human and mouse adipose atlas for the fat side of cultivated meat: a template for characterizing adipogenic progenitors and mature adipocyte states, and a cross-species reference against which livestock (cow, pig) intramuscular- and subcutaneous-fat single-cell data can be aligned and annotated.

cited by827

Genome-scale metabolic models

GEMs are SBML-formatted reconstructions of an organism’s metabolic network — every reaction, every metabolite, every gene-protein-reaction mapping — and are the input data structure for the constraint-based modeling tools listed in Software.md / Metabolic Modeling & Strain Design. The human reconstructions below are the upstream reference from which the cell-ag livestock GEMs (catalogued on the per-species pages) inherit network structure and curation conventions.

Recon3D / Human1 / HMR — Homo sapiens (template / upstream reference)

The human genome-scale metabolic reconstructions — Recon3D (Brunk et al. 2018, Nature Biotechnology), Human-GEM / Human1 (Robinson et al. 2020, Science Signaling), and the underlying HMR2 — are the foundational human GEMs from which most mammalian-cell models (including the cell-ag GEMs in the per-species pages of this directory) inherit reaction networks, biomass equations, and curation conventions. Direct use in cell-ag is rare; they’re more often used as homology templates or biomass-function donors for species-specific reconstructions. Recon3D source is at github.com/SBRG/Recon3D with the structural-systems-biology companion library at github.com/SBRG/ssbio.

References: Papers.md #86 (Brunk et al. 2018, Nature Biotechnology) for Recon3D; Papers.md #87 (Robinson et al. 2020, Science Signaling) for Human-GEM.

Curation source: These entries are long-standing CAAIL curation, migrated from the prior flat Datasets.md. They are cross-species reference substrate rather than cultivated-meat-species data, so — unlike the per-species pages — they are not drawn from the Todhunter et al. 2024 supplemental.

Human Reference (Homo sapiens)

Single-cell & perturbation corpora

Arc Virtual Cell Atlas

Genome-scale metabolic models

Recon3D / Human1 / HMR — Homo sapiens (template / upstream reference)

Further reading