Software

AlphaFold

Predicts the 3D structure of proteins from their amino acid sequence using deep learning. Can be used for design and engineering of cost-effective recombinant growth factors and signaling molecules. Predicting protein structure helps optimize stability, activity, and binding affinity, reducing the high cost associated with media components.

cited by45.8K

Protein Language Models

This is a collection of links to Protein Language Models, which are models trained to predict structure and function. Can be used for the design and engineering of cost-effective recombinant growth factors and signaling molecules.

foldseek

Search large protein databases using monomers and multimers. Can be used to find known structures to infer function or guide engineering efforts of a new or target protein.

cited by2.4K

amii-cell-ag-tools

Open-source applied AI research code from the Alberta Machine Intelligence Institute (Amii) targeting cellular agriculture, hosted under the Amii-Applied-AI GitHub organization. The repository collects two Python subprojects: protein-thermostability-data-tools (code used in the development of a public protein thermostability dataset) and active-learning-for-cell-media (active-learning analysis applied to cell media optimization). MIT-licensed.

ESMFold

An end-to-end single-sequence protein structure predictor built on the ESM-2 protein language model — it predicts structure directly from one sequence with no multiple-sequence alignment, avoiding the alignment search that MSA-dependent predictors require. For cellular agriculture, useful for quickly validating the fold of engineered growth-factor analogs and other recombinant media proteins. (The facebookresearch/esm repository also hosts the ESM-2 / ESM-1b protein language models.)

cited by5.1K

ColabFold

A fast, accessible implementation of AlphaFold2 (and related models) with GPU-accelerated MSA generation via MMseqs2, runnable in Google Colab or locally. Lowers the compute barrier for cell-ag teams designing growth factors, binders, and other recombinant media components without a local AlphaFold deployment (Mirdita et al. 2022, Nature Methods).

cited by9.8K

OmegaFold

A high-resolution de-novo structure predictor that folds proteins directly from primary sequence using a protein language model, without multiple-sequence alignments — aimed at proteins that lack deep evolutionary alignments. Practical for high-throughput structure validation of engineered cell-ag proteins.

cited by408

RFdiffusion

A diffusion-model framework for de novo protein design — unconditional generation, motif scaffolding, symmetric oligomers, and binder design — from the Baker lab. Applicable to engineering novel growth-factor / receptor-binding domains and protein scaffolds for cultivated-meat work (Watson et al. 2023, Nature).

cited by2K

ProteinMPNN

A deep-learning inverse-folding model that designs amino-acid sequences for a given protein backbone — the standard sequence-design step in generative pipelines, typically paired with RFdiffusion and a structure predictor. Useful for optimizing the sequence of recombinant media components (growth factors, binders) for expressibility and stability (Dauparas et al. 2022, Science).

cited by1.9K

EvoDiff

A sequence-space diffusion framework (Microsoft Research) that generates novel, diverse proteins directly from evolutionary-scale sequence data, without requiring structure — complementing structure-based design with sequence-first generation of functional proteins such as growth factors.

cited by155

Boltz

An open, commercially-usable (MIT-licensed) family of diffusion-based biomolecular structure prediction models — Boltz-1 was the first fully open-source model to approach AlphaFold3 accuracy, with later versions (Boltz-2) extending the family — predicting structures of proteins and complexes. Useful for assessing the fold and binding interfaces of engineered growth-factor domains.

cited by422

Chai-1

A multimodal foundation model for biomolecular structure prediction (Chai Discovery) that folds proteins, nucleic acids, and complexes and can optionally incorporate experimental restraints. Valuable for cell-ag protein engineering where binding geometry and complex structure matter as much as the monomer fold.

cited by374

Non-commercial

IgFold

A fast, antibody-specific structure predictor (Gray lab, Johns Hopkins) that models antibody Fv structures and CDR loops directly from sequence. Useful for engineering recombinant antibody fragments used as affinity reagents or bio-scaffolds. Note: distributed under a JHU non-commercial license — check terms before commercial cell-ag use (Ruffolo et al. 2023, Nature Communications).

cited by317

AbLang

An antibody-specific language model (Oxford Protein Informatics Group) for antibody sequence representation, restoration of missing residues, and design. Enables rapid engineering of antibody-based scaffolds and binding domains for cell-ag affinity-reagent work.

cited by196

BoTorch

A PyTorch library for Bayesian optimization, providing Gaussian-process surrogates, Monte-Carlo acquisition functions, and constrained and multi-objective optimization. In cellular agriculture it is the engine under iterative media-formulation and bioprocess design-of-experiment loops, where each wet-lab round is costly and the optimizer picks the next formulation to test — the pattern in the media-optimization campaigns catalogued in Papers.md. MIT; docs at https://botorch.org/.

Media optimization

cited by204

Ax

An adaptive-experimentation platform that wraps BoTorch with managed experiment orchestration, sequential design, and multi-objective optimization behind a higher-level API. For cell-ag it runs closed-loop media and process optimization campaigns — proposing formulations, ingesting assay results, and adapting the search — without hand-coding the underlying Bayesian optimization. MIT; docs at https://ax.dev/.

Media optimization

CellCultureBayesianOptimization

The reference implementation of the Bayesian-optimization iterative media-development workflow of Narayanan et al. 2025 — the code that proposes successive media formulations and folds in measured outcomes to accelerate serum-free and recombinant-protein media design. Companion to Papers.md ref #58 (Narayanan et al. 2025). The repository carries no explicit license (all rights reserved by default); the paper's figshare data deposit is MIT.

Media optimization

cited by21

Bioprocess Modeling & Scaling

OpenFOAM

An open-source suite for Computational Fluid Dynamics (CFD) Used to model and a variety of environmental factors within bioreactors, to ensure viable conditions at scale.

cited by5.3K

CompuCell3D

A modeling environment for simulating cell-cell and cell-environment interactions. Enables the modeling of complex phenomena like cell proliferation, nutrient gradients within thick tissues, and the mechanics of cells interacting with scaffolds.

cited by565

PhysiCell

A physics-based framework specialized for multi-cell simulations. Can be used to model complex cell behaviors, including differentiation and nutrient limitations in tissue cultures.

cited by0

CC-BY-4.0

Morpheus

A modeling and simulation environment for multicellular systems that couples cell behavior (motility, division, differentiation) with reaction-diffusion PDEs for the surrounding chemical fields, configured through a GUI rather than hand-written code. Applicable to modeling nutrient and signaling gradients across cultivated tissue and cell-scaffold constructs (Starruß et al. 2014, Bioinformatics).

Project page: https://morpheus.gitlab.io/.

cited by362

pyFOOMB

A Python framework for object-oriented modeling of bioprocesses aimed at users with limited programming experience, wrapping ODE model definition, event handling, and parameter estimation with uncertainty quantification for fed-batch and other bioreactor models. Useful for building and calibrating growth/substrate/product kinetic models of cultivated-cell bioprocesses (Hemmerich et al. 2021, Engineering in Life Sciences).

Docs: https://github.com/MicroPhen/pyFOOMB/tree/main/examples.

cited by29

MxlPy

A Python package for mechanistic learning that combines mechanistic kinetic/metabolic modeling (ODE construction, simulation, parameter fitting, metabolic control analysis) with machine-learning surrogates, aimed at interpretable, data-informed models. Applicable to cell-ag bioprocess and metabolic modeling where mechanistic and data-driven components are combined (van Aalst et al. 2025, bioRxiv). From the Computational Biology group at RWTH Aachen.

Docs: https://computational-biology-aachen.github.io/MxlPy/.

PC-Gym

A library of Gymnasium-style reinforcement-learning environments for process control, providing simulated reactor and separation systems (CSTR, multistage extraction column, and others) with configurable setpoints, disturbances, and constraints for benchmarking RL controllers. Gives cell-ag process-control work a standardized testbed for training and comparing RL policies for bioreactor regulation (Bloor et al. 2024, arXiv).

Docs: https://maximilianb2.github.io/pc-gym/.

cited by2

Metabolic Modeling & Strain Design

COBRApy

The de-facto Python package for constraint-based modeling of metabolic networks, hosted by the openCOBRA consortium. Provides FBA, FVA, gene and reaction knockouts, flux sampling (OptGP, ACHR), gap-filling, production envelopes, dynamic FBA, and SBML I/O on a unified Model object that integrates cleanly with NumPy / pandas / SymPy. The reference implementation for FBA in the bioinformatics community (Ebrahim et al. 2013, BMC Systems Biology).

Docs: http://opencobra.github.io/cobrapy/.

cited by1.5K

COBRA Toolbox

The MATLAB counterpart to COBRApy, also under the openCOBRA umbrella. Older but more feature-rich for advanced methods (thermodynamics-aware FBA via multiTFA, ME-models via COBRAme, elementary flux modes), and still the production tool of choice in many academic labs. For new cell-ag work, COBRApy is usually the better starting point unless a specific COBRA Toolbox feature is required. The current protocol is Heirendt et al. 2019 (Nature Protocols).

Docs: https://opencobra.github.io/cobratoolbox.

cited by1.4K

Memote

The genome-scale metabolic model test suite, providing automated QC and benchmarking for SBML GEMs. Computes >100 metrics covering annotation completeness, stoichiometric consistency, biomass formulation, growth simulation, and reproducibility, producing a versioned HTML report. Used by BiGG Models and the openCOBRA community as the de-facto curation standard; any cell-ag-specific GEM should be benchmarked with Memote before downstream use.

Docs: https://memote.readthedocs.io/.

cited by558

Escher

A web-based tool for building, sharing, and embedding visualizations of metabolic pathway maps, with bi-directional integration to COBRApy via JSON. Supports interactive overlay of flux distributions, reaction knockouts, and metabolite concentrations on hand-drawn or auto-generated pathway maps (King et al. 2015, PLOS Computational Biology).

Project page: https://escher.github.io.

cited by443

Artistic-2.0

COPASI

A software application for simulation and analysis of biochemical networks and their dynamics, supporting deterministic ODE integration, stochastic Gillespie simulation, steady-state analysis, parameter estimation, sensitivity analysis, and metabolic control analysis. Cross-platform GUI plus Python bindings via basico; used as the simulation backend by Talk2Biomodels.

Project page: https://copasi.org.

cited by2.7K

Tellurium

A Python environment for modeling and simulating biological systems built around libRoadRunner (fast SBML simulator), Antimony (human-readable model language), and notebook-friendly workflows. Bridges between SBML, ODE simulation, parameter scans, and Python-native scripting. Maintained by the Sauro lab at University of Washington.

Project page: http://tellurium.analogmachine.org/.

cited by188

RAVEN Toolbox

A MATLAB toolbox for genome-scale metabolic model reconstruction, curation, and analysis, maintained by the Nielsen lab at Chalmers. Provides automated reconstruction from KEGG, MetaCyc, or template models, gap-filling, model curation tools, and integration with COBRA Toolbox; widely used to build species-specific GEMs in food, fermentation, and biopharma settings.

Docs: http://sysbiochalmers.github.io/RAVEN/.

cited by389

StrainDesign

A Python package built on COBRApy that consolidates the strain-design MILP family — OptKnock, OptCouple, RobustKnock, Minimal Cut Sets (MCS / advanced MCS) — into a single API for in-silico engineering of metabolic networks. Useful beyond microbial work for identifying knockouts in cultivated mammalian cell GEMs that redirect flux toward biomass while suppressing lactate accumulation.

cited by30

CNApy

An integrated visual environment for metabolic modeling that wraps COBRApy with a GUI for FBA, FVA, Elementary Flux Modes, thermodynamic methods, Minimal Cut Sets, OptKnock, RobustKnock, OptCouple, and other strain-design techniques. Strong fit for teaching and for users who prefer interactive exploration over notebook-based work.

Docs: https://cnapy-org.github.io/CNApy-guide/.

cited by27

moped

A Python package (QTB lab, HHU Düsseldorf) serving as an integrative hub for reproducible, scriptable construction, modification, curation, and analysis of metabolic models — importing existing SBML models, supporting metabolic network expansion, and converting directly to COBRApy objects for constraint-based analysis (Saadat et al. 2022, Metabolites). A lighter-weight on-ramp for building and editing cell-ag GEMs alongside the openCOBRA stack above.

cited by8

AGPL-3.0

mergem

A Python package and command-line tool (Lobo lab, UMBC) for merging, comparing, and translating genome-scale metabolic models via a universal metabolite / reaction identifier mapping, producing curated consensus models; paired with the Fluxer web application for flux visualization. Useful when reconciling draft cell-ag GEMs from different reconstruction pipelines into a single curated model (NAR Genomics and Bioinformatics, 10.1093/nargab/lqae010).

Agent integration. Code-execution agents (Cursor, Claude Code, Biomni) can invoke any of these tools as Python; for COBRApy specifically, the cobrapy skill from K-Dense-AI's scientific-agent-skills (see the K-Dense-AI entry below) provides curated recipes that make this reliable in agent loops.

cited by9

CarveMe

A command-line tool for automated, top-down reconstruction of genome-scale metabolic models directly from genome or protein sequences, using a manually curated universal model as the template. Builds draft models for single species or microbial communities in seconds, giving cell-ag teams a fast starting GEM for a feedstock organism or bioreactor consortium before manual curation (Machado et al. 2018, Nucleic Acids Research).

Docs: https://carveme.readthedocs.io/.

cited by791

gapseq

A tool that predicts metabolic pathways and transporters from genome sequences and reconstructs gap-filled genome-scale models informed by that pathway evidence. The pathway-prediction step is useful for building well-annotated GEMs of the microbial and feedstock organisms surrounding cultivated-meat and fermentation workflows (Zimmermann et al. 2021, Genome Biology).

Docs: https://gapseq.readthedocs.io/.

cited by316

cameo

A high-level Python library for computer-aided metabolic engineering built on COBRApy, providing strain-design algorithms (OptKnock, OptGene, differential flux variability, heuristic search) and simulation methods for predicting knockouts and interventions that increase a target product. Applies to redirecting flux in cultivated-cell or feedstock GEMs toward biomass or a desired metabolite. From the Biosustain group at DTU.

Docs: https://cameo.readthedocs.io/.

cited by77

GECKO

A toolbox that enhances genome-scale models with enzyme constraints (GEM with Enzymatic Constraints using Kinetic and Omics data), adding enzyme-usage bounds derived from kcat values and proteomics so a model captures the proteome cost of flux. Enzyme-constrained models predict overflow metabolism and growth limits more accurately, which matters for cultivated-cell media and bioprocess modeling (Domenzain et al. 2022, Nature Communications).

Docs: https://github.com/SysBioChalmers/GECKO/wiki.

cited by144

MEWpy

An integrated metabolic-engineering workbench for strain-design optimization that runs on top of constraint-based models, supporting multiple simulation methods (FBA, pFBA, MOMA, ROOM), evolutionary-computation search over gene/reaction/enzyme modifications, and regulatory and enzyme-constrained formulations. A flexible option for in-silico engineering of cultivated-cell or microbial GEMs when the design space extends beyond simple knockouts.

Docs: https://mewpy.readthedocs.io/.

cited by32

pyTFA

A Python implementation of thermodynamics-based flux analysis, adding Gibbs-free-energy and metabolite-concentration constraints to constraint-based models so predicted flux distributions are thermodynamically feasible. Tightens the realism of flux predictions for cultivated-cell and feedstock GEMs used in media and bioprocess design (Salvy et al. 2019, Bioinformatics). From the Hatzimanikatis lab at EPFL.

Docs: https://pytfa.readthedocs.io/.

cited by116

MASSpy

A Python package for building, simulating, and visualizing dynamic mass-action kinetic models of metabolism, extending constraint-based network structures with kinetic rate laws and ODE integration. Bridges steady-state GEM analysis and time-resolved kinetic simulation, useful for modeling the dynamics of cultivated-cell metabolism and spent-media turnover (Haiman et al. 2021, PLOS Computational Biology). From the Systems Biology Research Group (SBRG) at UC San Diego.

Docs: https://masspy.readthedocs.io/.

cited by45

RIPTiDe

A Python tool that tailors a genome-scale metabolic model to transcriptomic (or proteomic) data by parsimonious flux analysis, yielding a context-specific model that predicts metabolism better in complex environments (Jenior et al. 2020, PLOS Computational Biology). It is the transcriptome-to-context-specific-model step that the classic extraction methods GIMME, iMAT, and tINIT pioneered — those live in the already-listed COBRA Toolbox / cobrapy (GIMME, iMAT) and RAVEN Toolbox (tINIT). For cell-ag it conditions a species GEM on the actual expression state of a proliferating or differentiating muscle or fat cell for media and process design. MIT.

Metabolic modeling

cited by99

Quantitative Genetics & Multi-Omics Analysis

OmiGA

An ultra-efficient toolkit for molecular quantitative trait loci (molQTL) mapping across multi-omics data, from the Zhang group at China Agricultural University. The performance backbone of the FarmGTEx project family — eQTL, sQTL, mQTL, and other molQTL discovery at livestock-atlas scale, optimized for the throughput needed to handle the FarmGTEx tissue / sample matrices. Companion to Papers.md ref #143 (Teng et al. 2026, Nature Communications). Reproducibility deposits: ENA BioProject PRJEB58031, Zenodo 10.5281/zenodo.10072081 and 10.5281/zenodo.18280923.

AI Methods & Tooling

cited by3

OmicVerse

A Python framework integrating bulk, single-cell, and spatial RNA-seq analysis — including trajectory and bulk-to-single-cell interpolation (BulkTrajBlend) and multi-omics integration — in one toolkit. Useful for analysing cultivated cell-line transcriptomic data across bulk and single-cell modalities (Zeng et al. 2024, Nature Communications).

AI Methods & Tooling

cited by81

CellRank

A Python framework (scverse / Theis lab) for single-cell fate mapping that combines trajectory inference with signals such as RNA velocity to model cell-state transitions and terminal states. Informs understanding of cultivated-cell differentiation dynamics under perturbation or media change.

AI Methods & Tooling

cited by792

CellChat

An R toolkit for inferring and analysing intercellular communication networks from single-cell and spatial transcriptomics, using a curated ligand–receptor interaction database. Applicable to cell-culture and scaffold / co-culture analysis where cell–cell signalling shapes differentiation (Jin et al. 2021, Nature Communications).

AI Methods & Tooling

cited by8.4K

Giotto Suite

An R package suite (Dries lab) for end-to-end spatial-transcriptomics and spatial multi-omics analysis at multiple scales and resolutions, including 2D/3D spatial analysis and cell–cell interaction analysis. Supports analysis of cultivated-tissue structure and microarchitecture.

AI Methods & Tooling

cited by26

Open Problems

A community benchmarking framework for single-cell analysis that pairs formalized tasks (batch integration, label projection, denoising, and more) with bundled datasets, standardized metrics, and baseline methods, so new methods are evaluated on common ground (Luecken et al. 2025, Nature Biotechnology). Because it ships its own data and evaluation harness, cell-ag teams can benchmark single-cell methods on cultivated-cell transcriptomics against a maintained, reproducible standard.

Project page: https://openproblems.bio/.

AI Methods & Tooling

cited by31

PanCanStem / mRNAsi

The code and one-class logistic regression (OCLR) signature behind the mRNA stemness index (mRNAsi), a transcriptomic score of undifferentiated / stem-like cell state derived across cancer types (Malta et al. 2018, Cell). For cell-ag it offers a transferable, quantitative readout of stemness versus differentiation that can track how far cultivated cells have progressed toward mature muscle or fat identity.

Docs: https://github.com/ArtemSokolov/PanCanStem#readme.

AI Methods & Tooling

cited by2.4K

Matrisome AnalyzeR

An R package that annotates and quantifies extracellular-matrix (matrisome) molecules in large omics datasets across organisms, classifying proteins and transcripts into matrisome categories for downstream analysis (Petrov et al. 2023, Journal of Cell Science). Directly applicable to cultivated-meat scaffolding and tissue-maturation work, where ECM composition is a key structural and sensory determinant.

Project page: https://matrisome.org/.

AI Methods & Tooling

cited by60

Mass Spectrometry & Chemometrics

OpenMS / pyOpenMS

An open-source C++ framework with Python bindings (pyOpenMS) for mass-spectrometry data analysis, from raw spectra processing through quantitative proteomics and metabolomics. Maintained by an international consortium led by the Kohlbacher lab (Tübingen); OpenMS 3 (Pfeuffer et al. 2024, Nat Methods) provides a modular toolkit (TOPP), Python bindings, KNIME nodes, Galaxy wrappers, and integrations with nf-core pipelines (e.g. nf-core/quantms). Widely used for flavor metabolomics, off-flavor characterization, and as a building block in larger workflow-manager pipelines.

Project page: https://openms.de/. pyOpenMS docs: https://pyopenms.readthedocs.io/.

cited by65

MZmine 3

A modular, open-source platform for LC-MS / GC-MS / IMS-MS data processing, maintained by the Pluskal lab (University of Münster) and the international MZmine consortium. Schmid et al. 2023 (Nat Biotech) describes the v3 release with multimodal MS support (LC-MS, IMS-MS, MS-imaging). Provides feature detection, alignment, gap-filling, MS/MS networking integrations (GNPS / FBMN, SIRIUS), and a CLI for batch processing. Standard preprocessor for flavor and natural-products metabolomics workflows.

Source: https://github.com/mzmine/mzmine. News and releases from mzio, the team behind MZmine: https://mzio.io/mzmine-news/.

cited by1.2K

CC-BY-NC-4.0

MS-DIAL

A standalone Windows tool for DIA / DDA MS/MS spectral deconvolution and metabolite / lipid annotation, developed by Tsugawa et al. at RIKEN (Tsugawa et al. 2015, Nat Methods; the 2020 Nat Biotech lipidomics atlas paper extended MS-DIAL 4 to lipid identification). The de-facto standard for GC-MS deconvolution in flavor and food chemistry labs; ships with extensive built-in spectral libraries.

Source: https://github.com/systemsomicslab/MsdialWorkbench.

cited by3.4K

CC-BY-4.0

MRMPROBS

A C# tool for widely targeted metabolomics that processes multiple reaction monitoring (MRM) / selected reaction monitoring (SRM) data — plus SCAN and DIA MS/MS — developed by Tsugawa et al. (2013, Analytical Chemistry), same first author as the MS-DIAL entry above. Evaluates metabolite peaks by posterior probability and provides large-scale visualisation, data curation, and statistical analysis of widely-targeted metabolomics datasets — the targeted-quantitation complement to MS-DIAL's discovery-focused deconvolution.

Distributed via Zenodo: https://zenodo.org/records/11219831/latest.

cited by127

Academic (non-commercial)

XCMS

The most-cited R / Bioconductor package for LC-MS / GC-MS metabolomics preprocessing, originally Smith et al. 2006 (Anal Chem) and continuously maintained since. Provides nonlinear retention-time alignment, peak picking, grouping, and gap-filling — the analytical workhorse of many academic metabolomics pipelines including the Galaxy-based Workflow4Metabolomics platform.

Bioconductor: https://bioconductor.org/packages/xcms/.

cited by5.3K

ProteoWizard / msconvert

A cross-platform C++ library and command-line toolkit for mass-spectrometry data conversion and analysis, maintained by the Mallick lab at Stanford and an international community (Chambers et al. 2012, Nat Biotech). The msconvert utility is the universal first step in essentially every open MS pipeline — converting vendor-locked binary formats (.RAW, .D, .lcd, .wiff) to open standards (mzML, mzXML, MGF) so downstream tools can ingest the data.

Source: https://github.com/ProteoWizard/pwiz.

cited by4.3K

SIRIUS + CSI:FingerID

Java application for in-silico molecular formula determination and structure annotation from MS/MS spectra, maintained by the Böcker lab (University of Jena). SIRIUS 4 (Dührkop et al. 2019, Nat Methods) integrates fragmentation-tree-based formula prediction, CSI:FingerID for structure database search, CANOPUS for compound-class prediction, and COSMIC for confidence scoring. Standard tool for de-novo annotation in untargeted metabolomics and natural-products work.

Source: https://github.com/boecker-lab/sirius.

cited by2.1K

LGPL-2.1

MetFrag

In-silico fragmenter for MS/MS-based compound identification, originally Wolf et al. 2010 (BMC Bioinformatics) and substantially revised in MetFrag Relaunched (Ruttkies et al. 2016, J Cheminform). Scores candidate structures from compound databases (PubChem, ChemSpider, KEGG) against measured fragmentation spectra. Integrated into many metabolomics workflows including nf-core/metaboigniter and Workflow4Metabolomics.

Source: https://github.com/c-ruttkies/MetFragRelaunched.

MS2Query

A machine-learning-based mass-spectral analogue search tool from the iomega consortium (de Jonge et al. 2023, Nat Comms). Uses Spec2Vec and MS2DeepScore embeddings to find spectral analogues in reference libraries, including for compounds without exact matches — directly addressing the long-tail "unknown unknowns" problem in flavor and natural-products metabolomics.

cited by104

CC-BY-4.0

MS-FINDER

A standalone tool for in-silico compound identification from MS/MS spectra, developed by Tsugawa et al. at RIKEN as a companion to MS-DIAL. Combines isotope pattern matching, formula prediction, in-silico fragmentation, and database search against curated reference libraries (HMDB, FooDB, ChEBI, PubChem, KEGG) to score candidate structures. Widely used in untargeted flavor metabolomics for annotating unknown compounds from GC-MS / LC-MS spectra.

cited by686

GNPS

The Global Natural Products Social Molecular Networking platform — a web-based MS/MS analysis platform from the Dorrestein lab at UCSD (Wang et al. 2016, Nat Biotech). Provides community-curated reference spectral libraries, Feature-Based Molecular Networking (FBMN), Ion Identity Molecular Networking (IIMN), and analog search via spectral similarity. Standard tool for compound annotation, dereplication, and pattern discovery in flavor and natural-products metabolomics workflows.

Also listed as a spectral-library reference in Databases.md / Mass Spectrometry Spectral Databases — dual-listed because the community-curated reference libraries are themselves a queryable database.

cited by4.6K

ropls

An R package on Bioconductor implementing PCA, PLS, OPLS, and OPLS-DA for chemometric analysis of metabolomics and other -omics data (Thévenot et al. 2015, J Proteome Res). Provides the multivariate engine in the Workflow4Metabolomics Galaxy platform and is widely used in flavor / sensory metabolomics for sensory-instrumental correlation, biomarker discovery, and quality-control modeling. See also ropls in the K-Dense-AI scientific-agent-skills collection for an agent-callable wrapper.

cited by1.7K

MetaboAnalyst

A comprehensive web-based and R-based platform for statistical, functional, and visual analysis of metabolomics data, maintained by the Xia lab at McGill in collaboration with the Wishart lab (current v6, Pang et al. 2024, Nucleic Acids Research). Provides modules for univariate / multivariate statistics, pathway enrichment, network analysis, biomarker discovery, time-series and dose-response analysis, plus a companion R package MetaboAnalystR for scripted workflows. The most-cited tool in food metabolomics applications; widely used for sensory-instrumental data analysis in flavor and off-flavor work.

cited by2.1K

mixOmics

An R / Bioconductor package for the integration and exploration of single- and multi-omics datasets, maintained by the Le Cao group (University of Melbourne) (Rohart et al. 2017, PLOS Computational Biology). Implements PCA, PLS, sparse PLS-DA, and the DIABLO method for multi-block supervised classification across heterogeneous data blocks. The methodological standard for fusing sensory panel scores, instrumental volatilome / non-volatilome data, and microbiome / -omics layers — directly applicable to multi-omics flavor and cell-ag quality-prediction work.

cited by4K

SensoMineR

A food-specific R package for the analysis of sensory data, maintained by the Le and Husson group at Agrocampus Ouest (Lê & Husson 2008, J Sensory Studies). Implements QDA, napping, sorted napping, projective mapping, preference mapping, and panel performance diagnostics; built on top of FactoMineR. The de-facto open-source tool for descriptive-analysis sensory panel data in academic food science, including alt-protein flavor benchmarking.

cited by185

matchms

A Python package for importing, cleaning, and comparing tandem mass spectrometry (MS/MS) data, with metadata harmonization, peak filtering, and a library of spectral similarity scores (cosine, modified cosine, and embedding-based measures). A building block for flavor and spent-media metabolomics pipelines that match measured spectra against reference libraries (Huber et al. 2020, Journal of Open Source Software).

Docs: https://matchms.readthedocs.io/.

cited by129

Skyline

A widely used open-source Windows application for building and analyzing targeted mass-spectrometry methods (SRM/MRM, PRM, DIA) across proteomics and metabolomics, from the MacCoss lab at the University of Washington. Standard tool for targeted quantification of specific proteins or metabolites, applicable to tracking defined growth factors, amino acids, or off-flavor compounds in cultivated-meat media and tissue.

Docs: https://skyline.ms/project/home/software/Skyline/begin.view.

cited by5K

asari

A Python tool for trackable, reproducible feature extraction from LC-MS metabolomics data, using a mass-grid and elution-track model to build a consistent feature table with explicit provenance for each peak. Its reproducibility focus suits cultivated-meat spent-media and flavor metabolomics studies where features must be tracked across large sample batches (Li et al. 2023, Nature Communications).

Docs: https://asari.readthedocs.io/.

cited by65

chemotools

A Python library that brings chemometric spectral preprocessing (baseline correction, scatter correction, smoothing, derivatives, scaling, calibration transfer) into the scikit-learn API, so spectral pipelines compose with standard machine-learning estimators and cross-validation. Directly useful for building models over Raman, NIR, or IR spectra collected as bioreactor process-analytical measurements (Cabaneros Lopez 2024, Journal of Open Source Software).

Docs: https://chemotools.org/.

cited by2

RamanSPy

An open-source Python package for integrative Raman spectroscopy analysis, providing standardized data loading from commercial instruments, preprocessing (cosmic-ray removal, denoising, baseline correction), analysis methods, and machine-learning integration. Relevant to cultivated-meat bioprocessing, where Raman is a common process-analytical-technology (PAT) probe for real-time monitoring of nutrients and metabolites in bioreactors (Georgiev et al. 2024, Analytical Chemistry).

Docs: https://ramanspy.readthedocs.io/.

cited by67

LipidSig 2.0

A web server for automated lipidomics data analysis that maps features onto LIPID MAPS lipid characteristics and provides differential-expression, machine-learning-based classification and feature selection, network, and correlation analyses. Applicable to characterizing the lipid composition of cultivated-meat cells and tissue, a determinant of nutritional profile and flavor (Liu et al. 2024, Nucleic Acids Research).

Project page: https://lipidsig.bioinfomics.org/.

cited by27

Imaging & Segmentation

Cellpose

A generalist deep-learning model for cell and nucleus segmentation that works across microscopy modalities without per-dataset retraining, from the Stringer and Pachitariu groups (Stringer et al. 2021, Nature Methods). For cell-ag it gives a ready segmentation backbone for proliferation counts, confluency tracking, and morphology readouts on cultivated muscle and fat cells.

Docs: https://cellpose.readthedocs.io/.

AI Methods & Tooling

cited by4.4K

CellProfiler

A modular, no-code image-analysis platform for building high-content screening pipelines that measure cell count, shape, intensity, and texture across large image sets (Stirling et al. 2021, BMC Bioinformatics). Useful for cultivated-cell assay development where the same measurement pipeline is applied reproducibly across plates and conditions.

Project page: https://cellprofiler.org/.

AI Methods & Tooling

cited by2.2K

StarDist

A deep-learning model that segments nuclei and cells as star-convex polygons, giving accurate instance boundaries in crowded fields where region-growing methods merge neighbors (Schmidt et al. 2018, MICCAI 2018). Applicable to dense cultivated-cell and organoid imaging where separating touching cells matters for accurate counts.

Docs: https://github.com/stardist/stardist#readme.

AI Methods & Tooling

cited by1.6K

AdipoQ

A pair of open-source ImageJ / Fiji plugins that quantify adipocyte morphology and lipid-droplet content in tissue sections and in vitro cultures (Sieckmann et al. 2022, Molecular Biology of the Cell). Directly relevant to cultivated-fat characterization, where lipid accumulation and droplet size are primary readouts of adipogenic differentiation.

Docs: https://github.com/hansenjn/AdipoQ#readme.

AI Methods & Tooling

cited by29

Olfaction & Sensory

OpenPOM

An open-source (MIT) implementation of the Principal Odor Map, an ensemble message-passing graph neural network that predicts odor descriptors directly from a molecule's structure (SMILES), built on DeepChem. It reproduces the modeling approach of Lee et al. 2023 (Science), which mapped structure to human olfactory perception. For cell-ag it offers a structure-to-odor predictor for reasoning about aroma-active compounds in cultivated-product flavor work.

Docs: https://github.com/BioMachineLearning/openpom#readme.

Pyrfume

A Python package that packages curated olfaction datasets (molecules paired with odor-perception data) in a standardized, analysis-ready form, so odor-prediction models can be trained and benchmarked on consistent inputs. Provides the data layer beneath structure-to-odor modeling for flavor and aroma work in cultivated and alt-protein products.

Project page: https://pyrfume.org/.

cited by15

AGPL-3.0

BitterSweet

Machine-learning models that predict the bitter or sweet taste of small molecules from freely available molecular descriptors, from the Bagler group (Center for Computational Biology, IIIT-Delhi) (Tuwani et al. 2019, Scientific Reports). Applicable to taste screening of media components, metabolites, and flavor molecules relevant to cultivated-product sensory design.

Project page: https://cosylab.iiitd.edu.in/bittersweet/.

cited by76

TastepepAI

An AI framework for customized taste-peptide de novo design and safety assessment, predicting and generating peptides across multiple taste modalities (Yue et al. 2025, PLOS Computational Biology). Relevant to cell-ag flavor engineering, where taste-active peptides shape the sensory profile of cultivated and fermentation-derived products. See also the related TastePeptides-Meta platform at http://www.tastepeptides-meta.com/.

Docs: https://github.com/leleshidawang/TastepepAI#readme.

cited by9

bitter-peptide-design

A de novo bitter-peptide design workflow from the Di Pizio lab that pairs ZymCTRL, a conditional protein language model, for peptide generation with BitterPep-GCN, a graph convolutional network, for bitterness prediction, and a KNIME workflow for sequence preparation and physicochemical analysis. For cellular agriculture it supports rational screening and mitigation of bitter off-tastes in protein hydrolysates and next-generation food products by generating and classifying candidate peptides in silico. Companion to Papers.md ref #285 (Steuer et al. 2026).

cited by0

Workflow-Manager Pipelines

nf-core/metaboigniter

A community-curated Nextflow pipeline for untargeted LC-MS metabolomics preprocessing, identification, and analysis within the nf-core consortium. Provides a containerized, parameterizable chain from raw MS data through peak picking (XCMS, CAMERA), alignment, annotation (MetFrag, CSI:FingerID), and statistical analysis. The most directly applicable existing reproducible Nextflow pipeline for flavor and off-flavor metabolomics work in cell-ag contexts.

AI Methods & Tooling

Workflow4Metabolomics (W4M)

A Galaxy-based collaborative research infrastructure for metabolomics, providing a large library of workflow modules for LC-MS, GC-MS, FIA-MS, and NMR preprocessing, identification, and statistical analysis (Giacomoni et al. 2015, Bioinformatics). Integrates XCMS, ropls, MetFrag, CAMERA, and many other established tools into a single web-based interface backed by French institutional compute (IFB), accessible without local installation. The closest "one-stop" reproducible food / flavor metabolomics platform; widely used in academic flavor metabolomics work.

cited by468

UmetaFlow

A Snakemake workflow for untargeted LC-MS/MS metabolomics, maintained by the Biosustain group at DTU (Kontou et al. 2023, J Cheminform). Wraps MZmine 3, SIRIUS, GNPS / FBMN, and Ion Identity Molecular Networking (IIMN) into a containerized, parameterizable pipeline that produces feature tables, annotations, and molecular networks from raw MS data. Closest existing Snakemake pipeline that flavor-metabolomics work can build on.

AI Methods & Tooling

cited by32

nf-core/ampliseq

A community-curated Nextflow pipeline for amplicon sequencing analysis (16S, 18S, ITS) within the nf-core consortium (Straub et al. 2020). Provides a full preprocessing + ASV-calling + taxonomic-assignment + diversity-analysis chain (Cutadapt → DADA2 → QIIME 2 → Picrust2) suitable for microbiome work in fermented foods, fermentation-based alt-protein, and the microbial communities relevant to cultivated-meat bioreactors. Routinely used for the microbiome arm of multi-omics flavor studies.

AI Methods & Tooling

cited by389

nf-core/mag

A community-curated Nextflow pipeline for shotgun metagenomic assembly and binning (Krakau et al. 2022, NAR Genomics and Bioinformatics). Produces metagenome-assembled genomes (MAGs) with quality control, taxonomic classification, and functional annotation — the analytical complement to ampliseq for higher-resolution microbial-community profiling. Relevant to cultivated-meat work involving complex bioreactor microbiomes, scaffold biofilms, or fermentation co-culture analysis.

AI Methods & Tooling

cited by108

AI Agents & Foundation Models

ToolUniverse

An ecosystem for building AI scientists from any language or reasoning model, providing a unified interface across open- and closed-weight models and a shared tool/data/analysis environment. Provides infrastructure that bespoke biomedical AI agent projects have historically had to reinvent. Companion to the Gao et al. 2025 paper introducing ToolUniverse (Papers.md ref #41).

Project page: https://aiscientist.tools/. Docs: https://zitniklab.hms.harvard.edu/ToolUniverse/.

cited by6

Biomni

A general-purpose biomedical AI agent from Stanford's SNAP lab (Leskovec group) that performs autonomous multi-step research workflows across drug discovery, genomics, clinical analysis, and adjacent biomedical domains. Companion to the Huang et al. 2025 paper (Papers.md ref #49).

Project page: https://biomni.stanford.edu. AWS tutorial (Biomni + Bedrock AgentCore): https://aws.amazon.com/blogs/machine-learning/build-a-biomedical-research-agent-with-biomni-tools-and-amazon-bedrock-agentcore-gateway/.

cited by99

AIAgents4Pharma

An open-source AI-agent platform from VirtualPatientEngine targeting drug discovery and pharmaceutical R&D. Hosts a family of LLM-based agents — Talk2Biomodels (kinetic biological models), Talk2KnowledgeGraphs, Talk2Cells, and Talk2Scholars — that share infrastructure for tool use and reasoning over biomedical resources. Companion to the Wehling et al. 2025 Talk2Biomodels paper (Papers.md ref #50).

Docs: https://virtualpatientengine.github.io/AIAgents4Pharma/.

BRAD

A bioinformatics-focused LLM agent (Bioinformatics Retrieval Augmented Digital Assistant) combining retrieval-augmented generation with bioinformatics tool orchestration. Targets automatic biomarker discovery, enrichment analysis, and general bioinformatics automation — patterns that map cleanly onto cell-ag tasks like cell-type marker discovery and pathway-enrichment analysis on cultivated-tissue scRNA-seq. Companion to Papers.md ref #94 (Pickard et al. 2025, Bioinformatics).

cited by10

CellForge

An agentic system for the design of virtual cell models from the Gerstein lab at Yale. Coordinates multiple LLM agents to plan, configure, and execute cell-modeling workflows — directly relevant to cell-ag for cell-type-specific virtual-cell pipelines that media-optimization, perturbation-response prediction, and bioprocess workflows can build on. Companion to Papers.md ref #93 (Tang et al. 2026).

cited by2

AI Scientist

Sakana AI's framework for fully automated open-ended scientific discovery via large language models, performing end-to-end idea generation, experimentation, and paper drafting. Version 2 (SakanaAI/AI-Scientist-v2) extends the system to workshop-level automated discovery via agentic tree search; the AI-Scientist-ICLR2025-Workshop-Experiment repository archives the run whose AI-generated paper passed peer review at an ICLR 2025 workshop. Companion to Lu et al. 2024 (Papers.md ref #45) and Yamada et al. 2025 (Papers.md ref #47).

Project home: https://sakana.ai/.

cited by107

PaperQA

An open-source retrieval-augmented generative agent for answering questions over scientific literature with verified citations, from the FutureHouse research lab. Released as PaperQA in 2023 and substantially extended in PaperQA2 (2024), which achieves superhuman synthesis accuracy on literature questions. Companion to Papers.md ref #44 (PaperQA, Lála et al. 2023) and ref #46 (PaperQA2, Skarlinski et al. 2024).

FutureHouse cookbook (docs): https://futurehouse.gitbook.io/futurehouse-cookbook. Commercial spinout — Edison Scientific: https://edisonscientific.com/ (docs).

cited by52

Aviary

A gymnasium of language-agent environments for challenging scientific tasks, from the FutureHouse lab. Agents are framed as Language Decision Processes (LDP) and trained and evaluated against tasks spanning molecular cloning, scientific-literature QA, and protein stability — providing the reusable training-and-evaluation substrate beneath FutureHouse's task-specific agents. Companion to Papers.md ref #160 (Narayanan et al. 2024). Apache-2.0; built on the LDP framework, with Aviary docs at https://docs.edisonscientific.com/aviary.

cited by9

Finch

An Aviary-based data-science agent that operates inside Jupyter notebooks, also from FutureHouse. It plans and executes notebook-based analyses, pairing the Aviary environment with the kind of exploratory bioinformatics work — single-cell exploration, media-response analysis — that cell-ag teams increasingly delegate to notebook agents. Apache-2.0; no companion paper at time of curation.

K-Dense-AI

K-Dense-AI is an AI co-scientist ecosystem combining a commercial agent platform (K-Dense Web, which autonomously executes complex science / engineering / healthcare / finance tasks end-to-end) with a substantial open-source stack of agent infrastructure:

scientific-agent-skills — A multi-domain collection of 120+ "Agent Skills" wrapping scientific Python libraries and platforms in Claude Code / Cursor / Antigravity-compatible format. Each skill provides curated recipes, code examples, and discovery prompts for one library. Cell-ag-relevant skills include cobrapy (FBA / metabolic modeling — see Metabolic Modeling & Strain Design), pyopenms (mass spectrometry), scanpy / scvi-tools / anndata (single-cell analysis), rdkit / datamol / medchem (cheminformatics), biopython / bioservices / gget (bioinformatics utilities), cellxgene-census, pyzotero, and many others.
k-dense-byok — A bring-your-own-key desktop client that runs the Scientific Agent Skills locally with your own LLM API keys.
mimeo — A tool for "mimeographing" an expert's knowledge into a SKILL.md / AGENTS.md file consumable by Claude Code or similar agents.
mimeographs — A collection of persona-based agent skills (founders, philosophers, scientists) generated with mimeo.
claude-scientific-writer — A Claude-Code-compatible general-purpose scientific writing agent.
science-superpowers — A composable computational-science methodology for research agents: 15 auto-triggering skills — 13 covering the research lifecycle (framing falsifiable questions, surveying prior work, designing and pre-registering the analysis, reproducible execution, anomaly root-causing, results verification, red-team review, and reporting/archiving) plus two meta skills for authoring and onboarding new skills. A science-domain reimplementation of obra/superpowers whose central discipline is pre-registration rather than test-driven development; runs with only the agent harness and a POSIX shell (zero third-party dependencies). MIT-licensed.

Not an MCP server but a documentation-and-prompt-context layer that pairs well with code-execution agents like Biomni, Cursor, and Claude Code.

Project page: https://k-dense.ai.

Superpowers

An agentic skills framework and software-development methodology authored by Jesse "obra" Vincent (obra/superpowers) — one of the most-starred Claude Code skill collections. Provides domain-agnostic skills for planning, debugging, code review, and execution that compose cleanly with cell-ag-specific skill packs (e.g. K-Dense-AI's scientific-agent-skills) when assembling an agent stack for a cell-ag lab. Shell-based, MIT-licensed. K-Dense-AI's science-superpowers is a science-domain reimplementation of this methodology for data analysis, swapping test-driven development for pre-registration.

Skill Seekers

A meta-tool that converts documentation websites, GitHub repositories, and PDFs into Claude AI skills with automatic conflict detection. The closest existing automation for the pattern an AI-augmented cell-ag lab needs as it scales: turn a new wet-lab protocol PDF, a new bioinformatics package's docs, or a new GitHub library's README into a Claude Code / Cursor skill that the lab's agents can call directly — without hand-curating each integration. Python, MIT-licensed.

AI Research Skills Library

An open-source library of AI research and engineering skills covering vLLM, Megatron, GRPO, HuggingFace, and the broader LLM training and serving stack — maintained by Orchestra Research. Designed to package skills into Claude Code, Codex, or Gemini agents so they operate as fully-equipped AI research agents. Cell-ag teams building or fine-tuning their own biology foundation models (cf. TranscriptFormer) or running large-scale agentic workflows can pull from this library for the ML-training and serving infrastructure layer rather than reinventing it. MIT-licensed.

Proprietary

Seqera AI / Co-Scientist

Seqera Cloud's AI assistant for bioinformatics workflows, providing an interactive co-scientist that helps users author and debug Nextflow pipelines, query workflow run data, and interpret results. Accessible through the Seqera CLI and Seqera Platform; no companion paper at time of curation.

Dotmatics Luma

A commercial lab-orchestration platform from Dotmatics that connects laboratory instruments, data systems, and AI assistance into a unified "connected digital lab" workflow — covering instrument integration, automated data flow, electronic-lab-notebook integration, and agent-assisted experimental planning. Marketed primarily to biopharma and biotech R&D groups; representative of the commercial-tooling layer that cell-ag startups increasingly evaluate as they scale beyond bench-scale workflows. See also the Dotmatics Luma webinar in Talks.md for an overview of the platform.

scDataset

A PyTorch IterableDataset for efficient deep-learning training on single-cell omics too large to fit in memory, streaming directly from on-disk AnnData (and other formats) with no prior conversion step. It combines block sampling with batched fetching to approximate random sampling — recovering the minibatch diversity that uniform training needs while avoiding the throughput collapse of true random disk access; on the 100-million-cell Tahoe-100M it reports more than two orders-of-magnitude speedup over true random sampling while matching its downstream accuracy (D'Ascenzo & Cultrera di Montesano 2025, arXiv:2506.01883). For cell-ag, it is the data-loading layer beneath single-cell foundation-model training (Geneformer, scGPT, scFoundation and the like) on atlas-scale corpora — including Tahoe-100M — that increasingly inform cellular-engineering and perturbation models.

cited by0

TranscriptFormer

A family of generative foundation models for single-cell transcriptomics from the Chan Zuckerberg Initiative, trained on up to 112 million cells spanning 1.53 billion years of evolution across 12 species (Pearce et al. 2026, Science; see Papers.md ref #92). Provides state-of-the-art performance on cell-type classification and supports cross-species reasoning over transcriptomic data — directly relevant to cell-ag for translating biological knowledge between bovine, porcine, chicken, salmonid, and other livestock cells where annotated reference data is sparse (see the per-species pages in Datasets/ for the cell-ag-relevant data substrate). Distributed via CZI's Virtual Cells Platform with versioned releases.

Quickstart docs: https://virtualcellmodels.cziscience.com/quickstart/transcriptformer-quickstart. Announcement: https://chanzuckerberg.com/blog/transcriptformer-model-overview/.

Geneformer

A transformer-based foundation model for transfer learning in network biology from the Theodoris lab (Broad Institute / Gladstone), pretrained on ~30 million human single-cell transcriptomes via rank-encoded masked language modeling. Distributed exclusively through Hugging Face with tokenizer, pretrained weights, and example fine-tuning recipes for cell-type classification, gene-network inference, and in silico perturbation prediction; widely used as a single-cell-FM baseline. Companion to Papers.md ref #111 (Theodoris et al. 2023, Nature); pretraining corpus: Genecorpus-30M in Datasets/HumanReference.md.

scGPT

A generative pretrained transformer for single-cell multi-omics from the Wang lab at the University Health Network (Toronto), trained on >33M cells spanning scRNA-seq, scATAC-seq, and CITE-seq. Provides downstream fine-tuning recipes for cell-type annotation, multi-batch integration, gene-regulatory-network inference, and perturbation prediction; one of the most-used baselines for newer single-cell foundation models. Companion to Papers.md ref #117 (Cui et al. 2024, Nature Methods). Documentation: https://scgpt.readthedocs.io/.

scBERT

An early single-cell BERT-style foundation model for cell-type annotation from Tencent AI Lab Healthcare, treating individual genes as tokens with binned expression values. Released alongside Papers.md ref #112 (Yang et al. 2022, Nature Machine Intelligence); the independent re-evaluation in ref #113 (Boiarsky et al. 2024) and author reply in ref #114 (Yang et al. 2024) are core methodological reading for anyone benchmarking new single-cell FMs against existing baselines.

cited by604

scFoundation

A large-scale foundation model on single-cell transcriptomics from BioMap Research, pretrained on ~50M cells with read-depth-aware encoding that explicitly handles the variable sequencing depths characteristic of public scRNA-seq corpora. Provides downstream applications spanning cell-type annotation, drug-response prediction, and perturbation-effect modeling. Companion to Papers.md ref #116 (Hao et al. 2024, Nature Methods).

cited by486

UCE

Universal Cell Embeddings from Stanford's SNAP lab (Leskovec group) — a single-cell foundation model that represents each cell as an unordered set of expressed genes and each gene by its protein-language-model embedding, enabling zero-shot generalization to species and tissues never seen at training time. Releases include pretrained weights and zero-shot inference scripts for novel cell-type discovery across species. Companion to Papers.md ref #119 (Rosen et al. 2026, bioRxiv; Nature, in press at time of curation). The same lab's earlier SATURN method (ref #118, Rosen et al. 2024, Nature Methods) introduced the protein-LM-gene-embedding pattern that UCE generalizes — directly relevant to cell-ag where annotated livestock-species single-cell data is sparse and cross-species transfer is essential.

cited by160

tGPT

A generative pretraining model for single-cell deciphering, applying GPT-style autoregressive next-token prediction over gene-expression vocabularies. Smaller and earlier than scGPT or Geneformer, but methodologically important as one of the first demonstrations that next-token-prediction objectives (vs. masked-language-modeling) work for single-cell biology — the lineage that now includes Arc's State and Cell2Sentence. Companion to Papers.md ref #115 (Shen et al. 2023, iScience).

cited by60

Cell2Sentence (C2S-Scale)

A framework for treating single-cell expression profiles as natural-language sentences — ordered lists of expressed gene symbols ranked by expression — enabling direct reuse of pretrained LLM architectures (and, in C2S-Scale, billion-parameter scaling) for single-cell biology. From the van Dijk lab at Yale. Companion to Papers.md ref #120 (Rizvi et al. 2026, bioRxiv). C2S-Scale project page: https://www.vandijklab.org/c2s-scale.

cited by32

GEARS

Graph-Enhanced gene-Activation Response Simulator — a graph neural network for predicting transcriptional outcomes of novel multi-gene CRISPR perturbations, including combinations never observed during training. From Stanford's SNAP lab (Leskovec group). Generalizes single-gene perturbation training data to combinatorial perturbation prediction via co-essentiality and gene-ontology graph priors. Companion to Papers.md ref #121 (Roohani et al. 2024, Nature Biotechnology).

cited by312

CC-BY-NC-SA-4.0

State + Cell-Eval

Arc Institute's first-generation virtual cell model and companion evaluation framework, designed to predict stem-cell, cancer-cell, and immune-cell responses to drugs, cytokines, and genetic perturbations. Trained on ~170M observational and ~100M perturbational single-cell measurements across 70+ cell lines; uses a bidirectional transformer architecture with self-attention over cell sets and reportedly is the first model to consistently beat simple linear baselines on perturbation-response prediction. Released alongside cell-eval, the standardized evaluation framework for virtual-cell models. Companion to Papers.md ref #57 (Adduri et al. 2025, bioRxiv); see also the Arc Institute news article on State on the AI Agents & Foundation Models page. The follow-on Stack model — companion to Papers.md ref #124 (Dong et al. 2026) — extends State with in-context learning, simulating cellular conditions via prompt engineering without further fine-tuning.

cited by64

BioDiscoveryAgent

An LLM-based AI agent from Stanford's SNAP lab for designing genetic-perturbation experiments — including CRISPR-Cas9 single-gene and combinatorial knockouts — by reasoning over gene-function literature, prior screens, and experimental constraints. Demonstrates that an LLM agent with tool use can match or exceed specialized active-learning methods on hit-rate-driven experimental-design tasks. Companion to Papers.md ref #125 (Roohani et al. 2025, arXiv). Directly applicable to cell-ag as an off-the-shelf experimental-design layer for cell-line-engineering campaigns (selecting which TFs to overexpress for myogenic vs. adipogenic differentiation, or which media-pathway genes to knock down to test rate-limiting steps).

cited by16

CausalBench

A large-scale benchmark for evaluating network-inference methods from single-cell perturbation data — including Perturb-seq, CROP-seq, and ECCITE-seq. Built around interventional ground truth from genome-scale CRISPR screens, providing standardized metrics, baselines, and dataset splits for ML methods that infer gene-regulatory networks. Companion to Papers.md ref #127 (Chevalley et al. 2025, Communications Biology).

cited by14

BioContextAI

A community hub for agentic biomedical systems — a registry of biomedical Model Context Protocol (MCP) servers plus a knowledgebase MCP server that exposes curated biomedical resources to LLM agents. Lets cell-ag teams plug standardized biomedical tools and data sources into agent stacks (Claude Code, Cursor, Biomni) without bespoke per-resource integration. Companion to Papers.md ref #133 (Kuehl et al. 2025, Nature Biotechnology). GitHub org: https://github.com/biocontext-ai.

BioMCP

A one-binary MCP server from GenomOncology unifying many biomedical knowledge sources — PubTator3, Europe PMC, ClinicalTrials.gov, MyVariant.info, cBioPortal, Reactome, Open Targets, MyDisease.info, MONDO, Monarch, DisGeNET — behind a single Model Context Protocol surface for LLM agents. MIT-licensed; the leanest existing MCP-native bridge between general biomedical literature, clinical-trial, and variant data and an agent stack. Sister project to BioContextAI, which catalogues biomedical MCP servers including BioMCP.

Context7

An open-source MCP server (and hosted service) from Upstash that injects up-to-date, version-specific library documentation and code examples into LLM prompts, so AI coding agents work from current API docs instead of stale training data. Unlike the biomedical MCP servers above, Context7 is general developer infrastructure — not cell-ag-specific — but it is directly relevant to CAAIL's AI-agent audience: the coding agents that build and maintain cell-ag pipelines, parsers, and analysis tooling depend on accurate, current documentation for the bioinformatics and ML libraries they call. MIT-licensed; hosted at https://context7.com.

Apache-2.0 + Commons Clause

Virtual Lab

An LLM multi-agent framework that runs a simulated "research lab" (a principal-investigator agent coordinating specialist agents plus a critic) to tackle open-ended scientific problems, from the Zou group at Stanford (Swanson et al. 2025, Nature), where it designed SARS-CoV-2 nanobodies with experimental validation. For cell-ag it is a reusable pattern for multi-agent experiment design and hypothesis generation across media, cell-engineering, and bioprocess problems.

Docs: https://github.com/zou-group/virtual-lab#readme.

cited by110

Coscientist

An autonomous LLM-driven agent that plans, codes, and executes chemistry experiments end to end, combining web search, documentation retrieval, and control of automated lab hardware (Boiko et al. 2023, Nature). A reference architecture for closed-loop autonomous experimentation that maps onto cell-ag tasks such as automated media-formulation or assay optimization. Companion to Papers.md ref #70.

Docs: https://github.com/gomesgroup/coscientist#readme.

cited by850

Data Standards & Interchange Formats

SBML (Systems Biology Markup Language)

The de-facto XML-based standard for representing computational models of biological processes — metabolic networks, signaling pathways, gene-regulatory networks, and kinetic models. SBML is the interchange format for every genome-scale metabolic model catalogued in the per-species pages of the Datasets/ directory and the lingua franca of the constraint-based and kinetic modeling tools in Metabolic Modeling & Strain Design. Maintained by the SBML community with libSBML bindings across all major languages.

cited by3.2K

LinkML (Linked data Modeling Language)

A schema language for authoring, validating, and transforming structured data models, with first-class support for ontology terms, code generation across languages, and export to JSON-Schema / SHACL / OWL. Increasingly used to define machine-readable metadata schemas for biological datasets and knowledge graphs — the structured backbone that agentic AI systems need in order to reason reliably over cell-ag data resources.

AI Methods & Tooling

Project PISCES (Standard Flowsheet Format)

Project PISCES (Process Integration & Synthesis using Chemical Engineering Standards) standardizes process flowsheet data into a machine-readable Standard Flowsheet Format (SFF) for AI-powered knowledge extraction and analysis. For cellular agriculture, a standardized flowsheet format is the missing substrate for AI-assisted bioprocess design, scale-up modeling, and techno-economic analysis — letting agents reason over cultivated-meat process designs the way they reason over SBML metabolic models. SFF documentation: https://projectpisces.org/?page=sff-docs.

Process-flowsheet background (what SFF standardizes) — for readers approaching this from the AI / biology side, the LibreTexts Foundations of Chemical and Biological Engineering chapter on chemical processes and process diagrams and the ScienceDirect "Flowsheet" topic overview introduce the flowsheet concept and its notation.

Cell-ag application context — The Unjournal's cultivated-meat cost-modeling work, namely the unjournal/cm_pq_modeling repository and its techno-economic comparison of cultured-chicken cost models, is exactly the kind of bioprocess techno-economic analysis that a standardized flowsheet format like SFF is designed to make reproducible and machine-comparable.

AI Methods & Tooling

Biomedical Ontology & Identifier Infrastructure

Biopragmatics Stack

An interlocking stack of MIT-licensed Python tools and registries supporting biomedical semantics and pragmatics. Each component is independently usable, and together they cover the full lifecycle of biomedical-entity identification, normalization, and cross-linking. Directly relevant to cell-ag agentic workflows that need to reason consistently across the livestock-genomics, metabolic-modeling, sensomics, and chemistry resources elsewhere in CAAIL.

GitHub org: https://github.com/biopragmatics. Core components:

Bioregistry — A registry of biomedical identifier registries, with prefix normalization, identifier resolution, and a REST API. The meta-resource the rest of the stack builds on.
pyobo — Python library for using ontologies, terminologies, and biomedical nomenclatures.
bioontologies — Unified access across biomedical ontologies.
biolookup — Service for retrieving metadata and ontological information for biomedical entities.
Biolexica — Generates and applies coherent biomedical lexical indices for named-entity recognition (NER) and normalization (NEN).
Biosynonyms — Decentralized database of synonyms for biomedical concepts.
Biomappings — Community-curated and predicted equivalences and related mappings between named biological entities not available from primary sources.
SemRA — Semantic Mapping Reasoning Assembler, for assembly and reasoning over semantic mappings at scale (Hoyt et al. 2025, Bioinformatics).
bioversions — Tracks the latest version of each biomedical database — useful as a freshness check across the resources curated in Databases.md.

AI Methods & Tooling

BridgeDb

A framework plus companion mapping databases that translate identifiers across gene, protein, and metabolite databases through one unified API (a Java library, the R/Bioconductor BridgeDbR client, a JavaScript client, a web service, and a Cytoscape app). For cellular agriculture it resolves the identifier-scheme mismatches — Ensembl vs UniProt vs KEGG/HMDB/ChEBI — that otherwise break multi-omics integration and genome-scale metabolic-model curation for the non-model livestock and microbial-host species the field works with (van Iersel et al. 2010, BMC Bioinformatics). Apache-2.0.

Metabolic modeling

cited by235

Food Safety & Allergenicity

AllerCatPro 2.0

Screens a query protein for allergenic potential by combining sequence homology with predicted three-dimensional structure similarity against a reference set of roughly 4,979 known allergens drawn from the COMPARE, AllergenOnline/FARRP, and WHO/IUIS databases. The method is similarity- and rule-based rather than machine-learned. For cellular agriculture it is a practical first-pass in-silico screen for the recombinant growth factors, media proteins, and scaffold or matrix proteins introduced into cultured products before wet-lab IgE testing.

Allergenicity

cited by366

AllerTOP v2

Alignment-free allergenicity classifier that transforms a protein's amino-acid physicochemical descriptors by auto-cross-covariance and predicts allergen versus non-allergen with k-nearest-neighbours. Because it needs no alignment it can flag potential allergenicity for engineered or non-natural sequences that have no close homolog, which suits the de-novo-designed proteins used in cell-ag media and products.

Allergenicity

cited by2K

AllergenFP

Encodes each protein as a binary fingerprint of physicochemical properties and scores allergenicity by Tanimoto similarity to known allergens. A companion alignment-free method to AllerTOP, useful as an independent second-opinion screen on candidate cellular-agriculture proteins.

Allergenicity

cited by762

AlgPred 2.0

Machine-learning allergenicity predictor that classifies a protein as allergen or non-allergen using a random-forest ensemble over compositional, BLAST, and motif features, and additionally maps IgE-binding epitopes so the specific allergenic regions of an engineered protein can be localized rather than only flagged. The method paper is Papers.md ref #290 and its labeled training corpus is catalogued in Datasets/FoodSafety.md.

Allergenicity

Allermatch

Implements the exact FAO/WHO Codex Alimentarius allergenicity criteria, a sliding 80-mer window at or above 35% identity plus an exact 6-mer match, against curated allergen databases. It is the regulatory-standard homology screen that any novel protein entering the food chain, including cultivated-meat media and product proteins, is expected to pass.

Allergenicity

cited by180

ALLERDET

Deep-learning allergenicity web application that learns sequence features with a restricted Boltzmann machine and classifies with a decision tree, trained on curated allergen and non-allergen sequences. A recent alternative classifier for screening candidate cell-ag proteins.

Allergenicity

cited by28

AllergenAI

Deep-learning model that quantifies a protein's allergenic potential from sequence alone using a convolutional network, trained on allergen sets from SDAP 2.0, COMPARE, and AlgPred 2.0. Unlike homology- or physicochemical-feature tools it learns sequence features directly, and has been applied to flag candidate novel allergens in the plant cupin and vicilin protein families relevant to alternative-protein feedstocks.

Allergenicity

Techno-Economic & Life-Cycle Assessment

NCSA

BioSTEAM

Python platform for the design, simulation, techno-economic analysis, and life-cycle assessment of biorefinery and fermentation processes under uncertainty, with economics validated against SuperPro and Aspen. It is the open-source engine most cultivated-meat cost models build on for fermentation-bioproduct economics.

Techno-Economic & LCA

cited by167

NCSA

QSDsan

Open-source Python platform built on BioSTEAM that integrates process modeling, simulation, techno-economic analysis, and life-cycle assessment in one toolkit. It is the tightest single-package TEA-plus-LCA combination for evaluating bioprocess cost and sustainability trade-offs.

Techno-Economic & LCA

cited by30

Brightway

The leading Python-native open-source LCA framework, spanning life-cycle data, Monte Carlo uncertainty, input-output analysis, and the Activity Browser GUI. It is the scriptable, reproducible environmental-footprint engine used for cradle-to-gate cultivated-meat LCA (documentation at https://docs.brightway.dev/).

Techno-Economic & LCA

cited by558

MPL-2.0

openLCA

The most widely used free and open professional LCA desktop application, from GreenDelta, with a large database ecosystem (Nexus) and a Python IPC API. It is the GUI complement to Brightway and is heavily used in food and agricultural LCA, including cultivated-meat studies (home at https://www.openlca.org/).

Techno-Economic & LCA