AI for Bio has a Fuzzy API problem

“AI for bio” is getting hot again. Given the excitement in the current moment, I thought I’d share a bit about what actually makes biology uniquely hard as an application domain for machine learning. The reason is not simply that biology is complicated, though it obviously is. ML is good at many things that are complicated. The deeper reason is that drug discovery does not have the kind of clean feedback loops and clean interfaces that made modern ML so powerful elsewhere.

In software, we are used to clean APIs. One team can build a backend service, expose an endpoint, and another team can build on top of it. The interface is typed. The object either satisfies the contract or it does not. If something breaks, you can usually trace the failure to a bug, fix the code, rerun the test, and ship again. This is so much the case that billion dollar companies are regularly built satisfying exactly one interface (e.g. Supabase for databases, Exa for search, NVIDIA for GPU compute).

It is tempting to imagine drug discovery the same way:

target = target_discovery(disease)
drug = drug_design(target)
medicine = clinical_trial(drug)

Target discovery gives you a target. Drug design gives you a molecule. Clinical trials tell you whether it works. One company satisfies each interface.

Unfortunately biology does not expose clean APIs. The output of target discovery is not really a target. It is a probabilistic hypothesis that modulating some biological process, in some direction, in some tissue, in some patient subset, at some disease stage, will produce a useful clinical effect without unacceptable toxicity. The output of drug design is not really a drug. It is a candidate intervention whose value depends on whether the target hypothesis was right, whether the modality is appropriate, whether the molecule reaches the right tissue, whether it has enough selectivity, whether the safety margin is acceptable, whether it can be manufactured, whether it has a defensible IP position, whether the unknown competitive landscape materializes, and whether it fits a viable clinical strategy. The output of a clinical trial is not simply a “cure”. It is an outcome filtered through patient selection, endpoint choice, dosing regimen, site execution, statistical power, standard of care, and regulatory interpretation.

So the API is fuzzy. A target can look validated until the molecule hits it in the wrong tissue. A molecule can look great until it fails because the disease biology was wrong. An animal model can look convincing until the human disease is meaningfully different. A trial can look negative even though the drug might have worked in a narrower patient population. The downstream or upstream stages encode specific assumptions in other stages. To me, that is the core problem: AI for bio has a fuzzy API problem. In software, good APIs hide complexity. In biology, the hidden complexity inadvertently kills programs. This essay is about where that fuzziness shows up: target discovery, drug design, and clinical development, and where that poses both challenges and opportunities to use machine learning to revolutionize the field.

The discovery process

The first order, drug discovery involves designing an intervention that stops some sort of deleterious process occurring inside the body. In practice, modern therapeutic development usually looks something like this:

Determine the causal biology driving a disease in a particular patient population.
Design a chemical or biological intervention that modulates that biology.
Test that intervention in model systems (cell, animals, etc) to build evidence that it is safe and plausibly effective in humans.
Run clinical trials to determine whether it works in humans.

That clean list hides a lot. The intervention itself can take many forms, called modalities. They can involve small molecules that block a particular molecule, antibodies that deliver toxins to certain molecules, pieces of RNA that block protein design, in-vivo edits that permanently delete bad processes, methods for reprogramming the immune system to recognize a foreign body, and several others. Clinical development then generally proceeds through three phrases, testing various aspects of safety and efficacy in humans.

For the purposes of this post, I’ll simplify the whole process into three stages:

Target discovery: What biology should we modulate?
Preclinical Design and Translation design: What intervention can modulate it? Does the intervention look safe and effective enough in model systems to justify trying it in humans?
Clinical development: Does it actually help the intended human population?

Machine learning can matter in all three stages. But the type of ML that matters, and the difficulty of the problem, varies a lot by stage.

Target Discovery

When I talk about target discovery, in essence what we’re saying is “do biology research to find a hypothesis of something worth targeting”. Historically, a lot of target discovery research happened in academia. Large scale efforts like the Broad Institute’s DepMap helped identify what perturbations in cancer cells affected their growth. Researchers used that to identify potential genes that drove cell growth, and hypothesized that blocking those genes would enable anti cancer drugs. Note that these were not experiments on people. They were usually on cancer cell lines that had been grown ex-vivo, or occasionally on patient derived lines.

Other efforts like genome wide association studies (GWAS) attempt to use real human data collected from population scale sequencing efforts like the UK Biobank. The idea is pretty simple — with enough people, you can identify genetic variants associated with disease risk. A GWAS hit might tell you that a region of the genome is associated with a disease. It does not automatically tell you which gene matters, in which cell type, through which pathway, in which patient subset, or how you should intervene therapeutically. This is the fuzzy API problem again. The “target” returned by human genetics is often not a clean therapeutic object. It is a clue that needs to be further validated by physical experiments. So historically, there has not been much deep learning at the core of target discovery. There have been statistical genetics methods, network methods, and deep learning tools for things like gene regulation. But not a unified model that could answer the question we actually care about:

“if I intervene on patient x in cell type y with intervention z, what will happen?”.

That problem framing is making the field newly excited about so-called “virtual cell” or “virtual human” models. The idea is roughly: collect lots of data about cell/person states, perturb them in various ways, and measure many readouts — transcriptomics, proteomics, imaging, cell growth, functional phenotypes, clinical data — and train models that learn the relationships among them. Scaling laws from LLMs give a framework for how to think about allocating resources, and some early efforts suggest there are similar scaling laws in these data too. Companies like Tahoe, Noetik, and Recursion are all pursuing some version of this. This is a good direction.

This also turns out to be extremely hard. First, the existing data is sparse and low quality. If you have millions of genetic loci, many possible cell types, many disease states, many interventions, and only hundreds of thousands or even millions of patients, you are still in trouble in terms of the curse of dimensionality. Biobank data is powerful because it captures naturally occurring variation in humans and correlates that to disease outcomes. This is great for finding drugs whose effect mirrors naturally occurring variation. On the flipside, this data cannot capture the outcome of arbitrary drug perturbations. There are not many ethically acceptable ways to perturb a human, wait, and see what happens.

So we fall back on model systems: we use cells, organoids, mice, rats, dogs, non-human primates, and others. But, a cell line does not have an immune system, a vascular system, a liver, a microbiome, or the full context of a human body. An animal has a body, but not a human body. An engineered disease model can be useful, but it may not capture the human disease process we actually care about. We know this is true because drugs that work in model species preclinically regularly fail in human clinical trials.

This creates a sim-to-real gap between the model system and the human. Worse, since we are training machine learning models on this sim data, we are building models of the simulation and trying to get them to transfer to reality. In other words, we have sim^2-to-real gap and no great physics engine.

That is why target discovery is so hard. It is not just that we need better models. We need better data, better perturbation systems, better causal inference, better human-relevant assays, and better ways to close the loop between prediction and experimental validation. Don’t get me wrong. I’m excited to see that companies are making serious attempts to solve these problems and develop new ML models for target discovery. Also, historically it wasn’t clear that discovering targets could be effectively monetized other than by making drugs yourself against those targets, but companies like Cartography Bio are proving otherwise.

Still, the hard part remains: the “target” object is fuzzy. A model can nominate a gene, pathway, cell state, or biomarker. But the clinically relevant object is a causal, directional, context-specific intervention hypothesis.

Drug Discovery

This is the stage of the process that is most widely discussed. There is obviously enormous promise here. Models like Boltz, Chai, and Nabla can be used to predict structures, generate binders, and reason about protein interactions. Dyno is using machine learning to design AAV capsids, and other companies are designing payloads for genetic medicines with generative models. Some of these are using natural language LLMs while others are taking the lessons for what worked in computer vision and LLMs and designing new model architectures to support bio-specific use-cases (Nabla, Chai, Dyno, etc are all this). That’s great.

Now here is the problem: the common ML person view of drug discovery sees this stage as a clean API whose implementation looks something like this:

def drug_design_process(target, known_inhibitors):
    new_drug = None
    while too_similar(new_drug, known_inhibitors):
        new_drug = design_and_optimize(target)
    return new_drug

The naive view is that as long as you can run the while loop well, you win. I certainly thought that’s how it worked when we started Reverie. It also matches the conventional startup wisdom to focus on your core competency. If you’re a machine learning engineer interested in structural biology, just build the best drug design engine. Let someone else find the target. Let someone else run the trial.

Unfortunately, this is not how biology works in practice. For any well-known target — meaning a target with a good paper, credible human genetics, or obvious clinical rationale — there will often be many competitors. Some you know about, some you do not. If you are not first, you need to be meaningfully better. But you usually do not have access to your competitors’ full efficacy data, safety data, formulation details, or usually even their structure until much later. On top of that, if you’re an American company, there is a high probability you have a competitor in China moving quickly and effectively, who can offer up their drug to a pharma partner for a fraction of what your development costs. They can also generate animal data faster/cheaper than you can, and run a Phase 1 for a fraction of the price of doing it in the US.

So maybe you avoid crowded targets and go after novel biology. That can be the right move. Perhaps you use a target discovery platform company or a proprietary data platform to identify new targets that others have missed. But then the error bars on your confidence for your target explode. No one has validated that modulating the target in a human works. Your disease model, patient subset, biomarker, or modality might be wrong. You can design a beautiful molecule and still create no value because the upstream biological hypothesis was false. In other words, you can do your job perfectly in the algorithm above, and produce something that has zero value.

For companies building in this layer, there is also the challenge of commoditization. I don’t doubt that there are companies with superior AI drug design technologies. But the relevant question is whether that superiority is large enough and connected enough to the rest of the drug-development process to get durable returns. Open-source models will improve, and all models are constrained by data. This does not mean AI drug design is overhyped. It means the value is probably highest when the design loop is tightly coupled to proprietary data, fast experimental feedback, differentiated targets, modality-specific expertise, or a clinical strategy that makes the molecule matter. In other words, you may have to abandon the tight API and accept that your process will depend on the other factors as well.

Clinical Trials

And so we arrive at clinical trials. This is the part of the process I would argue has been most neglected by serious machine learning people. It’s the area where there are the most obvious tailwinds with LLMs. Essentially all of the problems associated with clinical trials are well suited either to use LLMs out of the box or to use machine learning to develop custom models that are relevant for the domain. Here are some examples.

Patient stratification: Can you predict what subset of patients are going to be responders for a particular drug? This is a hard machine learning problem for similar reasons as target discovery: it’s hard to get useful human data. But it might be a better fit for an LLM-driven approach. You could imagine post training a general LLM with large amounts of human EHR data, or giving an intelligent agent access to health records to look for patterns between a wide variety of unstructured notes and indicators of drug effectiveness. The goal would be to reason across messy unstructured records rather than just query a database to train a supervised ML model. Moreover, for many trials, we often know the type of patient we want, but the problem is finding them. An example: a trial might need patients with mutation X, blood pressure levels above Y, that have tried/failed at first/second line treatments, are over 35 years old, and have never had an acute side effect to an mRNA vaccine. Unfortunately, that’s not SQL queryable in any database, especially in America’s decentralized system. The method to do this before LLMs was having a human manually read EHR entries and identify them. We can now use LLMs to do this. Then, once we identify them, we can have an AI agent pester them to come enroll for a trial. This is not just an ML problem. The system that does this has to exist inside a health system, with clinician trust and incentive alignment.

Trial Operations: There are immense aspects of clinical trials that are slow and inefficient. It is slow to recruit patients (see above), slow to collect and analyze samples, and slow to QC and verify the resultant outputs. A huge amount of this work already happens through large outsourcing organizations and software vendors: CROs, EDC systems, clinical data-management vendors, site networks, and consulting firms. These companies make billions of dollars doing work that often involves basic data management, review, reconciliation, and coordination. These present enormous opportunities to use off-the-shelf models to significantly accelerate and improve the process.

Scaled up Literature Mining: There are enormous amounts of useful biological and clinical information scattered across papers, abstracts, trial registries, regulatory filings, patents, and conference posters. Some of the most interesting drug hypotheses come from connecting evidence across domains that no individual team had time to synthesize. Perhaps there was a drug that failed in one population, and you have good reason to believe that it could be repurposed for a different population due to a biomarker you found in an obscure paper. Previously, you’d have to have an entire team of people scouring academia to find information like this. Today, you could use intelligent research agents to find it. This is often the inspiration behind companies purporting to build an “AI-native Roivant”: a company that uses AI agents to mine literature, clinical evidence, competitive landscapes, and external assets, then forms asset-specific companies around the best opportunities.

Monetization: Clinical development is also easier to monetize than many upstream AI-for-bio ideas because the value of time is much more legible. A drug’s commercial life is finite, driven by patent limitations and exclusivity rules, among other factors. A lot of that window is consumed by clinical development — usually drug patents are filed right before beginning a clinical trial. This means that their patent clock is ticking while running the trial, and that every day they can save in clinical trials is a day longer their drug profits on the market after approval.

To first order, you can calculate the per day value of time savings as Peak Daily Sales*P(Approval). For most big drugs, this is worth millions of dollars per day. For blockbuster drugs it can be tens of millions of dollars per day. So, if you can prove that you can save time on a trial, pharma companies should be willing to pay very large amounts of money for your service.

So why don’t more machine learning people enter this space? My guess is that it feels boring. If you grew up loving physics and chemistry and math/CS like I did, you’d naturally gravitate towards things that look like AlphaFold. It feels really good to encode the triangle inequality in a weight update. Those problems are deeply satisfying as a scientist. Most of the clinical trial work is schlep work. But as the lesson goes, schlep work is often the most lucrative to focus on.

Upshot

So where does that put us? Am I saying that everyone should abandon target discovery and drug design and focus entirely on clinical trials? No obviously not. All of the efforts to work on better target discovery and drug design are great for humanity, and I’m glad they are being funded and pushed forward. I’m sure great companies will be built doing those things. But I do think AI-for-bio founders should be realistic about where their edge lives. The uncomfortable implication is that a lot of AI-for-bio value may accrue less to the team with the highest scoring generative model and more to the team with the best closed-loop system for selecting, testing, financing, and clinically advancing assets.

One version of that company looks like an AI-native Roivant:

Use agents to mine literature, clinical evidence, trial data, patents, and competitive landscapes.
Identify underappreciated therapeutic hypotheses.
Take a trip to Shanghai to license assets that match those hypotheses.
Create asset-specific companies to push them through clinical inflection points.
Repeat until one works.

This is roughly what some biotech investors already do, minus the AI-native research layer.

Another version looks like a vertically integrated target-discovery company:

Come up with a scaled data collection platform to get a unique / novel target hypothesis (hard, but many undertapped methods out there)
Take an off the shelf model (e.g. Boltz) to design a drug against it.
Validate quickly through tight experiment feedback loops
Create the subsidiary with the asset, raise capital to take that drug to the clinic, or sell it off to pharma at Phase 1.
Repeat.

Both of these company structures make sense to me, but I see them much less often than drug discovery companies that focus on neatly solving their clean API inferface. These structures do demand more from founders – they have to be opinionated and correct about more aspects of the therapeutic stack. But that seems better than being blind to the complexity.

I’d love to hear feedback on this. Reach out anytime if you’d like to talk.