What we learned running Claude, Gemini and ChatGPT in production for clinical AI

·

The $1.5B Anthropic JV and the $10B OpenAI Deployment Company landed on the same day. Most commentary has focused on what it means for McKinsey, BCG, and Accenture. Almost none has focused on where these deployments will fail.

We know something about that.


The build

FOS Medical is a Sydney-based radiopharmaceutical AI company targeting precision diagnostics for glioblastoma, multiple myeloma, and lung adenocarcinoma. The team includes Prof Dale Bailey — Principal Medical Physics Specialist at Royal North Shore, past ANZSNM President, h-index 64, 300-plus peer-reviewed publications, one of the architects of total body PET — Dr James Drummond, RNSH neuroradiologist and Brain Imaging Lab Director, and Arash Atashnama, our CTO, an AI-driven drug discovery entrepreneur and UNSW research commercialisation coach who designed the FOS Engine architecture.

In late 2025, we presented the RINO Project at SNO — the 7th Quadrennial World Federation of Neuro-Oncology Societies meeting. The work: systematic target selection for GBM radiopharmaceutical candidates using the FOS Engine.

The architecture runs Claude 3.7 Sonnet, Gemini 2.5 Flash, and GPT-4.1 in parallel. These were the frontier versions available at the time we designed and ran the process — the models have moved on since, but the architecture and findings haven’t. Each model evaluates candidates independently. We then apply adversarial uncertainty induction — prompting each model to name its own uncertainty before three-model consensus voting. 30 GBM antigen candidates, 1,350 target-metric combinations, inter-model agreement of Kendall’s W = 0.78, p<0.001.

Here’s what that process showed us.


Single-model trust doesn’t hold in regulated domains

In a regulated clinical context, “the model got it right” is insufficient. You need to know why it got it right, and you need confidence it will hold on the specific cases where being wrong is dangerous.

Single-model outputs have a structural problem: you can’t distinguish genuine confidence from confident hallucination without a reference. When 3 frontier models evaluate the same task independently and agree, that agreement is evidence. When they disagree, that’s signal — it shows you where uncertainty lives.

Most multi-agent setups run models in ensemble and average the outputs. We found that prompting each model to name what it was least certain about before consensus voting materially changed which candidates surfaced. Models that looked aligned would diverge sharply when forced to state their uncertainty. That divergence was clinically useful — it flagged candidates where the literature was thin, mechanisms were contested, or the target-metric relationship was model-specific rather than domain-grounded.

Kendall’s W of 0.78 across 1,350 combinations in a domain this specialised is strong agreement. The cases where that number would have dropped — without the adversarial layer — are exactly where a single-model approach would have failed silently.


Data and verification, not model intelligence

All 3 models we ran are frontier-class. In general benchmarks, the differences between them matter. In our domain, they mostly didn’t.

What mattered was the quality of the expert-verified data used to frame the evaluation, and the verification architecture sitting above the models. Prof Bailey’s command of the domain shaped how we defined the target-metric combinations, structured the evaluation criteria, and interpreted disagreement between models. That expertise isn’t in any frontier model’s training data in the specific, operational form we needed.

Model intelligence sets the floor. In regulated domains, the ceiling is set by niche expert-verified data and a rigorous verification layer. A weaker model with good domain data and a real verification architecture will outperform a stronger model running on general knowledge with no verification.


The same gap appears across high-volume clinical workflows

What we hit in radiopharmaceutical target selection is structurally the same as what forward-deployed engineers will hit in prior authorisation, medical coding, specialty clinical documentation, claims adjudication, and clinical decision support.

In each workflow: the domain is specific enough that general model knowledge produces confident but unreliable outputs. The stakes mean silent failure isn’t acceptable. The regulatory environment means “the model said so” isn’t a defensible audit trail.

The FDE model is sound. The engineering is solvable. But the niche expert-verified data and the eval infrastructure can’t be built from first principles in a 12-week deployment. That layer needs to exist before the engineers arrive. Right now, mostly, it doesn’t.


Where this leads

The interesting builds in healthcare AI right now aren’t at the foundation model layer. They’re eval suites built on real clinical workflows, expert-verified task libraries, verification architectures that produce auditable confidence scores.

We built a version of that for radiopharmaceuticals. The architecture generalises. The question is where it gets built — inside the JV structures, by specialist partners, or by health systems themselves.

If you’re working on healthcare AI deployments and thinking about the eval and verification layer, I’d compare notes. The SNO poster is available on request. Reach me at jb@fosmedical.com or via LinkedIn.


Jakob Boije is co-founder and CBO of FOS Medical and founder of FarGrit. Previously Expert Associate Partner at McKinsey, leading design and customer experience in the Life Sciences practice.