Why o1 is not suitable for Medical Diagnosis

A follow up on o1’s Medical Capabilities + major concern about it’s utility in Medical Diagnosis

Devansh
7 min readSep 25, 2024

This article was originally published in my newsletter- AI Made Simple- over here.

Very quick summary of the email-

  1. I am retracting our earlier statement that OpenAI deliberately cherry-picked the medical diagnostic example to make o1 seem better than it is. After further research- it seems that this is not a case of malice but a mistake caused by extremely lazy evals and carelessness.
  2. I also noticed that GPT seems to over-estimate the probability of (and thus over-diagnose) a very rare condition, which is a major flag and must be studied further. This (and its weird probability distributions for diseases) lead me to caution people against using o1 in Medical Diagnosis (unless o1 main is a complete departure from the preview).
The Floating Harbor Syndrome has been recorded in less than 50 people ever. It shouldn’t be ranked this highly.

In the future, I think any group making claims of great performance on Medical Diagnosis must release their testing outputs on this domain, so that we can evaluate and verify their results more transparently. Lives are literally at stake if we screw up evals. Share more of your testing results, and let the results speak for themselves.

Background

Our most recent post- “Fact-checking the new o1 ‘Strawberry’ for healthcare”- has ballooned beyond anything I was expecting-

Thank y’all for sharing our little cult with outsiders ❤.

As a reminder, the Fact-Checking post pointed out these key ideas-

  1. Running the prompt- making a diagnosis based on a given phenotype profile (a set of phenotypes- the observable characteristics of an organism, resulting from the interaction between its genes and the environment- that are present and excluded)- on ChatGPT o1 leads to inconsistent diagnosis (it does not pick the same diagnosis every time).
  2. This inconsistency is not made very explicit.
  3. Each time o1 picked any diagnosis (whether correct or incorrect)- it would rationalize it’s call very well.
  4. The above three combined are problematic.
  5. To make themselves look better, OAI had specifically cherry picked a run where they got the desired answer and never mentioned the other times. This is very problematic.
  6. For diagnostic purposes, it is better to provide models that provide probability distributions + deep insights so that doctors can make their own call.

This has had led to some very interesting conversations, including one very brief one with OpenAI employee B, who was involved in creating the original question.

Why we weren’t able to recreate the OAI claims

B shared the following: Sergei (our guest author) was not able to reproduce the experiment because Sergei used the o1 preview, while the experiment was created and tested on the main o1 model (which isn’t publicly available). Therefore, this wasn’t a case of cherry-picking but rather an apples-to-oranges comparison (which oranges totally win btw). B did not comment on any of the other points.

Unless I’m given a strong reason not to, I prefer to take people’s words at face-value. So I decided to confirm this for myself and subsequently print a retraction. So I mosied over to the OAI announcement blog, where I saw something interesting. The Blogpost mentions the preview model specifically-

Learning to Reason with LLMs

This loops back into our original problem- OAI promotes the performance of it’s model without acknowledging massive limitations. If a doctor was to see this blogpost without prior understanding of GPT’s limitations, they might get swept by the hype and use the preview, leading to a misdiagnosis. While I believe that end users are responsible for using a product safely, I think technical solution providers have a duty to make usersclearly aware of any limitations upfront. When not doing so, the provider should be held responsible for any damages arising from misuse.

But I didn’t want to immediately start pointing fingers and screaming outrage, so I went back to B to get their inputs on what was going on. Here is what I believe happened-

  1. B mentioned that they had tested O1 (main) for the prompt a bunch of times. O1 always had the same outputs (KBG).
  2. This was communicated to the author(s) of the blog post, who didn’t bother rigorously verifying. They likely ran the prompt once or twice, got KBG and assumed that the performance of o1 main translated to the preview (this part is my speculation, B made no such comments).

I wanted to see how easy this would have been to catch, so I decided to run the experiments myself. I fed the same prompt to 10 different instances of o1 (all completely fresh to avoid any context contamination from previous chats).

My Experiments with o1 Preview and one concerning result

The results were interesting-

  1. 6 KBG.
  2. 4 Floating Harbor Syndrome.
  3. 1 Kabuki.

I wanted to see if this pattern would hold up even if I asked o1 for the three most likely diagnoses instead of just one. Interestingly, for the 5 times I ran it, it gave me 5 different distributions with different diseases(KBG the only overlap between the 5). This would line up with our base hypothesis that the writer of the blog post might have been careless in their evals and overestimated the stability of their prediction models. Taking B’s statements, my own experiences with AI Evals being rushed/skipped, and the results of these experiments, I retract my claim that OAI was deliberately misleading and switch to the claim that OAI was negligent in their communications. While it is unlikely, I can see how people acting in good faith can get to this situation. Corners are often cut to please the gods of Rapid Deployment.

I didn’t think I would have to teach OAI folk the importance of cross-validation. Perhaps this article will be useful to them (I do this with my clients, in-case you want a good set of eyes)-

While I was trying to evaluate the outcomes and see if I could learn anything else that was interesting, I spotted something that made me question GPT’s language comprehension. Firstly, while the diseases differed, 4 out of the 5 generations we got had the 70–20–10 split. This could be a complete accident, but if it’s not, I would be very, very hesitant using o1 for any kind of medical diagnosis.

Looking at these probabilities, it’s interesting that the Floating Harbor, which was diagnosed frequently in the individual case, doesn’t show up as frequently over here. So I googled the disease and came across a very interesting piece of information-

Floating–Harbor syndrome, also known as Pelletier–Leisti syndrome, is a rare disease with fewer than 50 cases described in the literature -Wikipedia

Less than 50 cases. Compared to that, o1 is radically overestimating the chance of it happening. Given how rare this disease (the low prior probability), I am naturally suspicious of any system that weighs it this highly (the classic “calculate the probability of cancer” intro to Bayes Theory that I’m sure all of us remember) -

Source

There’s a chance that this phenotype is only there in FHS (a very skewed posterior), but that would invalidate the other outcomes. Either way, these results make me extremely skeptical about the utility of o1 as a good diagnostic tool.

Given the variable nature of the problem (and my lack of expertise in Medicine), there is always going to be more nuance that can be added. This is why I think we should do the following-

  1. OpenAI (and future providers) should be very clear, explicit, and straight in sharing the limitations of their systems when it comes zero-fault fields like Medicine.
  2. Anyone claiming to have a powerful foundation model for these tasks should be sharing their evals (or at least a lot more of them) so that users can spot gaps, understand capabilities, and make more informed decisions. Too many groups use technical complexity as a shield against consumer protection, and that can’t be allowed in these more sensitive fields. If you’re as good as you claim to be, let the results speak for themselves.

To OpenAI: If you think my critique contains mistakes, misrepresented results, or anything else, the floor is yours. I will happily share a rebuttal to my post as long as I think it’s good-faith, high-quality, and meaningfully addresses my concerns.

If you’d like to support our mission to conduct and share independent AI Research, please consider a premium subscription to this newsletter below-

And if you want to talk about these ideas, send me a message below-

Reach out to me

Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.

Small Snippets about Tech, AI and Machine Learning over here

AI Newsletter- https://artificialintelligencemadesimple.substack.com/

My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

--

--

Devansh

Writing about AI, Math, the Tech Industry and whatever else interests me. Join my cult to gain inner peace and to support my crippling chocolate milk addiction