Medicine Doesn't Have a Ground Truth. The AI Race Is Pretending It Does.
Medicine has many Northwest Passages
The map above is the Northwest Passage as the best European cartographers drew it in the seventeenth century. The passage does not exist. Explorers died chasing it. The cartographers were not villains. They were reading from the best map of their era.
So were the doctors in every story below.
By the time the lobotomy became an embarrassment, close to twenty thousand Americans had received the operation. The man who invented it had won the Nobel Prize. A generation of babies died face-down on the mattress because the most trusted childcare book on every new mother’s shelf told her to put them there. The food pyramid was wrong at the base and the tip. Margarine was the dangerous fat. Peptic ulcers are caused by a bacterium, and the acid blocker that treated them was the best-selling medicine on earth. Twenty years went by before the field stopped laughing at the Australians who proved it, one of them by swallowing the bacterium himself.
This is the list a viral post on X presented. The post has three hundred thousand views and it is mostly correct. The doctors who delivered each of those consensuses, the post says, were not villains. They were sincere, and all were certain about what they recommended. They were reading from a map that somebody else had drawn and handed them and the map was drawn in indelible ink.
The post is also, in the middle of its own list, an example of the thing it is warning about. The line about menopause cites the 2002 Women’s Health Initiative as the moment the medical establishment finally caught up to the danger of hormone replacement therapy. The trial was halted early. The headlines ran. The harm reported was real. What goes unsaid is that the WHI alarm itself caused population-level harm. Decades of women denied effective treatment for symptoms that genuinely degrade life, because a study that lumped women starting HRT well past menopause with women starting it near menopause was reported as if the two were the same. Population-level estimates suggest the avoidance itself shortened lifespans. Menopause specialists have been making this case for years; Marty Makary brought it to a wider audience in his 2024 book Blind Spots, and the chapter is harder to read than the X post. The post was written to teach you skepticism of medical certainty. The irony is that the example it picked was itself the cautionary tale.
That is the thing worth contemplating. Medical consensus does not just reverse. The reversal becomes the next consensus, and the next consensus is delivered with the same steady confidence as the one it replaced.
The map is what we should be talking about
Every doctor in that thread was reading from a map. The map is medicine’s ground truth. It is what every guideline, every screening recommendation, every prescription, every standard of care, ultimately rests on. The map is built from studies, which are built from data, which are built from how clinicians actually capture what happens during care. And every layer in that stack is shakier than we admit.
A study selects its inclusion criteria. The criteria sort patients by codes the clinician applied at the time, which are themselves a translation of a clinical impression formed under time pressure with incomplete data. The codes get aggregated. The aggregate becomes a finding. The finding becomes a guideline. The guideline becomes the new consensus. And then somebody asks, in 2026, what is the ground truth for healthcare AI, and the answer is: it is whatever the last guideline said. The map.
I delivered the map myself
I have handed patients the map with a confidence I would later regret.
I trained in the 1990s. The campaign to make pain the fifth vital sign was building through that decade. The American Pain Society laid the groundwork in 1995 and 1996; the Joint Commission wrote it into accreditation standards by 2001. The reasoning was humane. The evidence was treated as settled. Pain was undertreated. Opioids were positioned as less dangerous than the alternatives. NSAIDs were the riskier class. There comes a point when the consensus is solid enough that you stop questioning it and start delivering it. I delivered it. The line was “You don’t get any extra points for being in pain.”
I said it because I believed it. I believed it because every authority I trusted said the same thing. The Joint Commission said it. The pain societies said it. The guidelines said it. My attendings said it. And on the strength of that consensus, I wrote prescriptions that helped seed an opioid epidemic implicated in close to a million American overdose deaths.
I was not a villain. I was current. Like the doctor in 1966 recommending the small daily pill, like the doctor in 1979 telling you to drink milk for your ulcer, I was reading from a map that somebody else had drawn and handed me. The patients in front of me did not know that. They believed they were getting the best of what medicine knew. They were. The best of what medicine knew was wrong.
The system is not built to reward the doctor who says “I am not sure.” It is built to reward the doctor who is reading from the current map, with confidence, on time. The doctor who refused to deliver the consensus would not have been a hero. He would have been an outlier. That is what makes the viral post hard to read instead of easy. The certainty is the problem.
Now we are handing the map to the machine
Nature Medicine published a paper on June 12 that compared three frontier large language models, GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6, against two specialized clinical AI tools, OpenEvidence and UpToDate Expert AI. The frontier models won across all three of the paper’s evaluation stages. The general-purpose models are better than the domain-specific ones. Procurement implications. Vendor implications. Buy this, do not buy that.
The harder question is what the models were measured against.
Stage one was MedQA, a set of board-style multiple-choice questions. The “ground truth” is the answer key. These are exam questions, written for testing knowledge recall, not clinical judgment, and the paper itself flags the risk of data leakage. The models may have been trained on the questions they were tested on.
Stage two was HealthBench, a benchmark developed by OpenAI, scored by a panel of large language models acting as judges. The paper concedes, in its own limitations section, that “industry-created benchmarks may systematically favor the systems developed by their creators” and that “frontier models served as both evaluated systems and judges.” The benchmark is OpenAI’s, the highest-scoring model is OpenAI’s, and the judges grading it include OpenAI’s. The authors flag this. The press release does not.
Stage three is the closest thing to clinical reality in the paper, and it is the most revealing. Twelve clinicians scored a hundred real clinical queries on a four-point scale across four dimensions. Three of the twelve scored each item. The paper reports the inter-rater reliability with admirable honesty. On individual scores, the agreement landed at a Krippendorff’s alpha of 0.10 to 0.20. By every standard convention for inter-rater reliability, that is poor. Only when the four-point scores were collapsed into a binary, acceptable versus unacceptable, did the agreement become respectable.
This is the ground truth medicine is offering to AI in 2026. An exam answer key the model may have seen during training. A rubric developed by a vendor who is also one of the evaluated parties. And three clinicians per question whose individual judgments often disagreed, averaged into a stable signal we now train and rank against.
The paper is rigorous. The limitations are stated in the limitations section. That is more than most of the work in this field offers. What the field is doing in the public conversation, including the way procurement committees will read the results, is treating this stack as if it were the bedrock on which we now adjudicate clinical AI. It is not bedrock. It is what we have. The two things are different.
What if the data were good enough that the question of which AI won became uninteresting?
I keep coming back to a question that does not appear in the paper.
What if the underlying clinical data were good enough that a clinician, or a patient, could query the actual evidence and judge the conclusion themselves, without an AI mediating between them and a map drawn by somebody else? What if the discrete signal of what patients actually said, what their symptoms actually were, what the clinician actually noticed, were captured well enough that the next consensus could be tested against the territory instead of against the last consensus?
That is the work no one wants to do. The data underneath every guideline is fragmentary, unstructured, captured under time pressure, and discarded for billing. The signal that would actually let us update consensus faster than once a generation is mostly not recorded. It lives in the unstructured note, the conversation that never gets coded, the symptom the patient mentioned and the clinician did not write down. Ambient capture might one day surface some of that. Discrete data capture for what patients actually say, beyond ICD and SNOMED and CPT, is the foundation problem that, if solved, would make most of the AI debate look different.
That work is boring. It does not raise at a 12 billion dollar valuation, which is what OpenEvidence is reportedly worth right now. It does not get celebrated in Nature Medicine. It does not give a procurement committee a clean recommendation. It just makes the next consensus less likely to be the next reversal.
The map and the territory
The post on X is right about the doctors. They were sincere, they were kind, they were certain, and the certainty is the part worth considering.
What the post does not quite say is that the AI being trained right now is being handed the same certainty. Not because the engineers are villains. Because the map is what we have, and the field has decided to ship before the map is honest.
The doctor in 1966 was reading from the best map of 1966. The model in 2026 is being trained on the best map of 2026. We already know what that means in the long arc.
The work I would credit, the work I am not seeing anyone fund, is the work that gets the map closer to the territory. Until that happens, every story about which AI won is a story about which tool more confidently delivered a consensus we will eventually walk back.
John Lee is an emergency physician and Epic consultant who helps health systems bridge the gap between Epic’s capabilities and operational reality. He specializes in data architecture, registry optimization, and making Epic’s tools actually deliver results.
If you need help configuring your Epic environment to support these capabilities, connect with him on LinkedIn or via his website.



You forgot to mention Manny Rivers and the "data driven" Surviving Sepsis with the subsequent "lactate police" and the the hundreds of thousand of unnecessary central lines we were forced to foist on our patients when at the end of the day, the multi-million $$$ initiative could be summarized with this: "Hey, if you have sick elderly people who look bad and have a low blood pressure in the waiting room, maybe you should bring them to the back sooner!" Great work, John
Very thoughtful piece!! Insightful as always John!