- Study
- Published in Radiology (RSNA journal), 2026. Peer-reviewed.
- Lead researcher
- Dr. Mickael Tordjman, Icahn School of Medicine at Mount Sinai, New York
- Scope
- 17 radiologists, 12 institutions, 6 countries (US, France, Germany, Turkey, UK, UAE). 264 images — 50% real, 50% AI-generated.
- Unaware detection rate
- 41% — when radiologists didn't know fakes were present
- Aware detection rate
- 58–92% individual range; ~75% average when told to look for fakes
- AI model detection
- GPT-4o, GPT-5, Gemini 2.5 Pro, Llama 4 Maverick — all scored 57–85%
- Image sources
- ChatGPT (various regions) and RoentGen, a Stanford Medicine open-source diffusion model
The concern with AI-generated medical images has existed as a theoretical problem for a few years. A study published this month in Radiology — the journal of the Radiological Society of North America — makes it a documented one. Seventeen radiologists across twelve institutions and six countries examined 264 X-ray images, half of which were AI-generated. When they weren’t told fakes were present, 59% of the synthetic images went undetected. When they were told to look, the best performers caught about nine in ten. The worst caught fewer than six.
That spread matters. The framing around AI-generated medical images tends to land in one of two places — either “humans are obsolete” or “experts can always tell.” This study punctures both. The realistic picture is messier: detection is possible, inconsistent, and not meaningfully correlated with experience. A forty-year radiology veteran performs roughly the same as a resident when neither knows they’re looking for fakes.
What the Images Look Like
The study used two image sets. The first mixed real X-rays across multiple body regions with images generated by ChatGPT. The second focused on chest X-rays, comparing real images against outputs from RoentGen, an open-source diffusion model released by Stanford Medicine.
Both sets fooled radiologists. The fakes aren’t perfect — experts examining them carefully identified characteristic tells. Bones are rendered too smooth. Spines align with an anatomical tidiness that real spines don’t achieve. Lung symmetry is exaggerated in ways that reflect training data idealization rather than actual anatomy. Blood vessel patterns are uniform where they should be branching and irregular. Fractures, when present, are implausibly clean.
These tells are learnable. They’re also subtle enough that 59% of synthetic images cleared an expert radiologist who wasn’t primed to find them. And they’ll get harder to spot. The current generation of image models produces artifacts because training data is finite and biased toward textbook presentations. The next generation won’t have that constraint to the same degree.
The Detection Gap
When radiologists knew they were looking for fakes, individual accuracy ranged from 58% to 92% for ChatGPT-generated images and 62% to 78% for RoentGen chest X-rays. The wide range suggests detection is a skill that varies significantly between individuals — and crucially, the study found no correlation between years of experience and accuracy. Subspecialty mattered more: musculoskeletal radiologists outperformed other subspecialties, possibly because their training involves close attention to the precise bone characteristics that AI renders implausibly.
Four multimodal AI models were also tested as detectors: GPT-4o, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick. Their accuracy ranged from 57% to 85%. GPT-4o, which generated some of the fake images being tested, didn’t reliably identify its own outputs — catching some but missing others. The AI-versus-AI detection problem, where a model is used to authenticate content it or its siblings could have generated, is not solved by capability improvements alone.
The Fraud Applications
Dr. Tordjman’s framing of the fraud risk is specific. Fabricated images accurate enough to fool radiologists have direct applications in insurance fraud — a fake fracture that reads as real through standard clinical review can support an injury claim that doesn’t exist. In litigation, fabricated imaging evidence admitted through an expert who doesn’t know to look for fakes could influence outcomes. The study also flags Munchausen syndrome, where patients fabricate illness — AI-generated images lower the technical barrier to producing convincing supporting evidence.
The most acute concern may be the hospital network attack scenario. If a malicious actor gains access to a hospital’s imaging infrastructure and injects synthetic images at the point of storage or transmission, radiologists reviewing cases wouldn’t have reason to suspect the images are fake. The standard clinical workflow doesn’t include authentication of image provenance — the assumption is that a scan in the system is a scan the system’s equipment produced.
What the Proposed Fixes Are
The study recommends a combination of technical and educational interventions. Invisible watermarks embedded at image capture time would allow downstream verification of provenance. Cryptographic signatures linked to the specific imaging equipment and the technologist operating it at capture create a chain of custody that’s significantly harder to fabricate. These solutions exist in prototype — they are not currently standard anywhere.
Training radiologists to recognize synthetic image artifacts is the shorter-term intervention. The detectability characteristics the study identified — the unnaturally smooth bones, the over-symmetrical anatomy — are teachable pattern recognition. The problem is that these characteristics will shift as generation models improve, requiring ongoing curriculum updates rather than a fixed detection skill.
This is a well-constructed study with a straightforward finding: AI-generated X-rays are currently detectable by experts who are looking for them, and not reliably detectable by experts who aren't. That's a meaningful gap, and one that will narrow as generation quality improves.
The fraud risk is real and specific — insurance claims, litigation, hospital network integrity — and the current clinical infrastructure has no systematic answer to it. Watermarking and cryptographic provenance are the technically sound solutions. Neither is deployed at scale. The window between "this is possible" and "this is commonly exploited" is likely shorter than the time required to standardize the fixes.