When Seeing Is No Longer Believing: AI-Generated X-Rays and the Coming Crisis of Medical Trust

A ChapGPT logo is seen on a smartphone in West Chester, Pa., Wednesday, Dec. 6, 2023. Catalyzing a year of AI fanfare was ChatGPT. The chatbot gave the world a glimpse of recent advances in computer science even if not everyone figured out quite how it works or what to do with it. (AP Photo/Matt Rourke)

A new study is raising an uncomfortable question about medical imaging: if a doctor cannot tell whether an X-ray is real, can medicine still trust what it sees?

A study published in March in Radiology, the journal of the Radiological Society of North America, found that neither radiologists nor multimodal large language models are able to easily distinguish AI-generated "deepfake" X-ray images from authentic ones. The findings do not simply belong in the category of "AI fools experts," but point toward a structural problem that medicine has not yet developed the institutional vocabulary to describe, let alone solve.

Researchers recruited 17 radiologists from 12 centers across six countries, including France, Germany, the United Arab Emirates, the US, the United Kingdom, and Turkey. Researchers analyzed a total of 264 X-ray images, split evenly between real and AI-generated scans. The radiologists reviewed two separate image sets, the first having mixed genuine radiographs with images generated by GPT-4o. The second focused on chest X-rays, divided equally between authentic images and synthetic ones produced by RoentGen, an open-source AI diffusion model developed by Stanford Medicine researchers.

When radiologists were unaware of the study's true purpose, only 41 percent spontaneously identified AI-generated images. After being informed that the dataset contained synthetic images, the radiologists' mean accuracy in distinguishing real from synthetic X-rays was 75 percent. The gap between those two numbers is itself telling, as even with a direct warning, one in four fakes still slipped through.

The AI systems did not perform much better. Four multimodal LLMs (GPT-4o, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick) achieved accuracy rates ranging from 52 percent to 89 percent across both image sets tested, and from 57 percent to 85 percent on GPT-4o-generated images specifically. Even GPT-4o, the model used to create the deepfakes, was unable to accurately detect all of them, though it identified the most by a considerable margin compared to the other models.

Dr. Brad Black with the Center for Asbestos Related Disease health clinic is shown looking at X-rays, Feb. 18, 2010, in Libby, Mont. Attorneys for the clinic denied allegations that it submitted false medical claims, earning Libby residents lifetime government medical benefits. (AP Photo/Rick Bowmer, File)

Intriguingly, there was no difference based on the radiologists' experience, ranging from zero to 40 years. While seniority offered no protection, subspecialty did, as musculoskeletal radiologists demonstrated significantly higher accuracy than other radiology subspecialists.

The study's lead author, Dr. Mickael Tordjman, a post-doctoral fellow at Icahn School of Medicine at Mount Sinai in New York, described what makes the synthetic images both convincing and subtly off. "Deepfake medical images often look too perfect," he said. "Bones are overly smooth, spines unnaturally straight, lungs overly symmetrical, blood vessel patterns excessively uniform, and fractures appear unusually clean and consistent, often limited to one side of the bone." This improbable cleanliness may eventually serve as a detection cue, but for now it remains easy to miss.

The stakes the researchers identify are not abstract. Tordjman warned that the findings create "a high-stakes vulnerability for fraudulent litigation if, for example, a fabricated fracture could be indistinguishable from a real one.” He also raised cybersecurity concerns about hackers gaining access to hospital networks and injecting synthetic images "to manipulate patient diagnoses or cause widespread clinical chaos by undermining the fundamental reliability of the digital medical record."

Insurance fraud, fabricated workplace injuries, disputed surgical outcomes, and manufactured evidence in personal injury lawsuits now have a plausible AI-assisted pathway. The X-ray, long treated as an objective artifact of the body rather than a representation of it, may be entering the same contested territory as the photograph.

That shift carries implications beyond the clinic. Fields that have already grappled with synthetic media, such as journalism, cybersecurity, and digital forensics, have spent years developing frameworks for image provenance, chains of custody, and authentication. Medicine, which has historically treated clinical images as inherently trustworthy, is just beginning to confront the same challenge.

The researchers propose technical countermeasures. These measures include invisible watermarks embedded directly into images and cryptographic signatures linked to the technologist at the time of image capture, helping to verify authenticity. The study's authors also point toward blockchain-based registration systems, which would create decentralized audit trails for medical images, allowing anyone in the chain, like a radiologist, an insurer, or a court, to verify that an image has not been altered since it was taken.

These are not new ideas. Digital watermarking, cryptographic hashing, and blockchain-based registration systems have all been proposed as tools for medical image security to ensure authenticity, integrity, and secure access. For years, those tools addressed a mostly theoretical threat; the new study makes clear that is no longer the case.

The study warns that increasingly realistic AI-generated X-rays could be misused in medicine, research, insurance, and litigation, strengthening the case for watermarking, clinician training, and dedicated detection tools. Researchers have already released a curated deepfake dataset with interactive quizzes to help train radiologists in detection, but this is only a small step toward building the kind of adversarial literacy that other image-dependent fields have had to develop.

The broader question is whether medicine can adapt its institutional assumptions quickly enough. Courtrooms,  hospitals, and insurance claims departments operated on the premise that a medical image is a record, not a construction. That premise is now in question, and unlike a bad diagnosis, it cannot simply be corrected with a second opinion.

Next
Next

War, Oil, and Ashes: The Environmental Costs of the US-Israel War in Iran