Doctors are getting a new writing partner. That sounds harmless until you remember the writing in question is the medical record - the document other clinicians trust when the room is busy, the shift just changed, and nobody has time for a philosophical debate about who actually typed the sentence.
In Tracing the Pen, Arash Nargesi and colleagues make a simple argument with big consequences: generative AI is already moving into electronic health records, and we need a way to trace which words came from the model and which came from the human clinician [1]. Not because doctors are lazy. Because the paperwork war is real, the inbox is endless, and the EHR has been eating physician attention like a raccoon in a campground.
Large language models are the new logistics officers in this campaign. They can draft notes, suggest replies to patient messages, summarize encounters, and help prepare diagnostic reports. Recent studies suggest that ambient AI scribes can trim documentation time and reduce after-hours EHR work, which is the medical equivalent of getting reinforcements before midnight [2,3]. Reviews of clinical trials also show that documentation and data handling are among the busiest early deployment zones for medical LLMs [4].
That is the good news.
The awkward news is that once AI text slips into the chart, it can start blending into the wallpaper. And in medicine, wallpaper can kill.
Fog of war, but make it billing-compliant
An EHR is not a casual notebook. It is a legal record, a care coordination tool, a billing artifact, and sometimes the only map another clinician has when making a fast decision. If an LLM drafts a sentence that is subtly wrong, overly confident, or scrubbed clean of uncertainty, that error does not just sit there looking embarrassing. It can get copied forward, echoed into referrals, and treated like ground truth by the next person in line. Congratulations: the typo now has a passport and a chain of command.
That is the heart of this paper. Nargesi and colleagues are not arguing that AI should stay out of the EHR. They are arguing that mixed human-AI authorship without traceability is a bad battlefield habit [1]. Their concern is not sci-fi robot mutiny. It is something much more believable and therefore more annoying: mundane confusion. Who wrote this? Who reviewed it? Which parts were edited? Was the patient told AI helped draft the message? Did the clinician actually agree with the wording, or just miss one landmine in a 5:47 p.m. charting sprint?
The authors point to a few defenses. One is disclosure inside notes or patient communications. Another is vendor-level tagging of AI-generated content, similar to how copied text can be flagged in some EHRs. They also discuss watermarking and AI-text detection, while noting the obvious problem: detectors are not magic bloodhounds. Other research shows detection can fail, especially after paraphrasing or editing [1,5,6]. In other words, if your strategy is "we'll figure it out later with a detector," you are planning a retreat, not a defense.
The tools are improving faster than the rules
This is not a theoretical skirmish. Health systems are already fielding ambient scribes and EHR-integrated drafting tools at scale, and major EHR vendors are pushing deeper into AI charting workflows. Studies on patient messaging show LLM-assisted replies can improve quality in some settings, while other implementations increased reading time or changed communication patterns in ways that still need scrutiny [1,7,8]. Meanwhile, note-quality evaluations of ambient scribes look promising, but they are still early, context-specific, and very much not a license to stop paying attention [9].
The broader pattern is easy to see: the models are getting deployed first, and governance is sprinting behind them carrying a clipboard.
That is why this paper matters. It is a perspective piece, not a victory-lap trial. It does not prove one tagging system wins the war. It says the war already started, and we should stop pretending provenance will sort itself out by vibes and good intentions. Fair point. Audit trails exist for a reason. If a transformer is going to help write part of your chart, the record should say so as plainly as a supply manifest.
What happens next
The most plausible future is not AI replacing clinicians. It is clinicians supervising an expanding stack of AI-written drafts, summaries, and suggestions. Think less "robot doctor" and more "junior staffer who types fast, never sleeps, and occasionally hallucinates with the confidence of a man explaining crypto at a wedding."
If that is the future, traceability is not bureaucracy for its own sake. It is how medicine keeps accountability attached to the words that move through care. When the chart becomes a hybrid artifact, the signature matters as much as the sentence.
And that, buried under the paper’s calm title, is the real dispatch from the front: in the race to automate documentation, medicine is not just teaching machines to write. It is deciding whether anyone will still be able to tell whose pen drew the line.
References
-
Nargesi AA, You JG, Bitterman DS, et al. Tracing the Pen: Electronic Health Records Amid the Rise of Generative AI. npj Digital Medicine. Published April 21, 2026. doi:10.1038/s41746-026-02508-6. https://doi.org/10.1038/s41746-026-02508-6
-
Ma SP, et al. Ambient artificial intelligence scribes: utilization and impact on documentation time. Journal of the American Medical Informatics Association. 2025;32(2):381-385. PubMed PMID: 39688515. https://pubmed.ncbi.nlm.nih.gov/39688515/
-
Rotenstein LS, et al. Ambient Documentation Technology in Clinician Experience of Documentation Burden and Burnout. JAMA Network Open. 2025;8(8):e2528056. doi:10.1001/jamanetworkopen.2025.28056. https://pubmed.ncbi.nlm.nih.gov/40839265/
-
Omar M, Nadkarni GN, Klang E, Glicksberg BS. Large language models in medicine: A review of current clinical trials across healthcare applications. PLOS Digital Health. 2024;3(11):e0000662. doi:10.1371/journal.pdig.0000662. PMCID: PMC11575759. https://doi.org/10.1371/journal.pdig.0000662
-
Dathathri S, See A, Ghaisas S, et al. Scalable watermarking for identifying large language model outputs. Nature. 2024;634:818-823. doi:10.1038/s41586-024-08025-4. PMCID: PMC11499265. https://doi.org/10.1038/s41586-024-08025-4
-
Sadasivan VS, Kumar A, Balasubramanian S, Wang W, Feizi S. Can AI-Generated Text be Reliably Detected? arXiv:2303.11156, 2023. https://arxiv.org/abs/2303.11156
-
Chen S, et al. The effect of using a large language model to respond to patient messages. The Lancet Digital Health. 2024;6(7):e441-e443. doi:10.1016/S2589-7500(24)00111-0. https://pubmed.ncbi.nlm.nih.gov/38664108/
-
Tai-Seale M, et al. AI-Generated Draft Replies Integrated Into Health Records and Physicians’ Electronic Communication. JAMA Network Open. 2024;7(4):e246565. doi:10.1001/jamanetworkopen.2024.6565. PMCID: PMC11019394. https://pmc.ncbi.nlm.nih.gov/articles/PMC11019394/
-
Palm E, Manikantan A, Mahal H, Belwadi SS, Pepin ME. Assessing the quality of AI-generated clinical notes: validated evaluation of a large language model ambient scribe. Frontiers in Artificial Intelligence. 2025;8:1691499. doi:10.3389/frai.2025.1691499. PMCID: PMC12586549. https://doi.org/10.3389/frai.2025.1691499
Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.