Two types of people: those who already know large language models will confidently invent nonsense when cornered, and those about to find out that the usual way we grade them may be encouraging that nonsense like a sleep-deprived parent accidentally rewarding a supermarket tantrum with fruit snacks.
A new Nature paper by Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang makes a sharp, slightly uncomfortable point: if you score language models mostly on plain accuracy, you nudge them toward guessing instead of admitting uncertainty [1]. And honestly, that tracks. If every report card says, "Gold star for blurting something out," do not act shocked when the machine starts freewheeling like your uncle explaining crypto after two beers.
The Model Is Not Lying. It Is Panic-Answering.
Most modern LLMs are built on transformers, which use attention to weigh which earlier words matter when predicting the next one [2][3]. Under the hood, the whole setup starts with next-token prediction. The model sees a pile of text roughly the size of several libraries that nobody alphabetized, then learns to continue patterns.
That works absurdly well for grammar, style, and common facts. But the Nature paper argues there is a built-in problem: rare facts, especially one-off details, do not get reinforced the same way regular patterns do [1]. If a fact appears once, the model has much less statistical support than it has for, say, "the cat sat on the..." or "Paris is the capital of..." This means some errors are not just sloppy post-training behavior. They are baked into the training incentives from day one.
In tired-parent terms, the model knows the bedtime routine. It does not remember where you left the one green dinosaur sock three Tuesdays ago.
Accuracy Scores Have Main Character Energy
Here is the paper's real plot twist. Even if later tuning tries to make the model safer, our favorite benchmark habit still pushes in the wrong direction. Standard accuracy metrics reward a correct guess, but often do not meaningfully punish a wrong one or reward a careful "I don't know" [1].
That sounds minor until you realize what it teaches. If abstaining gets you nothing and guessing might get you points, the rational strategy is to keep talking. Which is also how toddlers answer "Who drew on the wall?" with "Maybe... the wall did it?"
The authors propose "open-rubric" evaluations that explicitly state how much errors are penalized and whether abstention is acceptable [1]. In plain English: tell the model what kind of mistake budget exists, then see if it behaves differently when the stakes are high. That is a lot saner than pretending every question should be answered with the same confidence, whether it is "What color are bananas?" or "What chemotherapy protocol fits this patient?"
The Babysitters Have Been Busy
This paper lands in the middle of a crowded, slightly frazzled research neighborhood. Surveys from 2023 to 2025 show hallucination has become one of the central reliability problems in LLMs, with researchers splitting hairs in useful ways: factual errors, source-grounding failures, contradictions, and cases where the model drifts away from the prompt like a child wandering off because a butterfly looked interesting [4][5][6].
Benchmarks are getting more serious too. HalluLens, from ACL 2025, tries to separate different types of hallucination and measure them more cleanly [5]. MiniCheck, from EMNLP 2024, focuses on efficient fact-checking against grounding documents and shows smaller specialized systems can do this verification work far more cheaply than repeatedly asking a giant model to inspect its own homework [6].
Then there is grounding. WikiChat showed that tying a chatbot more tightly to Wikipedia can slash hallucinations in conversation [7]. Same basic family of idea as retrieval-augmented generation: do not make the model rely on vibes alone when you could hand it the actual receipt.
That is why this new paper matters. It does not just say, "Use retrieval" or "add a verifier." It says the report card itself may be teaching bad habits. That is a deeper problem.
Why You Should Care Even If You Never Say "Loss Function" Out Loud
If LLMs are going into search, medicine, law, customer support, science tools, and internal company assistants, then "usually sounds right" is not good enough. A wrong answer with confidence is worse than a shrug. A shrug is annoying. A polished fake citation is how meetings get scheduled, policies get misread, and someone ends up debugging a problem that never existed.
The practical lesson is refreshingly unglamorous: reward calibrated uncertainty. Punish unwarranted guessing. Measure whether a model knows when to sit down and be quiet.
Which, frankly, is good advice for both AI systems and children who have not napped.
References
- Kalai AT, Nachum O, Vempala SS, Zhang E. Evaluating large language models for accuracy incentivizes hallucinations. Nature. Published April 22, 2026. DOI: 10.1038/s41586-026-10549-w. PubMed: PMID 42020757
- Wikipedia contributors. Large language model. Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
- Wikipedia contributors. Transformer (deep learning). Wikipedia. https://en.wikipedia.org/wiki/Transformer_(deep_learning)
- Huang L, Yu W, Ma W, et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv. 2023. arXiv:2311.05232
- Bang Y, Ji Z, Schelten A, et al. HalluLens: LLM Hallucination Benchmark. In: ACL 2025. DOI: 10.18653/v1/2025.acl-long.1176
- Tang L, Goyal T, Durmus E. MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents. arXiv. 2024. arXiv:2404.10774
- Semnani SJ, Yao V, Zhang H, Lam MS. WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. Findings of EMNLP 2023. arXiv:2305.14292
- Lin Z, Guan S, Zhang W, et al. Towards trustworthy LLMs: a review on debiasing and dehallucinating in large language models. Artificial Intelligence Review. 2024;57:243. DOI: 10.1007/s10462-024-10896-y
Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.