aimedicineupdate.com July 20, 2025
Even after two years of rapid model upgrades, generative systems can still invent dates, mis-attribute quotations, or cite sources that do not exist. Microsoft researchers call these slip-ups ungrounded content, answers that look fluent but cannot be traced to any reliable evidence (source). The very word “hallucination” in AI, although contested by some but widely used, was coined by computer-vision scientists Simon Baker and Takeo Kanade, who titled their 1999 face-super-resolution paper “Hallucinating Faces,” (source). It was later popularized for language models by Andrej Karpathy’s 2015 blog post showing an RNN that “hallucinated” non-existent URLs (source). Because LLMs now sit inside everyday assistants (Microsoft Copilot), developer tools (Google Gemini), social-media chatbots (xAI Grok) and enterprise platforms (Anthropic Claude 3), preventing hallucinations is now a baseline requirement for safe deployment.
1 | How we measure hallucinations today
There is no single “hallucination score,” so practitioners lean on complementary tests. Vectara’s HHEM 2.1 leaderboard checks whether a model stays faithful when it summarizes a document; the July 2025 table shows rates as low as 0.7 % for Gemini-2 Flash and 1.5 % for GPT-4o, while Grok-2 and GPT-3.5-Turbo cluster around 1.9 % (source). For open-domain questions, TruthfulQA probes whether models repeat common myths; GPT-4 tops the chart with 0.59 accuracy, nearly doubling GPT-3’s performance but still far from perfect truthfulness (source). Finally, the new HalluLens battery separates intrinsic logical errors from extrinsic fabricated facts and regenerates its test set on every run to block memorised answers, giving teams a moving target for stress-tests (source). Taken together, these benchmarks let developers monitor monthly regressions, compare vendors on equal footing, and decide how much additional guard-railing a particular workflow needs.
Ranking of the top 25 LLMs hallucination rate as of July 16th, 2025,

2 | Making hallucinations rare in practice
The most reliable defense starts before the first token is generated: craft a system prompt that fences the model in “Only use the sources provided; if you’re unsure, say ‘I don’t know.’” Controlled experiments at Microsoft show that clear grounding rules alone can trim hallucinations by roughly a third (source). Next, wire the model into a retrieval-augmented generation (RAG) loop, the architecture now standard in Copilot and Gemini; by injecting live web or enterprise snippets, RAG anchors answers to text the user can inspect (source).
A second, quieter safeguard is self-critique: after drafting a reply, the model (or a smaller critic model) rereads its own text and flags claims that lack support. Research on self-evaluation and “chain-of-verification” techniques shows consistent drops in factual errors across tasks and model sizes (source).
Even so, high-stakes deployments keep humans in the loop. Azure’s groundedness-detection and correction service can analyze Copilot output in real time, identify ungrounded fragments relative to trusted sources, and—when configured—rewrite or remove unsupported content before users see it. In cases flagged as ambiguous or high-risk, implementing organizations can escalate output to expert review queues for human moderation (source). In practice, organizations blend these layers, prompt discipline, RAG, automated self-checks, and human oversight to push hallucination rates below the thresholds their risk teams will accept.
3 | Why humans (and their expertise) still matter
Automation can highlight suspicious lines, but a true expert will spot a made-up claim in seconds. MedHallu, a 10 000-question test for medical hallucinations, shows the gap: top LLMs caught only about 63 % of the trickiest errors, while doctors did far better (source).
Imagine an LLM summarizing the 2025 cardiology guidelines and slipping in a non-existent anticoagulant called cardioxalin. A cardiologist would notice instantly, there is no such drug, whereas a general health blogger might repeat the name without question. That difference is why high-stakes deployments need to keep a subject-matter expert (SME) in the loop. SMEs feedback needs to be part of the prompt and retrieval, so the mistakes like this don’t occur.
When the cost of a missed hallucination is measured in health outcomes, legal exposure, or dollars, teaming up with a domain expert is still the safest bet.
Quick “no-hallucination” checklist
- Begin every prompt with a grounding rule.
- Inject authoritative context via RAG.
- Run a self-critique or ensemble vote before serving answers.
- Track at least one public benchmark (e.g., HHEM) each month.
- Route high-stakes content through SME review.
Blend these habits into your workflow and the odds of an embarrassing or costly hallucination fall from “likely” to “rare and catchable.”
Disclaimer
This article was drafted by a human author. All sources listed above were manually checked for accuracy. OpenAI ChatGPT assisted with literature discovery, organization of key points, and wording improvements, but final content selection, fact-verification, and editing were performed by the author.