Anthropic Models Take Top 3 Spots With Lowest Hallucination Rates In New Omniscience Benchmark

Hallucination has been a big problem in the large-scale deployment of LLMs, and a new Hallucination Index looks to quantify the problem — and determine which the lowest hallucinating models are.

Artificial Analysis has released a new benchmark that tests AI models on Hallucination rates. AA-Omniscience, developed by research firm Artificial Analysis, measures both factual recall and knowledge calibration across 6,000 questions covering 42 economically relevant topics within six domains: Business, Humanities & Social Sciences, Health, Law, Software Engineering, and Science, Engineering & Mathematics.

Unlike traditional benchmarks that reward guessing, AA-Omniscience introduces a novel metric called the Omniscience Index that ranges from -100 to 100. The index penalizes incorrect answers and rewards models for abstaining when uncertain, with a score of 0 representing a model that answers correctly as often as it answers incorrectly.

Questions are derived from authoritative academic and industry sources, filtered to ensure frontier-level difficulty, and designed to test a model’s embedded knowledge without access to external tools or context.

The Hallucination Challenge

The results paint a sobering picture of current AI capabilities. Only three models managed to achieve an Omniscience Index score above zero, with Claude 4.1 Opus leading at 4.8. This means the vast majority of evaluated models—including many considered “frontier” systems—produce incorrect answers more often than correct ones when accounting for hallucinations. GPT-5.1 and Grok 4 followed in the next two spots.

High hallucination rates proved to be the dominant factor driving low scores. For instance, while Grok 4 and GPT-5 (high) recorded the highest accuracy at 39%, their hallucination rates of 64% and 81% resulted in substantial penalties on the Omniscience Index.

In stark contrast, Claude 4.1 Opus achieved 36% accuracy alongside one of the lowest hallucination rates, yielding the highest overall score due to its stronger calibration. Claude 4.5 Haiku demonstrated similar restraint, achieving only 16% accuracy but maintaining a notably low 26% hallucination rate.

Leaders Across Key Metrics

Accuracy (Raw Knowledge): Grok 4 and GPT-5 (high) tied for the highest accuracy at 39%, with Claude 4.1 Opus close behind at 36%.

Hallucination Rate (Reliability): Anthropic’s models dominated this critical metric, with Claude 4.1 Opus and Claude 4.5 Haiku both demonstrating exceptional calibration—knowing when to abstain rather than guess incorrectly.

Domain-Specific Performance: No single model consistently dominated across all six domains. Claude 4.1 Opus led in Law, Software Engineering, and Humanities & Social Sciences; GPT-5.1 achieved the highest reliability on Business questions; and Grok 4 performed best in Health and in Science, Engineering & Mathematics.

Intelligence Doesn’t Equal Reliability

Perhaps most surprisingly, the research found that overall intelligence does not reliably predict strong embedded knowledge or low hallucination rates. When compared against Artificial Analysis’s Intelligence Index, which measures general capabilities across tasks like coding and reasoning, several high-performing models showed poor factual reliability.

Models such as Minimax M2 and gpt-oss-120b (high) achieved strong Intelligence Index scores, yet their elevated hallucination rates resulted in poor performance on the Omniscience Index, making them unsuitable for applications that depend on factual accuracy.

The Cost of Reliability

The benchmark also revealed a clear positive association between model performance and cost, indicating that achieving higher levels of factual reliability often requires greater expenditure. However, some models proved more cost-efficient than others.

Claude 4.5 Haiku attained a higher Omniscience Index than several substantially more expensive models, including GPT-5 (high) and Kimi K2 Thinking, suggesting certain models offer more favorable cost efficiency for knowledge-intensive tasks.

Implications for Enterprise AI

The results carry significant implications for organizations deploying AI systems in knowledge-intensive domains. The benchmark focuses on economically important fields, with the six evaluated domains collectively making up 44% of U.S. wages in 2024.

The evaluation addresses a critical gap: even when models have retrieval or tool use capabilities, embedded knowledge remains both competitive with and a prerequisite for effective tool use, as models with poor knowledge struggle with context understanding and efficient searching.

The benchmark’s methodology—using an automated question generation agent that derives questions from authoritative sources—allows it to scale across domains and continuously update with recent information, ensuring its continued relevance as models evolve.

For organizations selecting AI models for deployment, the research suggests that general capability benchmarks don’t tell the whole story. Models that appear suboptimal in overall rankings may offer competitive or superior reliability within targeted domains, and models with strong general knowledge do not necessarily demonstrate high reliability within every specific domain.

Posted in AI