AI Search: Why LLMs Get Things Wrong (Part 2 of 3)

Picture of Kathryn Hillis
Kathryn Hillis

Director of Organic Search Strategy

Most of us who use Google AI Overviews, ChatGPT, or other AI search technologies have come across everything from partially correct AI-generated responses to perplexingly incorrect AI summaries. These AI tools are powered by large language models (LLMs), and understanding where LLMs fail can help explain why AI search sometimes gets things wrong.

This article organizes LLM failure points into three categories: Data LimitationsRetrieval and Context Issues, and Generation Behavior. Errors often involve multiple failure points, and specific behaviors vary by model. Still, the same broad patterns show up across major platforms, and recognizing them is a good starting point for assessing reliability risk and what you may be able to influence.

It’s worthwhile to be familiar with causes of errors because reliability remains a weakness for LLMs, even as other capabilities improve. The recently published Phare LLM benchmark V2 (Dec 2025) highlights this.

This article builds on the previous article in this series about how LLMs work. There, we discussed how LLMs are pattern-learning prediction machines. They learn statistical patterns from large amounts of data, then generate outputs based on those patterns. That process can produce errors at different stages, discussed below.

Data Limitations

LLMs, and the AI search experiences they power, can become unreliable when the underlying information is limited, contradictory, wrong. In these cases, AI systems may fill gaps with a plausible-sounding, incorrect response.

Limited Or Inconsistent Information

LLMs are most reliable when information appears across sources in repeatable, consistent patterns. When data is sparse or contradictory, outputs become less reliable.

Search queries are becoming longer and more conversational. For example, queries three times longer in AI Mode than traditional search, as stated during Google’s Q4 earnings call. The shift towards conversational can surface specialized, long-tail intent groups that may not be well covered yet.

When coverage is thin, AI systems will generally still produce an answer (more on this in the last category in this article). They may fill gaps by pulling in nearby information that seems related but does not actually apply, or by generating plausible-sounding details not grounded in a source.

This scenario presents both opportunity and liability. For more specific topics, you may be able to earn visibility relatively quickly by publishing clear, specific, extractable answers. On the flip side, if an authoritative source does not exist for a topic, your audience may encounter less reliable information in AI experiences that sound definitive and true.

I recently came across an example of how limited information can contribute to a specific but incorrect answer in Google AI Overviews. I was searching for scheduled maintenance for a rollercoaster at Universal in Orlando. The AI Overview returned potential closure dates.

Upon closer look, the closure dates provided in the overview were actually for an Universal water park, not the ride I searched for. The overview seems to have pulled the dates from a credible, third-party page. As a human reader, that page contained obvious and clear closure dates for the water park. The ride I searched for was mentioned on the page, but in separate sections with no mention of closures.

I could not find any page that directly addresses scheduled maintenance for rides. The closest official source is the Universal Park Hours page, which lists temporary closures without much explanation. Third-party sites tend to do the same. They list what’s closed, but they don’t directly state whether major attractions have scheduled seasonal closures or are generally expected to run year-round aside from unexpected downtime.

When an information environment is clear about some facts (what is closed), but other information is implied, (“everything else is operating normally now and in the foreseeable future”), the direct answer gap can result in AI systems retrieving information that seems relevant but isn’t. We’ll discuss retrieval more in the next section. 

Whether it’s worth adding these types of explicit claims to your content will vary per case. But the underlying pattern that played out in this example is worth being aware of. 

Outdated Information

Training data has a cutoff point. Most major AI tools supplement training with web retrieval to help fill gaps, a process covered in Part 1 of this series. But not all queries trigger a live search, and retrieval doesn’t guarantee the newest source is selected.

Incorrect Sources

Incorrect sources refers to cases where the source information is wrong, whether by accident or deliberate manipulation. This can overlap with conflicting sources, discussed in the next section, because accurate and inaccurate sources can coexist and compete.

A joint study led by Anthropic found that as few as 250 documents can poison LLMs of any size. The study focused narrowly on hidden triggers that cause abnormal behavior, not corrupting general knowledge, but it still confirms that LLMs are influenceable by surprisingly small volumes as data.

The retrieval systems in some AI search experiences may be especially vulnerable. The Times reported in December 2025 how scammers are seeding fake call center numbers across the web. In this instance, Google’s AI Overviews and Perplexity returned incorrect results. ChatGPT and Claude returned correct answers, but still drew from some of the poisoned sources.

In a similar vein, Ahrefs ran a misinformation experiment in December 2025. They created a fake luxury brand with an official FAQ, then seeded contradictory incorrect information across third-party websites like Reddit and Medium. In their testing, specific misinformation often won out over vague truth. Their findings will not apply perfectly to established brands because their fake brand was newly created, but the apparent preference for specificity is still worth keeping in mind.

Retrieval and Context Issues

Retrieval and context issues pertain to when accurate information exists but models don’t find it or use it correctly. The previous article covers how retrieval works, including how pages are broken into smaller segments that can lose surrounding context. 

The examples in this section focus on selection and interpretation problems, where the model’s choice or handling of sources makes the answer effectively wrong. While this section focuses on cases where information gets misused or passed over, the same retrieval and selection issues can surface misinformation like the examples touched on above.

Conflicting Sources

AI search tools and traditional search engines both have to navigate multiple plausible sources competing with each other. An important difference is that traditional search shows you the competing links. In contrast, AI search might generate an answer for you.

This Seer Interactive analysis is a good example of how source conflicts can create friction for users, even when the outcome is not outright harmful. They searched for branded phone numbers across multiple AI models and found that third-party sources were cited more often than brand-owned sources. The phone number provided by AI search did not the match the brand’s customer service page 36% of the time. Fortunately, in this testing, AI systems generally provided a viable company phone number, just not the right one.

Context Confusion

Missing or misinterpreted context can affect AI search outputs in unexpected ways. For example, LLMs can miss signals humans pick up naturally. Sarcasm, speculative forum posts, and April Fools jokes can be misinterpreted as fact.

Generation Behavior

The generation behavior category covers the way LLMs produce answers. Even with accurate source material, models may overstate confidence, generalize, or go along with confident-sounding prompts. These factors are the hardest to directly influence from a marketing perspective, but they’re worth being aware of.

Guessing rather than expressing uncertainty

Current training “rewards guessing over acknowledging uncertainty,” as reported in the research paper Why Language Models Hallucinate by OpenAI. This is an important issue, and steps are being taken to improve in this area.

For now, though, large language models will confidently guess more than acknowledge uncertainty. The practical risk for brands is that models will likely generate a confident guess no matter what, including if information is missing or unclear.

Agreeing with Confident Users

LLMs can also be swayed by how confidently users phrase things, called sycophancy. For example, results from the Phare v1 published in May 2025 revealed that information presented tentatively (“I heard that..”) was more likely to be corrected by LLMs than information with confident delivery (“My teacher told me…”). Sycophancy varied based on the model tested, which the linked article describes in detail.

Synthesis Errors

Synthesis errors are mistakes introduced during summarization. The underlying sources can be correct, but the AI output can still shift the meaning of key claims. This can happen, for example, when a model misses a distinction that matters or paraphrases in a way that doesn’t preserve what the source actually says.

One way synthesis can fall short is overgeneralization. Common language patterns can be statistically stronger than nuanced claims or niche details, nudging models towards more generic phrasing.

This tendency has been measured in scientific summarization. Signs of it were also present when the BBC asked four generative AI tools 100 news questions and found that over half had significant issues.

The risk of overgeneralization and other synthesis issues may be higher in information spaces where details are non-standard, terminology is evolving, or small distinctions matter a lot to accuracy.

Conclusion

LLMs can produce unexpected results for various reasons. AI search strategy is part opportunity and part risk mitigation, and the risk side includes information related to your brand and broader topic space being wrong. In the final article, we’ll talk about how to prioritize AI-focused updates on your website, the part of the information ecosystem most directly in your control.