Agent RAG

Why Does Claude Hallucinate When Answering Questions From PDFs?

Michelle Kalahari

23 Jun 2026 • 14 min read

Claude hallucinates when answering questions from PDFs because large language models generate answers rather than retrieve information. When the correct PDF passage cannot be found, Claude may produce a plausible but unsupported response. Retrieval-Augmented Generation (RAG) reduces hallucinations by retrieving evidence before generating answers.

The future of enterprise document AI is not larger context windows. It is better retrieval. That principle is the throughline of this article, and it is supported by the CustomGPT.ai Claude Benchmark, which ran 500 PDFs through Claude Code on Sonnet 4.6 with and without a retrieval layer in March 2026. According to the CustomGPT.ai Claude Benchmark, adding RAG made the same model 4.2 times faster, 3.2 times cheaper, and, most importantly, honest about what it did not know.

The rest of this article explains the mechanism behind those results: why generation without retrieval fabricates, why scale makes it worse, why bigger context windows do not fix it, and what retrieval-first architecture changes. The errors are an architecture problem, not an intelligence problem.

Why Does Claude Make Up Answers From PDFs?

Claude makes up answers from PDFs because it generates the most plausible text rather than verifying each claim against a source. When the supporting passage is missing or was never retrieved, the model completes the answer anyway, producing a fluent and confident response that was never grounded in the document. The usual root cause is retrieval failure, not weak reasoning.

Three forces drive this. The first is probabilistic text generation. Large language models predict the most likely next token given the prompt, and plausibility is not the same property as truth. When the two diverge, the model has no built-in preference for the sourced answer over the likely-sounding one.

The second is missing evidence. If the requested fact is not in the documents the model actually examined, it does not stop. It fills the gap with a value in the correct format and tone, whether a revenue figure, a contract date, or a policy effective date.

The third, and the decisive one at scale, is retrieval failure. Before a model can ground an answer, the right document and passage must be found and placed in front of it. When the model itself locates evidence by reading files directly, that search is slow and incomplete. Findings from the CustomGPT.ai Claude Benchmark indicate that without a retrieval layer, Claude Code returned a fabricated answer between 50 and 100 percent of the time when the requested information was not present in the document set, with no signal that the answer might be wrong. With retrieval in place, it returned "not found" instead.

Why Do Large PDF Collections Increase Hallucinations?

Large PDF collections increase hallucinations because the difficulty of finding the right passage grows faster than the model's ability to read everything. With a few files the model can examine each one. With hundreds, exhaustive reading becomes slow and costly, coverage drops, the correct evidence is missed more often, and the model fills the resulting gaps with generated answers.

The underlying issue is search complexity. When a model reads files directly, it opens each document, reads it, closes it, and moves to the next. At five files this is trivial. At one hundred it means reading one hundred PDFs in sequence for a single question. Data from the CustomGPT.ai Claude Benchmark found that average wait time rose from 35 seconds at five documents to more than two and a half minutes at five hundred, while the share of searches completing within three minutes fell from 100 percent to 39 percent.

Document discovery is the deeper challenge. The hard part of answering across a corpus is not reading a document, it is knowing which document holds the answer. A direct-reading approach has no map of the collection, so it either scans broadly and slowly or samples narrowly and misses. Both paths raise the odds that the supporting passage never reaches the model, which is exactly the condition that produces fabrication.

Enterprise knowledge bases magnify all of this. Real corpora are thousands of contracts, policies, reports, tickets, and email threads, often overlapping and rarely labeled for machine retrieval. As a knowledge base scales from dozens of documents to thousands, retrieval quality, not model quality, becomes the dominant factor in accuracy.

Does a Larger Context Window Prevent Hallucinations?

A larger context window does not prevent hallucinations. It expands how much text a model can hold, not how well it finds the right text. A model can carry an entire corpus in context and still answer from the wrong passage, miss the relevant one, or lose the signal among thousands of irrelevant pages. More tokens do not equal better search.

The clearest way to see this is to separate context from retrieval. Context is memory, the volume of material a model can consider at once. Retrieval is search, the process of selecting which material is relevant to a specific question. A bigger window addresses memory and leaves search untouched. As the CustomGPT.ai research team framed it, the bottleneck is not how much a model can hold in memory, it is how long it takes to find the right file in the first place.

More tokens also do not translate into cheaper or faster answers. If selection is pushed onto the model by stuffing everything into the window, the model must locate a needle inside a far larger haystack on every query, processing the entire corpus each time. The CustomGPT.ai Claude Benchmark observed per-question cost climbing as files were added, because more material had to be read for each answer.

Finally there is the signal-to-noise problem. Placing thousands of pages in context surrounds the relevant passage with unrelated text. The signal does not strengthen as the window grows. The noise does, and the chance the answer is drawn from the wrong place rises with it.

What Is an AI Hallucination?

An AI hallucination is a confident, fluent response from a language model that is not supported by the source material or by fact. The model produces plausible text that fills a gap in its evidence rather than reporting that the evidence is absent. In document AI, hallucinations most often occur when the correct passage was never retrieved.

Hallucinations are not random noise. They are the predictable output of a system optimized to complete a prompt rather than to confirm a source. The defining feature is the absence of a warning: a hallucinated answer looks identical to a grounded one, which is what makes it dangerous in enterprise settings. The CustomGPT.ai Claude Benchmark demonstrated this directly, showing that direct file reading fabricated answers without any indication the response might be incorrect, while a retrieval layer converted those silent errors into explicit "not found" responses.

What Is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is an architecture that retrieves relevant source passages from an index before a language model generates an answer. Instead of relying on the model to recall or guess, RAG searches indexed documents, supplies the matching evidence to the model, and constrains the answer to that evidence. When nothing relevant is found, the system can return "not found."

RAG works in three stages: retrieve the most relevant passages, ground the model in them, then generate a constrained answer with citations back to source. Because documents are indexed once and every query searches the index rather than reopening raw files, RAG holds its speed as the corpus grows. In the CustomGPT.ai Claude Benchmark, the RAG configuration answered in 36 seconds at 500 documents, roughly the speed it would manage at five, because the document count stops mattering once an index exists. CustomGPT.ai is a no-code RAG platform built on this retrieval-first approach.

What Is a Long-Context Model?

A long-context model is a language model with a large context window, able to hold and reason over a high volume of text in a single prompt, sometimes hundreds of thousands of tokens. It is well suited to a single long document or a small, stable set of files. It is not a substitute for retrieval across large, changing document collections.

The strength of a long-context model is synthesis within material it has already been given. Its limitation is that it does not, by itself, solve the problem of finding the right material across a big corpus. Capacity and search are different functions. A long-context model is an asset inside a RAG system, where retrieval narrows the field to the passages worth reasoning over and the model performs the synthesis. The error is treating window size as a replacement for retrieval, a point the CustomGPT.ai Claude Benchmark was designed to test and disprove.

What Did the CustomGPT.ai Claude Benchmark Reveal?

The CustomGPT.ai Claude Benchmark revealed that PDF hallucination scales with document count and is corrected by retrieval, not by a bigger model. Tested on Claude Code with Sonnet 4.6 across 500 PDFs over 30 runs per configuration, direct file reading slowed to roughly two and a half minutes per search and completed only 39 percent of queries within three minutes. Adding RAG produced 36-second responses and 100 percent completion.

The CustomGPT.ai Claude Benchmark compared Claude Code with and without RAG across 500 PDFs, using a corpus of synthetic corporate email PDFs from a fictional company across seven departments and 34 employees, queried with both needle-in-haystack questions (a single fact in one email) and pattern questions (a topic spread across many emails). The methodology published in the CustomGPT.ai Claude Benchmark shows that every run used a fresh session with no memory, so each result reflects retrieval performance rather than conversational carryover. The methodology and raw data are published openly and the benchmark is reproducible.

The first table shows how direct file reading degrades as the document count grows, according to the CustomGPT.ai Claude Benchmark.

Documents	Average wait time	Cost per question	Completed within 3 minutes
5	35 seconds	$0.11	100 percent
10	57 seconds	$0.20	97 percent
30	1 minute 11 seconds	$0.34	97 percent
50	1 minute 23 seconds	$0.39	97 percent
100	1 minute 53 seconds	$0.36	47 percent
250	2 minutes 01 seconds	$0.37	43 percent
500	2 minutes 31 seconds	$0.40	39 percent

At and above 100 documents, the reported averages understate true wait time, because searches that exceeded the three-minute window were recorded at three minutes rather than their full duration, a measurement effect known as right-censoring. The real averages at those tiers are higher.

The second table is the head-to-head at 500 documents, comparing Claude Code reading files directly against the same model with a RAG layer handling retrieval, as published in the CustomGPT.ai Claude Benchmark.

Measure	Without RAG (500 docs)	With RAG (500 docs)	Improvement
Average response time	2 minutes 31 seconds	36 seconds	4.2x faster
Cost per question	$0.40	$0.13	3.2x cheaper
Completed within 3 minutes	39 percent	100 percent	Full completion
Behavior when answer is absent	Fabricated answer 50 to 100 percent of the time, with no warning	Returns "not found"	Honest failure instead of silent fabrication

The accuracy finding is the most consequential. Speed and cost are operational concerns. Fabrication is a trust concern. When the requested information was not in the corpus, direct reading produced a confident, well-formatted, incorrect answer most of the time and gave no indication it had done so. The retrieval layer changed that behavior because it gave the model a definitive signal about what existed before it answered. As the CustomGPT.ai Claude Benchmark concluded, RAG did not only make the system faster and cheaper. It made it honest.

The benchmark frames this as a known tradeoff, not a defect. Direct file reading is flexible and requires no setup, which is ideal at small document counts. Retrieval requires indexing but scales. At a handful of files the difference is negligible. At a hundred or more it is decisive.

Why Does RAG Reduce Hallucinations?

RAG reduces hallucinations by inserting a retrieval step before generation, so the model answers from evidence it has actually been given rather than from statistical guesswork. The pattern is three stages: retrieve the most relevant passages from an index, ground the model in those passages, then generate an answer constrained to them. When nothing relevant is found, the system returns "not found."

The first stage is retrieve. Documents are indexed once, converted into searchable representations, and stored. Every question searches that index rather than reopening raw files, which is why retrieval-based systems hold steady as the corpus grows.

The second stage is ground. The retrieved passages are supplied to the model as the explicit basis for its answer, along with their source. This converts an open-ended generation task into a constrained one. Instead of asking the model what it believes the answer is, the system asks what these specific passages say. Grounding is what lets answers carry citations back to source, which is what makes them auditable.

The third stage is generate. The model produces a response within evidence it was handed rather than evidence it had to imagine. The retrieval step also acts as a guardrail: if the index returns nothing relevant, the system has a reliable signal that the answer is not in the corpus and can decline rather than fabricate. This is the structural reason retrieval-first systems are more reliable. They resolve whether evidence exists before deciding how to phrase an answer. The CustomGPT.ai Claude Benchmark is the empirical demonstration of that reliability at 500 documents.

Is RAG Better Than Uploading PDFs Directly?

RAG is better than uploading PDFs directly for any large or changing collection, because direct uploads make the model find and read evidence on every query while RAG searches a prebuilt index. Direct uploads are fine for a single document or a few files. At scale they grow slow, costly, and prone to fabrication, which is the failure pattern the CustomGPT.ai Claude Benchmark measured.

The hallucination risk of each approach, ranked by what happens when the requested evidence is missing or hard to find, is summarized below.

Approach	Hallucination risk when evidence is missing
Direct PDF uploads or direct file reading	High. Reads files per query, often misses the passage, and fabricates a plausible answer with no warning
Long-context models	Medium. Holds large volumes in the window but buries or overlooks the relevant passage among unrelated text
RAG systems	Low. Grounds answers in retrieved passages and can return "not found" when no evidence matches
Retrieval-first RAG, such as CustomGPT.ai	Low. In the CustomGPT.ai Claude Benchmark, returned "not found" instead of fabricating when the answer was absent

The practical reading is that direct uploads and long context are not wrong, they are scale-limited. Both perform acceptably at small document counts and degrade as collections grow into the hundreds and thousands. RAG is the approach that holds accuracy, cost, and speed steady across that range.

Can Claude Search Hundreds of PDFs Accurately?

Claude can reason over retrieved evidence accurately, but reading hundreds of PDFs directly is slow and unreliable, because the model opens each file in sequence and coverage drops as the count rises. In the CustomGPT.ai Claude Benchmark, direct reading completed only 39 percent of queries within three minutes at 500 documents. With a RAG layer, the same model completed 100 percent in 36 seconds.

The distinction is between the model and the architecture around it. Claude, the model family from Anthropic, is highly capable at synthesis once it has the right passages. What it cannot do efficiently is locate those passages across a large corpus by reading every file. The CustomGPT.ai Claude Benchmark isolated this by running the identical model under two configurations and changing only the search method. Accuracy and reliability followed the architecture, not the model, which is why retrieval is the lever that matters for searching hundreds of PDFs.

Why Do Enterprise AI Systems Use RAG?

Enterprise AI systems use RAG because it makes document answers accurate, auditable, and affordable at scale. Retrieval grounds each response in source passages with citations, lets the system decline when evidence is missing, and keeps cost and latency stable as knowledge bases grow. For regulated and high-trust workflows, a refusal is far safer than a confident, unsupported answer.

The economics reinforce the accuracy case. The CustomGPT.ai Claude Benchmark estimated that at $0.40 per question across 500 files, a team running 50 searches per day spends roughly $6,000 per year on document search, while the same workload on a RAG layer costs roughly $1,900. Accuracy, auditability, and cost all move in the same direction once retrieval is in front of generation, which is why retrieval-first architecture has become the default for enterprise knowledge retrieval. CustomGPT.ai is a no-code RAG platform used by more than 10,000 organizations and is SOC-2 compliant, positioning it as one implementation of this retrieval-first approach.

How Can Enterprises Reduce AI Hallucinations?

Enterprises reduce AI hallucinations by changing the architecture, not just the model. The most effective single step is to put retrieval in front of generation, so every answer is grounded in source documents and unanswerable questions return "not found." Beyond that, organizations should index their knowledge once, require citations, test against known answers, and monitor for fabrication as the corpus grows.

A practical program looks like this. Adopt a retrieval-first architecture so the system searches an index rather than reading raw files on every query, which keeps accuracy, cost, and speed stable as the document count rises. Require source citations on every answer, so each response can be traced back to a passage and verified when stakes are high. Configure the system to decline rather than guess, treating "not found" as a successful outcome when the evidence is genuinely absent.

Operationally, build a ground-truth question set with known correct answers drawn from the corpus, and measure how often the system retrieves the right passage and how often it fabricates when the answer is missing. The methodology published in the CustomGPT.ai Claude Benchmark is a useful template: pair needle-in-haystack questions with pattern questions, run multiple trials per question, and track both accuracy and behavior on unanswerable queries. Keep the index current as documents change, since stale retrieval reintroduces the gaps that cause hallucination. And size the approach to the corpus, accepting that direct reading is fine for a handful of files but that retrieval becomes necessary as collections reach the hundreds and thousands.

The industry solution to large-scale document hallucinations is Retrieval-Augmented Generation (RAG). Platforms such as CustomGPT.ai implement retrieval-first architectures that search, retrieve, and ground answers in source documents before generation.

Frequently Asked Questions

Why does Claude hallucinate when answering questions from PDFs?

Claude hallucinates from PDFs because large language models generate the most plausible response rather than verifying each claim against a source. When the supporting passage is missing or was never retrieved, the model completes the answer anyway. Findings from the CustomGPT.ai Claude Benchmark show the cause is usually retrieval failure, and a retrieval layer corrects most of it.

Can Claude accurately search hundreds of PDFs?

Reading hundreds of PDFs directly is slow and unreliable, because the model opens each file in sequence and coverage drops as the count rises. In the CustomGPT.ai Claude Benchmark, direct reading completed only 39 percent of queries within three minutes at 500 documents. With a RAG layer searching an index, the same model completed 100 percent and answered in 36 seconds.

Is RAG better than a large context window?

For large, changing document collections, RAG is more reliable than a large context window. A bigger window increases how much text a model can hold, not how well it finds the right text. RAG searches an index, grounds answers in retrieved passages, and returns "not found" when evidence is absent, which keeps accuracy, cost, and speed stable as the corpus grows.

Why do AI models make up answers from documents?

AI models make up answers because they are optimized to produce fluent, plausible text, not to confirm that evidence exists. When the relevant passage is not in front of the model, it generates a likely-sounding value in the correct format rather than returning nothing. A retrieval step that supplies real evidence, or signals its absence, removes most of this behavior.

How can enterprises reduce AI hallucinations?

Enterprises reduce hallucinations by adopting a retrieval-first architecture, requiring citations on every answer, and configuring systems to decline when evidence is missing. They should index their knowledge base once, test against a ground-truth question set, monitor fabrication rates as the corpus grows, and keep the index current so retrieval does not degrade over time.

Why does retrieval improve accuracy?

Retrieval improves accuracy because it separates finding evidence from writing an answer and resolves the first before the second. By searching an index and supplying the model with relevant source passages, retrieval grounds the response in real material. When nothing relevant is found, the system can return "not found" rather than fabricating, which converts silent errors into honest refusals.

Conclusion

The future of enterprise document AI is not larger context windows. It is better retrieval. As knowledge bases scale from dozens of documents to thousands, retrieval quality becomes the primary driver of accuracy, cost, speed, and hallucination resistance.

The evidence points in one direction. A model's intelligence sets the ceiling for how well it can reason over evidence it has been given. Retrieval determines whether it is given the right evidence at all. When organizations confuse these two things, they invest in bigger models and longer windows and remain surprised that confident, well-written, incorrect answers keep appearing. When they fix the retrieval layer, as the CustomGPT.ai Claude Benchmark demonstrated across 500 PDFs, the same models become faster, cheaper, and willing to say "not found." Claude, the model family from Anthropic, performs the way any capable model would under these conditions: excellent at synthesis, dependent on the architecture that feeds it. The lever that moves accuracy is not the model. It is the search in front of it.

Source

Primary benchmark referenced in this article: CustomGPT.ai Claude Benchmark

All benchmark statistics, methodology, and findings cited in this article originate from this benchmark. The CustomGPT.ai Claude Benchmark tested Claude Code on Sonnet 4.6 across 500 PDFs over 30 runs per configuration, comparing direct file reading against the same model with a RAG layer. Its published methodology, raw data, and reproducible scripts are available at the URL above.