Iron Goo
Methodology

How we build chatbots that don't hallucinate

A chatbot's quality is decided in knowledge prep, not prompt engineering. Knowledge structure, retrieval discipline, fall-through safety, cost ceilings. The four things that separate a useful bot from a confident liar.

Most chatbots hallucinate because the people who built them spent their time on the wrong layer. They tuned the prompt, they picked a frontier model, they wrote elaborate guardrail instructions. They gave almost no thought to the knowledge layer underneath. The result is a confident bot drawing from messy, duplicated, half-stale source material, and confidently making up the gaps. We have rebuilt enough of these to know the pattern.

A chatbot is a retrieval system with a model on top. The model is mostly fixed (you pick one of three or four serious choices). The retrieval is everything. If the right information comes back from the knowledge layer when the customer asks a question, the model will speak truthfully. If the wrong information comes back, or none, the model will speak smoothly, and what it says will sometimes be wrong. That is the failure mode every bad chatbot shares. The fix is not prompt engineering. The fix is knowledge prep, retrieval discipline, and a fall-through that catches the cases retrieval cannot answer.

The knowledge-prep step

We start by getting hold of every document the bot is meant to speak from. Help center articles, sales pages, product documentation, internal FAQs, training material, return policies, warranty terms. We make a list. Almost always the list contains duplicate or contradictory documents (the help article says seven days, the policy page says fourteen days, the rep on the phone says ten). We surface the contradictions and ask which one is right. This conversation alone has stopped two or three chatbot projects from shipping a confidently wrong answer in production.

Once we have a clean source list, we chunk. The default recommendation in most retrieval tutorials is to chunk by fixed token count (say, every five hundred tokens, with a hundred tokens of overlap). We do not do this. Fixed-token chunking cuts paragraphs in half, separates a question from its answer, and forces the retrieval system to stitch chunks together at query time. We chunk by semantic boundary instead: by heading, by FAQ entry, by paragraph break, depending on what the document looks like. A help article becomes one chunk per answer. A policy page becomes one chunk per clause. A long article becomes one chunk per H2 section. Each chunk is self-contained enough that the model can quote from it without needing context from a neighbour.

We then write metadata for each chunk: source URL, last-updated date, document type, audience (customer-facing vs. internal), and any tags relevant to your business. The metadata is what lets us filter at retrieval time. If a customer asks about returns, we want the bot to draw from the customer-facing returns documents, not from an internal training memo about how to handle a return on the phone. Without metadata, both look the same to a vector search.

Last, we embed. We pick an embedding model appropriate to the content (we default to a strong general-purpose model unless there is a reason to specialize), we run the embedding job, and we store the vectors alongside the chunks. The output is a knowledge base the bot can search with sub-second latency.

The retrieval discipline

When a customer asks a question, the bot does not go straight to the model. It searches the knowledge base first, returns the top three to five most relevant chunks, and only then assembles a prompt that includes those chunks and asks the model to answer using them. The model is instructed (and we test that it follows the instruction) to answer only from the provided chunks. If the chunks do not contain the answer, the model is instructed to say so.

The discipline lives in the details. We set a similarity threshold below which chunks do not count as a match (a half-relevant chunk is worse than no chunk, because the model will pretend it answered the question). We tune the number of chunks returned (too few and the bot misses context, too many and the model gets confused). We use re-ranking when the use case justifies the cost (a fast first pass to retrieve twenty chunks, then a slower second pass to re-rank to the best five). We log every retrieval, so we can see when the bot retrieved the wrong chunks and why.

Above all, we enforce the rule: no answer without a source. Every reply the bot produces names the chunk (and therefore the document) it drew from. This is partly so customers can verify the answer, and partly so we can audit the bot ourselves. Anonymous answers are how hallucinations hide.

The fall-through safety net

Some questions the knowledge base cannot answer. A new customer asks about a product feature you have not written about yet. An edge case crosses two policies that conflict. A question is ambiguous and the bot is not sure which document to draw from. For each of these, the bot needs a fall-through that does not involve making something up.

Our default fall-through has three rungs. Rung one: the bot tells the customer it does not have a confident answer to that question and offers to connect them to a human. Rung two: it captures the question (and the conversation context) into a queue your team can review. Rung three: it learns. Once a week, we review the captured questions, decide which ones are recurring enough to deserve a knowledge update, and write or update a chunk so the bot can answer the next person who asks. The fall-through is not a safety net in the static sense. It is a feedback loop that compounds.

For high-stakes use cases (financial advice, medical questions, legal interpretation), we do not just fall through. We refuse. The bot is configured to recognize the topic and decline to answer at all, with an explicit handoff to a qualified human. A bot that says I cannot help with that, please speak to one of our specialists is not a worse bot. It is the only kind of bot that does not eventually generate a complaint.

Cost and monitoring

Chatbots can run away on cost. A poorly bounded bot that retrieves twenty chunks, sends them all to a frontier model on every turn, and supports long conversations will produce a surprise five-figure bill in a busy month. We bound this in three ways.

First, we set per-conversation token ceilings. Every conversation has a maximum total token budget. Once it is reached, the bot offers to escalate to a human and ends. The number is tunable per use case (a support bot might cap at twenty thousand tokens; a sales-qualification bot might cap at five thousand). The point is that there is a cap.

Second, we use a cheaper, faster model where it is sufficient. Most customer questions do not need a frontier model. Smaller models with good retrieval will answer eighty percent of questions correctly at a fraction of the cost. We route high-difficulty questions (long context, ambiguous intent, or where the first model returned a low-confidence answer) up to the larger model. Most questions never need to escalate.

Third, we monitor. Every conversation is logged. We watch for cost outliers, for retrieval failures (when the bot pulled wrong or no chunks), for fall-through rate (how often the bot could not answer), and for satisfaction signal (when the customer escalated to a human, was it because the bot got it wrong, or because the customer wanted a human anyway). The monitoring dashboard is a real artifact we hand off, not a nice-to-have. A bot you cannot see is a bot you cannot trust.

What we will not do

We will not ship a chatbot that pretends to know things it does not. We will not ship a chatbot for a use case where the cost of a wrong answer is high and the verification is hard (legal, medical, financial advice without a human checkpoint). We will not ship a chatbot whose cost we cannot bound. And we will not promise that a chatbot will replace your team. A chatbot is a way to handle the questions that have the same answer every time. The questions that need judgment still need a human, and our job is to make sure the bot knows where the line is.

Where this fits

Building a trustworthy chatbot is one shape of automation work we do. The other shapes (internal agents, custom workflows, human-in-the-loop tooling) follow the same principle: prepare the knowledge layer, constrain the model to it, and monitor what comes out. Read the automation service page for the customer-side framing, or get in touch and tell us what you have already tried. Most of the chatbot projects we ship start as a rebuild of something a previous vendor or in-house attempt got partway through.

Ready to move?

Send us a note about where your business is today. You'll get back a written assessment within two business days.

Talk to us