Branded Iron Goo guide card on building a small business chatbot that stays honest instead of confidently wrong

The UX of Adding Your Own AI Feature

Atamyrat Hangeldiyev

Systems Architect

March 14, 2026

On this page

What the UX of your own AI feature actually is
A confident wrong answer costs more than an honest "I do not know"
The design decisions that keep a small business's AI feature honest
What an owner can decide before commissioning anything
Your own AI feature versus the things it gets confused with
What shipping this feature changes around it
The behavior to demand first

Foundations

Designing for the Person

The Execution Playbook

UX in the AI Era & Keeping It

Three days after a regional appliance retailer put a shiny new support chatbot on its site, a customer asked it whether her four-year-old French-door refrigerator was still covered for a failed compressor, and the bot answered in one clean, certain sentence: yes, the sealed system on that model carries a ten-year warranty, so the repair would be covered at no charge. It did not say "I think." It did not say "let me get someone." It named the part, named the term, and sounded exactly like a person at the company who knew. She scheduled the repair, declined the third-party quote she had been weighing, and waited. The compressor warranty on her unit had been five years, not ten, and it had lapsed eleven months earlier. She found out when the technician handed her an invoice for six hundred and forty dollars, and the owner found out the next morning, near the top of a refund and complaint queue, looking at a one-star review that quoted the bot's sentence back to him word for word.

The UX of your own AI feature is the set of design decisions that govern how a chatbot, assistant, or AI search you ship behaves toward the person using it, so that it resolves their task honestly instead of spending the trust your business spent years earning, in the context of a small business adding AI to its own product or site. It is not the model. It is not the vendor's demo. It is every choice about what the thing says before it speaks, how it signals what it does not know, where its answers come from, and what happens at the moment it is wrong, and those choices are yours whether you make them on purpose or let a vendor make them for you by default.

This guide is about an AI surface you ship and are responsible for. It is not about the separate case where someone else's AI agent uses your ordinary site to act for a customer; that is its own problem with its own answer, and the boundary between the two is drawn carefully later, because conflating them is the most expensive mistake an owner can make here.

What the UX of your own AI feature actually is

An AI feature is a surface like any other surface you ship. A customer reading your chatbot is in the same trust relationship with you as a customer reading your pricing page or talking to someone on your phone line. The difference is that the chatbot generates its words on the spot, in a confident register, with no human deciding whether each sentence is true before the customer reads it. That is the whole problem in one sentence, and every design decision in this guide exists to manage it.

Your customer reads the bot as the business speaking, not as a tool

When the appliance retailer's customer read "yes, that is covered," she did not parse it as a probabilistic output from a language model. She read it as the company telling her she was covered. That is not naivety. It is the correct reading. You put the thing on your site, in your colors, with your name on it. Whatever it says, you said. A customer does not owe you the sophistication to discount your own chatbot's confidence, and the ones who would never think to are exactly the ones who get hurt and tell people.

This is why the UX of an AI feature is a trust problem before it is a technology problem. The technology question is "can the model produce a fluent answer," and the answer to that has been yes for a while. The trust question is "when it produces a fluent answer that is wrong, what does that cost the relationship," and that question is answered entirely by design decisions you control, not by the model you picked. A more capable model lowers how often it is wrong. It does nothing about what happens when it is, and it is the what-happens-when-it-is that burns trust.

It also means the feature is yours to get right, not the vendor's. A vendor will tell you their bot is accurate, and they will mean it, and it will still be wrong sometimes because every generative system is wrong sometimes. The behaviors that decide whether being wrong is a survivable event or a refund-queue event are UX behaviors, and a vendor who only talks about accuracy and never about what the bot does when it does not know is selling you the demo, not the production system.

The same question, the bot that overpromised and the bot built to be honest

Here is the appliance retailer's exact situation, run twice, same customer, same question, two different design philosophies behind the same friendly chat box.

The bot that overpromised

Customer: "Is the compressor on my model still under warranty?" Bot: "Yes, the sealed system on that model is covered by a ten-year warranty, so a compressor repair would be at no charge to you." No source. No date checked against her purchase. No "let me confirm with someone." It guessed a plausible warranty term, stated it as fact, and offered no way to verify and no human. She acted on it. The cost landed the next morning: a refund, a furious review, and a customer who will tell everyone she knows that the company's own assistant lied to her.

The bot built to be honest

Customer: "Is the compressor on my model still under warranty?" Bot: "Warranty length depends on your model and purchase date, and I should not guess at that. Here is the warranty policy page for sealed-system parts, and I can connect you to our service team, who can check your specific coverage by serial number. Want me to do that?" It set the expectation that it would not invent the answer, pointed at a real source she could read, and offered a human who could actually resolve it. Worst case, she waits an hour for a definitive answer. Nobody pays six hundred and forty dollars for a sentence the bot made up.

Both bots used a capable model. Both produced fluent English. The first one cost the business a customer and a public accusation; the second cost it nothing and probably earned trust, because a customer who sees a system refuse to guess about her money believes it more the next time it does answer. The difference is not intelligence. It is design.

A confident wrong answer costs more than an honest "I do not know"

The instinct when adding an AI feature is to make it impressive: answer everything, answer instantly, never stall, never punt. That instinct is exactly backward, and it is worth understanding precisely why, because the reasoning is what lets you push back on a vendor who is optimizing for the demo.

The failure that does the real damage: specific, certain, wrong, no fallback

Not all wrong answers cost the same. A bot that says "I am not sure, let me get someone" and is occasionally too cautious costs you a little friction and some response time. A bot that says "your order ships Tuesday" with total confidence when it ships in three weeks costs you a customer who rearranged their week around your sentence. The damage scales with four things stacked together: how specific the answer is, how certain it sounds, whether it is actually wrong, and whether there was any fallback to a human. Stack all four and you get the worst outcome an AI feature can produce.

The appliance bot hit all four. It was specific (ten years, sealed system, no charge). It was certain (no hedge, no qualifier). It was wrong (the term was five years and lapsed). And there was no fallback (no source to check, no human offered). Remove any one of those and the damage shrinks. Make it vague and the customer does not act on it. Make it uncertain and the customer verifies. Make it correct and there is no harm. Offer a human and the customer gets a real answer before the technician arrives. The confident-specific-wrong-no-fallback combination is the one design has to make structurally impossible, and a feature that cannot do that is not ready for paying customers no matter how good its average answer is.

The reason this is the failure that matters: a customer cannot tell a confident correct answer from a confident wrong one. They look identical. The bot's certainty is not calibrated to its correctness, so the customer has no signal. They trust the tone, because the tone is the only thing they have, and the tone was wrong on purpose by design when nobody decided what the bot should do when it does not know.

Why "I do not know, here is a human" is a feature, not a weakness

Owners hear "the bot will sometimes say it does not know" as a failure of the bot. It is the opposite. A bot that can say "I do not know, but here is the page that does, and here is a person who can confirm" is more valuable than one that always produces an answer, because the always-answers bot is producing wrong answers at an unknown rate and you have no way to find out until the refund queue tells you.

A clean "I do not know" is not the absence of an answer. It is a different answer with a different cost profile. It costs the customer a short wait or a click to a real source. It costs you nothing in trust, and it often builds trust, because the customer just watched the system decline to bluff about something that mattered. People extend more credibility to a source that admits its limits than to one that never does, and they are right to, because a source that never admits a limit is one whose wrong answers you simply have not caught yet. Designing the "I do not know" path well, where it goes, what it offers, how fast the human is, returns more trust per unit of effort than almost anything else you can do here, and it is the part vendors selling demo bots skip entirely because it does not demo well.

Who pays for the wrong answer, in order

The customer pays first. She paid six hundred and forty dollars and a wasted afternoon. That is the immediate cost and it is real money out of a real person's pocket because your surface told her something false.

The owner pays second. He paid the refund, the staff time on the complaint, and the cost of a one-star review that will sit in his search results quoting his own bot indefinitely. That review does not age out. It is now part of what a prospective customer reads when they look him up, and it says, in his bot's voice, that the business's assistant lies.

The trust pays last and longest. The customer who got burned does not just stop trusting the bot. She stops trusting the company, because, correctly, she does not separate them. She tells people the specific story, the warranty, the invoice, the review, and the story is sticky because it is concrete and it has a villain. That is the actual price, and no model upgrade buys it back.

The design decisions that keep a small business's AI feature honest

Here are the six decisions that separate a production AI feature from a demo. None of them is about the model. All of them are about behavior, and an owner can require every one of them in plain language without knowing what a model is.

1. Set the expectation before it speaks

Before a customer types anything, the feature should tell them what it is, what it can do, and what it cannot. Not a legal disclaimer nobody reads. One honest line in the opening state. "I am the support assistant. I can help with order status, returns, and product questions, and I will point you to a person for warranty claims and anything I am not sure about." That sentence does real work. It tells the customer this is an assistant, not a person. It sets a scope, so a question outside the scope gets a clean handoff instead of a confident guess. And it pre-commits the bot, in front of the customer, to punting on the things that hurt most when wrong.

Expectation-setting is the cheapest decision on this list and the one most often skipped, because a vendor demo opens with "Hi! How can I help you today?", which sets no expectation at all and quietly promises the bot can do anything. The appliance bot's customer had no idea she was talking to a system that would guess. One honest opening line would have changed how she read every word after it.

2. Show uncertainty instead of manufacturing confidence

A language model does not natively show doubt. Left alone it answers a question it half-knows in exactly the same confident register as one it fully knows, because fluent confidence is its default output. The design job is to make the feature express uncertainty when it is uncertain, and to make uncertainty visible to the customer rather than buried.

This is concrete, not philosophical. When the answer is not solidly grounded in a source the system has, the feature should change its behavior visibly: hedge honestly ("I am not certain about this one"), point to the source instead of asserting, and offer the human path. The customer should be able to tell a grounded answer from a guess by reading it, because the entire failure mode in this guide is a customer who could not tell. The technical work of detecting low confidence is the engineer's. The requirement, that uncertainty must be visible to the person and must change what the bot does, is yours, and you should state it as non-negotiable.

3. Ground every answer in a real source the user can see

The single most powerful decision on this list: the feature should answer from your actual content, your help center, your policy pages, your product docs, your real warranty terms, and it should show the customer which source it used. This is what "retrieval-grounded" means in plain English, and you do not need the term. You need the behavior: the bot retrieves the relevant passage from your real material, answers from it, and links it.

The appliance bot's failure was a grounding failure. It did not retrieve the actual warranty terms for that model and answer from them. It generated a plausible-sounding warranty from nothing. A grounded version would have pulled the real sealed-system policy, seen that coverage depends on model and purchase date, and answered with the policy in hand and a link to it, which would have led it straight to "I cannot confirm your specific coverage, here is the policy and here is a person." Grounding does two things at once: it makes answers far more likely to be right, and it gives the customer a source to check, so even a wrong retrieval is catchable by the person reading it. An ungrounded bot is a confident stranger making things up in your name. A grounded one is your documentation, read aloud, with a citation.

When you name the engine for this, the reference point is the Claude API, used in a retrieval setup so the assistant answers from your supplied source material and is instructed to refuse when the answer is not in it rather than fill the gap with a guess. Claude models are the assistant for the grounded, cite-the-source, refuse-cleanly behavior this whole section is about, and an engineer building and operating the feature does that work in an agentic client like Claude Code. The model is one part. It makes the refuse-cleanly behavior reliable. The other five decisions on this list are the rest, and a model alone, however good, does not give you them.

4. Fail cleanly: hand off to a human with the conversation context preserved

Every AI feature will reach the edge of what it can answer. The decision that matters is what happens there. The wrong design dead-ends ("I cannot help with that") or, worse, guesses. The right design hands the customer to a human, and it hands the human the conversation, what the customer asked, what the bot tried, what it was unsure about, so the customer does not have to start over and the person picks up mid-stream.

The context-preservation part is the part that gets cut and the part that matters most. A handoff that drops the customer into a generic queue to re-explain everything is barely better than a dead end; the customer has been told "a human will help" and then made to do the work again, which reads as the company wasting their time twice. A handoff that arrives with "this customer asked about compressor warranty on this model, the bot was not sure and did not answer, here is the thread" lets a person resolve it in one message. Same human, completely different experience, and the only difference is whether the design preserved the context across the seam. Require this explicitly: not "it can escalate," but "it escalates with the full conversation attached so the customer never repeats themselves."

5. Never let it pretend to be a person

The feature must never claim, imply, or perform being human. No invented agent name presented as a real employee, no "let me check with my manager" theater, no dodging "are you a bot?" The first time a customer realizes the "person" they have been confiding in is a system that misled them, you have not lost an interaction, you have lost the relationship, because now every prior friendly exchange reads in hindsight as a deception you designed.

This is cheap to get right and expensive to get wrong. Honesty about being a bot costs nothing; customers deal with assistants all day and do not mind one that is straight about what it is. The pretense costs everything the moment it breaks, and it always breaks. A customer who knows it is a bot from the first line and gets a clean handoff feels respected. A customer who thought they were talking to "Sarah from support" for ten minutes and then discovers Sarah is a script feels played, and they are not wrong. Make the feature own what it is, plainly, every time it is asked and ideally before it is asked.

6. Always give the user a way out

At every point in the conversation, the customer must have a visible, one-action path to a human or to the thing they actually came for. Not buried, not after three "are you sure you don't want to keep chatting?" prompts, not gated behind the bot deciding it has failed. A persistent, obvious exit.

The reason is trust, not just usability. A feature that traps a customer in a loop, refusing to surface the human path, reading their frustration as something to talk them out of, communicates that the business values deflection over resolution, and customers feel that precisely. The exit being visible at all times is also what makes the rest of the design safe to ship: even if every other behavior fails on a given conversation, a customer who can leave for a human in one click does not get to the refund-queue outcome. The exit is the backstop under the entire system. Design it as a permanent element of the surface, not a fallback the bot grants when it gives up.

Key idea

The non-negotiable behaviors to require in writing before anyone builds this for you: it states what it is and what it cannot do before the first message; it shows uncertainty visibly instead of guessing confidently; it answers from your real content and shows the source; it hands off to a human with the full conversation attached so the customer never repeats themselves; it never pretends to be a person; and a one-action path to a human is visible at every point. A vendor who cannot commit to all six in plain language is selling you a demo bot.

What an owner can decide before commissioning anything

You do not need an engineer to make the decisions that matter most here. You need to know what to require and what to refuse, and you can do that from the business side before a line of code exists.

The non-negotiable behaviors to require in writing

Put the six behaviors above into the brief, in plain language, as acceptance criteria, not aspirations. "Answers from our actual help center and policy pages and shows which one" is a requirement you can verify by asking the vendor to demonstrate it on a question whose answer lives in your real content. "Escalates to a human with the full conversation attached" is verifiable by watching one handoff end to end. "Tells the customer it is an assistant and what it cannot do before the first message" you can verify by opening the chat yourself. Written, specific, demonstrable. If a behavior is not in writing, assume the build will optimize it away, because the impressive-demo version of every one of these is the version that drops it.

Add one more requirement that is pure UX and pure trust: a logged record of conversations, so when something goes wrong you can see exactly what the bot said. The appliance owner had to reconstruct what happened from a furious review. A feature that keeps an honest audit trail of what it told people is one you can actually manage; one that does not is one you are flying blind on until a customer tells you it lied.

The tells that a vendor is selling you a demo bot

Some signals tell you, before you buy, that you are being shown the dangerous version.

The demo answers everything instantly and never says "I am not sure." A production bot punts sometimes; a demo that never does is one whose wrong answers you have not seen yet. The pitch is about how smart and human it sounds and never about what it does when it does not know. There is no source shown under answers, just confident prose. There is no real human handoff, or the handoff dumps the customer in a queue to start over. The bot has a human name and a personality and dodges "are you a bot?". The vendor cannot show you the conversation logs or talk about an audit trail. Accuracy is quoted as a number and the question "what happens on the wrong ones" gets a vague answer. Each of these is the same underlying tell: the product is optimized to impress the person evaluating it, not to be safe for the customer using it, and those are different products that happen to look identical in a fifteen-minute demo.

When the honest answer is "not an AI feature yet"

Sometimes the right call is not to ship an AI feature at all, and a vendor selling one will never tell you that. If most of your customer questions are a small, stable set, "where is my order," "what is your return window," "do you service my area," a good search box, a clear help center, and a fast path to a human resolve them with zero hallucination risk and a fraction of the cost. An AI feature earns its place when the question space is genuinely large and varied and a static page or search cannot cover it, and your real content is good enough to ground answers in. If your help content is thin or wrong, an AI feature does not fix that; it reads your thin, wrong content aloud with more confidence. Fix the content and the human path first. The honest sequence is sometimes a better contact form and a faster human before an AI feature, and an owner who can say that out loud is harder to oversell.

A wrong certainty costs more

The core trade

Show the source

Grounding

Hand off with the context

Failure path

Never a fake person

Honesty

Your own AI feature versus the things it gets confused with

Three things sit close to this and get confused with it. The first confusion is the expensive one and is drawn in full here.

Your AI feature you ship versus your ordinary surface an external agent uses

This is the boundary an owner most needs to get right, and it is a clean either/or once you see it.

Guide 11, this guide, is about an AI surface you ship. You put a chatbot, an assistant, or an AI search on your product. It is your feature, in your name, and you are responsible for how it behaves toward the person using it: whether it sets expectations, shows uncertainty, grounds its answers, fails cleanly to a human, and never fakes being a person. The user is your customer. The surface is something you built and own. Everything in this guide is about that.

The other case is the reverse. Your ordinary surface, your normal site, the one you would have had if AI never existed, gets used not by a person directly but by someone else's AI agent acting on that person's behalf. A customer tells their assistant "book me an appointment with this dental group" and the agent drives your normal booking page. You did not ship that agent. You do not control it. Your responsibility there is not how a bot you built behaves; it is whether your ordinary surface is legible and actionable enough that an external agent can complete the task on it. That is a different problem with different decisions, and it has its own guide: designing for the AI agent that uses your ordinary site on a customer's behalf owns the case where your normal, non-AI surface is being driven by an agent you did not build.

The clean test is one example: a booking widget you built and put on your site that talks to patients is this guide; your existing booking form, unchanged, being filled out by a customer's personal assistant agent is guide 10. They feel adjacent because both involve AI and a website, and they are opposite responsibilities: one is the behavior of a system you built, the other is the legibility of a system you did not. An owner who runs them together either over-engineers their ordinary forms as if they were AI features or, worse, ships an AI feature with none of the six behaviors because they were thinking about the other problem.

A responsibly designed AI feature versus a generic "add a chatbot" pitch

"Add a chatbot" covers two products that share a chat box and nothing else. One is a scripted decision tree: rigid, predictable, dumb, it cannot answer anything off-script but it also cannot invent a warranty, because it cannot say anything it was not scripted to say. The other is an ungrounded model wrapper: a capable model in a chat box with no grounding, no source-showing, no real handoff, sold as "AI-powered." The wrapper is the appliance bot. It is the dangerous one precisely because it is fluent and confident and has nothing under it. A responsibly designed AI feature is neither: it has the model's fluency and the six behaviors that make fluency safe. When a vendor says "we'll add a chatbot," the only question that matters is which of these three they mean, and the generic pitch is almost always the middle one, the wrapper, because it demos best and costs them least to build.

An AI feature versus ordinary contact and form UX

A contact form, a help center search, a live-chat-to-a-human widget: these are real, useful surfaces, and none of them is an AI feature. They do not generate answers, so they cannot generate wrong ones. They are sometimes the honest answer instead of an AI feature, as the pre-commission section says. The confusion runs both ways: a vendor reframes a glorified contact form as "AI" to charge more, or an owner believes they need an AI feature when a better-routed contact form and a faster human would resolve more with no hallucination risk. The distinction is simple: an AI feature generates language and therefore needs the six behaviors; a form collects input and routes it and needs ordinary, good form UX. Do not buy the first when the second is what the problem actually wants, and do not let the first be sold to you wearing the second's risk profile.

What shipping this feature changes around it

An AI feature does not sit in isolation. Shipping one changes three things around it, and an owner should see those before committing.

How one AI feature spends or protects the trust you built

Every answer the feature gives either spends or protects the trust your business spent years earning, and it does so at a scale and speed nothing else on your site does, because it is talking to many customers at once, generating, and sounding authoritative. A pricing page is wrong the same way for everyone until you fix it. A bot is wrong differently for every customer, in a confident voice, in real time, and you may not find out until the queue does. That cuts both ways: a feature built to the standard in this guide protects trust on every conversation by refusing to bluff; a feature built as the demo wrapper spends it on every conversation it gets wrong, and you do not get to choose which conversations those are.

This is exactly the work most SMBs do not staff. Building an AI feature to the responsible standard described here, retrieval grounded so it answers from your real content and cites it, instructed to refuse cleanly when the answer is not in the source instead of guessing, escalating to a human with the conversation context preserved, with hallucination guardrails and a logged audit trail on every conversation, is real design and engineering, not a plugin you switch on. Companies that do not have that capability in-house are who the Operations engagement that builds a retrieval-grounded customer-facing assistant with these guardrails exists for: it ships the feature with the behaviors in this guide built in, because the behaviors are the deliverable, not an add-on. The discipline is the work, whoever does it.

How a responsible handoff changes what reaches your support team

A well-designed AI feature does not just deflect tickets. It changes what reaches a human and in what condition. The bot resolves the high-volume, well-documented questions from your real content, so what reaches your team is the harder, judgment-requiring, or unusual cases, and each one arrives with the context already attached. Your team stops answering "where is my order" for the hundredth time and starts each real conversation already oriented instead of cold. That is a different support operation, and it only happens if the handoff was designed to preserve context. A feature that deflects without that just moves the same questions to a slower queue and makes your team start every one from zero, which is worse than no bot, because now the customer waited through a bot first.

How this changes what you require of whoever builds it

Once you understand that the behaviors are the product, your vendor conversation changes permanently. You stop asking "how accurate is it" as the headline question and start asking "show me what it does when it does not know, show me a handoff end to end, show me the source under an answer, show me the logs." You require the six behaviors in writing as acceptance criteria. You treat a vendor who cannot discuss the wrong-answer path as having answered the most important question by avoiding it. This guide does not make you an engineer. It makes you an owner who cannot be sold the demo, which is most of what protects you here, because the demo is convincing and the difference between it and a production system is exactly the part that does not show in a demo.

The behavior to demand first

A surface you ship now serves a person and, increasingly, the agent acting for that person, and an AI feature you put your name on is judged by one thing: whether it resolves the task honestly for the human in front of it or spends, in a confident wrong sentence, the trust you spent years earning. The design decisions are the rest, and they are yours.

The reciprocal case, your ordinary surface being driven by an agent you did not build, has its own node in this pillar at designing for the AI agent that acts for a person on your normal site, and the UX pillar overview frames where both sit and which question to read next. If you take one thing into the room before you commission anything, take this: the first behavior to demand, in writing, is that the feature says what it is and what it cannot do before its first message and never guesses confidently when it does not know, because that single behavior is the difference between the bot that overpromised and the one that protected the business, and a vendor who will not commit to it has already told you which one they are selling you.

Related in UX