Every SaaS product is adding a chatbot. Most of them are bad. They hallucinate, give wrong answers, and frustrate users who expected something useful. The gap between a demo-quality chatbot and a production-quality one is enormous — roughly 3× the effort.
This guide covers how to build an AI chatbot that actually helps your users, with real architecture decisions, cost breakdowns, and the mistakes that waste the most time and money.
Architecture: How a SaaS Chatbot Works
A production chatbot has five components:
1. Document Ingestion Pipeline
Your knowledge base (help docs, product guides, FAQs, release notes) gets processed into chunks and stored as vector embeddings.
Chunking strategy matters. Bad chunking is the primary cause of wrong answers. Rules:
- Chunk size: 200–500 tokens works best for most content. Larger chunks provide more context but reduce retrieval precision.
- Overlap: 50–100 token overlap between chunks prevents cutting information in half.
- Respect document structure: Do not split a paragraph across chunks. Use headers as natural boundaries.
- Preserve metadata: Each chunk should carry its source URL, document title, and section header. This enables attribution.
2. Vector Store
Embeddings are stored in a vector database that enables fast similarity search.
Options:
- pgvector (PostgreSQL extension): Good if you already use Postgres. Lower operational complexity. Works well up to 1 million vectors.
- Pinecone: Managed service, no ops burden, good performance. $70/month for starter tier.
- Weaviate: Open source, self-hosted or cloud. Good filtering capabilities.
- Qdrant: Open source, performant, good for high-volume use cases.
For most SaaS products under 100,000 documents, pgvector is the pragmatic choice. You already have Postgres. Adding an extension is simpler than managing another service.
3. Retrieval Logic
When a user asks a question:
- The question is converted to a vector embedding
- The vector store finds the K most similar document chunks (typically K=5–10)
- A relevance filter removes chunks below a similarity threshold
- Remaining chunks are ranked and passed to the LLM
This is where most chatbots fail. If the retrieval step returns irrelevant chunks, the LLM generates confident-sounding wrong answers. Your chatbot is only as good as its retrieval.
Strategies that improve retrieval:
- Hybrid search: Combine vector similarity with keyword search. Some queries are better served by exact keyword matches (error codes, product names).
- Query expansion: Rewrite the user's question into multiple search queries. "How do I change my password?" also searches for "reset credentials" and "update login."
- Reranking: After initial retrieval, use a cross-encoder model to rerank results by actual relevance (not just vector similarity).
- Metadata filtering: If the user is on a specific product plan, filter out documentation for other plans before searching.
4. Generation (LLM)
The retrieved chunks plus the user's question go to an LLM (GPT-4o, Claude, Gemini) with a system prompt that defines behavior.
System prompt essentials:
- Role: "You are a support assistant for [Product]. Answer questions using only the provided context."
- Constraints: "If the context does not contain enough information to answer, say so. Do not make up answers."
- Format: "Cite your sources using [Source: document title]. Keep answers concise and actionable."
- Boundaries: "Do not answer questions unrelated to [Product]. Redirect to human support for account-specific issues."
5. Conversation Management
A production chatbot needs:
- Conversation history: Remember previous messages in the conversation (typically 5–10 turns of context)
- Session management: Separate conversations per user, with the ability to start new threads
- Escalation paths: When the chatbot cannot help, seamlessly hand off to human support with full conversation context
- Feedback collection: Thumbs up/down on answers to identify quality issues
Cost Breakdown
Development costs
| Component | Cost | Timeline |
|---|---|---|
| Document ingestion pipeline | $2,000–$4,000 | 1–2 weeks |
| Vector store setup | $1,000–$2,000 | 3–5 days |
| Retrieval logic + quality tuning | $3,000–$8,000 | 2–3 weeks |
| Chat UI + conversation management | $2,000–$5,000 | 1–2 weeks |
| Testing + edge case handling | $2,000–$4,000 | 1 week |
| Total | $10,000–$23,000 | 5–8 weeks |
Ongoing monthly costs (per 1,000 active users)
| Component | Cost/Month |
|---|---|
| LLM API calls (GPT-4o) | $100–$500 |
| Vector database hosting | $50–$200 |
| Embedding generation (new documents) | $10–$30 |
| Compute (API server) | $50–$100 |
| Total | $210–$830 |
Cost per query breakdown
Assuming GPT-4o with 2,000 input tokens (retrieved context + question) and 500 output tokens:
- Input: 2,000 tokens × $2.50/1M = $0.005
- Output: 500 tokens × $10.00/1M = $0.005
- Embedding (query): $0.0001
- Total per query: ~$0.01
At 10,000 queries/month: $100 in LLM costs. At 50,000 queries/month: $500 in LLM costs.
Common Mistakes (and How to Avoid Them)
Mistake 1: Skipping retrieval quality testing
The most expensive mistake. Teams build the pipeline, test a few queries, see reasonable answers, and ship. Then users ask questions the team did not test, retrieval returns wrong chunks, and the chatbot confidently provides wrong information.
Fix: Build an evaluation dataset of 100+ questions with expected answers. Measure retrieval accuracy (are the right chunks being retrieved?) separately from generation quality. Target 90%+ retrieval accuracy before shipping.
Mistake 2: Chunks too large or too small
Large chunks (1,000+ tokens) provide context but reduce precision — the model might get the right document but the wrong section. Small chunks (50–100 tokens) improve precision but lose context — the model gets a fragment that does not make sense alone.
Fix: 200–500 tokens with 50–100 token overlap. Test with your actual content and adjust based on retrieval quality metrics.
Mistake 3: No fallback for unknowns
When the chatbot does not know something, it should say so. Most implementations fail at this — the LLM generates a plausible-sounding answer from its training data instead of the retrieved context.
Fix: Strong system prompt constraints ("only answer from provided context"), low temperature (0.1–0.3), and a confidence scoring mechanism that detects when retrieved chunks have low relevance.
Mistake 4: Ignoring conversation context
A chatbot that forgets the previous message in a conversation frustrates users. "What is your pricing?" followed by "What about for enterprise?" needs context from the first message.
Fix: Include the last 5–10 conversation turns in the prompt. Use a sliding window to manage context length.
Mistake 5: No cost controls
A viral moment or a power user can generate thousands of queries in a day. Without rate limiting, your API bill explodes.
Fix: Per-user rate limits (e.g., 50 queries/day on free tier), per-tenant monthly caps for B2B, and alerting when costs exceed thresholds.
Mistake 6: Building when you should buy
If your chatbot needs are simple (answer questions from help docs, no custom logic), consider off-the-shelf solutions before building custom:
- Intercom Fin: $0.99/resolution, no engineering time
- Zendesk AI: Integrated with existing support workflow
- ChatBot.com: Template-based, good for simple use cases
Build custom when: you need deep product integration, custom logic, multi-tenant isolation, or specific behavior that off-the-shelf tools cannot provide.
Build vs Buy Decision
Build custom when:
- Your chatbot needs to access user-specific data in real-time
- Multi-tenant isolation is required (each customer's data is separate)
- You need custom business logic in responses (calculations, lookups, actions)
- Off-the-shelf pricing does not make sense at your scale
- The chatbot is a core product feature, not just support deflection
Buy off-the-shelf when:
- You only need to answer questions from static documentation
- Support deflection is the primary goal
- You need something working this week, not in 6 weeks
- Your query volume is under 5,000/month (where per-resolution pricing is reasonable)
Realistic Timeline
Week 1: Requirements, knowledge base audit, chunking strategy design Week 2: Document ingestion pipeline, vector store setup, initial embedding Week 3–4: Retrieval logic, quality tuning, hybrid search implementation Week 5: Chat UI, conversation management, streaming responses Week 6: Evaluation testing, edge case handling, fallback logic Week 7: Beta rollout to internal team or subset of users Week 8: Iterate on feedback, cost monitoring, production launch
After launch, expect 2–4 weeks of tuning based on real user questions. The questions users actually ask are always different from what you predicted.
Our Experience
We have built AI chatbots for 6 SaaS products across different domains. The patterns are remarkably consistent:
- Retrieval quality determines chatbot quality. Period.
- Chunking strategy needs to match your content structure (technical docs chunk differently than marketing content)
- Users ask the same 50 questions 80% of the time — optimize for those first
- The long tail of unusual questions is where hallucination risk lives — invest in good fallback behavior
- Cost per query at production scale is $0.01–$0.03 — manageable for any SaaS with reasonable pricing
If your SaaS product handles customer questions, support tickets, or knowledge management, a well-built chatbot reduces support load by 30–50%. At $3–$5 per human support interaction, the ROI is clear within the first month.