A support bot with vector-search retrieval returned the top-50 chunks per query (default `k=50`) and stuffed all of them into the prompt. By turn 3 of any conversation the context exceeded 180K tokens and the model started ignoring the user's actual question.
What happened
Support bot built on a popular RAG framework. Default retrieval config:
retriever = vector_store.as_retriever(search_kwargs={"k": 50})
Every user message triggered retrieval. All 50 chunks (avg 800 tokens each = 40K tokens) were concatenated into the system prompt and re-sent every turn.
By turn 3:
- Turn 1 chunks: 40K tokens
- Turn 2 chunks: 40K tokens
- Turn 3 chunks: 40K tokens
- System prompt + few-shot: 8K tokens
- Conversation history: 4K tokens
- Total input: 132K tokens per turn, just for context
By turn 5 the model was at 195K tokens and either hit the context limit (200K) or the cost-per-message exceeded the entire support contract.
Diagnosis
Three failures stacked:
1. k=50 is way too high. The default in most RAG examples is 4. 50 was set by a junior engineer who confused "chunks" with "documents."
2. No deduplication across turns. Every turn re-retrieved and re-stuffed, even when the same chunks came back.
3. No relevance threshold. The retriever returned the top 50 by cosine similarity even when the 30th result was 0.41 (effectively unrelated).
The fix
- retriever = vector_store.as_retriever(search_kwargs={"k": 50})
+ retriever = vector_store.as_retriever(
+ search_type="similarity_score_threshold",
+ search_kwargs={"k": 6, "score_threshold": 0.65},
+ )
Plus a context manager that tracks chunks already shown in the conversation:
def get_relevant_context(query, conversation_id):
chunks = retriever.get_relevant_documents(query)
seen = SEEN_CHUNKS.get(conversation_id, set())
new = [c for c in chunks if c.metadata["chunk_id"] not in seen]
SEEN_CHUNKS[conversation_id] = seen | {c.metadata["chunk_id"] for c in chunks[:6]}
return new[:6]
After fix: average context dropped from 132K to 14K tokens. Cost per conversation dropped 9×. Answer quality went up because the model wasn't being asked to find the needle in 40K tokens of haystack.
Takeaway
k is the most consequential RAG hyperparameter and the one most often left at a copy-pasted default. Audit it. And remember: more context isn't always better — at some scale, models start "lost in the middle" and ignore relevant chunks anyway.