Internal token counter undercounts by 18%, agent silently exceeds the model's context window

A custom 'tokens ≈ chars / 4' approximation undercounted code-heavy prompts by 18%. The agent's pre-flight check thought it had budget, sent the request, and got truncated server-side without any error.

What happened

To avoid loading a 1MB tokenizer in serverless, a team used Math.ceil(text.length / 4) as a token estimate. Worked fine for English prose. For code-heavy prompts (lots of punctuation, identifiers like getUserPreferencesByDeviceId), the real token count was 18% higher.

The agent's pre-flight:

const estimated = Math.ceil(prompt.length / 4);
if (estimated > MODEL_CONTEXT - 4096) {
  return summarize(prompt);  // truncate path
}
return callModel(prompt);

For a code-review prompt at ~150K real tokens, the estimate was 123K. Pre-flight passed. Server received the request, processed the first 200K tokens (= 130K real input + 70K of output budget), then truncated mid-output with finish_reason: "length". The response looked complete to the wrapper, which didn't check finish_reason.

Diagnosis

chars / 4 is OK for short English. For:

Code (lots of single-char tokens like {, (, ;): closer to chars/3.2
Non-Latin scripts (CJK, Arabic): closer to chars/2.5
Mixed (markdown with code blocks): varies wildly

The 18% undercount is reproducible and bites when prompts approach context limits.

The fix

- const estimated = Math.ceil(prompt.length / 4);
+ // tiktoken WASM, lazy-loaded once
+ const enc = await getEncoder("cl100k_base");
+ const estimated = enc.encode(prompt).length;

For models without exact tokenizers (Claude, Gemini), use the provider's count-tokens endpoint when accuracy matters:

const { input_tokens } = await anthropic.beta.messages.count_tokens({
  model: "claude-sonnet-4-5",
  messages: [...]
});

Plus: always check finish_reason post-call.

Takeaway

chars/4 is a rule of thumb for sizing dropdowns, not a budget enforcement primitive. If you're making routing decisions based on token count, use the real tokenizer. And always check whether your output got truncated server-side.