Cancelled SSE stream keeps generating tokens — backend bills accumulate after user closes the tab

User closes a long-running chat tab. Frontend SSE connection drops. But the backend keeps reading from the LLM stream until the model finishes (could be 30+ seconds). Tokens are billed. With 10K daily users abandoning slow streams, this added $1,200/month in pure waste.

What happened

Chat app with streaming responses. User asks a question, model streams ~5K tokens of output. Average user reads ~30 tokens before deciding the response isn't useful and closes the tab.

Naive backend:

@app.post("/chat")
async def chat(req):
    stream = await client.messages.create(stream=True, ...)
    async def emit():
        async for chunk in stream:
            yield f"data: {chunk.json()}\n\n"
    return StreamingResponse(emit(), media_type="text/event-stream")

When the client disconnects, FastAPI raises ClientDisconnect eventually, but the async for chunk in stream loop keeps pulling from the model. The model continues generating. Tokens are billed.

Per-user cost difference per cancelled response:

Tokens emitted to user: ~30
Tokens generated server-side after disconnect: ~4,970
Cost ratio: 99.4% wasted

Diagnosis

The httpx/anthropic/openai SDK streams keep the upstream HTTP connection open and keep reading bytes. Closing the client connection downstream doesn't propagate upstream automatically. You have to explicitly cancel.

The fix

Watch for client disconnect and abort the upstream stream:

import asyncio
from fastapi import Request
@app.post("/chat")
async def chat(req: Request):
    stream = await client.messages.create(stream=True, ...)
async def emit():
        try:
            async for chunk in stream:
                if await req.is_disconnected():
                    break  # bail out
                yield f"data: {chunk.json()}\n\n"
        finally:
            await stream.aclose()  # explicit abort — important
return StreamingResponse(emit(), media_type="text/event-stream")

stream.aclose() (or stream.cancel() in some SDKs) sends a TCP close to the model provider. Most providers stop billing within a token or two.

After fix: average wasted tokens per cancelled response dropped from 4,970 to ~12.

Takeaway

Streaming responses are asymmetric — closing the downstream connection does NOT cancel the upstream model call. Wire the disconnect signal explicitly. And if you're using a framework that wraps SSE, verify it handles this — many don't.