An orchestrator spawned 12 parallel sub-agents that each made 4-5 LLM calls. They all hit the provider's per-minute token limit simultaneously. The default retry-on-429 logic re-sent every failed call after a 1-second backoff, multiplying load 12× and stretching a 30-second job to 14 minutes.
What happened
A document-processing pipeline orchestrated 12 parallel sub-agents (one per document chunk). Each sub-agent made ~5 LLM calls. The provider rate limit was 4M tokens/minute on the tier in use.
Cold-start at t=0:
- 12 agents × ~150K tokens each first call = 1.8M tokens in flight at t=0.5s
- Burst is fine for the first 90 seconds (within the 4M/min budget)
- At t=92s, the rolling window catches up: provider returns 429 to the next 8 simultaneous calls
The retry logic:
@retry(stop=stop_after_attempt(5), wait=wait_fixed(1))
def call_llm(...):
return client.messages.create(...)
All 8 retries fired exactly 1 second later. All hit 429 again. Wait another second. All retried again. Effectively a tight loop hammering the provider with a uniform delay across all 8 calls — which guaranteed they kept colliding.
Diagnosis
Three issues:
1. No jitter on retries. All 8 failed calls wait the same 1s and retry simultaneously. Synchronized retries are anti-coordination.
2. No global token budget. Each sub-agent didn't know how many tokens were already in flight from siblings.
3. wait_fixed(1) is the wrong shape. Rate limits typically reset on a sliding window of 30-60s. A 1s backoff guarantees you'll be retrying inside the same penalty window.
The fix
import random
@retry(
stop=stop_after_attempt(5),
- wait=wait_fixed(1),
+ wait=wait_random_exponential(multiplier=2, max=60), # 2s, 4s, 8s, 16s, 32s + jitter
retry=retry_if_exception_type(RateLimitError),
)
def call_llm(...):
return client.messages.create(...)
Plus a token bucket shared across the orchestrator:
from asyncio import Semaphore
4M tokens/min / 60s ≈ 67K tokens/sec sustainable
TOKEN_BUDGET = TokenBucket(rate_per_sec=60_000, burst=200_000)
async def call_llm_metered(model, prompt, max_tokens):
estimated = count_tokens(prompt) + max_tokens
await TOKEN_BUDGET.consume(estimated)
return await client.messages.create(...)
After fix: same workload completes in 38 seconds (target was 30) with zero retries. Tail latency dropped from 14 minutes to 41 seconds.
Takeaway
Rate-limit retries without jitter become DDoS attacks on yourself. Always use exponential backoff with jitter. And when fanning out parallel work, model the global rate budget — don't assume the provider's rate limiter will gracefully shape your traffic.