Parallel sub-agents trigger a rate-limit cascade that takes down a 12-agent pipeline

An orchestrator spawned 12 parallel sub-agents that each made 4-5 LLM calls. They all hit the provider's per-minute token limit simultaneously. The default retry-on-429 logic re-sent every failed call after a 1-second backoff, multiplying load 12× and stretching a 30-second job to 14 minutes.

What happened

A document-processing pipeline orchestrated 12 parallel sub-agents (one per document chunk). Each sub-agent made ~5 LLM calls. The provider rate limit was 4M tokens/minute on the tier in use.

Cold-start at t=0:

12 agents × ~150K tokens each first call = 1.8M tokens in flight at t=0.5s
Burst is fine for the first 90 seconds (within the 4M/min budget)
At t=92s, the rolling window catches up: provider returns 429 to the next 8 simultaneous calls

The retry logic:

@retry(stop=stop_after_attempt(5), wait=wait_fixed(1))
def call_llm(...):
    return client.messages.create(...)

All 8 retries fired exactly 1 second later. All hit 429 again. Wait another second. All retried again. Effectively a tight loop hammering the provider with a uniform delay across all 8 calls — which guaranteed they kept colliding.

Diagnosis

Three issues:

1. No jitter on retries. All 8 failed calls wait the same 1s and retry simultaneously. Synchronized retries are anti-coordination.

2. No global token budget. Each sub-agent didn't know how many tokens were already in flight from siblings.

3. wait_fixed(1) is the wrong shape. Rate limits typically reset on a sliding window of 30-60s. A 1s backoff guarantees you'll be retrying inside the same penalty window.

The fix

import random
@retry(
    stop=stop_after_attempt(5),
  wait=wait_fixed(1),

+   wait=wait_random_exponential(multiplier=2, max=60),  # 2s, 4s, 8s, 16s, 32s + jitter
    retry=retry_if_exception_type(RateLimitError),
)
def call_llm(...):
    return client.messages.create(...)

Plus a token bucket shared across the orchestrator:

from asyncio import Semaphore
4M tokens/min / 60s ≈ 67K tokens/sec sustainable
TOKEN_BUDGET = TokenBucket(rate_per_sec=60_000, burst=200_000)
async def call_llm_metered(model, prompt, max_tokens):
    estimated = count_tokens(prompt) + max_tokens
    await TOKEN_BUDGET.consume(estimated)
    return await client.messages.create(...)

After fix: same workload completes in 38 seconds (target was 30) with zero retries. Tail latency dropped from 14 minutes to 41 seconds.

Takeaway

Rate-limit retries without jitter become DDoS attacks on yourself. Always use exponential backoff with jitter. And when fanning out parallel work, model the global rate budget — don't assume the provider's rate limiter will gracefully shape your traffic.