Free LLM tiers are generous enough to ship real things on — until your app gets popular for an afternoon and every other request comes back 429 Too Many Requests. Free tiers have tight limits precisely because they're free: a Flash-class model might give you ~15 requests/minute, Groq's free tier sits around 30 RPM and ~6,000 tokens/minute, and most providers cap both requests and tokens independently.
You can't raise those ceilings without paying. What you can do is stop wasting headroom and degrade gracefully when you hit it. Here's how, in the order you should implement it.
First: understand what you're actually limited on
The single most common mistake is assuming you're limited on requests when you're actually limited on tokens. Most free tiers enforce both, in parallel:
- RPM / RPD — requests per minute / per day.
- TPM / TPD — tokens per minute / per day.
A handful of long-context calls can blow your TPM while you're nowhere near your RPM. So the first move is to know your provider's specific numbers. The FreeAIRouter directory tracks current per-provider free limits across stations — check the exact RPM/TPM/RPD for the provider you're on, like Google AI Studio or Cerebras, before you tune anything. (Limits change often; verify against the provider's own docs too.)
Respect Retry-After — most backoff code gets this wrong
When you do get a 429, the provider usually tells you exactly how long to wait. Anthropic and OpenAI both return a Retry-After header on 429 responses with the precise wait time. Most "exponential backoff" implementations ignore it and guess — which means they retry too early (another 429) or too late (wasted latency).
The correct order of operations on a 429:
- Read
Retry-Afterif present and wait exactly that long. - If absent, fall back to exponential backoff with jitter — start at 1–2s, double each attempt (1s, 2s, 4s, 8s), add random jitter so concurrent clients don't retry in lockstep.
- Cap retries at 3–5 attempts, then surface a real error.
- Only retry retryable statuses — 429, 500, 502, 503, 504, and 529. Don't retry a 400; you'll just burn quota on a request that will never succeed.
Why jitter matters: if 50 clients all hit a limit at once and all back off on the identical 1s/2s/4s schedule, they retry simultaneously and re-trigger the limit. Randomized jitter spreads them out, and in practice backoff-with-jitter roughly doubles success rate over constant retries.
One warning: some providers (OpenAI among them) detect aggressive retry loops and extend your backoff or temporarily ban the client. Hammering a 429 makes things worse, not better. Back off politely.
Estimate tokens before you send
Reactive backoff handles limits you hit. Proactive estimation stops you hitting them. Count tokens client-side before each call — tiktoken for OpenAI-style tokenizers, the provider's token-counting endpoint for Claude — so you can:
- Reject or trim a request that would obviously blow your TPM in one shot.
- Pace your sending so you stay under the per-minute token budget.
- Pick the right model — route a small prompt to a cheaper/faster model and reserve the big tier for calls that need it.
Use a dual token bucket, not a textbook one
The classic single token-bucket rate limiter isn't enough here, because you're constrained on two axes. Run two buckets in parallel: one metering requests (RPM) and one metering tokens (TPM). A call only proceeds when both buckets have capacity. This is what keeps you from passing the request check while silently overrunning the token limit.
For per-day limits (RPD/TPD), track a rolling daily counter alongside the per-minute buckets and stop issuing new work — or shed it to a fallback — when the daily ceiling is near.
Coalesce and cache duplicate work
A surprising fraction of free-tier 429s come from doing the same work twice:
- Request coalescing — if N callers ask for the same completion at the same time (same prompt, same params), issue one upstream call and fan the result out to all of them.
- Caching — cache identical or near-identical prompts. For Claude specifically, provider-side prompt caching also cuts the token cost of repeated system prompts, which directly relieves TPM pressure.
Every request you don't send is a request that can't be rate-limited.
Build a fallback ladder
When you've exhausted one provider's free quota, the most robust pattern is to fail over to another provider rather than erroring out. A fallback ladder might look like:
- Primary free tier (e.g. Gemini Flash).
- Secondary free tier on a different provider (e.g. a Groq-hosted Llama model).
- A paid call as the last resort for critical requests.
Because providers meter independently, spreading load across two or three free tiers multiplies your effective free throughput. The FreeAIRouter directory is useful here for picking secondary stations with compatible models, and aggregators like OpenRouter can implement a fallback ladder behind a single endpoint.
The full stack, in order
A production-grade free-tier client layers all of this:
pre-flight token estimation → dual token bucket → request coalescing → fallback ladder → backoff with jitter (respecting
Retry-After)
Implement them in that order. Estimation and bucketing prevent most 429s; coalescing kills duplicate load; the fallback ladder absorbs spikes; and correct backoff cleans up whatever still slips through.
For the bigger picture of which free tiers to combine, see which LLMs have a free tier in 2026.
