April 8, 20268 min readTechnical Deep-DiveAPIPython
API Rate Limits Will Wreck Your AI Project (Here's How to Handle Them)
Your automation is running smoothly. Then at 2 PM on Tuesday, everything stops. The Claude API is rejecting requests with 429 Too Many Requests. If you're not prepared, rate limits will break your system when you can't afford downtime.
Your automation was working fine yesterday at the same volume. What changed? Maybe nothing. Maybe Claude had brief capacity constraints. Maybe you spiked unexpectedly.
The point: rate limits will happen. And if you're not prepared, they'll break your system when you can't afford downtime.
I'm going to walk through the architecture that survives rate limits—and the code patterns that make it work.
Capacity protection. Claude API has finite compute. They limit per-user request volume to ensure nobody monopolizes capacity.
Cost control. Buggy code can spend $50K in 30 seconds if you're not careful. Rate limits give you time to notice and kill runaway jobs.
Fair use. Without limits, one customer could starve everyone else.
For Claude, the limits are typically:
Requests per minute (RPM): Varies by tier. Free tier might be 100 RPM. Paid tiers range from 1,000 to 100,000+ RPM.
Tokens per minute (TPM): Your actual token consumption. Free tier: 50K TPM. Paid: scales with plan.
Both matter. You could be under the RPM limit but over the TPM limit.
The Naive Approach (That Fails)
Most people do this:
for invoice in invoices:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
messages=[{"role": "user", "content": f"Extract data: {invoice}"}]
)
save_to_db(response)
If you hit 500 invoices per minute and you're over the RPM limit, this crashes with a 429 error. The entire batch fails. The user sees "something went wrong." You're paged at 2 AM.
The Production Approach: Exponential Backoff with Jitter
Here's what actually works:
import time
import random
from anthropic import Anthropic, RateLimitError
client = Anthropic()
defcall_claude_with_retry(prompt, max_retries=5):
"""Call Claude with exponential backoff + jitter."""for attempt inrange(max_retries):
try:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response
except RateLimitError:
if attempt == max_retries - 1:
raise# Give up on final attempt# Exponential backoff: 2^attempt seconds, plus random jitter
backoff = 2 ** attempt
jitter = random.uniform(0, 1)
wait_time = backoff + jitter
print(f"Rate limited. Retrying in {wait_time:.2f}s (attempt {attempt+1}/{max_retries})")
time.sleep(wait_time)
returnNone
What's happening here:
Try the request. If it succeeds, great.
Catch rate limits. If you get a 429, don't fail immediately.
Back off exponentially. First retry: wait 1–2 seconds. Second: 2–4 seconds. Third: 4–8 seconds. This gives the API time to recover.
Add jitter. Without jitter, 100 clients all retry at the same time and hammer the API again. Jitter spreads retries randomly, preventing thundering herd.
With this pattern, temporary rate limit spikes are absorbed transparently. The user never sees an error.
Token Budget: The Real Limit
Requests per minute is easy to track. Tokens per minute is sneakier.
A "token" is roughly 4 characters. A 500-word document might be 800 tokens. If you're processing invoices that average 1,000 tokens, and you have a 50K TPM limit, you can do 50 extractions per minute maximum. That's not about request volume; it's about token volume.
classTokenBucket:
"""Rate limit by tokens, not requests."""def__init__(self, capacity, refill_rate_per_second):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate_per_second
self.last_refill = time.time()
defconsume(self, tokens):
"""Consume tokens. Block if not enough capacity."""
now = time.time()
elapsed = now - self.last_refill
refill = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + refill)
self.last_refill = now
if self.tokens >= tokens:
self.tokens -= tokens
returnTruereturnFalsedefwait_until_available(self, tokens):
"""Block until tokens are available."""while not self.consume(tokens):
time.sleep(0.1)
# Usage
bucket = TokenBucket(capacity=50_000, refill_rate_per_second=50_000/60)
for invoice in invoices:
prompt = f"Extract: {invoice}"# Estimate tokens (rough: 4 chars per token)
estimated_tokens = len(prompt) // 4 + 100# prompt + response
bucket.wait_until_available(estimated_tokens)
response = extract_invoice(invoice, client)
This ensures you never exceed your TPM limit. You're self-throttling at 100% utilization, not crashing at 101%.
Queue-Based Architecture (For High Scale)
If you're processing thousands of documents per hour, a queue pattern is better:
import json
import boto3
from concurrent.futures import ThreadPoolExecutor
sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/invoices"defproducer(invoices):
"""Put invoices on queue."""for invoice in invoices:
sqs.send_message(
QueueUrl=QUEUE_URL,
MessageBody=json.dumps({"invoice_text": invoice})
)
defconsumer(token_bucket):
"""Pull from queue and process with rate limiting."""whileTrue:
response = sqs.receive_message(
QueueUrl=QUEUE_URL, MaxNumberOfMessages=10
)
if"Messages"not in response:
time.sleep(1)
continuefor message in response["Messages"]:
body = json.loads(message["Body"])
invoice_text = body["invoice_text"]
# Wait for token budget
estimated_tokens = len(invoice_text) // 4 + 100
token_bucket.wait_until_available(estimated_tokens)
# Process
extracted = extract_invoice(invoice_text, client)
save_to_db(extracted)
# Delete from queue
sqs.delete_message(
QueueUrl=QUEUE_URL,
ReceiptHandle=message["ReceiptHandle"]
)
# Run consumer with multiple workers
token_bucket = TokenBucket(capacity=50_000, refill_rate_per_second=50_000/60)
withThreadPoolExecutor(max_workers=3) as executor:
for _ inrange(3):
executor.submit(consumer, token_bucket)
The queue decouples producers (your app submitting invoices) from consumers (workers processing them). If rate limits hit, the queue just backs up. Invoices don't fail; they wait. The user keeps using the app.