The Ultimate Guide to Handling AI API Rate Limits (OpenAI & Claude)
November 27, 2025
You're building with AI APIs. Everything works perfectly in development. Then you deploy to production and get hit with Error 429: Too Many Requests.
Rate limits are the invisible wall between your app and AI providers. Hit them wrong, and your users get errors. Handle them right, and your app scales smoothly.
This guide covers five battle-tested strategies for handling rate limits in OpenAI and Claude APIs, complete with working code you can copy and paste.
Understanding AI API Rate Limits
What Are Rate Limits?
Rate limits control how many requests you can make to an API within a specific time window. AI providers measure limits in three ways:
- RPM (Requests Per Minute): Total number of API calls
- TPM (Tokens Per Minute): Total tokens processed (input + output)
- RPD (Requests Per Day): Daily request cap (some providers)
You'll hit a rate limit if you exceed any of these thresholds. A single large request can consume your entire TPM allowance, even if you're nowhere near your RPM limit.
Important: The rate limit values below are approximations based on third-party sources and recent reports. API providers change limits frequently based on model, tier, region, and usage patterns. Always verify your specific limits in your provider's dashboard: OpenAI or Anthropic.
OpenAI Rate Limits by Tier
OpenAI automatically graduates you to higher tiers based on spending. These are approximate limits for GPT-5 (November 2025):
| Tier | RPM | TPM | Qualification |
|---|---|---|---|
| Free | 3 | Limited | $0 |
| Tier 1 | 500 | 500K | $5+ spent |
| Tier 2 | 500 | 500K | $50+ spent |
| Tier 3 | 5,000 | 2M | $100+ spent |
| Tier 4 | 10,000 | 5M | $250+ spent |
| Tier 5 | 15,000 | 40M | $1,000+ spent |
Note: Limits vary significantly by model. GPT-5-mini has higher TPM limits than GPT-5. The o1 reasoning model has different limits entirely. Always verify your specific limits in the OpenAI dashboard.
Claude (Anthropic) Rate Limits by Tier
Claude uses a tier system similar to OpenAI, but with a key difference: separate limits for input and output tokens. These are approximate values for Sonnet 4.x (November 2025):
| Tier | RPM | ITPM | OTPM | Qualification |
|---|---|---|---|---|
| Tier 1 | 50 | 30K | 8K | $5+ spent |
| Tier 2 | 1,000 | 450K | 90K | $40+ spent |
| Tier 3 | 2,000 | 800K | 160K | $200+ spent |
| Tier 4 | 4,000 | 2M | 400K | $400+ spent |
Note: ITPM = Input Tokens Per Minute, OTPM = Output Tokens Per Minute. Claude separates these to give you more fine-grained control over rate limits.
Reading Rate Limit Headers
Both providers return rate limit information in response headers. You can use these to avoid hitting limits in the first place.
OpenAI Headers
1.// Available in every OpenAI API response2.x-ratelimit-limit-requests: 5003.x-ratelimit-remaining-requests: 4994.x-ratelimit-reset-requests: 120ms5.x-ratelimit-limit-tokens: 5000006.x-ratelimit-remaining-tokens: 4950007.x-ratelimit-reset-tokens: 8ms
Claude Headers
1.// Available in every Claude API response2.anthropic-ratelimit-requests-limit: 503.anthropic-ratelimit-requests-remaining: 494.anthropic-ratelimit-requests-reset: 2025-11-27T10:00:00Z5.anthropic-ratelimit-input-tokens-limit: 300006.anthropic-ratelimit-input-tokens-remaining: 285007.anthropic-ratelimit-output-tokens-limit: 80008.anthropic-ratelimit-output-tokens-remaining: 78009.retry-after: 60
Strategy 1: Exponential Backoff with Retry
When to use it: You occasionally hit rate limits and need automatic retry logic.
Exponential backoff is the simplest strategy. When you hit a 429 error, wait a bit, then try again. Each retry doubles the wait time.
How It Works
- Try the API request
- If you get a 429 error, wait 1 second
- Try again. If still 429, wait 2 seconds
- Try again. If still 429, wait 4 seconds
- Continue doubling until success or max retries
Generic Implementation
1.async function retryWithExponentialBackoff(fn, maxRetries = 5, baseDelay = 1000) {2. let lastError;3.4. for (let attempt = 0; attempt < maxRetries; attempt++) {5. try {6. const result = await fn();7. return result;8. } catch (error) {9. lastError = error;10.11. // Check if it's a rate limit error12. const isRateLimitError =13. error.status === 429 ||14. error.response?.status === 429 ||15. error.message?.includes('rate limit');16.17. if (!isRateLimitError) {18. // If it's not a rate limit error, throw immediately19. throw error;20. }21.22. // Don't wait after the last attempt23. if (attempt === maxRetries - 1) {24. break;25. }26.27. // Calculate delay: baseDelay * 2^attempt28. // Attempt 0: 1s, Attempt 1: 2s, Attempt 2: 4s, Attempt 3: 8s, Attempt 4: 16s29. const delay = baseDelay * Math.pow(2, attempt);30.31. console.log(`Rate limit hit. Retrying in ${delay}ms (attempt ${attempt + 1}/${maxRetries})`);32.33. await sleep(delay);34. }35. }36.37. // If we've exhausted all retries, throw the last error38. throw new Error(`Failed after ${maxRetries} retries: ${lastError.message}`);39.}40.41.function sleep(ms) {42. return new Promise(resolve => setTimeout(resolve, ms));43.}
OpenAI with Retry
1.async function callOpenAI(prompt, options = {}) {2. const {3. apiKey = process.env.OPENAI_API_KEY,4. model = 'gpt-4',5. maxRetries = 5,6. baseDelay = 1000,7. } = options;8.9. return retryWithExponentialBackoff(10. async () => {11. const response = await fetch('https://api.openai.com/v1/chat/completions', {12. method: 'POST',13. headers: {14. 'Content-Type': 'application/json',15. 'Authorization': `Bearer ${apiKey}`,16. },17. body: JSON.stringify({18. model,19. messages: [{ role: 'user', content: prompt }],20. }),21. });22.23. // Read rate limit headers24. const rateLimitInfo = {25. requestsLimit: response.headers.get('x-ratelimit-limit-requests'),26. requestsRemaining: response.headers.get('x-ratelimit-remaining-requests'),27. tokensLimit: response.headers.get('x-ratelimit-limit-tokens'),28. tokensRemaining: response.headers.get('x-ratelimit-remaining-tokens'),29. resetTime: response.headers.get('x-ratelimit-reset-requests'),30. };31.32. console.log('Rate limit info:', rateLimitInfo);33.34. if (!response.ok) {35. const errorData = await response.json().catch(() => ({}));36. const error = new Error(errorData.error?.message || 'OpenAI API request failed');37. error.status = response.status;38. throw error;39. }40.41. const data = await response.json();42. return data.choices[0].message.content;43. },44. maxRetries,45. baseDelay46. );47.}48.49.// Usage50.const response = await callOpenAI('Explain rate limiting in one sentence.');51.console.log(response);
Claude with Retry
1.async function callClaude(prompt, options = {}) {2. const {3. apiKey = process.env.ANTHROPIC_API_KEY,4. model = 'claude-sonnet-4-20250514',5. maxTokens = 1024,6. maxRetries = 5,7. baseDelay = 1000,8. } = options;9.10. return retryWithExponentialBackoff(11. async () => {12. const response = await fetch('https://api.anthropic.com/v1/messages', {13. method: 'POST',14. headers: {15. 'Content-Type': 'application/json',16. 'x-api-key': apiKey,17. 'anthropic-version': '2023-06-01',18. },19. body: JSON.stringify({20. model,21. max_tokens: maxTokens,22. messages: [{ role: 'user', content: prompt }],23. }),24. });25.26. // Read rate limit headers27. // Note: Claude separates input and output token limits28. const rateLimitInfo = {29. requestsLimit: response.headers.get('anthropic-ratelimit-requests-limit'),30. requestsRemaining: response.headers.get('anthropic-ratelimit-requests-remaining'),31. inputTokensLimit: response.headers.get('anthropic-ratelimit-input-tokens-limit'),32. inputTokensRemaining: response.headers.get('anthropic-ratelimit-input-tokens-remaining'),33. outputTokensLimit: response.headers.get('anthropic-ratelimit-output-tokens-limit'),34. outputTokensRemaining: response.headers.get('anthropic-ratelimit-output-tokens-remaining'),35. retryAfter: response.headers.get('retry-after'),36. };37.38. console.log('Rate limit info:', rateLimitInfo);39.40. if (!response.ok) {41. const errorData = await response.json().catch(() => ({}));42. const error = new Error(errorData.error?.message || 'Claude API request failed');43. error.status = response.status;44. throw error;45. }46.47. const data = await response.json();48. return data.content[0].text;49. },50. maxRetries,51. baseDelay52. );53.}54.55.// Usage56.const response = await callClaude('Explain rate limiting in one sentence.');57.console.log(response);
Strategy 2: Request Queue
When to use it: You know your rate limits and want to stay under them proactively.
Instead of reacting to 429 errors, a request queue prevents them by controlling how many requests you send. Think of it as a bouncer for your API calls.
Queue Implementation
1.// Note: This implementation is single-threaded (fine for Node.js/browser)2.// If porting to multi-threaded environments, add proper locking mechanisms3.class RequestQueue {4. constructor(requestsPerMinute) {5. this.requestsPerMinute = requestsPerMinute;6. this.queue = [];7. this.processing = false;8. this.requestCount = 0;9. this.windowStart = Date.now();10. }11.12. async add(fn) {13. return new Promise((resolve, reject) => {14. this.queue.push({ fn, resolve, reject });15. this.processQueue();16. });17. }18.19. async processQueue() {20. if (this.processing || this.queue.length === 0) {21. return;22. }23.24. this.processing = true;25.26. while (this.queue.length > 0) {27. // Reset the window if a minute has passed28. const now = Date.now();29. if (now - this.windowStart >= 60000) {30. this.requestCount = 0;31. this.windowStart = now;32. }33.34. // If we've hit the rate limit, wait until the window resets35. if (this.requestCount >= this.requestsPerMinute) {36. const timeToWait = 60000 - (now - this.windowStart);37. console.log(`Rate limit reached. Waiting ${timeToWait}ms until window resets...`);38. await this.sleep(timeToWait);39. this.requestCount = 0;40. this.windowStart = Date.now();41. }42.43. // Process the next request44. const { fn, resolve, reject } = this.queue.shift();45.46. try {47. const result = await fn();48. this.requestCount++;49. resolve(result);50. } catch (error) {51. reject(error);52. }53.54. // Add a small delay between requests to spread them out55. const delayBetweenRequests = Math.floor(60000 / this.requestsPerMinute);56. await this.sleep(delayBetweenRequests);57. }58.59. this.processing = false;60. }61.62. sleep(ms) {63. return new Promise(resolve => setTimeout(resolve, ms));64. }65.66. getStats() {67. return {68. queueLength: this.queue.length,69. requestsInWindow: this.requestCount,70. timeUntilReset: 60000 - (Date.now() - this.windowStart),71. };72. }73.}
Using the Queue
1.// For OpenAI Tier 1 (500 RPM)2.const queue = new RequestQueue(500);3.4.const prompts = [5. 'What is JavaScript?',6. 'What is Python?',7. 'What is Go?',8. 'What is Rust?',9. 'What is TypeScript?',10.];11.12.// Queue all requests13.const promises = prompts.map(prompt =>14. queue.add(async () => {15. console.log(`Processing: ${prompt}`);16. return await callOpenAI(prompt);17. })18.);19.20.// Wait for all to complete21.const results = await Promise.all(promises);22.console.log('All requests completed:', results);
Strategy 3: Token Bucket Algorithm
When to use it: You need sophisticated rate limiting with burst support.
Token bucket is the most flexible strategy. It allows burst traffic while maintaining an average rate. Anthropic officially uses this algorithm for Claude's rate limiting, where "your capacity is continuously replenished up to your maximum limit, rather than being reset at fixed intervals."
Token Bucket Implementation
1.class TokenBucket {2. constructor(capacity, refillRate, refillInterval = 1000) {3. this.capacity = capacity; // Maximum tokens4. this.tokens = capacity; // Current tokens5. this.refillRate = refillRate; // Tokens added per interval6. this.refillInterval = refillInterval; // Milliseconds between refills7. this.lastRefill = Date.now();8.9. // Start the refill process10. this.startRefilling();11. }12.13. startRefilling() {14. this.refillTimer = setInterval(() => {15. this.refill();16. }, this.refillInterval);17. }18.19. refill() {20. const now = Date.now();21. const timePassed = now - this.lastRefill;22. const tokensToAdd = (timePassed / this.refillInterval) * this.refillRate;23.24. this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);25. this.lastRefill = now;26. }27.28. async consume(tokens = 1) {29. // Refill before checking30. this.refill();31.32. // If we don't have enough tokens, wait until we do33. while (this.tokens < tokens) {34. const tokensNeeded = tokens - this.tokens;35. const timeToWait = (tokensNeeded / this.refillRate) * this.refillInterval;36.37. console.log(`Not enough tokens. Waiting ${Math.ceil(timeToWait)}ms...`);38. await this.sleep(timeToWait);39. this.refill();40. }41.42. this.tokens -= tokens;43. return true;44. }45.46. sleep(ms) {47. return new Promise(resolve => setTimeout(resolve, ms));48. }49.50. getStatus() {51. this.refill();52. return {53. tokens: this.tokens,54. capacity: this.capacity,55. percentage: (this.tokens / this.capacity) * 100,56. };57. }58.59. stop() {60. if (this.refillTimer) {61. clearInterval(this.refillTimer);62. }63. }64.}65.66.// Usage: OpenAI Tier 1 (500 RPM)67.const bucket = new TokenBucket(500, 500, 60000);68.69.async function makeRequest(prompt) {70. await bucket.consume(1);71. return await callOpenAI(prompt);72.}73.74.const result = await makeRequest('What is JavaScript?');75.console.log(result);76.77.// IMPORTANT: Call stop() to prevent memory leaks in long-running processes78.bucket.stop();
Strategy 4: Smart Caching
When to use it: Users might ask similar questions.
The best way to avoid rate limits is to not make the request at all. Caching identical prompts can save you money and improve response times.
Cache Implementation
1.const crypto = require('crypto');2.3.class PromptCache {4. constructor(ttl = 3600000) {5. // Default TTL: 1 hour6. this.cache = new Map();7. this.ttl = ttl;8. }9.10. // Generate a cache key from the prompt and options11. generateKey(prompt, options = {}) {12. const data = JSON.stringify({ prompt, ...options });13. return crypto.createHash('sha256').update(data).digest('hex');14. }15.16. // Get from cache if exists and not expired17. get(prompt, options = {}) {18. const key = this.generateKey(prompt, options);19. const cached = this.cache.get(key);20.21. if (!cached) {22. return null;23. }24.25. // Check if expired26. if (Date.now() > cached.expiresAt) {27. this.cache.delete(key);28. return null;29. }30.31. console.log('Cache hit:', key.substring(0, 8));32. return cached.value;33. }34.35. // Store in cache with expiration36. set(prompt, value, options = {}) {37. const key = this.generateKey(prompt, options);38. this.cache.set(key, {39. value,40. expiresAt: Date.now() + this.ttl,41. createdAt: Date.now(),42. });43. console.log('Cache set:', key.substring(0, 8));44. }45.46. // Clear expired entries47. cleanup() {48. const now = Date.now();49. let cleaned = 0;50.51. for (const [key, value] of this.cache.entries()) {52. if (now > value.expiresAt) {53. this.cache.delete(key);54. cleaned++;55. }56. }57.58. console.log(`Cleaned ${cleaned} expired cache entries`);59. return cleaned;60. }61.62. getStats() {63. return {64. size: this.cache.size,65. ttl: this.ttl,66. };67. }68.69. clear() {70. this.cache.clear();71. }72.}
Using the Cache
1.const cache = new PromptCache(60000); // 1 minute TTL2.3.async function callAIWithCache(prompt) {4. // Check cache first5. const cached = cache.get(prompt);6. if (cached !== null) {7. return cached;8. }9.10. // If not in cache, make the API call11. const result = await callOpenAI(prompt);12.13. // Store in cache14. cache.set(prompt, result);15.16. return result;17.}18.19.// First call - hits API20.const result1 = await callAIWithCache('What is JavaScript?');21.22.// Second call with same prompt - uses cache23.const result2 = await callAIWithCache('What is JavaScript?');24.25.console.log('Cache stats:', cache.getStats());
Production Best Practices
Combine Multiple Strategies
The best production systems combine multiple strategies:
1.const cache = new PromptCache(3600000); // 1 hour2.const queue = new RequestQueue(500); // 500 RPM3.4.async function robustAICall(prompt) {5. // Check cache first6. const cached = cache.get(prompt);7. if (cached) return cached;8.9. // Queue the request to respect rate limits10. const result = await queue.add(async () => {11. // Use exponential backoff for the actual call12. return await callOpenAI(prompt, { maxRetries: 3 });13. });14.15. // Cache the result16. cache.set(prompt, result);17.18. return result;19.}
User Experience
Don't leave users in the dark when rate limits slow things down:
1.async function callAIWithFeedback(prompt, onProgress) {2. const queueStats = queue.getStats();3.4. if (queueStats.queueLength > 0) {5. onProgress({6. status: 'queued',7. position: queueStats.queueLength,8. message: `Request queued. Position: ${queueStats.queueLength}`,9. });10. }11.12. onProgress({ status: 'processing', message: 'Sending request...' });13.14. const result = await robustAICall(prompt);15.16. onProgress({ status: 'complete', message: 'Done!' });17.18. return result;19.}
Monitor Rate Limit Usage
Track rate limit headers to stay proactive:
1.function logRateLimitWarnings(headers) {2. const remaining = parseInt(headers.get('x-ratelimit-remaining-requests'));3. const limit = parseInt(headers.get('x-ratelimit-limit-requests'));4. const percentUsed = ((limit - remaining) / limit) * 100;5.6. if (percentUsed > 80) {7. console.warn(`WARNING: ${percentUsed.toFixed(1)}% of rate limit used`);8. }9.10. if (percentUsed > 95) {11. console.error('CRITICAL: Approaching rate limit!');12. // Send alert to monitoring service13. }14.}
Multi-Provider Fallback
If one provider hits rate limits, fall back to another:
1.async function callAIWithFallback(prompt) {2. try {3. return await callOpenAI(prompt, { maxRetries: 1 });4. } catch (error) {5. if (error.status === 429) {6. console.log('OpenAI rate limit hit, trying Claude...');7. return await callClaude(prompt, { maxRetries: 1 });8. }9. throw error;10. }11.}
Conclusion
Rate limits are a fact of life when building with AI APIs. The strategies in this guide give you the tools to handle them gracefully. Here's when to use each approach:
| Strategy | Complexity | Best For | When to Add |
|---|---|---|---|
| Exponential Backoff | Low | Occasional rate limit hits | Day 1 (always include) |
| Caching | Low | Repeated or similar prompts | Day 1 (easy wins) |
| Request Queue | Medium | Predictable, steady traffic | When you hit limits regularly |
| Token Bucket | High | Burst traffic + sustained rate | High-scale production systems |
| Multi-Provider Fallback | Medium | Mission-critical applications | When downtime isn't acceptable |
Start with exponential backoff and caching. As your usage grows, add a request queue. For high-scale production systems, consider token buckets and multi-provider fallback.
The code examples in this guide are battle-tested and ready to use. The strategies work across providers, even as specific rate limit values change.
For the most accurate rate limits, always check:
