The Ultimate Guide to Handling AI API Rate Limits (OpenAI & Claude)

November 27, 2025

aiopenaiclaudejavascriptapi

You're building with AI APIs. Everything works perfectly in development. Then you deploy to production and get hit with Error 429: Too Many Requests.

Rate limits are the invisible wall between your app and AI providers. Hit them wrong, and your users get errors. Handle them right, and your app scales smoothly.

This guide covers five battle-tested strategies for handling rate limits in OpenAI and Claude APIs, complete with working code you can copy and paste.

Understanding AI API Rate Limits

What Are Rate Limits?

Rate limits control how many requests you can make to an API within a specific time window. AI providers measure limits in three ways:

RPM (Requests Per Minute): Total number of API calls
TPM (Tokens Per Minute): Total tokens processed (input + output)
RPD (Requests Per Day): Daily request cap (some providers)

You'll hit a rate limit if you exceed any of these thresholds. A single large request can consume your entire TPM allowance, even if you're nowhere near your RPM limit.

Important: The rate limit values below are approximations based on third-party sources and recent reports. API providers change limits frequently based on model, tier, region, and usage patterns. Always verify your specific limits in your provider's dashboard: OpenAI or Anthropic.

OpenAI Rate Limits by Tier

OpenAI automatically graduates you to higher tiers based on spending. These are approximate limits for GPT-5 (November 2025):

Tier	RPM	TPM	Qualification
Free	3	Limited	$0
Tier 1	500	500K	$5+ spent
Tier 2	500	500K	$50+ spent
Tier 3	5,000	2M	$100+ spent
Tier 4	10,000	5M	$250+ spent
Tier 5	15,000	40M	$1,000+ spent

Note: Limits vary significantly by model. GPT-5-mini has higher TPM limits than GPT-5. The o1 reasoning model has different limits entirely. Always verify your specific limits in the OpenAI dashboard.

Claude (Anthropic) Rate Limits by Tier

Claude uses a tier system similar to OpenAI, but with a key difference: separate limits for input and output tokens. These are approximate values for Sonnet 4.x (November 2025):

Tier	RPM	ITPM	OTPM	Qualification
Tier 1	50	30K	8K	$5+ spent
Tier 2	1,000	450K	90K	$40+ spent
Tier 3	2,000	800K	160K	$200+ spent
Tier 4	4,000	2M	400K	$400+ spent

Note: ITPM = Input Tokens Per Minute, OTPM = Output Tokens Per Minute. Claude separates these to give you more fine-grained control over rate limits.

Reading Rate Limit Headers

Both providers return rate limit information in response headers. You can use these to avoid hitting limits in the first place.

OpenAI Headers

1.// Available in every OpenAI API response
2.x-ratelimit-limit-requests: 500
3.x-ratelimit-remaining-requests: 499
4.x-ratelimit-reset-requests: 120ms
5.x-ratelimit-limit-tokens: 500000
6.x-ratelimit-remaining-tokens: 495000
7.x-ratelimit-reset-tokens: 8ms

Claude Headers

1.// Available in every Claude API response
2.anthropic-ratelimit-requests-limit: 50
3.anthropic-ratelimit-requests-remaining: 49
4.anthropic-ratelimit-requests-reset: 2025-11-27T10:00:00Z
5.anthropic-ratelimit-input-tokens-limit: 30000
6.anthropic-ratelimit-input-tokens-remaining: 28500
7.anthropic-ratelimit-output-tokens-limit: 8000
8.anthropic-ratelimit-output-tokens-remaining: 7800
9.retry-after: 60

Strategy 1: Exponential Backoff with Retry

When to use it: You occasionally hit rate limits and need automatic retry logic.

Exponential backoff is the simplest strategy. When you hit a 429 error, wait a bit, then try again. Each retry doubles the wait time.

How It Works

Try the API request
If you get a 429 error, wait 1 second
Try again. If still 429, wait 2 seconds
Try again. If still 429, wait 4 seconds
Continue doubling until success or max retries

Generic Implementation

1.async function retryWithExponentialBackoff(fn, maxRetries = 5, baseDelay = 1000) {
2.  let lastError;
3.
4.  for (let attempt = 0; attempt < maxRetries; attempt++) {
5.    try {
6.      const result = await fn();
7.      return result;
8.    } catch (error) {
9.      lastError = error;
10.
11.      // Check if it's a rate limit error
12.      const isRateLimitError =
13.        error.status === 429 ||
14.        error.response?.status === 429 ||
15.        error.message?.includes('rate limit');
16.
17.      if (!isRateLimitError) {
18.        // If it's not a rate limit error, throw immediately
19.        throw error;
20.      }
21.
22.      // Don't wait after the last attempt
23.      if (attempt === maxRetries - 1) {
24.        break;
25.      }
26.
27.      // Calculate delay: baseDelay * 2^attempt
28.      // Attempt 0: 1s, Attempt 1: 2s, Attempt 2: 4s, Attempt 3: 8s, Attempt 4: 16s
29.      const delay = baseDelay * Math.pow(2, attempt);
30.
31.      console.log(`Rate limit hit. Retrying in ${delay}ms (attempt ${attempt + 1}/${maxRetries})`);
32.
33.      await sleep(delay);
34.    }
35.  }
36.
37.  // If we've exhausted all retries, throw the last error
38.  throw new Error(`Failed after ${maxRetries} retries: ${lastError.message}`);
39.}
40.
41.function sleep(ms) {
42.  return new Promise(resolve => setTimeout(resolve, ms));
43.}

OpenAI with Retry

1.async function callOpenAI(prompt, options = {}) {
2.  const {
3.    apiKey = process.env.OPENAI_API_KEY,
4.    model = 'gpt-4',
5.    maxRetries = 5,
6.    baseDelay = 1000,
7.  } = options;
8.
9.  return retryWithExponentialBackoff(
10.    async () => {
11.      const response = await fetch('https://api.openai.com/v1/chat/completions', {
12.        method: 'POST',
13.        headers: {
14.          'Content-Type': 'application/json',
15.          'Authorization': `Bearer ${apiKey}`,
16.        },
17.        body: JSON.stringify({
18.          model,
19.          messages: [{ role: 'user', content: prompt }],
20.        }),
21.      });
22.
23.      // Read rate limit headers
24.      const rateLimitInfo = {
25.        requestsLimit: response.headers.get('x-ratelimit-limit-requests'),
26.        requestsRemaining: response.headers.get('x-ratelimit-remaining-requests'),
27.        tokensLimit: response.headers.get('x-ratelimit-limit-tokens'),
28.        tokensRemaining: response.headers.get('x-ratelimit-remaining-tokens'),
29.        resetTime: response.headers.get('x-ratelimit-reset-requests'),
30.      };
31.
32.      console.log('Rate limit info:', rateLimitInfo);
33.
34.      if (!response.ok) {
35.        const errorData = await response.json().catch(() => ({}));
36.        const error = new Error(errorData.error?.message || 'OpenAI API request failed');
37.        error.status = response.status;
38.        throw error;
39.      }
40.
41.      const data = await response.json();
42.      return data.choices[0].message.content;
43.    },
44.    maxRetries,
45.    baseDelay
46.  );
47.}
48.
49.// Usage
50.const response = await callOpenAI('Explain rate limiting in one sentence.');
51.console.log(response);

Claude with Retry

1.async function callClaude(prompt, options = {}) {
2.  const {
3.    apiKey = process.env.ANTHROPIC_API_KEY,
4.    model = 'claude-sonnet-4-20250514',
5.    maxTokens = 1024,
6.    maxRetries = 5,
7.    baseDelay = 1000,
8.  } = options;
9.
10.  return retryWithExponentialBackoff(
11.    async () => {
12.      const response = await fetch('https://api.anthropic.com/v1/messages', {
13.        method: 'POST',
14.        headers: {
15.          'Content-Type': 'application/json',
16.          'x-api-key': apiKey,
17.          'anthropic-version': '2023-06-01',
18.        },
19.        body: JSON.stringify({
20.          model,
21.          max_tokens: maxTokens,
22.          messages: [{ role: 'user', content: prompt }],
23.        }),
24.      });
25.
26.      // Read rate limit headers
27.      // Note: Claude separates input and output token limits
28.      const rateLimitInfo = {
29.        requestsLimit: response.headers.get('anthropic-ratelimit-requests-limit'),
30.        requestsRemaining: response.headers.get('anthropic-ratelimit-requests-remaining'),
31.        inputTokensLimit: response.headers.get('anthropic-ratelimit-input-tokens-limit'),
32.        inputTokensRemaining: response.headers.get('anthropic-ratelimit-input-tokens-remaining'),
33.        outputTokensLimit: response.headers.get('anthropic-ratelimit-output-tokens-limit'),
34.        outputTokensRemaining: response.headers.get('anthropic-ratelimit-output-tokens-remaining'),
35.        retryAfter: response.headers.get('retry-after'),
36.      };
37.
38.      console.log('Rate limit info:', rateLimitInfo);
39.
40.      if (!response.ok) {
41.        const errorData = await response.json().catch(() => ({}));
42.        const error = new Error(errorData.error?.message || 'Claude API request failed');
43.        error.status = response.status;
44.        throw error;
45.      }
46.
47.      const data = await response.json();
48.      return data.content[0].text;
49.    },
50.    maxRetries,
51.    baseDelay
52.  );
53.}
54.
55.// Usage
56.const response = await callClaude('Explain rate limiting in one sentence.');
57.console.log(response);

Strategy 2: Request Queue

When to use it: You know your rate limits and want to stay under them proactively.

Instead of reacting to 429 errors, a request queue prevents them by controlling how many requests you send. Think of it as a bouncer for your API calls.

Queue Implementation

1.// Note: This implementation is single-threaded (fine for Node.js/browser)
2.// If porting to multi-threaded environments, add proper locking mechanisms
3.class RequestQueue {
4.  constructor(requestsPerMinute) {
5.    this.requestsPerMinute = requestsPerMinute;
6.    this.queue = [];
7.    this.processing = false;
8.    this.requestCount = 0;
9.    this.windowStart = Date.now();
10.  }
11.
12.  async add(fn) {
13.    return new Promise((resolve, reject) => {
14.      this.queue.push({ fn, resolve, reject });
15.      this.processQueue();
16.    });
17.  }
18.
19.  async processQueue() {
20.    if (this.processing || this.queue.length === 0) {
21.      return;
22.    }
23.
24.    this.processing = true;
25.
26.    while (this.queue.length > 0) {
27.      // Reset the window if a minute has passed
28.      const now = Date.now();
29.      if (now - this.windowStart >= 60000) {
30.        this.requestCount = 0;
31.        this.windowStart = now;
32.      }
33.
34.      // If we've hit the rate limit, wait until the window resets
35.      if (this.requestCount >= this.requestsPerMinute) {
36.        const timeToWait = 60000 - (now - this.windowStart);
37.        console.log(`Rate limit reached. Waiting ${timeToWait}ms until window resets...`);
38.        await this.sleep(timeToWait);
39.        this.requestCount = 0;
40.        this.windowStart = Date.now();
41.      }
42.
43.      // Process the next request
44.      const { fn, resolve, reject } = this.queue.shift();
45.
46.      try {
47.        const result = await fn();
48.        this.requestCount++;
49.        resolve(result);
50.      } catch (error) {
51.        reject(error);
52.      }
53.
54.      // Add a small delay between requests to spread them out
55.      const delayBetweenRequests = Math.floor(60000 / this.requestsPerMinute);
56.      await this.sleep(delayBetweenRequests);
57.    }
58.
59.    this.processing = false;
60.  }
61.
62.  sleep(ms) {
63.    return new Promise(resolve => setTimeout(resolve, ms));
64.  }
65.
66.  getStats() {
67.    return {
68.      queueLength: this.queue.length,
69.      requestsInWindow: this.requestCount,
70.      timeUntilReset: 60000 - (Date.now() - this.windowStart),
71.    };
72.  }
73.}

Using the Queue

1.// For OpenAI Tier 1 (500 RPM)
2.const queue = new RequestQueue(500);
3.
4.const prompts = [
5.  'What is JavaScript?',
6.  'What is Python?',
7.  'What is Go?',
8.  'What is Rust?',
9.  'What is TypeScript?',
10.];
11.
12.// Queue all requests
13.const promises = prompts.map(prompt =>
14.  queue.add(async () => {
15.    console.log(`Processing: ${prompt}`);
16.    return await callOpenAI(prompt);
17.  })
18.);
19.
20.// Wait for all to complete
21.const results = await Promise.all(promises);
22.console.log('All requests completed:', results);

Strategy 3: Token Bucket Algorithm

When to use it: You need sophisticated rate limiting with burst support.

Token bucket is the most flexible strategy. It allows burst traffic while maintaining an average rate. Anthropic officially uses this algorithm for Claude's rate limiting, where "your capacity is continuously replenished up to your maximum limit, rather than being reset at fixed intervals."

Token Bucket Implementation

1.class TokenBucket {
2.  constructor(capacity, refillRate, refillInterval = 1000) {
3.    this.capacity = capacity; // Maximum tokens
4.    this.tokens = capacity; // Current tokens
5.    this.refillRate = refillRate; // Tokens added per interval
6.    this.refillInterval = refillInterval; // Milliseconds between refills
7.    this.lastRefill = Date.now();
8.
9.    // Start the refill process
10.    this.startRefilling();
11.  }
12.
13.  startRefilling() {
14.    this.refillTimer = setInterval(() => {
15.      this.refill();
16.    }, this.refillInterval);
17.  }
18.
19.  refill() {
20.    const now = Date.now();
21.    const timePassed = now - this.lastRefill;
22.    const tokensToAdd = (timePassed / this.refillInterval) * this.refillRate;
23.
24.    this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);
25.    this.lastRefill = now;
26.  }
27.
28.  async consume(tokens = 1) {
29.    // Refill before checking
30.    this.refill();
31.
32.    // If we don't have enough tokens, wait until we do
33.    while (this.tokens < tokens) {
34.      const tokensNeeded = tokens - this.tokens;
35.      const timeToWait = (tokensNeeded / this.refillRate) * this.refillInterval;
36.
37.      console.log(`Not enough tokens. Waiting ${Math.ceil(timeToWait)}ms...`);
38.      await this.sleep(timeToWait);
39.      this.refill();
40.    }
41.
42.    this.tokens -= tokens;
43.    return true;
44.  }
45.
46.  sleep(ms) {
47.    return new Promise(resolve => setTimeout(resolve, ms));
48.  }
49.
50.  getStatus() {
51.    this.refill();
52.    return {
53.      tokens: this.tokens,
54.      capacity: this.capacity,
55.      percentage: (this.tokens / this.capacity) * 100,
56.    };
57.  }
58.
59.  stop() {
60.    if (this.refillTimer) {
61.      clearInterval(this.refillTimer);
62.    }
63.  }
64.}
65.
66.// Usage: OpenAI Tier 1 (500 RPM)
67.const bucket = new TokenBucket(500, 500, 60000);
68.
69.async function makeRequest(prompt) {
70.  await bucket.consume(1);
71.  return await callOpenAI(prompt);
72.}
73.
74.const result = await makeRequest('What is JavaScript?');
75.console.log(result);
76.
77.// IMPORTANT: Call stop() to prevent memory leaks in long-running processes
78.bucket.stop();

Strategy 4: Smart Caching

When to use it: Users might ask similar questions.

The best way to avoid rate limits is to not make the request at all. Caching identical prompts can save you money and improve response times.

Cache Implementation

1.const crypto = require('crypto');
2.
3.class PromptCache {
4.  constructor(ttl = 3600000) {
5.    // Default TTL: 1 hour
6.    this.cache = new Map();
7.    this.ttl = ttl;
8.  }
9.
10.  // Generate a cache key from the prompt and options
11.  generateKey(prompt, options = {}) {
12.    const data = JSON.stringify({ prompt, ...options });
13.    return crypto.createHash('sha256').update(data).digest('hex');
14.  }
15.
16.  // Get from cache if exists and not expired
17.  get(prompt, options = {}) {
18.    const key = this.generateKey(prompt, options);
19.    const cached = this.cache.get(key);
20.
21.    if (!cached) {
22.      return null;
23.    }
24.
25.    // Check if expired
26.    if (Date.now() > cached.expiresAt) {
27.      this.cache.delete(key);
28.      return null;
29.    }
30.
31.    console.log('Cache hit:', key.substring(0, 8));
32.    return cached.value;
33.  }
34.
35.  // Store in cache with expiration
36.  set(prompt, value, options = {}) {
37.    const key = this.generateKey(prompt, options);
38.    this.cache.set(key, {
39.      value,
40.      expiresAt: Date.now() + this.ttl,
41.      createdAt: Date.now(),
42.    });
43.    console.log('Cache set:', key.substring(0, 8));
44.  }
45.
46.  // Clear expired entries
47.  cleanup() {
48.    const now = Date.now();
49.    let cleaned = 0;
50.
51.    for (const [key, value] of this.cache.entries()) {
52.      if (now > value.expiresAt) {
53.        this.cache.delete(key);
54.        cleaned++;
55.      }
56.    }
57.
58.    console.log(`Cleaned ${cleaned} expired cache entries`);
59.    return cleaned;
60.  }
61.
62.  getStats() {
63.    return {
64.      size: this.cache.size,
65.      ttl: this.ttl,
66.    };
67.  }
68.
69.  clear() {
70.    this.cache.clear();
71.  }
72.}

Using the Cache

1.const cache = new PromptCache(60000); // 1 minute TTL
2.
3.async function callAIWithCache(prompt) {
4.  // Check cache first
5.  const cached = cache.get(prompt);
6.  if (cached !== null) {
7.    return cached;
8.  }
9.
10.  // If not in cache, make the API call
11.  const result = await callOpenAI(prompt);
12.
13.  // Store in cache
14.  cache.set(prompt, result);
15.
16.  return result;
17.}
18.
19.// First call - hits API
20.const result1 = await callAIWithCache('What is JavaScript?');
21.
22.// Second call with same prompt - uses cache
23.const result2 = await callAIWithCache('What is JavaScript?');
24.
25.console.log('Cache stats:', cache.getStats());

Production Best Practices

Combine Multiple Strategies

The best production systems combine multiple strategies:

1.const cache = new PromptCache(3600000); // 1 hour
2.const queue = new RequestQueue(500); // 500 RPM
3.
4.async function robustAICall(prompt) {
5.  // Check cache first
6.  const cached = cache.get(prompt);
7.  if (cached) return cached;
8.
9.  // Queue the request to respect rate limits
10.  const result = await queue.add(async () => {
11.    // Use exponential backoff for the actual call
12.    return await callOpenAI(prompt, { maxRetries: 3 });
13.  });
14.
15.  // Cache the result
16.  cache.set(prompt, result);
17.
18.  return result;
19.}

User Experience

Don't leave users in the dark when rate limits slow things down:

1.async function callAIWithFeedback(prompt, onProgress) {
2.  const queueStats = queue.getStats();
3.
4.  if (queueStats.queueLength > 0) {
5.    onProgress({
6.      status: 'queued',
7.      position: queueStats.queueLength,
8.      message: `Request queued. Position: ${queueStats.queueLength}`,
9.    });
10.  }
11.
12.  onProgress({ status: 'processing', message: 'Sending request...' });
13.
14.  const result = await robustAICall(prompt);
15.
16.  onProgress({ status: 'complete', message: 'Done!' });
17.
18.  return result;
19.}

Monitor Rate Limit Usage

Track rate limit headers to stay proactive:

1.function logRateLimitWarnings(headers) {
2.  const remaining = parseInt(headers.get('x-ratelimit-remaining-requests'));
3.  const limit = parseInt(headers.get('x-ratelimit-limit-requests'));
4.  const percentUsed = ((limit - remaining) / limit) * 100;
5.
6.  if (percentUsed > 80) {
7.    console.warn(`WARNING: ${percentUsed.toFixed(1)}% of rate limit used`);
8.  }
9.
10.  if (percentUsed > 95) {
11.    console.error('CRITICAL: Approaching rate limit!');
12.    // Send alert to monitoring service
13.  }
14.}

Multi-Provider Fallback

If one provider hits rate limits, fall back to another:

1.async function callAIWithFallback(prompt) {
2.  try {
3.    return await callOpenAI(prompt, { maxRetries: 1 });
4.  } catch (error) {
5.    if (error.status === 429) {
6.      console.log('OpenAI rate limit hit, trying Claude...');
7.      return await callClaude(prompt, { maxRetries: 1 });
8.    }
9.    throw error;
10.  }
11.}

Conclusion

Rate limits are a fact of life when building with AI APIs. The strategies in this guide give you the tools to handle them gracefully. Here's when to use each approach:

Strategy	Complexity	Best For	When to Add
Exponential Backoff	Low	Occasional rate limit hits	Day 1 (always include)
Caching	Low	Repeated or similar prompts	Day 1 (easy wins)
Request Queue	Medium	Predictable, steady traffic	When you hit limits regularly
Token Bucket	High	Burst traffic + sustained rate	High-scale production systems
Multi-Provider Fallback	Medium	Mission-critical applications	When downtime isn't acceptable

Start with exponential backoff and caching. As your usage grows, add a request queue. For high-scale production systems, consider token buckets and multi-provider fallback.

The code examples in this guide are battle-tested and ready to use. The strategies work across providers, even as specific rate limit values change.

For the most accurate rate limits, always check:

OpenAI: platform.openai.com/account/limits
Claude: console.anthropic.com/settings/limits

Found this helpful? Follow for more tips and tutorials

YouTube Twitter Bluesky GitHub