Skip to content

Rate Limits

Understand API rate limits and how to manage them. Rate limits ensure platform stability and a fair experience for all users.

How It Works

Rate limits use a sliding window algorithm to control the number of requests per time unit. When requests exceed the limit, the API returns a 429 Too Many Requests error.

Rate limits are enforced on two dimensions:

  • RPM (Requests Per Minute): Maximum requests allowed per minute
  • Concurrency: Maximum number of simultaneous in-flight requests

When either dimension reaches its limit, subsequent requests are rejected.

Per-Model Limits

Each model has its own rate limits:

Text Generation Models

ModelRPMConcurrencyNotes
deepseek-v4-flash10020High-speed model
deepseek-v4-pro305High-performance
qwen3.7-max6010General-purpose
glm-5.76010General-purpose
kimi-k2.6305Long-context model
minimax-m3305General-purpose

Image Generation Models

ModelRPMConcurrencyNotes
Kolors305Fast generation
gpt-image-2203High quality
nano-banana-2102High quality
doubao-seedream-4.0102Doubao image gen
doubao-seedream-5.0-lite102Doubao image gen
doubao-seedream-4.5102Doubao image gen

Video Generation Models

ModelRPMDaily LimitConcurrencyNotes
veo-351002Google Veo
wanx2.1-t2v-turbo102003Wanx 2.7
doubao-seedance-2-051002Doubao Seedance 2.0
ModelRPMConcurrencyNotes
web-search305Per-request

Rate Limit Response Headers

When a request hits a rate limit, the response includes these HTTP headers:

HeaderDescriptionExample
X-RateLimit-LimitMaximum requests allowed in the current window30
X-RateLimit-RemainingRequests remaining in the current window0
X-RateLimit-ResetUnix timestamp when the window resets1705312260
Retry-AfterRecommended wait time in seconds (429 only)30

Use these headers to proactively throttle requests on the client side.

Handling Rate Limits

Exponential Backoff with Jitter

When you receive a 429 error, use an exponential backoff strategy with jitter:

python
import time
import random
import requests

def call_with_retry(url, headers, payload, max_retries=5):
    for attempt in range(max_retries):
        response = requests.post(url, headers=headers, json=payload)

        if response.status_code == 429:
            # Prefer server-suggested wait time
            retry_after = response.headers.get("Retry-After")
            if retry_after:
                wait = int(retry_after)
            else:
                # Exponential backoff + random jitter
                wait = (2 ** attempt) + random.uniform(0, 1)

            print(f"Rate limited, retrying in {wait:.1f}s...")
            time.sleep(wait)
            continue

        return response

    raise Exception("Max retries exceeded")
javascript
async function callWithRetry(url, headers, payload, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const response = await fetch(url, {
      method: 'POST',
      headers,
      body: JSON.stringify(payload),
    })

    if (response.status === 429) {
      const retryAfter = response.headers.get('Retry-After')
      const wait = retryAfter ? parseInt(retryAfter) : Math.pow(2, attempt) + Math.random()

      console.log(`Rate limited, retrying in ${wait.toFixed(1)}s...`)
      await new Promise((r) => setTimeout(r, wait * 1000))
      continue
    }

    return response
  }

  throw new Error('Max retries exceeded')
}

Client-Side Throttling

Proactively control request rate at the application layer to avoid hitting server limits:

python
import time
from threading import Lock

class RateLimiter:
    def __init__(self, max_requests, window_seconds=60):
        self.max_requests = max_requests
        self.window = window_seconds
        self.requests = []
        self.lock = Lock()

    def acquire(self):
        with self.lock:
            now = time.time()
            # Prune entries outside the window
            self.requests = [t for t in self.requests if now - t < self.window]

            if len(self.requests) >= self.max_requests:
                wait = self.window - (now - self.requests[0])
                if wait > 0:
                    time.sleep(wait)

            self.requests.append(time.time())

Best Practices

  1. Monitor response headers: Read X-RateLimit-Remaining and slow down proactively before hitting the limit
  2. Implement request queues: Use a queue to control request throughput in high-concurrency scenarios
  3. Cache results: Cache identical queries to reduce redundant requests
  4. Use Retry-After: When receiving a 429 response, prefer the server-provided Retry-After value
  5. Distinguish error types: Only retry on 429 and 5xx errors — fix 400 errors before retrying
  6. Async tasks are exempt from RPM: Video generation and other async tasks are limited by concurrency, not RPM. Status polling does not count toward RPM.