Rate Limits

Understand API rate limits and how to manage them. Rate limits ensure platform stability and a fair experience for all users.

How It Works

Rate limits use a sliding window algorithm to control the number of requests per time unit. When requests exceed the limit, the API returns a 429 Too Many Requests error.

Rate limits are enforced on two dimensions:

RPM (Requests Per Minute): Maximum requests allowed per minute
Concurrency: Maximum number of simultaneous in-flight requests

When either dimension reaches its limit, subsequent requests are rejected.

Per-Model Limits

Each model has its own rate limits:

Text Generation Models

Model	RPM	Concurrency	Notes
deepseek-v4-flash	100	20	High-speed model
deepseek-v4-pro	30	5	High-performance
qwen3.7-max	60	10	General-purpose
glm-5.7	60	10	General-purpose
kimi-k2.6	30	5	Long-context model
minimax-m3	30	5	General-purpose

Image Generation Models

Model	RPM	Concurrency	Notes
Kolors	30	5	Fast generation
gpt-image-2	20	3	High quality
nano-banana-2	10	2	High quality
doubao-seedream-4.0	10	2	Doubao image gen
doubao-seedream-5.0-lite	10	2	Doubao image gen
doubao-seedream-4.5	10	2	Doubao image gen

Video Generation Models

Model	RPM	Daily Limit	Concurrency	Notes
veo-3	5	100	2	Google Veo
wanx2.1-t2v-turbo	10	200	3	Wanx 2.7
doubao-seedance-2-0	5	100	2	Doubao Seedance 2.0

Web Search

Model	RPM	Concurrency	Notes
web-search	30	5	Per-request

Rate Limit Response Headers

When a request hits a rate limit, the response includes these HTTP headers:

Header	Description	Example
`X-RateLimit-Limit`	Maximum requests allowed in the current window	`30`
`X-RateLimit-Remaining`	Requests remaining in the current window	`0`
`X-RateLimit-Reset`	Unix timestamp when the window resets	`1705312260`
`Retry-After`	Recommended wait time in seconds (429 only)	`30`

Use these headers to proactively throttle requests on the client side.

Handling Rate Limits

Exponential Backoff with Jitter

When you receive a 429 error, use an exponential backoff strategy with jitter:

python

import time
import random
import requests

def call_with_retry(url, headers, payload, max_retries=5):
    for attempt in range(max_retries):
        response = requests.post(url, headers=headers, json=payload)

        if response.status_code == 429:
            # Prefer server-suggested wait time
            retry_after = response.headers.get("Retry-After")
            if retry_after:
                wait = int(retry_after)
            else:
                # Exponential backoff + random jitter
                wait = (2 ** attempt) + random.uniform(0, 1)

            print(f"Rate limited, retrying in {wait:.1f}s...")
            time.sleep(wait)
            continue

        return response

    raise Exception("Max retries exceeded")

javascript

async function callWithRetry(url, headers, payload, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const response = await fetch(url, {
      method: 'POST',
      headers,
      body: JSON.stringify(payload),
    })

    if (response.status === 429) {
      const retryAfter = response.headers.get('Retry-After')
      const wait = retryAfter ? parseInt(retryAfter) : Math.pow(2, attempt) + Math.random()

      console.log(`Rate limited, retrying in ${wait.toFixed(1)}s...`)
      await new Promise((r) => setTimeout(r, wait * 1000))
      continue
    }

    return response
  }

  throw new Error('Max retries exceeded')
}

Client-Side Throttling

Proactively control request rate at the application layer to avoid hitting server limits:

python

import time
from threading import Lock

class RateLimiter:
    def __init__(self, max_requests, window_seconds=60):
        self.max_requests = max_requests
        self.window = window_seconds
        self.requests = []
        self.lock = Lock()

    def acquire(self):
        with self.lock:
            now = time.time()
            # Prune entries outside the window
            self.requests = [t for t in self.requests if now - t < self.window]

            if len(self.requests) >= self.max_requests:
                wait = self.window - (now - self.requests[0])
                if wait > 0:
                    time.sleep(wait)

            self.requests.append(time.time())

Best Practices

Monitor response headers: Read X-RateLimit-Remaining and slow down proactively before hitting the limit
Implement request queues: Use a queue to control request throughput in high-concurrency scenarios
Cache results: Cache identical queries to reduce redundant requests
Use Retry-After: When receiving a 429 response, prefer the server-provided Retry-After value
Distinguish error types: Only retry on 429 and 5xx errors — fix 400 errors before retrying
Async tasks are exempt from RPM: Video generation and other async tasks are limited by concurrency, not RPM. Status polling does not count toward RPM.

Rate Limits ​

How It Works ​

Per-Model Limits ​

Text Generation Models ​

Image Generation Models ​

Video Generation Models ​

Web Search ​

Rate Limit Response Headers ​

Handling Rate Limits ​

Exponential Backoff with Jitter ​

Client-Side Throttling ​

Best Practices ​