Rate Limits
Understand API rate limits and how to manage them. Rate limits ensure platform stability and a fair experience for all users.
How It Works
Rate limits use a sliding window algorithm to control the number of requests per time unit. When requests exceed the limit, the API returns a 429 Too Many Requests error.
Rate limits are enforced on two dimensions:
- RPM (Requests Per Minute): Maximum requests allowed per minute
- Concurrency: Maximum number of simultaneous in-flight requests
When either dimension reaches its limit, subsequent requests are rejected.
Per-Model Limits
Each model has its own rate limits:
Text Generation Models
| Model | RPM | Concurrency | Notes |
|---|---|---|---|
| deepseek-v4-flash | 100 | 20 | High-speed model |
| deepseek-v4-pro | 30 | 5 | High-performance |
| qwen3.7-max | 60 | 10 | General-purpose |
| glm-5.7 | 60 | 10 | General-purpose |
| kimi-k2.6 | 30 | 5 | Long-context model |
| minimax-m3 | 30 | 5 | General-purpose |
Image Generation Models
| Model | RPM | Concurrency | Notes |
|---|---|---|---|
| Kolors | 30 | 5 | Fast generation |
| gpt-image-2 | 20 | 3 | High quality |
| nano-banana-2 | 10 | 2 | High quality |
| doubao-seedream-4.0 | 10 | 2 | Doubao image gen |
| doubao-seedream-5.0-lite | 10 | 2 | Doubao image gen |
| doubao-seedream-4.5 | 10 | 2 | Doubao image gen |
Video Generation Models
| Model | RPM | Daily Limit | Concurrency | Notes |
|---|---|---|---|---|
| veo-3 | 5 | 100 | 2 | Google Veo |
| wanx2.1-t2v-turbo | 10 | 200 | 3 | Wanx 2.7 |
| doubao-seedance-2-0 | 5 | 100 | 2 | Doubao Seedance 2.0 |
Web Search
| Model | RPM | Concurrency | Notes |
|---|---|---|---|
| web-search | 30 | 5 | Per-request |
Rate Limit Response Headers
When a request hits a rate limit, the response includes these HTTP headers:
| Header | Description | Example |
|---|---|---|
X-RateLimit-Limit | Maximum requests allowed in the current window | 30 |
X-RateLimit-Remaining | Requests remaining in the current window | 0 |
X-RateLimit-Reset | Unix timestamp when the window resets | 1705312260 |
Retry-After | Recommended wait time in seconds (429 only) | 30 |
Use these headers to proactively throttle requests on the client side.
Handling Rate Limits
Exponential Backoff with Jitter
When you receive a 429 error, use an exponential backoff strategy with jitter:
import time
import random
import requests
def call_with_retry(url, headers, payload, max_retries=5):
for attempt in range(max_retries):
response = requests.post(url, headers=headers, json=payload)
if response.status_code == 429:
# Prefer server-suggested wait time
retry_after = response.headers.get("Retry-After")
if retry_after:
wait = int(retry_after)
else:
# Exponential backoff + random jitter
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited, retrying in {wait:.1f}s...")
time.sleep(wait)
continue
return response
raise Exception("Max retries exceeded")async function callWithRetry(url, headers, payload, maxRetries = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const response = await fetch(url, {
method: 'POST',
headers,
body: JSON.stringify(payload),
})
if (response.status === 429) {
const retryAfter = response.headers.get('Retry-After')
const wait = retryAfter ? parseInt(retryAfter) : Math.pow(2, attempt) + Math.random()
console.log(`Rate limited, retrying in ${wait.toFixed(1)}s...`)
await new Promise((r) => setTimeout(r, wait * 1000))
continue
}
return response
}
throw new Error('Max retries exceeded')
}Client-Side Throttling
Proactively control request rate at the application layer to avoid hitting server limits:
import time
from threading import Lock
class RateLimiter:
def __init__(self, max_requests, window_seconds=60):
self.max_requests = max_requests
self.window = window_seconds
self.requests = []
self.lock = Lock()
def acquire(self):
with self.lock:
now = time.time()
# Prune entries outside the window
self.requests = [t for t in self.requests if now - t < self.window]
if len(self.requests) >= self.max_requests:
wait = self.window - (now - self.requests[0])
if wait > 0:
time.sleep(wait)
self.requests.append(time.time())Best Practices
- Monitor response headers: Read
X-RateLimit-Remainingand slow down proactively before hitting the limit - Implement request queues: Use a queue to control request throughput in high-concurrency scenarios
- Cache results: Cache identical queries to reduce redundant requests
- Use
Retry-After: When receiving a 429 response, prefer the server-providedRetry-Aftervalue - Distinguish error types: Only retry on 429 and 5xx errors — fix 400 errors before retrying
- Async tasks are exempt from RPM: Video generation and other async tasks are limited by concurrency, not RPM. Status polling does not count toward RPM.