Rate Limiting My API: What Actually Worked (And What Didn't)

I've implemented rate limiting three times. The first two were disasters. Here's what I learned about token buckets, sliding windows, and why Redis is your friend.

2025-09-1814 min

Related toolJWT Decoder

Use the tool alongside this guide for hands-on practice.

Learning rate limiting the hard way

My first API rate limiting implementation was a disaster. I used a simple counter that reset every minute. Seemed fine in testing. Then a bot hit my API at 11:59:59, made 100 requests, waited 2 seconds, and made another 100. My 'protected' endpoint got hammered.

That's when I learned there's more to rate limiting than counting requests. Token buckets, sliding windows, distributed state - it gets complex fast. But the fundamentals aren't hard once someone explains them properly.

This is what I wish someone had told me before I deployed my first rate limiter. Use the JWT Decoder to inspect rate limit claims if you're embedding limits in tokens.

Rate limiting algorithms explained

Several algorithms underlie rate limiting implementations. Each has different characteristics for burst handling, memory usage, and fairness.

Token Bucket maintains a bucket of tokens that refills at a constant rate. Each request consumes a token. When the bucket is empty, requests are rejected. This allows bursts up to the bucket size while enforcing an average rate.

Leaky Bucket processes requests at a constant rate, queuing excess requests. It smooths traffic to a steady rate, eliminating bursts. This is ideal when downstream systems cannot handle variable load.

Fixed Window counts requests in fixed time windows (e.g., per minute). Simple to implement but allows bursts at window boundaries—a user could make double the limit by timing requests at window edges.

Sliding Window tracks requests over a moving window, eliminating the boundary burst problem. More accurate than fixed window but requires more memory or computation.

Token Bucket: Allows bursts, enforces average rate
Leaky Bucket: Constant rate, smooth traffic
Fixed Window: Simple but has edge burst issues
Sliding Window: Accurate but more complex
Choose based on burst tolerance and accuracy needs

Token bucket vs sliding window vs fixed window

Understanding the tradeoffs between algorithms helps you choose the right one for your use case.

Token Bucket is ideal when you want to allow bursts while enforcing an overall rate. A user with 100 tokens per minute can make 100 requests instantly, then wait for refill. This matches real usage patterns where users work in bursts.

Fixed Window is simplest to implement—increment a counter, reset at window boundaries. The edge burst problem matters less if your limits are generous relative to typical usage. Many APIs use fixed windows successfully.

Sliding Window eliminates edge bursts by considering requests over the past N seconds continuously. Implementation options include sliding logs (store each timestamp) or sliding window counters (approximate using weighted windows).

Sliding Window Counters combine benefits: accuracy approaching sliding logs with memory efficiency closer to fixed windows. Weight the current and previous window by overlap to approximate the true sliding window.

Token Bucket: Good for bursty workloads, user-friendly
Fixed Window: Simple, slight edge burst issue
Sliding Log: Accurate, high memory usage
Sliding Window Counter: Good balance of accuracy and efficiency

Implementing rate limits in your API

Rate limiting typically happens at the API gateway or in middleware before requests reach application logic.

Identify requests by API key, user ID, IP address, or a combination. API keys are most reliable. IP addresses are problematic when users share IPs (NAT, corporate networks). Consider your user base when choosing identifiers.

Store rate limit state in a fast, shared data store. Redis is the most common choice—it is fast, supports atomic operations, and handles expiration. For single-server deployments, in-memory storage works.

Check limits early in the request pipeline. Rejecting rate-limited requests quickly saves resources. Do not authenticate, parse bodies, or run business logic before checking limits.

Return appropriate responses. 429 Too Many Requests is the standard status. Include Retry-After header to tell clients when to retry. Provide helpful error messages.

Identify by API key, user ID, or IP
Store state in Redis or similar fast store
Check limits early in request pipeline
Return 429 with Retry-After header
Provide helpful error messages

HTTP headers for rate limiting

Standard headers communicate rate limit status to clients. Consistent headers enable clients to adapt their behavior and avoid hitting limits.

X-RateLimit-Limit indicates the maximum requests allowed in the current window. This tells clients their quota.

X-RateLimit-Remaining shows how many requests remain in the current window. Clients can pace themselves as they approach zero.

X-RateLimit-Reset indicates when the limit resets, usually as a Unix timestamp. Use the Timestamp Converter to verify these values during debugging.

Retry-After on 429 responses tells clients how long to wait before retrying. This can be seconds or an HTTP date.

RateLimit-Policy is an emerging standard that provides structured rate limit information including quota, window, and burst capacity.

X-RateLimit-Limit: Maximum requests per window
X-RateLimit-Remaining: Requests left in window
X-RateLimit-Reset: Unix timestamp when limit resets
Retry-After: How long to wait after 429
RateLimit-Policy: Structured limit information

JWT claims for user-based limits

JWTs can carry rate limit information as claims, enabling per-user limits without database lookups.

Include rate limit tier or plan identifier in JWT claims. The API checks this claim and applies appropriate limits. Different user plans get different limits.

Avoid including current quota usage in JWTs. Tokens are issued once and cannot track changing state. Usage must be tracked server-side.

Use the JWT Decoder to inspect tokens and verify rate limit claims are present and correct. This helps debug why certain users are getting unexpected limits.

Consider custom claims for specialized limits: per-endpoint limits, burst capacity, or grace periods. Structure claims clearly—complex claims become maintenance burdens.

Include plan/tier identifier as claim
Do not include current usage (cannot update token)
Use JWT Decoder to verify claims
Consider per-endpoint or burst claims
Keep claim structure simple and maintainable

Distributed rate limiting strategies

When your API runs on multiple servers, rate limiting must work across the cluster. Several strategies handle this coordination.

Centralized storage (Redis) is the most common approach. All servers read and write limits to a single Redis instance or cluster. This provides accurate, consistent limits but adds latency and a dependency.

Eventual consistency accepts slight inaccuracies for reduced latency. Each server maintains local counters that sync periodically. A user might slightly exceed limits during sync delays.

Sticky sessions route each user to the same server, enabling local rate limiting. This is simple but limits load balancing flexibility and fails when servers restart.

Token-based quotas assign quota blocks to servers. Each server has a portion of the total limit. This reduces coordination but can leave quota unused on less-loaded servers.

Centralized Redis: Accurate, adds latency
Eventual consistency: Fast, slight overruns possible
Sticky sessions: Simple, limits flexibility
Token-based quotas: Low coordination, potential waste

Monitoring and alerting

Rate limiting generates valuable signals about API usage and potential abuse. Monitoring these signals enables proactive management.

Track rate limit hits by user and endpoint. Frequent hits from legitimate users suggest limits are too strict or usage patterns are changing. Hits from unknown sources may indicate abuse.

Alert on unusual patterns. Sudden spikes in rate limit hits may indicate attacks or malfunctioning clients. Gradual increases may indicate growing popularity requiring limit adjustments.

Monitor quota utilization. Users consistently near their limits may need upgrades or indicate that plans are poorly sized. Users never approaching limits may be paying for unused capacity.

Log rate limited requests with context. Include user ID, endpoint, current usage, and limit. This enables debugging and abuse investigation.

Track rate limit hits by user and endpoint
Alert on unusual patterns or sudden spikes
Monitor quota utilization trends
Log rate limited requests with context
Use metrics for capacity planning

Handling rate limit errors gracefully

How clients handle rate limit errors affects user experience. Provide guidance and tools for graceful handling.

Always include Retry-After header with 429 responses. This tells clients exactly how long to wait. Include it in seconds or as an HTTP date.

Provide remaining quota in response headers for all requests, not just rate-limited ones. Clients can proactively slow down as they approach limits.

Document rate limits clearly. Include limits, windows, headers, and recommended handling in API documentation. Clients cannot adapt to limits they do not know about.

Consider degraded responses instead of hard failures. For read endpoints, returning cached or stale data may be acceptable when limits are hit. This maintains partial functionality.

Implement exponential backoff on your side for downstream rate limits. When your API calls external services that rate limit, back off gracefully rather than failing immediately.

Always include Retry-After header
Provide quota headers on all responses
Document limits clearly
Consider degraded responses over hard failures
Implement exponential backoff for downstream limits

Best practices and common mistakes

Following best practices and avoiding common mistakes leads to rate limiting that protects without frustrating users.

Set limits based on real usage data. Analyze how legitimate users consume your API before setting limits. Arbitrary limits often surprise users or fail to protect against real abuse.

Provide limit headroom. If typical usage is 50 requests per minute, setting a limit at 60 forces users to carefully manage their usage. Setting it at 100 or 200 protects against abuse while accommodating normal variation.

Consider burst allowances. Users often need to make many requests quickly, then pause. Token bucket algorithms naturally handle this. Fixed window limits may need explicit burst provisions.

Do not limit by IP alone in production. Multiple users share IPs through NAT, corporate proxies, and VPNs. IP-based limits are suitable for anonymous endpoints but problematic for authenticated APIs.

Test rate limiting under load. Verify your implementation handles high concurrency correctly. Race conditions in rate limit checks can allow limits to be exceeded.

Base limits on real usage data
Provide headroom above typical usage
Allow bursts for natural usage patterns
Use API keys, not just IP addresses
Test under concurrent load
Monitor and adjust limits over time

FAQ

What status code for rate limiting?

Always 429 Too Many Requests. Always include Retry-After header. I made the mistake of returning 503 once - clients thought my server was down. Be specific.

Rate limit by IP or API key?

API key whenever possible. IP limiting breaks for corporate networks (everyone shares an IP), VPNs, and mobile carriers. I use IP only as a last-resort anti-DDoS layer.

Token bucket or fixed window?

Start with fixed window - it's simpler. Upgrade to token bucket only if the edge burst problem actually hits you. Most APIs never need the complexity. I overthought this on my first implementation.

Which rate limit headers do I need?

X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset (Unix timestamp). Plus Retry-After on 429 responses. That's it. Don't overcomplicate.

How to rate limit with multiple servers?

Redis. Just use Redis. I tried other approaches - in-memory with sticky sessions, eventual consistency - and they all had edge cases. Redis atomic operations solve this cleanly.

MŠ

Martin Šikula

Founder of CodeUtil. Web developer building tools I actually use. When I'm not coding, I experiment with productivity techniques (with mixed success).

More about me →LinkedIn

January 9, 202511 min

URL Encoding: The Bug That Wasted My Entire Afternoon

I once spent 4 hours debugging an API call before realizing I'd used encodeURI instead of encodeURIComponent. Here's everything I learned so you don't make the same mistake.

URL Encoder/Decoderapiurlweb development

February 20, 202512 min

QR Code Generator for Developers: WiFi, vCard, API & Error Correction Guide

Everything developers need to know about QR codes: data types (URL, WiFi, vCard, email), error correction levels, size optimization, and code examples for JavaScript, Python, and PHP. Free online generator included.

QR Code Generatorqr codeencodingapimobile

May 15, 202512 min

Debugging APIs with Base64 and JWT Decoding

Master API debugging by learning to decode Base64 payloads and inspect JWT tokens. Troubleshoot authentication issues, read encoded error messages, and understand what your APIs are actually sending and receiving.

JWT Decoderapidebuggingjwtbase64authentication