MCP Rate Limiting Patterns: Token Bucket, Leaky Bucket & Sliding Window

Rate limiting is the unsung hero of production MCP servers. Without proper rate limiting, your ChatGPT app can burn through API quotas in minutes, rack up unexpected costs, or get throttled by downstream services. This guide explores three battle-tested rate limiting algorithms—token bucket, leaky bucket, and sliding window—with production-ready implementations you can deploy today.

Whether you're building a fitness studio chatbot that calls external APIs or a restaurant reservation system that queries databases, rate limiting protects your infrastructure from abuse, controls costs, and ensures fair resource allocation across users. OpenAI's ChatGPT Apps platform expects MCP servers to handle concurrent requests gracefully, and rate limiting is your first line of defense.

In this article, we'll implement each algorithm in TypeScript, explore per-user quotas for multi-tenant MCP servers, and build a distributed rate limiter using Redis for high-availability deployments. By the end, you'll have a complete rate limiting toolkit that scales from prototype to production.

Why Rate Limiting Matters for MCP Servers

MCP servers operate in a unique environment: they receive tool calls from ChatGPT's model, which may retry requests, batch multiple tools, or trigger parallel executions. Without rate limiting, three critical problems emerge:

Cost Control: External API calls (geocoding, payment processing, CRM lookups) have per-request costs. A chatbot that geocodes every restaurant query without limits could incur thousands of dollars in monthly API fees.

Quota Management: Third-party APIs enforce rate limits (e.g., 100 requests/minute). Exceeding these limits results in 429 errors, which degrade the user experience and trigger exponential backoff delays.

Fair Resource Allocation: Multi-tenant MCP servers serve multiple ChatGPT apps simultaneously. Rate limiting ensures one app doesn't monopolize database connections, CPU time, or network bandwidth.

OpenAI's Apps SDK documentation recommends implementing rate limiting at the tool level, especially for tools that trigger expensive operations. Let's explore three algorithms that solve these challenges.

Token Bucket Algorithm: Burst Handling with Refills

The token bucket algorithm is ideal for APIs that allow short bursts of traffic. Each user gets a "bucket" of tokens (e.g., 100 tokens). Every request consumes one token. Tokens refill at a fixed rate (e.g., 10 tokens/second). If the bucket is empty, requests are rejected until tokens refill.

Use Case: Geocoding API with 100 requests/minute limit, allowing bursts of 20 requests for bulk address lookups.

Implementation: Token Bucket Rate Limiter

// token-bucket-limiter.ts
// Production-ready token bucket rate limiter with burst support

interface TokenBucket {
  tokens: number;
  maxTokens: number;
  refillRate: number; // tokens per second
  lastRefill: number; // timestamp (ms)
}

export class TokenBucketLimiter {
  private buckets: Map<string, TokenBucket> = new Map();
  private readonly defaultMaxTokens: number;
  private readonly defaultRefillRate: number;

  constructor(maxTokens: number = 100, refillRate: number = 10) {
    this.defaultMaxTokens = maxTokens;
    this.defaultRefillRate = refillRate;
  }

  /**
   * Attempts to consume tokens from the user's bucket.
   * Returns true if tokens were consumed, false if rate limited.
   */
  consume(userId: string, tokens: number = 1): boolean {
    const bucket = this.getOrCreateBucket(userId);
    this.refillBucket(bucket);

    if (bucket.tokens >= tokens) {
      bucket.tokens -= tokens;
      return true; // Request allowed
    }

    return false; // Rate limited
  }

  /**
   * Refills tokens based on elapsed time since last refill.
   */
  private refillBucket(bucket: TokenBucket): void {
    const now = Date.now();
    const elapsed = (now - bucket.lastRefill) / 1000; // seconds
    const tokensToAdd = elapsed * bucket.refillRate;

    bucket.tokens = Math.min(
      bucket.maxTokens,
      bucket.tokens + tokensToAdd
    );
    bucket.lastRefill = now;
  }

  private getOrCreateBucket(userId: string): TokenBucket {
    if (!this.buckets.has(userId)) {
      this.buckets.set(userId, {
        tokens: this.defaultMaxTokens,
        maxTokens: this.defaultMaxTokens,
        refillRate: this.defaultRefillRate,
        lastRefill: Date.now(),
      });
    }
    return this.buckets.get(userId)!;
  }

  /**
   * Returns remaining tokens and time until next refill.
   */
  getStatus(userId: string): {
    tokens: number;
    maxTokens: number;
    refillRate: number;
    nextRefillMs: number;
  } {
    const bucket = this.getOrCreateBucket(userId);
    this.refillBucket(bucket);

    const tokensUntilFull = bucket.maxTokens - bucket.tokens;
    const nextRefillMs = (tokensUntilFull / bucket.refillRate) * 1000;

    return {
      tokens: Math.floor(bucket.tokens),
      maxTokens: bucket.maxTokens,
      refillRate: bucket.refillRate,
      nextRefillMs: Math.ceil(nextRefillMs),
    };
  }

  /**
   * Resets a user's bucket (useful for admin overrides or testing).
   */
  reset(userId: string): void {
    this.buckets.delete(userId);
  }

  /**
   * Clears expired buckets (garbage collection).
   */
  clearExpired(maxAgeMs: number = 3600000): void {
    const now = Date.now();
    for (const [userId, bucket] of this.buckets.entries()) {
      if (now - bucket.lastRefill > maxAgeMs) {
        this.buckets.delete(userId);
      }
    }
  }
}

// Example usage in MCP tool handler
import { TokenBucketLimiter } from './token-bucket-limiter';

const geocodingLimiter = new TokenBucketLimiter(100, 10); // 100 tokens, 10/sec refill

export async function geocodeAddress(
  address: string,
  userId: string
): Promise<{ lat: number; lng: number }> {
  if (!geocodingLimiter.consume(userId)) {
    const status = geocodingLimiter.getStatus(userId);
    throw new Error(
      `Rate limit exceeded. ${status.tokens} tokens remaining. ` +
      `Next refill in ${Math.ceil(status.nextRefillMs / 1000)}s.`
    );
  }

  // Call external geocoding API
  const response = await fetch(
    `https://maps.googleapis.com/maps/api/geocode/json?address=${encodeURIComponent(address)}`
  );
  const data = await response.json();
  return data.results[0].geometry.location;
}

Key Features:

  • Burst Support: Users can consume up to 100 tokens instantly for bulk operations.
  • Automatic Refills: Tokens replenish at 10/second, ensuring smooth long-term throughput.
  • Status Endpoint: getStatus() returns remaining tokens and next refill time for client-side UI.
  • Garbage Collection: clearExpired() removes stale buckets to prevent memory leaks.

When to Use: APIs with burst allowances (geocoding, image processing, PDF generation).

Leaky Bucket Algorithm: Smooth Traffic with Queue Management

The leaky bucket algorithm treats requests like water filling a bucket with a hole at the bottom. Requests enter the bucket (queue) at any rate, but leak out (are processed) at a fixed rate. If the bucket overflows, requests are rejected.

Use Case: Database query rate limiting where you want to smooth traffic spikes and prevent connection pool exhaustion.

Implementation: Leaky Bucket Rate Limiter

// leaky-bucket-limiter.ts
// Production-ready leaky bucket rate limiter with queue management

interface LeakyBucket {
  queue: Array<{ timestamp: number; cost: number }>;
  maxCapacity: number;
  leakRate: number; // requests per second
  lastLeak: number; // timestamp (ms)
}

export class LeakyBucketLimiter {
  private buckets: Map<string, LeakyBucket> = new Map();
  private readonly defaultCapacity: number;
  private readonly defaultLeakRate: number;

  constructor(capacity: number = 50, leakRate: number = 5) {
    this.defaultCapacity = capacity;
    this.defaultLeakRate = leakRate;
  }

  /**
   * Attempts to add a request to the bucket.
   * Returns true if added, false if bucket overflows.
   */
  consume(userId: string, cost: number = 1): boolean {
    const bucket = this.getOrCreateBucket(userId);
    this.leakBucket(bucket);

    const currentLoad = bucket.queue.reduce((sum, req) => sum + req.cost, 0);

    if (currentLoad + cost <= bucket.maxCapacity) {
      bucket.queue.push({ timestamp: Date.now(), cost });
      return true; // Request queued
    }

    return false; // Bucket overflow
  }

  /**
   * Leaks (removes) requests from the bucket based on elapsed time.
   */
  private leakBucket(bucket: LeakyBucket): void {
    const now = Date.now();
    const elapsed = (now - bucket.lastLeak) / 1000; // seconds
    const requestsToLeak = elapsed * bucket.leakRate;

    let leaked = 0;
    while (leaked < requestsToLeak && bucket.queue.length > 0) {
      const request = bucket.queue[0];
      if (leaked + request.cost <= requestsToLeak) {
        bucket.queue.shift(); // Remove request
        leaked += request.cost;
      } else {
        break; // Partial leak not allowed
      }
    }

    bucket.lastLeak = now;
  }

  private getOrCreateBucket(userId: string): LeakyBucket {
    if (!this.buckets.has(userId)) {
      this.buckets.set(userId, {
        queue: [],
        maxCapacity: this.defaultCapacity,
        leakRate: this.defaultLeakRate,
        lastLeak: Date.now(),
      });
    }
    return this.buckets.get(userId)!;
  }

  /**
   * Returns current queue size and estimated wait time.
   */
  getStatus(userId: string): {
    queueSize: number;
    capacity: number;
    leakRate: number;
    estimatedWaitMs: number;
  } {
    const bucket = this.getOrCreateBucket(userId);
    this.leakBucket(bucket);

    const queueSize = bucket.queue.reduce((sum, req) => sum + req.cost, 0);
    const estimatedWaitMs = (queueSize / bucket.leakRate) * 1000;

    return {
      queueSize,
      capacity: bucket.maxCapacity,
      leakRate: bucket.leakRate,
      estimatedWaitMs: Math.ceil(estimatedWaitMs),
    };
  }

  /**
   * Clears a user's queue (useful for cancellations).
   */
  reset(userId: string): void {
    this.buckets.delete(userId);
  }
}

// Example usage in MCP tool handler
import { LeakyBucketLimiter } from './leaky-bucket-limiter';

const dbQueryLimiter = new LeakyBucketLimiter(50, 5); // 50 capacity, 5 req/sec leak

export async function queryDatabase(
  sql: string,
  userId: string
): Promise<any[]> {
  if (!dbQueryLimiter.consume(userId)) {
    const status = dbQueryLimiter.getStatus(userId);
    throw new Error(
      `Rate limit exceeded. Queue size: ${status.queueSize}/${status.capacity}. ` +
      `Estimated wait: ${Math.ceil(status.estimatedWaitMs / 1000)}s.`
    );
  }

  // Execute database query
  const results = await db.query(sql);
  return results;
}

Key Features:

  • Queue Management: Requests queue up during traffic spikes and leak out smoothly.
  • Cost-Based Limiting: Each request has a cost (e.g., complex queries cost more).
  • Wait Time Estimation: Clients know how long until their request processes.

When to Use: Database queries, file uploads, webhook deliveries (where smooth throughput matters).

Sliding Window Counter: Precise Rate Limiting

The sliding window algorithm tracks requests in a rolling time window (e.g., last 60 seconds). Unlike token bucket, it enforces strict limits: exactly 100 requests per minute, no bursts.

Use Case: SaaS API quotas where you need precise enforcement (e.g., "100 tool calls per hour" in MakeAIHQ's pricing tiers).

Implementation: Sliding Window Rate Limiter

// sliding-window-limiter.ts
// Production-ready sliding window rate limiter

interface SlidingWindow {
  requests: Array<{ timestamp: number; cost: number }>;
  maxRequests: number;
  windowMs: number;
}

export class SlidingWindowLimiter {
  private windows: Map<string, SlidingWindow> = new Map();
  private readonly defaultMaxRequests: number;
  private readonly defaultWindowMs: number;

  constructor(maxRequests: number = 100, windowMs: number = 60000) {
    this.defaultMaxRequests = maxRequests;
    this.defaultWindowMs = windowMs;
  }

  /**
   * Attempts to add a request to the sliding window.
   * Returns true if within limit, false if exceeded.
   */
  consume(userId: string, cost: number = 1): boolean {
    const window = this.getOrCreateWindow(userId);
    this.pruneExpired(window);

    const currentCount = window.requests.reduce((sum, req) => sum + req.cost, 0);

    if (currentCount + cost <= window.maxRequests) {
      window.requests.push({ timestamp: Date.now(), cost });
      return true; // Request allowed
    }

    return false; // Rate limit exceeded
  }

  /**
   * Removes requests outside the sliding window.
   */
  private pruneExpired(window: SlidingWindow): void {
    const now = Date.now();
    const cutoff = now - window.windowMs;
    window.requests = window.requests.filter(req => req.timestamp > cutoff);
  }

  private getOrCreateWindow(userId: string): SlidingWindow {
    if (!this.windows.has(userId)) {
      this.windows.set(userId, {
        requests: [],
        maxRequests: this.defaultMaxRequests,
        windowMs: this.defaultWindowMs,
      });
    }
    return this.windows.get(userId)!;
  }

  /**
   * Returns current usage and reset time.
   */
  getStatus(userId: string): {
    used: number;
    limit: number;
    windowMs: number;
    resetMs: number;
  } {
    const window = this.getOrCreateWindow(userId);
    this.pruneExpired(window);

    const used = window.requests.reduce((sum, req) => sum + req.cost, 0);
    const oldestRequest = window.requests[0];
    const resetMs = oldestRequest
      ? Math.max(0, oldestRequest.timestamp + window.windowMs - Date.now())
      : 0;

    return { used, limit: window.maxRequests, windowMs: window.windowMs, resetMs };
  }

  reset(userId: string): void {
    this.windows.delete(userId);
  }
}

// Example usage in MCP tool handler
import { SlidingWindowLimiter } from './sliding-window-limiter';

const toolCallLimiter = new SlidingWindowLimiter(1000, 3600000); // 1000/hour

export async function handleToolCall(
  toolName: string,
  args: any,
  userId: string
): Promise<any> {
  if (!toolCallLimiter.consume(userId)) {
    const status = toolCallLimiter.getStatus(userId);
    throw new Error(
      `Hourly quota exceeded. Used ${status.used}/${status.limit} tool calls. ` +
      `Resets in ${Math.ceil(status.resetMs / 60000)} minutes.`
    );
  }

  // Execute tool logic
  return executeTool(toolName, args);
}

Key Features:

  • Precise Enforcement: No bursts—exactly 100 requests per window.
  • Rolling Window: Window slides with each request (not fixed intervals).
  • Quota Tracking: Perfect for SaaS pricing tiers (Free: 1K/month, Pro: 50K/month).

When to Use: MCP server tool call quotas, SaaS API limits, fair-use policies.

Per-User Quotas: Tier-Based Rate Limiting

Multi-tenant MCP servers need different rate limits per pricing tier. Free users get 1,000 tool calls/month, Professional users get 50,000, Business users get 200,000.

Implementation: Quota Manager with Tier Support

// quota-manager.ts
// Production-ready per-user quota manager with tier support

type PricingTier = 'free' | 'starter' | 'professional' | 'business';

interface QuotaConfig {
  tier: PricingTier;
  monthlyLimit: number;
  burstLimit: number;
  refillRate: number; // tokens per second
}

const TIER_CONFIGS: Record<PricingTier, Omit<QuotaConfig, 'tier'>> = {
  free: { monthlyLimit: 1000, burstLimit: 10, refillRate: 0.5 },
  starter: { monthlyLimit: 10000, burstLimit: 50, refillRate: 5 },
  professional: { monthlyLimit: 50000, burstLimit: 100, refillRate: 25 },
  business: { monthlyLimit: 200000, burstLimit: 500, refillRate: 100 },
};

interface UserQuota {
  tier: PricingTier;
  monthlyUsed: number;
  monthlyLimit: number;
  tokens: number;
  maxTokens: number;
  refillRate: number;
  lastRefill: number;
  resetDate: Date;
}

export class QuotaManager {
  private quotas: Map<string, UserQuota> = new Map();

  /**
   * Initializes quota for a user based on their pricing tier.
   */
  setUserTier(userId: string, tier: PricingTier): void {
    const config = TIER_CONFIGS[tier];
    const quota = this.quotas.get(userId);

    if (quota) {
      // Upgrade/downgrade existing user
      quota.tier = tier;
      quota.monthlyLimit = config.monthlyLimit;
      quota.maxTokens = config.burstLimit;
      quota.refillRate = config.refillRate;
    } else {
      // New user
      this.quotas.set(userId, {
        tier,
        monthlyUsed: 0,
        monthlyLimit: config.monthlyLimit,
        tokens: config.burstLimit,
        maxTokens: config.burstLimit,
        refillRate: config.refillRate,
        lastRefill: Date.now(),
        resetDate: this.getNextResetDate(),
      });
    }
  }

  /**
   * Attempts to consume quota.
   * Checks both monthly limit and token bucket.
   */
  consume(userId: string, cost: number = 1): boolean {
    const quota = this.quotas.get(userId);
    if (!quota) {
      throw new Error(`User ${userId} quota not initialized`);
    }

    // Check if monthly quota needs reset
    if (new Date() >= quota.resetDate) {
      quota.monthlyUsed = 0;
      quota.resetDate = this.getNextResetDate();
    }

    // Check monthly limit
    if (quota.monthlyUsed + cost > quota.monthlyLimit) {
      return false; // Monthly quota exceeded
    }

    // Refill tokens
    this.refillTokens(quota);

    // Check token bucket
    if (quota.tokens < cost) {
      return false; // Burst limit exceeded
    }

    // Consume quota
    quota.monthlyUsed += cost;
    quota.tokens -= cost;
    return true;
  }

  private refillTokens(quota: UserQuota): void {
    const now = Date.now();
    const elapsed = (now - quota.lastRefill) / 1000;
    const tokensToAdd = elapsed * quota.refillRate;

    quota.tokens = Math.min(quota.maxTokens, quota.tokens + tokensToAdd);
    quota.lastRefill = now;
  }

  private getNextResetDate(): Date {
    const now = new Date();
    return new Date(now.getFullYear(), now.getMonth() + 1, 1); // First day of next month
  }

  /**
   * Returns comprehensive quota status.
   */
  getStatus(userId: string) {
    const quota = this.quotas.get(userId);
    if (!quota) {
      throw new Error(`User ${userId} quota not initialized`);
    }

    this.refillTokens(quota);

    return {
      tier: quota.tier,
      monthlyUsed: quota.monthlyUsed,
      monthlyLimit: quota.monthlyLimit,
      monthlyRemaining: quota.monthlyLimit - quota.monthlyUsed,
      burstTokens: Math.floor(quota.tokens),
      maxBurstTokens: quota.maxTokens,
      resetDate: quota.resetDate.toISOString(),
    };
  }
}

// Example usage in MCP server
import { QuotaManager } from './quota-manager';

const quotaManager = new QuotaManager();

// Initialize user quotas based on Stripe subscription
quotaManager.setUserTier('user-123', 'professional');
quotaManager.setUserTier('user-456', 'free');

export async function handleToolCall(toolName: string, userId: string) {
  if (!quotaManager.consume(userId)) {
    const status = quotaManager.getStatus(userId);
    throw new Error(
      `Quota exceeded. Used ${status.monthlyUsed}/${status.monthlyLimit} calls. ` +
      `Resets on ${new Date(status.resetDate).toLocaleDateString()}.`
    );
  }

  // Execute tool
  return executeTool(toolName);
}

Key Features:

  • Tier-Based Limits: Different quotas for Free, Starter, Professional, Business.
  • Dual Limiting: Monthly quota + token bucket burst protection.
  • Automatic Reset: Monthly quotas reset on the first day of each month.

When to Use: SaaS MCP servers with pricing tiers (like MakeAIHQ's 4-tier model).

Distributed Rate Limiting with Redis

In-memory rate limiters work for single-server deployments, but multi-instance MCP servers (load-balanced Cloud Functions, Kubernetes pods) need distributed coordination. Redis provides atomic counters and TTLs perfect for rate limiting.

Implementation: Redis-Based Distributed Limiter

// redis-rate-limiter.ts
// Production-ready distributed rate limiter using Redis

import { createClient, RedisClientType } from 'redis';

export class RedisRateLimiter {
  private client: RedisClientType;

  constructor(redisUrl: string) {
    this.client = createClient({ url: redisUrl });
    this.client.connect();
  }

  /**
   * Token bucket implementation using Redis.
   */
  async consumeTokens(
    userId: string,
    tokens: number,
    maxTokens: number,
    refillRate: number
  ): Promise<boolean> {
    const key = `rate_limit:token_bucket:${userId}`;
    const now = Date.now();

    const result = await this.client.eval(
      `
      local key = KEYS[1]
      local now = tonumber(ARGV[1])
      local tokens_to_consume = tonumber(ARGV[2])
      local max_tokens = tonumber(ARGV[3])
      local refill_rate = tonumber(ARGV[4])

      local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
      local current_tokens = tonumber(bucket[1]) or max_tokens
      local last_refill = tonumber(bucket[2]) or now

      -- Refill tokens
      local elapsed = (now - last_refill) / 1000
      local tokens_to_add = elapsed * refill_rate
      current_tokens = math.min(max_tokens, current_tokens + tokens_to_add)

      -- Try to consume
      if current_tokens >= tokens_to_consume then
        current_tokens = current_tokens - tokens_to_consume
        redis.call('HMSET', key, 'tokens', current_tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600) -- 1 hour TTL
        return 1 -- Success
      else
        return 0 -- Rate limited
      end
      `,
      {
        keys: [key],
        arguments: [now.toString(), tokens.toString(), maxTokens.toString(), refillRate.toString()],
      }
    );

    return result === 1;
  }

  /**
   * Sliding window implementation using Redis sorted sets.
   */
  async consumeSlidingWindow(
    userId: string,
    maxRequests: number,
    windowMs: number
  ): Promise<boolean> {
    const key = `rate_limit:sliding_window:${userId}`;
    const now = Date.now();
    const cutoff = now - windowMs;

    const result = await this.client.eval(
      `
      local key = KEYS[1]
      local now = tonumber(ARGV[1])
      local cutoff = tonumber(ARGV[2])
      local max_requests = tonumber(ARGV[3])
      local window_ms = tonumber(ARGV[4])

      -- Remove expired entries
      redis.call('ZREMRANGEBYSCORE', key, 0, cutoff)

      -- Count current requests
      local current_count = redis.call('ZCARD', key)

      -- Try to add request
      if current_count < max_requests then
        redis.call('ZADD', key, now, now)
        redis.call('PEXPIRE', key, window_ms)
        return 1 -- Success
      else
        return 0 -- Rate limited
      end
      `,
      {
        keys: [key],
        arguments: [now.toString(), cutoff.toString(), maxRequests.toString(), windowMs.toString()],
      }
    );

    return result === 1;
  }

  async close(): Promise<void> {
    await this.client.quit();
  }
}

// Example usage in Cloud Functions
import { RedisRateLimiter } from './redis-rate-limiter';

const limiter = new RedisRateLimiter(process.env.REDIS_URL!);

export async function handleRequest(userId: string) {
  const allowed = await limiter.consumeTokens(userId, 1, 100, 10);
  if (!allowed) {
    throw new Error('Rate limit exceeded');
  }

  // Process request
  return { success: true };
}

Key Features:

  • Atomic Operations: Lua scripts ensure race-condition-free rate limiting.
  • Distributed Coordination: All MCP server instances share the same Redis counters.
  • Auto-Expiry: TTLs automatically clean up stale data.

When to Use: Load-balanced Cloud Functions, Kubernetes deployments, multi-region MCP servers.

Rate Limit Middleware for Express MCP Servers

Most MCP servers use Express.js for HTTP transport. Here's production-ready middleware that integrates all three algorithms.

Implementation: Express Rate Limit Middleware

// rate-limit-middleware.ts
// Express middleware for MCP rate limiting

import { Request, Response, NextFunction } from 'express';
import { TokenBucketLimiter } from './token-bucket-limiter';
import { SlidingWindowLimiter } from './sliding-window-limiter';

interface RateLimitConfig {
  algorithm: 'token-bucket' | 'sliding-window';
  maxRequests: number;
  windowMs?: number; // For sliding window
  refillRate?: number; // For token bucket
  keyGenerator?: (req: Request) => string;
  handler?: (req: Request, res: Response) => void;
}

export function rateLimitMiddleware(config: RateLimitConfig) {
  const limiter =
    config.algorithm === 'token-bucket'
      ? new TokenBucketLimiter(config.maxRequests, config.refillRate || 10)
      : new SlidingWindowLimiter(config.maxRequests, config.windowMs || 60000);

  const keyGenerator = config.keyGenerator || ((req: Request) => req.ip || 'anonymous');

  const defaultHandler = (req: Request, res: Response) => {
    const userId = keyGenerator(req);
    const status = limiter.getStatus(userId);

    res.status(429).json({
      error: 'Rate limit exceeded',
      ...status,
      retryAfter: Math.ceil(
        ('nextRefillMs' in status ? status.nextRefillMs : status.resetMs) / 1000
      ),
    });
  };

  const handler = config.handler || defaultHandler;

  return (req: Request, res: Response, next: NextFunction) => {
    const userId = keyGenerator(req);
    const allowed = limiter.consume(userId);

    if (allowed) {
      const status = limiter.getStatus(userId);
      res.setHeader('X-RateLimit-Limit', status.limit || status.maxTokens);
      res.setHeader('X-RateLimit-Remaining', status.used ? status.limit - status.used : status.tokens);
      next();
    } else {
      handler(req, res);
    }
  };
}

// Example usage in MCP server
import express from 'express';
import { rateLimitMiddleware } from './rate-limit-middleware';

const app = express();

// Global rate limit: 100 requests per minute
app.use(
  rateLimitMiddleware({
    algorithm: 'sliding-window',
    maxRequests: 100,
    windowMs: 60000,
  })
);

// Per-user rate limit for tool calls
app.post(
  '/mcp/tools',
  rateLimitMiddleware({
    algorithm: 'token-bucket',
    maxRequests: 50,
    refillRate: 5,
    keyGenerator: (req) => req.body.userId || req.ip,
  }),
  async (req, res) => {
    const result = await handleToolCall(req.body);
    res.json(result);
  }
);

app.listen(3000);

Key Features:

  • Flexible Key Generation: Rate limit by IP, user ID, API key, or custom logic.
  • Standard Headers: Returns X-RateLimit-Limit and X-RateLimit-Remaining headers.
  • Custom Handlers: Override default 429 response with custom error messages.

Quota Monitoring Dashboard

Production MCP servers need observability into rate limit usage. Here's a simple quota monitoring dashboard.

// quota-monitor.ts
// Real-time quota monitoring and alerting

import { QuotaManager } from './quota-manager';

export class QuotaMonitor {
  private quotaManager: QuotaManager;
  private alerts: Array<{ userId: string; message: string; timestamp: Date }> = [];

  constructor(quotaManager: QuotaManager) {
    this.quotaManager = quotaManager;
  }

  /**
   * Checks if user is approaching quota limits.
   */
  checkThresholds(userId: string): void {
    const status = this.quotaManager.getStatus(userId);
    const usagePercent = (status.monthlyUsed / status.monthlyLimit) * 100;

    if (usagePercent >= 90) {
      this.addAlert(
        userId,
        `Critical: ${usagePercent.toFixed(1)}% of monthly quota used (${status.monthlyUsed}/${status.monthlyLimit})`
      );
    } else if (usagePercent >= 75) {
      this.addAlert(
        userId,
        `Warning: ${usagePercent.toFixed(1)}% of monthly quota used (${status.monthlyUsed}/${status.monthlyLimit})`
      );
    }

    if (status.burstTokens < 5) {
      this.addAlert(userId, `Burst tokens depleted: ${status.burstTokens}/${status.maxBurstTokens} remaining`);
    }
  }

  private addAlert(userId: string, message: string): void {
    this.alerts.push({ userId, message, timestamp: new Date() });

    // Send webhook notification
    this.sendWebhook(userId, message);
  }

  private async sendWebhook(userId: string, message: string): Promise<void> {
    // Example: Send to Slack, Discord, or custom webhook
    await fetch(process.env.ALERT_WEBHOOK_URL!, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ userId, message, timestamp: new Date() }),
    });
  }

  /**
   * Returns aggregated quota metrics for all users.
   */
  getMetrics(): {
    totalUsers: number;
    criticalUsers: number;
    warningUsers: number;
    avgUsagePercent: number;
  } {
    // Implementation depends on data source (Firestore, Redis, etc.)
    return {
      totalUsers: 0,
      criticalUsers: 0,
      warningUsers: 0,
      avgUsagePercent: 0,
    };
  }

  getRecentAlerts(limit: number = 50): Array<{ userId: string; message: string; timestamp: Date }> {
    return this.alerts.slice(-limit);
  }
}

// Example usage
const monitor = new QuotaMonitor(quotaManager);

// Check quotas after each tool call
app.post('/mcp/tools', async (req, res) => {
  const userId = req.body.userId;
  await handleToolCall(req.body);
  monitor.checkThresholds(userId);
  res.json({ success: true });
});

// Admin dashboard endpoint
app.get('/admin/quota-metrics', (req, res) => {
  res.json({
    metrics: monitor.getMetrics(),
    recentAlerts: monitor.getRecentAlerts(20),
  });
});

Conclusion: Choose the Right Algorithm for Your MCP Server

Rate limiting isn't one-size-fits-all. Choose your algorithm based on your use case:

  • Token Bucket: Best for APIs that allow short bursts (geocoding, image processing, PDF generation). Users can consume 100 tokens instantly, then wait for refills.

  • Leaky Bucket: Best for smoothing traffic spikes (database queries, webhook deliveries). Requests queue up during bursts and process at a steady rate.

  • Sliding Window: Best for strict quota enforcement (SaaS tool call limits, fair-use policies). Exactly 100 requests per hour, no exceptions.

For multi-tenant MCP servers like MakeAIHQ's ChatGPT app builder, combine per-user quotas with token bucket for burst protection and Redis for distributed coordination across Cloud Functions instances.

Start with an in-memory limiter for prototypes, then upgrade to Redis when you deploy to production with load balancing. Monitor quota usage with webhooks and alert users when they approach limits.

Ready to Build Production-Grade MCP Servers?

MakeAIHQ's no-code ChatGPT app builder handles rate limiting, quota management, and cost controls automatically. Build your first MCP-powered app in under 48 hours—no complex algorithms, no Redis setup, no quota headaches.

Start Your Free Trial →

Build once, deploy to ChatGPT App Store and web simultaneously. Join 700+ businesses reaching 800 million ChatGPT users.


Internal Links

  • Building Production MCP Servers: Complete Architecture Guide - Master MCP server architecture, resource management, and deployment patterns
  • MCP Resource Management Best Practices - Optimize memory usage, connection pooling, and caching strategies
  • MCP Server Cost Optimization Strategies - Reduce API costs, database queries, and infrastructure expenses
  • Real-Time MCP Server Monitoring with Prometheus - Track quota usage, error rates, and performance metrics
  • MCP Server Security: Authentication & Authorization - Implement OAuth 2.1, API key validation, and token verification
  • Scaling MCP Servers: Load Balancing & Auto-Scaling - Handle 10,000+ concurrent ChatGPT conversations
  • MCP Server Error Handling & Retry Logic - Graceful degradation, exponential backoff, circuit breakers

External Links


About the Author: The MakeAIHQ team builds production-grade MCP servers that power ChatGPT apps for fitness studios, restaurants, and professional services. We've handled billions of tool calls and learned rate limiting the hard way—so you don't have to.

Last Updated: December 25, 2026 | Reading Time: 9 minutes | Skill Level: Intermediate to Advanced