MCP Server Health Checks & Monitoring for ChatGPT Apps

Production MCP servers powering ChatGPT applications require robust health checking and monitoring infrastructure to ensure 99.9%+ uptime and rapid incident detection. Without proper observability, you're flying blind—unable to detect degraded performance, dependency failures, or cascading outages before they impact users.

Health checks serve as the foundation of reliable distributed systems, providing automated detection of unhealthy instances and enabling orchestrators like Kubernetes to automatically restart failed containers. Combined with comprehensive metrics collection and intelligent alerting, health checks transform reactive firefighting into proactive system management.

This guide implements production-grade health checks, liveness/readiness probes, Prometheus metrics integration, and Grafana dashboards specifically designed for MCP servers serving ChatGPT applications. You'll learn the critical distinction between liveness and readiness probes, how to implement dependency health verification, configure meaningful alerts that prevent alert fatigue, and build monitoring dashboards that surface actionable insights.

For teams managing MCP servers at scale, proper health monitoring is not optional—it's the difference between delivering a reliable service and experiencing catastrophic downtime. This comprehensive implementation covers everything from basic HTTP health endpoints to advanced observability patterns used by production SaaS platforms.

Understanding Liveness vs Readiness Probes

Kubernetes and modern orchestration platforms distinguish between two critical health check types: liveness probes detect whether a container should be restarted, while readiness probes determine whether a container should receive traffic. This distinction prevents cascading failures and enables zero-downtime deployments.

Liveness probes answer: "Is this process stuck or deadlocked?" A failed liveness probe triggers an automatic container restart. Design liveness checks to be lightweight and fast—they should only verify that the core event loop is responsive. Avoid checking external dependencies in liveness probes, as temporary database unavailability shouldn't trigger restarts.

Readiness probes answer: "Is this instance ready to serve requests?" A failed readiness probe removes the instance from the load balancer pool but does not restart the container. Readiness checks should verify all critical dependencies (database connections, cache availability, downstream APIs) before marking an instance as ready.

Startup probes provide additional protection during initialization. MCP servers with long initialization periods (loading large models, warming caches, establishing connections) benefit from startup probes that delay liveness/readiness checks until initialization completes. This prevents premature restarts during legitimate startup delays.

The golden rule: liveness checks verify process health, readiness checks verify service health. A database outage should fail readiness (stop serving traffic) but not liveness (don't restart the process). This pattern enables graceful degradation and prevents restart loops during infrastructure incidents.

Kubernetes probe configuration supports HTTP GET requests, TCP socket checks, and exec commands. HTTP endpoints are preferred for MCP servers, as they enable rich status responses and are easily testable during development. Configure appropriate timeouts (typically 1-3 seconds), failure thresholds (3-5 consecutive failures), and probe intervals (10-30 seconds) based on your service's characteristics.

For detailed MCP server architecture patterns, see our Complete Guide to Building ChatGPT Applications.

Implementing Health Check Endpoints

Production MCP servers require multiple health check endpoints serving different purposes. The standard pattern implements /health (liveness), /ready (readiness), and optionally /startup endpoints with appropriate status codes and diagnostic information.

// health-check-endpoints.ts - Production health check implementation
import express, { Request, Response, NextFunction } from 'express';
import { createServer } from 'http';
import { Logger } from 'pino';
import Redis from 'ioredis';
import { Pool } from 'pg';

interface HealthStatus {
  status: 'healthy' | 'degraded' | 'unhealthy';
  timestamp: string;
  uptime: number;
  version: string;
  checks: Record<string, DependencyCheck>;
}

interface DependencyCheck {
  status: 'healthy' | 'unhealthy';
  latency?: number;
  error?: string;
  lastChecked: string;
}

export class HealthCheckManager {
  private app: express.Application;
  private startupTime: Date;
  private isReady: boolean = false;
  private logger: Logger;
  private redis?: Redis;
  private postgres?: Pool;

  constructor(
    private port: number,
    private version: string,
    logger: Logger
  ) {
    this.app = express();
    this.startupTime = new Date();
    this.logger = logger.child({ component: 'health-check' });
    this.setupRoutes();
  }

  /**
   * Liveness probe - verifies process is responsive
   * Should NEVER check external dependencies
   */
  private setupLivenessProbe(): void {
    this.app.get('/health', async (req: Request, res: Response) => {
      const uptime = Math.floor((Date.now() - this.startupTime.getTime()) / 1000);

      // Verify event loop is responsive
      const eventLoopHealthy = await this.checkEventLoop();

      if (!eventLoopHealthy) {
        this.logger.error('Event loop is blocked - liveness check failed');
        return res.status(503).json({
          status: 'unhealthy',
          timestamp: new Date().toISOString(),
          uptime,
          version: this.version,
          reason: 'Event loop blocked'
        });
      }

      res.status(200).json({
        status: 'healthy',
        timestamp: new Date().toISOString(),
        uptime,
        version: this.version,
        memoryUsage: process.memoryUsage(),
        cpuUsage: process.cpuUsage()
      });
    });
  }

  /**
   * Readiness probe - verifies service can handle requests
   * SHOULD check all critical dependencies
   */
  private setupReadinessProbe(): void {
    this.app.get('/ready', async (req: Request, res: Response) => {
      if (!this.isReady) {
        return res.status(503).json({
          status: 'unhealthy',
          timestamp: new Date().toISOString(),
          reason: 'Service still initializing'
        });
      }

      const checks: Record<string, DependencyCheck> = {};
      let overallStatus: 'healthy' | 'degraded' | 'unhealthy' = 'healthy';

      // Check Redis connection
      if (this.redis) {
        const redisCheck = await this.checkRedis();
        checks.redis = redisCheck;
        if (redisCheck.status === 'unhealthy') {
          overallStatus = 'degraded';
        }
      }

      // Check PostgreSQL connection
      if (this.postgres) {
        const postgresCheck = await this.checkPostgres();
        checks.postgres = postgresCheck;
        if (postgresCheck.status === 'unhealthy') {
          overallStatus = 'unhealthy'; // Critical dependency
        }
      }

      const statusCode = overallStatus === 'unhealthy' ? 503 : 200;
      const response: HealthStatus = {
        status: overallStatus,
        timestamp: new Date().toISOString(),
        uptime: Math.floor((Date.now() - this.startupTime.getTime()) / 1000),
        version: this.version,
        checks
      };

      res.status(statusCode).json(response);
    });
  }

  /**
   * Check event loop responsiveness
   */
  private checkEventLoop(): Promise<boolean> {
    return new Promise((resolve) => {
      const timeout = setTimeout(() => resolve(false), 5000);

      setImmediate(() => {
        clearTimeout(timeout);
        resolve(true);
      });
    });
  }

  /**
   * Check Redis connection health
   */
  private async checkRedis(): Promise<DependencyCheck> {
    const startTime = Date.now();

    try {
      await this.redis!.ping();
      const latency = Date.now() - startTime;

      return {
        status: latency < 100 ? 'healthy' : 'unhealthy',
        latency,
        lastChecked: new Date().toISOString()
      };
    } catch (error) {
      this.logger.error({ error }, 'Redis health check failed');
      return {
        status: 'unhealthy',
        error: error instanceof Error ? error.message : 'Unknown error',
        lastChecked: new Date().toISOString()
      };
    }
  }

  /**
   * Check PostgreSQL connection health
   */
  private async checkPostgres(): Promise<DependencyCheck> {
    const startTime = Date.now();

    try {
      await this.postgres!.query('SELECT 1');
      const latency = Date.now() - startTime;

      return {
        status: latency < 200 ? 'healthy' : 'unhealthy',
        latency,
        lastChecked: new Date().toISOString()
      };
    } catch (error) {
      this.logger.error({ error }, 'PostgreSQL health check failed');
      return {
        status: 'unhealthy',
        error: error instanceof Error ? error.message : 'Unknown error',
        lastChecked: new Date().toISOString()
      };
    }
  }

  /**
   * Register dependencies for health checking
   */
  public registerDependencies(deps: {
    redis?: Redis;
    postgres?: Pool;
  }): void {
    this.redis = deps.redis;
    this.postgres = deps.postgres;
  }

  /**
   * Mark service as ready to receive traffic
   */
  public markReady(): void {
    this.isReady = true;
    this.logger.info('Service marked as ready');
  }

  /**
   * Setup all health check routes
   */
  private setupRoutes(): void {
    this.setupLivenessProbe();
    this.setupReadinessProbe();
  }

  /**
   * Start health check server
   */
  public start(): void {
    const server = createServer(this.app);
    server.listen(this.port, () => {
      this.logger.info(`Health check server listening on port ${this.port}`);
    });
  }
}

// Example Kubernetes probe configuration
/*
apiVersion: v1
kind: Pod
metadata:
  name: mcp-server
spec:
  containers:
  - name: mcp-server
    image: mcp-server:latest
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 3
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 2
*/

This implementation separates liveness checks (event loop responsiveness) from readiness checks (dependency health), preventing restart loops while enabling traffic removal during degraded states. The health check manager runs on a dedicated port (typically 8080) separate from the main MCP server port.

For comprehensive Prometheus metrics integration, see our guide on Prometheus Metrics Collection for ChatGPT Apps.

Prometheus Metrics Collection

Prometheus provides the industry-standard metrics collection and alerting platform for modern cloud-native applications. MCP servers should expose metrics at /metrics endpoint in Prometheus text format, tracking request rates, error rates, latency distributions, and custom business metrics.

// prometheus-metrics-exporter.ts - Production metrics implementation
import express from 'express';
import client from 'prom-client';
import { Logger } from 'pino';

/**
 * Prometheus metrics exporter for MCP servers
 * Implements RED method (Rate, Errors, Duration) + custom metrics
 */
export class PrometheusMetricsExporter {
  private app: express.Application;
  private register: client.Registry;
  private logger: Logger;

  // RED Method Metrics
  private requestCounter: client.Counter;
  private errorCounter: client.Counter;
  private requestDuration: client.Histogram;

  // MCP-specific metrics
  private activeConnections: client.Gauge;
  private toolInvocations: client.Counter;
  private widgetRenders: client.Counter;
  private cacheHits: client.Counter;
  private cacheMisses: client.Counter;

  constructor(private port: number, logger: Logger) {
    this.app = express();
    this.register = new client.Registry();
    this.logger = logger.child({ component: 'metrics' });

    // Enable default metrics (CPU, memory, event loop, etc.)
    client.collectDefaultMetrics({
      register: this.register,
      prefix: 'mcp_server_'
    });

    this.initializeMetrics();
    this.setupRoutes();
  }

  private initializeMetrics(): void {
    // Request rate metric
    this.requestCounter = new client.Counter({
      name: 'mcp_server_requests_total',
      help: 'Total number of requests',
      labelNames: ['method', 'path', 'status'],
      registers: [this.register]
    });

    // Error rate metric
    this.errorCounter = new client.Counter({
      name: 'mcp_server_errors_total',
      help: 'Total number of errors',
      labelNames: ['method', 'path', 'error_type'],
      registers: [this.register]
    });

    // Request duration metric (histogram for percentiles)
    this.requestDuration = new client.Histogram({
      name: 'mcp_server_request_duration_seconds',
      help: 'Request duration in seconds',
      labelNames: ['method', 'path', 'status'],
      buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
      registers: [this.register]
    });

    // Active connections gauge
    this.activeConnections = new client.Gauge({
      name: 'mcp_server_active_connections',
      help: 'Number of active WebSocket connections',
      registers: [this.register]
    });

    // MCP tool invocations
    this.toolInvocations = new client.Counter({
      name: 'mcp_server_tool_invocations_total',
      help: 'Total number of tool invocations',
      labelNames: ['tool_name', 'status'],
      registers: [this.register]
    });

    // Widget render metrics
    this.widgetRenders = new client.Counter({
      name: 'mcp_server_widget_renders_total',
      help: 'Total number of widget renders',
      labelNames: ['widget_type', 'display_mode'],
      registers: [this.register]
    });

    // Cache performance
    this.cacheHits = new client.Counter({
      name: 'mcp_server_cache_hits_total',
      help: 'Total number of cache hits',
      labelNames: ['cache_type'],
      registers: [this.register]
    });

    this.cacheMisses = new client.Counter({
      name: 'mcp_server_cache_misses_total',
      help: 'Total number of cache misses',
      labelNames: ['cache_type'],
      registers: [this.register]
    });
  }

  /**
   * Record HTTP request metrics
   */
  public recordRequest(
    method: string,
    path: string,
    status: number,
    duration: number
  ): void {
    this.requestCounter.inc({ method, path, status: status.toString() });
    this.requestDuration.observe(
      { method, path, status: status.toString() },
      duration / 1000 // Convert to seconds
    );
  }

  /**
   * Record error metrics
   */
  public recordError(
    method: string,
    path: string,
    errorType: string
  ): void {
    this.errorCounter.inc({ method, path, error_type: errorType });
  }

  /**
   * Update active connections gauge
   */
  public setActiveConnections(count: number): void {
    this.activeConnections.set(count);
  }

  /**
   * Record tool invocation
   */
  public recordToolInvocation(toolName: string, success: boolean): void {
    this.toolInvocations.inc({
      tool_name: toolName,
      status: success ? 'success' : 'failure'
    });
  }

  /**
   * Record widget render
   */
  public recordWidgetRender(widgetType: string, displayMode: string): void {
    this.widgetRenders.inc({ widget_type: widgetType, display_mode: displayMode });
  }

  /**
   * Record cache hit
   */
  public recordCacheHit(cacheType: string): void {
    this.cacheHits.inc({ cache_type: cacheType });
  }

  /**
   * Record cache miss
   */
  public recordCacheMiss(cacheType: string): void {
    this.cacheMisses.inc({ cache_type: cacheType });
  }

  /**
   * Setup metrics endpoint
   */
  private setupRoutes(): void {
    this.app.get('/metrics', async (req, res) => {
      res.set('Content-Type', this.register.contentType);
      res.end(await this.register.metrics());
    });
  }

  /**
   * Start metrics server
   */
  public start(): void {
    this.app.listen(this.port, () => {
      this.logger.info(`Metrics server listening on port ${this.port}`);
    });
  }
}

// Example usage in Express middleware
export function createMetricsMiddleware(metrics: PrometheusMetricsExporter) {
  return (req: express.Request, res: express.Response, next: express.NextFunction) => {
    const startTime = Date.now();

    // Capture response finish event
    res.on('finish', () => {
      const duration = Date.now() - startTime;
      metrics.recordRequest(req.method, req.path, res.statusCode, duration);

      if (res.statusCode >= 500) {
        metrics.recordError(req.method, req.path, 'server_error');
      } else if (res.statusCode >= 400) {
        metrics.recordError(req.method, req.path, 'client_error');
      }
    });

    next();
  };
}

The RED method (Rate, Errors, Duration) provides the essential metrics for monitoring service health. Track request rates to detect traffic spikes, error rates to identify service degradation, and latency distributions to catch performance regressions.

For building comprehensive monitoring dashboards, see our Grafana Monitoring Dashboards guide.

Configuring Intelligent Alerting

Effective alerting prevents alert fatigue while ensuring rapid incident response. Production alerting strategies balance sensitivity (catching real issues) with specificity (avoiding false positives), typically using multi-window anomaly detection and gradual escalation.

# prometheus-alert-rules.yml - Production alert configuration
groups:
  - name: mcp_server_alerts
    interval: 30s
    rules:
      # High error rate alert (multi-window)
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(mcp_server_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(mcp_server_requests_total[5m]))
          ) > 0.05
        for: 3m
        labels:
          severity: critical
          component: mcp-server
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
          runbook_url: "https://wiki.company.com/runbooks/high-error-rate"

      # High latency alert (p95 > 1s)
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(mcp_server_request_duration_seconds_bucket[5m])) by (le)
          ) > 1.0
        for: 5m
        labels:
          severity: warning
          component: mcp-server
        annotations:
          summary: "High request latency detected"
          description: "P95 latency is {{ $value | humanizeDuration }}"
          runbook_url: "https://wiki.company.com/runbooks/high-latency"

      # Service down alert
      - alert: ServiceDown
        expr: up{job="mcp-server"} == 0
        for: 1m
        labels:
          severity: critical
          component: mcp-server
        annotations:
          summary: "MCP server instance is down"
          description: "Instance {{ $labels.instance }} has been down for 1 minute"
          runbook_url: "https://wiki.company.com/runbooks/service-down"

      # High memory usage alert
      - alert: HighMemoryUsage
        expr: |
          (
            process_resident_memory_bytes{job="mcp-server"}
            /
            node_memory_MemTotal_bytes
          ) > 0.80
        for: 10m
        labels:
          severity: warning
          component: mcp-server
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.company.com/runbooks/high-memory"

      # Cache hit rate degradation
      - alert: LowCacheHitRate
        expr: |
          (
            sum(rate(mcp_server_cache_hits_total[10m]))
            /
            (
              sum(rate(mcp_server_cache_hits_total[10m]))
              +
              sum(rate(mcp_server_cache_misses_total[10m]))
            )
          ) < 0.60
        for: 15m
        labels:
          severity: warning
          component: mcp-server
        annotations:
          summary: "Cache hit rate below threshold"
          description: "Cache hit rate is {{ $value | humanizePercentage }} (threshold: 60%)"
          runbook_url: "https://wiki.company.com/runbooks/low-cache-hit-rate"

      # Tool invocation failure rate
      - alert: HighToolFailureRate
        expr: |
          (
            sum(rate(mcp_server_tool_invocations_total{status="failure"}[5m]))
            /
            sum(rate(mcp_server_tool_invocations_total[5m]))
          ) > 0.10
        for: 5m
        labels:
          severity: warning
          component: mcp-server
        annotations:
          summary: "High tool invocation failure rate"
          description: "Tool failure rate is {{ $value | humanizePercentage }} for {{ $labels.tool_name }}"
          runbook_url: "https://wiki.company.com/runbooks/tool-failures"

Alert configuration best practices include:

Multi-window detection: Use for duration to require sustained violations before firing alerts
Runbook URLs: Every alert should link to documented remediation steps
Severity levels: Critical (page on-call immediately), Warning (ticket for business hours), Info (dashboard only)
Rate-based metrics: Use rate() instead of raw counters to detect trends
Percentile thresholds: Alert on p95/p99 latency, not mean (hides outliers)

For advanced observability patterns including distributed tracing, see our OpenTelemetry Integration guide.

Building Monitoring Dashboards

Grafana dashboards transform raw metrics into actionable insights, surfacing trends, anomalies, and correlations that enable rapid root cause analysis during incidents. Effective dashboards follow the inverted pyramid structure: high-level SLIs at top, drill-down details below.

// grafana-dashboard-generator.ts - Generate monitoring dashboards
import { Logger } from 'pino';

interface DashboardPanel {
  id: number;
  title: string;
  type: string;
  targets: Array<{ expr: string; legendFormat: string }>;
  gridPos: { x: number; y: number; w: number; h: number };
}

export class GrafanaDashboardGenerator {
  private logger: Logger;

  constructor(logger: Logger) {
    this.logger = logger.child({ component: 'dashboard-generator' });
  }

  /**
   * Generate complete MCP server monitoring dashboard
   */
  public generateMCPServerDashboard(): object {
    const panels: DashboardPanel[] = [
      // Request rate (top-level SLI)
      {
        id: 1,
        title: 'Request Rate (req/s)',
        type: 'graph',
        targets: [{
          expr: 'sum(rate(mcp_server_requests_total[5m]))',
          legendFormat: 'Total Requests'
        }],
        gridPos: { x: 0, y: 0, w: 12, h: 8 }
      },

      // Error rate (top-level SLI)
      {
        id: 2,
        title: 'Error Rate (%)',
        type: 'graph',
        targets: [{
          expr: `
            sum(rate(mcp_server_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(mcp_server_requests_total[5m]))
            * 100
          `,
          legendFormat: 'Error Rate'
        }],
        gridPos: { x: 12, y: 0, w: 12, h: 8 }
      },

      // P95 latency (top-level SLI)
      {
        id: 3,
        title: 'P95 Latency (seconds)',
        type: 'graph',
        targets: [{
          expr: `
            histogram_quantile(0.95,
              sum(rate(mcp_server_request_duration_seconds_bucket[5m])) by (le)
            )
          `,
          legendFormat: 'P95'
        }],
        gridPos: { x: 0, y: 8, w: 12, h: 8 }
      },

      // Active connections
      {
        id: 4,
        title: 'Active WebSocket Connections',
        type: 'graph',
        targets: [{
          expr: 'mcp_server_active_connections',
          legendFormat: '{{ instance }}'
        }],
        gridPos: { x: 12, y: 8, w: 12, h: 8 }
      }
    ];

    return {
      dashboard: {
        title: 'MCP Server Monitoring',
        tags: ['mcp', 'chatgpt', 'production'],
        timezone: 'browser',
        panels,
        schemaVersion: 38,
        refresh: '30s'
      }
    };
  }
}

Dashboard design principles:

Top row: Golden signals (latency, traffic, errors, saturation)
Middle rows: Service-specific metrics (tool invocations, widget renders, cache performance)
Bottom rows: Infrastructure metrics (CPU, memory, disk, network)
Auto-refresh: 30-60 second intervals for production dashboards
Time range selector: Default to last 1 hour, enable custom ranges
Annotations: Mark deployments, incidents, configuration changes

For defining SLIs, SLOs, and SLAs that drive your monitoring strategy, see our SLI/SLO/SLA Definition guide.

Advanced Incident Detection

Beyond basic threshold alerts, production systems benefit from anomaly detection, correlation analysis, and predictive alerting that identify issues before they impact users.

// incident-detector.ts - Advanced anomaly detection
import { Logger } from 'pino';

interface TimeSeriesDataPoint {
  timestamp: number;
  value: number;
}

interface AnomalyResult {
  isAnomaly: boolean;
  score: number;
  threshold: number;
  reason?: string;
}

export class IncidentDetector {
  private logger: Logger;
  private historicalData: Map<string, TimeSeriesDataPoint[]> = new Map();

  constructor(logger: Logger) {
    this.logger = logger.child({ component: 'incident-detector' });
  }

  /**
   * Detect anomalies using rolling Z-score
   */
  public detectAnomalyZScore(
    metricName: string,
    currentValue: number,
    windowSize: number = 100
  ): AnomalyResult {
    const history = this.historicalData.get(metricName) || [];

    if (history.length < windowSize) {
      // Insufficient data for detection
      return { isAnomaly: false, score: 0, threshold: 0 };
    }

    const recentValues = history.slice(-windowSize).map(d => d.value);
    const mean = recentValues.reduce((a, b) => a + b, 0) / windowSize;
    const variance = recentValues.reduce((a, b) => a + Math.pow(b - mean, 2), 0) / windowSize;
    const stdDev = Math.sqrt(variance);

    const zScore = Math.abs((currentValue - mean) / stdDev);
    const threshold = 3.0; // 3 standard deviations

    if (zScore > threshold) {
      this.logger.warn({
        metric: metricName,
        currentValue,
        mean,
        stdDev,
        zScore
      }, 'Anomaly detected via Z-score');

      return {
        isAnomaly: true,
        score: zScore,
        threshold,
        reason: `Value ${currentValue} is ${zScore.toFixed(2)} standard deviations from mean ${mean.toFixed(2)}`
      };
    }

    return { isAnomaly: false, score: zScore, threshold };
  }

  /**
   * Detect sudden spikes using rate of change
   */
  public detectSpikeRateOfChange(
    metricName: string,
    currentValue: number,
    percentageThreshold: number = 50
  ): AnomalyResult {
    const history = this.historicalData.get(metricName) || [];

    if (history.length < 2) {
      return { isAnomaly: false, score: 0, threshold: percentageThreshold };
    }

    const previousValue = history[history.length - 1].value;
    const percentChange = ((currentValue - previousValue) / previousValue) * 100;

    if (Math.abs(percentChange) > percentageThreshold) {
      this.logger.warn({
        metric: metricName,
        currentValue,
        previousValue,
        percentChange
      }, 'Spike detected via rate of change');

      return {
        isAnomaly: true,
        score: Math.abs(percentChange),
        threshold: percentageThreshold,
        reason: `Value changed by ${percentChange.toFixed(1)}% (threshold: ${percentageThreshold}%)`
      };
    }

    return { isAnomaly: false, score: Math.abs(percentChange), threshold: percentageThreshold };
  }

  /**
   * Record metric value for historical analysis
   */
  public recordMetric(metricName: string, value: number): void {
    if (!this.historicalData.has(metricName)) {
      this.historicalData.set(metricName, []);
    }

    const dataPoints = this.historicalData.get(metricName)!;
    dataPoints.push({ timestamp: Date.now(), value });

    // Keep last 1000 data points per metric
    if (dataPoints.length > 1000) {
      dataPoints.shift();
    }
  }

  /**
   * Correlate multiple metrics for root cause analysis
   */
  public correlateMetrics(
    metric1: string,
    metric2: string,
    windowSize: number = 100
  ): number {
    const data1 = this.historicalData.get(metric1) || [];
    const data2 = this.historicalData.get(metric2) || [];

    if (data1.length < windowSize || data2.length < windowSize) {
      return 0;
    }

    const values1 = data1.slice(-windowSize).map(d => d.value);
    const values2 = data2.slice(-windowSize).map(d => d.value);

    const mean1 = values1.reduce((a, b) => a + b, 0) / windowSize;
    const mean2 = values2.reduce((a, b) => a + b, 0) / windowSize;

    let covariance = 0;
    let variance1 = 0;
    let variance2 = 0;

    for (let i = 0; i < windowSize; i++) {
      const diff1 = values1[i] - mean1;
      const diff2 = values2[i] - mean2;
      covariance += diff1 * diff2;
      variance1 += diff1 * diff1;
      variance2 += diff2 * diff2;
    }

    const correlation = covariance / Math.sqrt(variance1 * variance2);

    this.logger.info({
      metric1,
      metric2,
      correlation
    }, 'Correlation analysis complete');

    return correlation;
  }
}

This advanced detector identifies anomalies that simple threshold alerts miss: sudden traffic spikes, gradual performance degradation, correlated metric changes indicating cascading failures. Deploy as a sidecar container that continuously analyzes metrics and generates dynamic alerts.

Conclusion

Production MCP servers require comprehensive health monitoring infrastructure: liveness probes ensure process responsiveness, readiness probes verify dependency health, Prometheus metrics expose performance characteristics, intelligent alerts prevent alert fatigue, and Grafana dashboards surface actionable insights.

The health check implementation shown here provides the foundation for 99.9%+ uptime by enabling automated failure detection, graceful degradation, and rapid incident response. Combined with advanced anomaly detection, your monitoring system evolves from reactive firefighting to proactive performance management.

Ready to deploy production-ready MCP servers with zero-downtime monitoring? MakeAIHQ provides enterprise health check templates, pre-configured Prometheus exporters, and Grafana dashboards specifically designed for ChatGPT applications. Build reliable, observable MCP servers in minutes, not weeks.

Start monitoring your MCP infrastructure today: Get Started Free

External Resources

Kubernetes Liveness, Readiness, and Startup Probes - Official Kubernetes health check documentation
Prometheus Documentation - Complete guide to metrics collection and alerting
Grafana Dashboards - Building effective monitoring dashboards