Chaos Engineering (Chaos Monkey) for ChatGPT Apps: Production-Ready Failure Testing

When Netflix's Chaos Monkey randomly terminates production instances, it's not sabotage—it's survival training. In the high-stakes world of ChatGPT applications serving millions of conversations daily, hoping your infrastructure survives failures is not a strategy. Chaos engineering transforms hope into certainty by proactively injecting failures in controlled experiments, exposing weaknesses before they cause customer-facing outages.

Traditional testing validates that your system works when everything functions correctly. Chaos engineering validates that your system continues working when critical components fail. It answers the uncomfortable questions: What happens when your primary database region goes offline mid-conversation? How does your app behave when OpenAI's API returns 500 errors for 5 minutes straight? Can your monitoring detect a memory leak before it crashes your MCP server?

ChatGPT apps face unique chaos scenarios that standard web applications never encounter: streaming response failures mid-sentence, widget state corruption during network partitions, authentication token expiration during long-running tool calls, and rate limit exhaustion during traffic spikes. These failures don't happen in isolation—they cascade. A database slowdown triggers timeouts, which trigger retries, which exhaust connection pools, which crash app servers, which trigger failover, which overwhelms the secondary region.

This comprehensive guide implements chaos engineering using three battle-tested approaches: Netflix Chaos Monkey (infrastructure chaos), Chaos Toolkit (experiment framework), and custom failure injection for ChatGPT-specific scenarios. You'll learn how to build automated chaos experiments that run continuously in production, observability integrations that measure blast radius in real-time, and safety guardrails that prevent experiments from escalating into actual disasters.

By the end, you'll have production-ready chaos scripts that randomly terminate instances, inject network latency, exhaust resources, and corrupt data—all while maintaining SLAs and building confidence in your system's resilience. Whether you're preparing for SOC 2 Type II certification or recovering from your third outage this quarter, chaos engineering transforms your ChatGPT app from fragile to antifragile.

Chaos Engineering Principles for ChatGPT Apps

Chaos engineering isn't random destruction—it's the scientific method applied to distributed systems. The Principles of Chaos Engineering provide the foundation:

1. Build a Hypothesis Around Steady State Behavior

Define metrics that represent normal operation: 95th percentile response time < 500ms, error rate < 0.1%, conversation completion rate > 99.5%. Your hypothesis: "The system will maintain these metrics even when [specific failure occurs]."

2. Vary Real-World Events

Inject failures that actually happen in production: cloud provider outages (AWS us-east-1 downtime), dependency failures (OpenAI API rate limits), resource exhaustion (memory leaks), network issues (packet loss, latency spikes).

3. Run Experiments in Production

Staging environments can't replicate production traffic patterns, data volumes, or inter-service dependencies. Real chaos happens in production with real users (but within controlled blast radius).

4. Automate Experiments to Run Continuously

Manual chaos experiments provide one-time validation. Continuous chaos (GameDays running daily) catches regressions introduced by new deployments, configuration changes, and dependency updates.

5. Minimize Blast Radius

Start with 1% of traffic, single availability zone, or canary environment. Gradually expand as confidence grows. Always maintain abort mechanisms.

ChatGPT-Specific Chaos Scenarios

Streaming Response Failures: Kill streaming connections mid-sentence to validate client reconnection logic and conversation state recovery.

Widget State Corruption: Inject invalid JSON into window.openai.setWidgetState() calls to test error boundaries and graceful degradation.

Tool Call Timeouts: Delay MCP server responses beyond timeout thresholds to validate retry logic and user feedback mechanisms.

Authentication Failures: Expire OAuth tokens mid-conversation to test token refresh flows and session recovery.

Rate Limit Exhaustion: Flood OpenAI API with requests to trigger rate limiting and validate backoff/retry strategies.

Chaos Monkey Implementation: Infrastructure Failure Injection

Netflix's Chaos Monkey randomly terminates instances during business hours, forcing teams to build systems that survive instance failures. Here's a production-ready implementation for ChatGPT apps running on AWS/GCP.

Chaos Monkey Core Engine

This Python script identifies candidate instances, randomly selects victims based on configured probability, and terminates them while logging all actions for audit trails.

#!/usr/bin/env python3
"""
Chaos Monkey for ChatGPT Apps
Randomly terminates instances to validate resilience

Usage:
  python chaos_monkey.py --config config.yaml --dry-run
  python chaos_monkey.py --config config.yaml --execute
"""

import os
import sys
import random
import logging
import argparse
from datetime import datetime, time
from typing import List, Dict, Optional
import yaml
import boto3
from dataclasses import dataclass

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s [%(levelname)s] %(message)s',
    handlers=[
        logging.FileHandler('chaos_monkey.log'),
        logging.StreamHandler(sys.stdout)
    ]
)
logger = logging.getLogger(__name__)


@dataclass
class ChaosConfig:
    """Chaos Monkey configuration"""
    enabled: bool
    probability: float  # 0.0-1.0 chance of terminating instance
    min_instances: int  # Never reduce below this count
    business_hours_only: bool
    business_hours_start: int  # 9 AM
    business_hours_end: int    # 5 PM
    excluded_tags: List[str]   # Don't terminate instances with these tags
    target_regions: List[str]
    target_services: List[str]
    blast_radius_limit: int    # Max instances to terminate per run


class ChaosMonkey:
    def __init__(self, config: ChaosConfig, dry_run: bool = True):
        self.config = config
        self.dry_run = dry_run
        self.ec2_clients = {
            region: boto3.client('ec2', region_name=region)
            for region in config.target_regions
        }
        self.ecs_clients = {
            region: boto3.client('ecs', region_name=region)
            for region in config.target_regions
        }

    def is_business_hours(self) -> bool:
        """Check if current time is within business hours"""
        if not self.config.business_hours_only:
            return True

        now = datetime.now().time()
        start = time(self.config.business_hours_start, 0)
        end = time(self.config.business_hours_end, 0)

        return start <= now <= end

    def get_candidate_instances(self, region: str) -> List[Dict]:
        """Find instances eligible for termination"""
        ec2 = self.ec2_clients[region]

        # Get running instances with target service tags
        response = ec2.describe_instances(
            Filters=[
                {'Name': 'instance-state-name', 'Values': ['running']},
                {'Name': 'tag:Service', 'Values': self.config.target_services}
            ]
        )

        candidates = []
        for reservation in response['Reservations']:
            for instance in reservation['Instances']:
                # Skip excluded instances
                tags = {tag['Key']: tag['Value'] for tag in instance.get('Tags', [])}
                if any(excluded in tags.get('Environment', '') for excluded in self.config.excluded_tags):
                    continue

                candidates.append({
                    'instance_id': instance['InstanceId'],
                    'region': region,
                    'service': tags.get('Service', 'unknown'),
                    'environment': tags.get('Environment', 'unknown'),
                    'launch_time': instance['LaunchTime']
                })

        return candidates

    def count_healthy_instances(self, service: str, region: str) -> int:
        """Count running instances for service to ensure min_instances"""
        ec2 = self.ec2_clients[region]

        response = ec2.describe_instances(
            Filters=[
                {'Name': 'instance-state-name', 'Values': ['running']},
                {'Name': 'tag:Service', 'Values': [service]}
            ]
        )

        count = sum(
            len(reservation['Instances'])
            for reservation in response['Reservations']
        )
        return count

    def terminate_instance(self, instance: Dict) -> bool:
        """Terminate a single instance"""
        instance_id = instance['instance_id']
        region = instance['region']

        logger.info(f"Terminating instance {instance_id} in {region} "
                   f"(service: {instance['service']}, env: {instance['environment']})")

        if self.dry_run:
            logger.info(f"DRY RUN: Would terminate {instance_id}")
            return True

        try:
            ec2 = self.ec2_clients[region]
            ec2.terminate_instances(InstanceIds=[instance_id])

            logger.info(f"Successfully terminated {instance_id}")
            return True

        except Exception as e:
            logger.error(f"Failed to terminate {instance_id}: {str(e)}")
            return False

    def run_chaos_experiment(self) -> Dict:
        """Execute chaos experiment: randomly terminate instances"""
        if not self.config.enabled:
            logger.info("Chaos Monkey is disabled in configuration")
            return {'status': 'disabled', 'terminated': []}

        if not self.is_business_hours():
            logger.info("Outside business hours, skipping chaos experiment")
            return {'status': 'outside_hours', 'terminated': []}

        # Collect candidates across all regions
        all_candidates = []
        for region in self.config.target_regions:
            candidates = self.get_candidate_instances(region)
            all_candidates.extend(candidates)

        logger.info(f"Found {len(all_candidates)} candidate instances")

        # Select victims based on probability
        victims = []
        for instance in all_candidates:
            if random.random() < self.config.probability:
                # Check min_instances constraint
                service = instance['service']
                region = instance['region']
                healthy_count = self.count_healthy_instances(service, region)

                if healthy_count > self.config.min_instances:
                    victims.append(instance)
                else:
                    logger.warning(
                        f"Skipping {instance['instance_id']}: "
                        f"would violate min_instances ({healthy_count} running)"
                    )

        # Apply blast radius limit
        if len(victims) > self.config.blast_radius_limit:
            logger.warning(
                f"Limiting victims from {len(victims)} to {self.config.blast_radius_limit} "
                f"(blast_radius_limit)"
            )
            victims = random.sample(victims, self.config.blast_radius_limit)

        # Terminate victims
        terminated = []
        for victim in victims:
            if self.terminate_instance(victim):
                terminated.append(victim)

        logger.info(f"Chaos experiment complete: terminated {len(terminated)} instances")

        return {
            'status': 'completed',
            'candidates': len(all_candidates),
            'victims_selected': len(victims),
            'terminated': terminated,
            'timestamp': datetime.now().isoformat()
        }


def load_config(config_path: str) -> ChaosConfig:
    """Load Chaos Monkey configuration from YAML"""
    with open(config_path, 'r') as f:
        config_data = yaml.safe_load(f)

    return ChaosConfig(
        enabled=config_data.get('enabled', False),
        probability=config_data.get('probability', 0.1),
        min_instances=config_data.get('min_instances', 2),
        business_hours_only=config_data.get('business_hours_only', True),
        business_hours_start=config_data.get('business_hours_start', 9),
        business_hours_end=config_data.get('business_hours_end', 17),
        excluded_tags=config_data.get('excluded_tags', ['production']),
        target_regions=config_data.get('target_regions', ['us-east-1']),
        target_services=config_data.get('target_services', ['chatgpt-mcp-server']),
        blast_radius_limit=config_data.get('blast_radius_limit', 1)
    )


def main():
    parser = argparse.ArgumentParser(description='Chaos Monkey for ChatGPT Apps')
    parser.add_argument('--config', required=True, help='Path to config.yaml')
    parser.add_argument('--dry-run', action='store_true', help='Simulate without terminating')
    parser.add_argument('--execute', action='store_true', help='Actually terminate instances')

    args = parser.parse_args()

    if not args.dry_run and not args.execute:
        parser.error('Must specify either --dry-run or --execute')

    config = load_config(args.config)
    monkey = ChaosMonkey(config, dry_run=args.dry_run)

    result = monkey.run_chaos_experiment()

    print(f"\nChaos Monkey Results:")
    print(f"Status: {result['status']}")
    print(f"Terminated: {len(result.get('terminated', []))} instances")

    if result.get('terminated'):
        print("\nTerminated Instances:")
        for instance in result['terminated']:
            print(f"  - {instance['instance_id']} ({instance['service']}, {instance['region']})")


if __name__ == '__main__':
    main()

Configuration file (config.yaml):

# Chaos Monkey Configuration
enabled: true
probability: 0.2  # 20% chance per instance
min_instances: 3  # Never go below 3 instances
business_hours_only: true
business_hours_start: 9   # 9 AM
business_hours_end: 17    # 5 PM
excluded_tags:
  - production-critical
  - chaos-exclude
target_regions:
  - us-east-1
  - us-west-2
target_services:
  - chatgpt-mcp-server
  - chatgpt-widget-runtime
  - chatgpt-auth-service
blast_radius_limit: 2  # Max 2 instances per run

Automated Instance Terminator with AWS SDK

This script integrates with Auto Scaling Groups to terminate instances while ensuring replacements are launched automatically.

/**
 * Chaos Monkey Instance Terminator for AWS Auto Scaling Groups
 * Terminates instances while maintaining ASG desired capacity
 */

import {
  EC2Client,
  DescribeInstancesCommand,
  TerminateInstancesCommand,
} from '@aws-sdk/client-ec2';
import {
  AutoScalingClient,
  DescribeAutoScalingGroupsCommand,
  TerminateInstanceInAutoScalingGroupCommand,
} from '@aws-sdk/client-auto-scaling';
import { CloudWatchClient, PutMetricDataCommand } from '@aws-sdk/client-cloudwatch';

interface ChaosTarget {
  asgName: string;
  region: string;
  minHealthy: number;
}

interface TerminationResult {
  instanceId: string;
  asgName: string;
  success: boolean;
  error?: string;
  timestamp: string;
}

export class AutoScalingChaosMonkey {
  private ec2Client: EC2Client;
  private asgClient: AutoScalingClient;
  private cwClient: CloudWatchClient;

  constructor(region: string = 'us-east-1') {
    this.ec2Client = new EC2Client({ region });
    this.asgClient = new AutoScalingClient({ region });
    this.cwClient = new CloudWatchClient({ region });
  }

  /**
   * Get healthy instance count for ASG
   */
  async getHealthyInstanceCount(asgName: string): Promise<number> {
    const command = new DescribeAutoScalingGroupsCommand({
      AutoScalingGroupNames: [asgName],
    });

    const response = await this.asgClient.send(command);
    const asg = response.AutoScalingGroups?.[0];

    if (!asg) {
      throw new Error(`ASG not found: ${asgName}`);
    }

    // Count instances in healthy state
    const healthyCount = asg.Instances?.filter(
      (instance) =>
        instance.HealthStatus === 'Healthy' &&
        instance.LifecycleState === 'InService'
    ).length || 0;

    return healthyCount;
  }

  /**
   * Select random victim instance from ASG
   */
  async selectVictim(asgName: string): Promise<string | null> {
    const command = new DescribeAutoScalingGroupsCommand({
      AutoScalingGroupNames: [asgName],
    });

    const response = await this.asgClient.send(command);
    const asg = response.AutoScalingGroups?.[0];

    if (!asg?.Instances || asg.Instances.length === 0) {
      return null;
    }

    // Filter to healthy instances only
    const healthyInstances = asg.Instances.filter(
      (instance) =>
        instance.HealthStatus === 'Healthy' &&
        instance.LifecycleState === 'InService'
    );

    if (healthyInstances.length === 0) {
      return null;
    }

    // Random selection
    const victim =
      healthyInstances[Math.floor(Math.random() * healthyInstances.length)];
    return victim.InstanceId || null;
  }

  /**
   * Terminate instance in ASG (ASG will launch replacement)
   */
  async terminateInstance(
    instanceId: string,
    asgName: string,
    decrementCapacity: boolean = false
  ): Promise<TerminationResult> {
    const result: TerminationResult = {
      instanceId,
      asgName,
      success: false,
      timestamp: new Date().toISOString(),
    };

    try {
      const command = new TerminateInstanceInAutoScalingGroupCommand({
        InstanceId: instanceId,
        ShouldDecrementDesiredCapacity: decrementCapacity,
      });

      await this.asgClient.send(command);

      result.success = true;
      console.log(
        `✅ Terminated instance ${instanceId} in ASG ${asgName} ` +
        `(decrement: ${decrementCapacity})`
      );

      // Publish CloudWatch metric
      await this.publishMetric(asgName, 1);

    } catch (error) {
      result.error = error instanceof Error ? error.message : String(error);
      console.error(`❌ Failed to terminate ${instanceId}: ${result.error}`);
    }

    return result;
  }

  /**
   * Publish chaos termination metric to CloudWatch
   */
  async publishMetric(asgName: string, terminationCount: number): Promise<void> {
    const command = new PutMetricDataCommand({
      Namespace: 'ChaosEngineering',
      MetricData: [
        {
          MetricName: 'InstanceTerminations',
          Value: terminationCount,
          Unit: 'Count',
          Timestamp: new Date(),
          Dimensions: [
            {
              Name: 'AutoScalingGroup',
              Value: asgName,
            },
          ],
        },
      ],
    });

    await this.cwClient.send(command);
  }

  /**
   * Run chaos experiment on target ASG
   */
  async runChaosExperiment(target: ChaosTarget): Promise<TerminationResult[]> {
    console.log(`🔥 Starting chaos experiment on ASG: ${target.asgName}`);

    const healthyCount = await this.getHealthyInstanceCount(target.asgName);
    console.log(`Healthy instances: ${healthyCount}, Min required: ${target.minHealthy}`);

    if (healthyCount <= target.minHealthy) {
      console.log(`⚠️  Aborting: Would violate min_healthy constraint`);
      return [];
    }

    const victimId = await this.selectVictim(target.asgName);
    if (!victimId) {
      console.log(`⚠️  No eligible victims found in ${target.asgName}`);
      return [];
    }

    console.log(`🎯 Selected victim: ${victimId}`);

    // Terminate without decrementing capacity (ASG will launch replacement)
    const result = await this.terminateInstance(victimId, target.asgName, false);

    return [result];
  }
}

// Example usage
async function main() {
  const monkey = new AutoScalingChaosMonkey('us-east-1');

  const targets: ChaosTarget[] = [
    {
      asgName: 'chatgpt-mcp-server-asg',
      region: 'us-east-1',
      minHealthy: 3,
    },
    {
      asgName: 'chatgpt-widget-runtime-asg',
      region: 'us-east-1',
      minHealthy: 2,
    },
  ];

  for (const target of targets) {
    const results = await monkey.runChaosExperiment(target);
    console.log(`Results:`, results);
  }
}

if (require.main === module) {
  main().catch(console.error);
}

Schedule Manager for Continuous Chaos

This scheduler runs chaos experiments continuously during business hours, integrating with AWS EventBridge for cron-based execution.

/**
 * Chaos Monkey Schedule Manager
 * Runs experiments on cron schedule with safety checks
 */

import { EventBridgeClient, PutRuleCommand, PutTargetsCommand } from '@aws-sdk/client-eventbridge';
import { SNSClient, PublishCommand } from '@aws-sdk/client-sns';

interface ChaosSchedule {
  name: string;
  cronExpression: string;  // e.g., "cron(0 9-17 ? * MON-FRI *)"
  enabled: boolean;
  targets: string[];       // ASG names
  notificationTopic: string;
}

export class ChaosScheduleManager {
  private ebClient: EventBridgeClient;
  private snsClient: SNSClient;

  constructor(region: string = 'us-east-1') {
    this.ebClient = new EventBridgeClient({ region });
    this.snsClient = new SNSClient({ region });
  }

  /**
   * Create EventBridge rule for chaos experiment
   */
  async createSchedule(schedule: ChaosSchedule): Promise<void> {
    console.log(`Creating chaos schedule: ${schedule.name}`);

    // Create EventBridge rule
    const ruleCommand = new PutRuleCommand({
      Name: `chaos-monkey-${schedule.name}`,
      Description: `Chaos engineering schedule for ${schedule.name}`,
      ScheduleExpression: schedule.cronExpression,
      State: schedule.enabled ? 'ENABLED' : 'DISABLED',
    });

    const ruleResponse = await this.ebClient.send(ruleCommand);
    console.log(`Rule ARN: ${ruleResponse.RuleArn}`);

    // Add Lambda target (assumes chaos Lambda exists)
    const targetCommand = new PutTargetsCommand({
      Rule: `chaos-monkey-${schedule.name}`,
      Targets: [
        {
          Id: '1',
          Arn: `arn:aws:lambda:us-east-1:123456789012:function:chaos-monkey-executor`,
          Input: JSON.stringify({
            targets: schedule.targets,
            notificationTopic: schedule.notificationTopic,
          }),
        },
      ],
    });

    await this.ebClient.send(targetCommand);
    console.log(`✅ Schedule created: ${schedule.name}`);
  }

  /**
   * Send chaos experiment notification
   */
  async sendNotification(
    topicArn: string,
    subject: string,
    message: string
  ): Promise<void> {
    const command = new PublishCommand({
      TopicArn: topicArn,
      Subject: subject,
      Message: message,
    });

    await this.snsClient.send(command);
  }
}

// Example schedules
const schedules: ChaosSchedule[] = [
  {
    name: 'weekday-business-hours',
    cronExpression: 'cron(0 9-17 ? * MON-FRI *)',  // Every hour 9am-5pm weekdays
    enabled: true,
    targets: ['chatgpt-mcp-server-asg', 'chatgpt-widget-runtime-asg'],
    notificationTopic: 'arn:aws:sns:us-east-1:123456789012:chaos-alerts',
  },
  {
    name: 'weekend-reduced-chaos',
    cronExpression: 'cron(0 12 ? * SAT-SUN *)',  // Noon on weekends
    enabled: true,
    targets: ['chatgpt-mcp-server-asg'],
    notificationTopic: 'arn:aws:sns:us-east-1:123456789012:chaos-alerts',
  },
];

Chaos Toolkit: Structured Experiment Framework

Chaos Toolkit provides a declarative YAML framework for defining chaos experiments with hypothesis validation, automated rollbacks, and extensible drivers.

Chaos Toolkit Experiment Definition

This experiment validates that your ChatGPT app maintains SLAs when the primary database region experiences 50% packet loss.

# chaos-experiment-database-latency.yaml
# Validates resilience to database network partition

version: 1.0.0
title: "Database Network Latency Resilience"
description: "Validate ChatGPT app continues serving requests when primary database region experiences 50% packet loss"

configuration:
  app_url: "https://api.makeaihq.com/health"
  database_instance: "chatgpt-db-primary"
  packet_loss_percentage: 50
  experiment_duration: 300  # 5 minutes

# Define steady state: what "normal" looks like
steady-state-hypothesis:
  title: "Application remains healthy with acceptable latency"
  probes:
    - name: "health-check-responds"
      type: probe
      tolerance:
        type: "http"
        status: 200
        timeout: 2
      provider:
        type: http
        url: "${app_url}"
        timeout: 5

    - name: "api-latency-acceptable"
      type: probe
      tolerance:
        type: "latency"
        target: "p95"
        lower: 0
        upper: 2000  # 2 seconds max
      provider:
        type: python
        module: chaos_toolkit_addons.probes
        func: measure_api_latency
        arguments:
          url: "${app_url}/api/apps"
          samples: 10

    - name: "error-rate-low"
      type: probe
      tolerance:
        type: "range"
        target: "error_rate"
        lower: 0
        upper: 0.01  # Max 1% errors
      provider:
        type: python
        module: chaos_toolkit_addons.probes
        func: measure_error_rate
        arguments:
          cloudwatch_namespace: "ChatGPTApp"
          metric_name: "5XXErrors"
          period: 60

# Actions to inject failure
method:
  - name: "inject-database-packet-loss"
    type: action
    provider:
      type: python
      module: chaosaws.ec2.actions
      func: inject_packet_loss
      arguments:
        instance_ids:
          - "${database_instance}"
        packet_loss: "${packet_loss_percentage}"
        duration: "${experiment_duration}"
        interface: "eth0"

  - name: "monitor-recovery"
    type: probe
    provider:
      type: python
      module: chaos_toolkit_addons.probes
      func: monitor_metrics
      arguments:
        duration: "${experiment_duration}"
        metrics:
          - name: "DatabaseConnections"
            namespace: "AWS/RDS"
          - name: "ReadLatency"
            namespace: "AWS/RDS"
          - name: "WriteLatency"
            namespace: "AWS/RDS"

# Rollback actions to restore normal state
rollbacks:
  - name: "remove-packet-loss"
    type: action
    provider:
      type: python
      module: chaosaws.ec2.actions
      func: remove_packet_loss
      arguments:
        instance_ids:
          - "${database_instance}"

  - name: "verify-recovery"
    type: probe
    provider:
      type: http
      url: "${app_url}"
      timeout: 5
      tolerance:
        type: "http"
        status: 200

# When to abort experiment
abort-conditions:
  - name: "error-rate-critical"
    type: probe
    tolerance:
      type: "range"
      target: "error_rate"
      lower: 0
      upper: 0.10  # Abort if >10% errors
    provider:
      type: python
      module: chaos_toolkit_addons.probes
      func: measure_error_rate
      arguments:
        cloudwatch_namespace: "ChatGPTApp"
        metric_name: "5XXErrors"
        period: 60

Steady State Hypothesis Validator

This TypeScript module defines custom probes for validating ChatGPT app health during chaos experiments.

/**
 * Chaos Toolkit Custom Probes for ChatGPT Apps
 * Measures steady state metrics during experiments
 */

import axios from 'axios';
import { CloudWatchClient, GetMetricStatisticsCommand } from '@aws-sdk/client-cloudwatch';

interface ProbeResult {
  success: boolean;
  value: number | string;
  message: string;
}

/**
 * Measure API latency percentiles
 */
export async function measureApiLatency(
  url: string,
  samples: number = 10,
  targetPercentile: number = 95
): Promise<ProbeResult> {
  const latencies: number[] = [];

  for (let i = 0; i < samples; i++) {
    const start = Date.now();
    try {
      await axios.get(url, { timeout: 5000 });
      latencies.push(Date.now() - start);
    } catch (error) {
      latencies.push(5000); // Timeout treated as 5s latency
    }
  }

  latencies.sort((a, b) => a - b);
  const percentileIndex = Math.floor((targetPercentile / 100) * latencies.length);
  const p95Latency = latencies[percentileIndex];

  return {
    success: p95Latency < 2000, // Success if p95 < 2s
    value: p95Latency,
    message: `P${targetPercentile} latency: ${p95Latency}ms`,
  };
}

/**
 * Measure error rate from CloudWatch metrics
 */
export async function measureErrorRate(
  cloudwatchNamespace: string,
  metricName: string,
  period: number = 60
): Promise<ProbeResult> {
  const cwClient = new CloudWatchClient({ region: 'us-east-1' });

  const endTime = new Date();
  const startTime = new Date(endTime.getTime() - period * 1000);

  const command = new GetMetricStatisticsCommand({
    Namespace: cloudwatchNamespace,
    MetricName: metricName,
    StartTime: startTime,
    EndTime: endTime,
    Period: period,
    Statistics: ['Sum'],
  });

  const response = await cwClient.send(command);
  const errorCount = response.Datapoints?.[0]?.Sum || 0;

  // Get total request count
  const totalCommand = new GetMetricStatisticsCommand({
    Namespace: cloudwatchNamespace,
    MetricName: 'RequestCount',
    StartTime: startTime,
    EndTime: endTime,
    Period: period,
    Statistics: ['Sum'],
  });

  const totalResponse = await cwClient.send(totalCommand);
  const totalCount = totalResponse.Datapoints?.[0]?.Sum || 1;

  const errorRate = errorCount / totalCount;

  return {
    success: errorRate < 0.01, // Success if < 1% errors
    value: errorRate,
    message: `Error rate: ${(errorRate * 100).toFixed(2)}%`,
  };
}

/**
 * Monitor metrics during experiment
 */
export async function monitorMetrics(
  duration: number,
  metrics: Array<{ name: string; namespace: string }>
): Promise<ProbeResult> {
  const cwClient = new CloudWatchClient({ region: 'us-east-1' });
  const results: Record<string, number[]> = {};

  const intervalMs = 10000; // Sample every 10s
  const iterations = Math.floor(duration / (intervalMs / 1000));

  for (let i = 0; i < iterations; i++) {
    for (const metric of metrics) {
      const command = new GetMetricStatisticsCommand({
        Namespace: metric.namespace,
        MetricName: metric.name,
        StartTime: new Date(Date.now() - 60000),
        EndTime: new Date(),
        Period: 60,
        Statistics: ['Average'],
      });

      const response = await cwClient.send(command);
      const value = response.Datapoints?.[0]?.Average || 0;

      if (!results[metric.name]) {
        results[metric.name] = [];
      }
      results[metric.name].push(value);
    }

    await new Promise((resolve) => setTimeout(resolve, intervalMs));
  }

  // Calculate summary statistics
  const summary = Object.entries(results).map(([name, values]) => {
    const avg = values.reduce((a, b) => a + b, 0) / values.length;
    const max = Math.max(...values);
    return `${name}: avg=${avg.toFixed(2)}, max=${max.toFixed(2)}`;
  });

  return {
    success: true,
    value: JSON.stringify(results),
    message: `Monitored ${iterations} samples: ${summary.join(', ')}`,
  };
}

Automated Rollback System

This module implements automatic rollback when experiments exceed blast radius limits or violate SLA constraints.

/**
 * Chaos Toolkit Rollback Automation
 * Automatically reverts infrastructure changes when experiments fail
 */

import { EC2Client, RevokeSecurityGroupIngressCommand } from '@aws-sdk/client-ec2';
import { ECSClient, UpdateServiceCommand } from '@aws-sdk/client-ecs';

interface RollbackAction {
  type: 'security_group' | 'ecs_service' | 'custom';
  resourceId: string;
  originalState: Record<string, any>;
}

export class ChaosRollbackOrchestrator {
  private ec2Client: EC2Client;
  private ecsClient: ECSClient;
  private rollbackStack: RollbackAction[] = [];

  constructor(region: string = 'us-east-1') {
    this.ec2Client = new EC2Client({ region });
    this.ecsClient = new ECSClient({ region });
  }

  /**
   * Register rollback action for later execution
   */
  registerRollback(action: RollbackAction): void {
    this.rollbackStack.push(action);
    console.log(`Registered rollback: ${action.type} - ${action.resourceId}`);
  }

  /**
   * Execute all rollback actions in reverse order
   */
  async executeRollbacks(): Promise<void> {
    console.log(`Executing ${this.rollbackStack.length} rollback actions`);

    // Execute in reverse order (LIFO)
    while (this.rollbackStack.length > 0) {
      const action = this.rollbackStack.pop()!;
      await this.executeRollback(action);
    }

    console.log('✅ All rollbacks completed');
  }

  /**
   * Execute single rollback action
   */
  private async executeRollback(action: RollbackAction): Promise<void> {
    console.log(`Rolling back: ${action.type} - ${action.resourceId}`);

    try {
      switch (action.type) {
        case 'security_group':
          await this.rollbackSecurityGroup(action);
          break;
        case 'ecs_service':
          await this.rollbackEcsService(action);
          break;
        case 'custom':
          await this.rollbackCustom(action);
          break;
      }
    } catch (error) {
      console.error(`❌ Rollback failed for ${action.resourceId}:`, error);
      throw error;
    }
  }

  /**
   * Rollback security group rule changes
   */
  private async rollbackSecurityGroup(action: RollbackAction): Promise<void> {
    const { securityGroupId, ipPermissions } = action.originalState;

    const command = new RevokeSecurityGroupIngressCommand({
      GroupId: securityGroupId,
      IpPermissions: ipPermissions,
    });

    await this.ec2Client.send(command);
    console.log(`✅ Security group ${securityGroupId} rolled back`);
  }

  /**
   * Rollback ECS service changes (restore original task count)
   */
  private async rollbackEcsService(action: RollbackAction): Promise<void> {
    const { cluster, service, desiredCount } = action.originalState;

    const command = new UpdateServiceCommand({
      cluster,
      service,
      desiredCount,
    });

    await this.ecsClient.send(command);
    console.log(`✅ ECS service ${service} rolled back to ${desiredCount} tasks`);
  }

  /**
   * Custom rollback handler
   */
  private async rollbackCustom(action: RollbackAction): Promise<void> {
    // Execute custom rollback logic
    console.log(`Custom rollback for ${action.resourceId}:`, action.originalState);
  }
}

// Example usage in chaos experiment
async function runExperimentWithRollback() {
  const rollback = new ChaosRollbackOrchestrator('us-east-1');

  try {
    // Register rollback actions BEFORE making changes
    rollback.registerRollback({
      type: 'ecs_service',
      resourceId: 'chatgpt-mcp-server',
      originalState: {
        cluster: 'production',
        service: 'chatgpt-mcp-server',
        desiredCount: 5,
      },
    });

    // Execute chaos action (reduce ECS task count)
    // ... chaos logic here ...

    // If experiment fails, rollback
    if (experimentFailed) {
      await rollback.executeRollbacks();
    }
  } catch (error) {
    console.error('Experiment error:', error);
    await rollback.executeRollbacks();
  }
}

Observability Integration: Tracking Chaos Experiments

Chaos experiments without observability are guesswork. Integrate with Prometheus, Grafana, and CloudWatch to measure blast radius, detect cascading failures, and validate SLA maintenance.

Experiment Tracker with Prometheus Metrics

This module tracks all chaos experiments, publishes metrics to Prometheus, and integrates with incident management systems.

/**
 * Chaos Experiment Tracker
 * Records experiment metadata and publishes metrics
 */

import { Registry, Counter, Histogram, Gauge } from 'prom-client';
import axios from 'axios';

interface ExperimentMetadata {
  experimentId: string;
  name: string;
  hypothesis: string;
  startTime: Date;
  endTime?: Date;
  status: 'running' | 'success' | 'failure' | 'aborted';
  impactedServices: string[];
  blastRadiusPercentage: number;
}

export class ChaosExperimentTracker {
  private registry: Registry;
  private experimentsCounter: Counter;
  private experimentDuration: Histogram;
  private activeExperiments: Gauge;
  private experiments: Map<string, ExperimentMetadata> = new Map();

  constructor() {
    this.registry = new Registry();

    this.experimentsCounter = new Counter({
      name: 'chaos_experiments_total',
      help: 'Total number of chaos experiments',
      labelNames: ['status', 'experiment_name'],
      registers: [this.registry],
    });

    this.experimentDuration = new Histogram({
      name: 'chaos_experiment_duration_seconds',
      help: 'Duration of chaos experiments',
      labelNames: ['experiment_name', 'status'],
      buckets: [10, 30, 60, 120, 300, 600],
      registers: [this.registry],
    });

    this.activeExperiments = new Gauge({
      name: 'chaos_experiments_active',
      help: 'Number of currently running chaos experiments',
      registers: [this.registry],
    });
  }

  /**
   * Start tracking chaos experiment
   */
  startExperiment(metadata: Omit<ExperimentMetadata, 'startTime' | 'status'>): string {
    const experiment: ExperimentMetadata = {
      ...metadata,
      startTime: new Date(),
      status: 'running',
    };

    this.experiments.set(metadata.experimentId, experiment);
    this.activeExperiments.inc();

    console.log(`🔥 Chaos experiment started: ${metadata.name}`);
    return metadata.experimentId;
  }

  /**
   * End experiment and record results
   */
  endExperiment(experimentId: string, status: 'success' | 'failure' | 'aborted'): void {
    const experiment = this.experiments.get(experimentId);
    if (!experiment) {
      console.error(`Experiment not found: ${experimentId}`);
      return;
    }

    experiment.endTime = new Date();
    experiment.status = status;

    const durationSeconds =
      (experiment.endTime.getTime() - experiment.startTime.getTime()) / 1000;

    // Record metrics
    this.experimentsCounter.inc({ status, experiment_name: experiment.name });
    this.experimentDuration.observe(
      { experiment_name: experiment.name, status },
      durationSeconds
    );
    this.activeExperiments.dec();

    console.log(
      `✅ Chaos experiment ${status}: ${experiment.name} (${durationSeconds}s)`
    );
  }

  /**
   * Get metrics for Prometheus scraping
   */
  async getMetrics(): Promise<string> {
    return this.registry.metrics();
  }

  /**
   * Send experiment report to incident management system
   */
  async sendReport(experimentId: string, webhookUrl: string): Promise<void> {
    const experiment = this.experiments.get(experimentId);
    if (!experiment) return;

    const duration = experiment.endTime
      ? (experiment.endTime.getTime() - experiment.startTime.getTime()) / 1000
      : 0;

    const report = {
      title: `Chaos Experiment: ${experiment.name}`,
      status: experiment.status,
      duration: `${duration}s`,
      hypothesis: experiment.hypothesis,
      impacted_services: experiment.impactedServices.join(', '),
      blast_radius: `${experiment.blastRadiusPercentage}%`,
      timestamp: experiment.startTime.toISOString(),
    };

    await axios.post(webhookUrl, report);
  }
}

Prometheus Metrics Collector

This script collects metrics during chaos experiments to identify impact on SLAs, error rates, and latency.

# prometheus-chaos-metrics.yaml
# Prometheus scrape config for chaos experiments

global:
  scrape_interval: 10s
  evaluation_interval: 10s

scrape_configs:
  - job_name: 'chaos-experiments'
    static_configs:
      - targets: ['localhost:9090']
    metrics_path: '/metrics'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'chaos-tracker'

  - job_name: 'chatgpt-app'
    static_configs:
      - targets: ['api.makeaihq.com:443']
    metrics_path: '/metrics'
    scheme: https
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: '(http_request_duration_seconds|http_requests_total|error_rate)'
        action: keep

# Alert rules for chaos experiments
rule_files:
  - 'chaos-alerts.yaml'

Alert rules (chaos-alerts.yaml):

groups:
  - name: chaos_experiment_alerts
    interval: 10s
    rules:
      - alert: ChaosExperimentHighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[1m]) > 0.05
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Chaos experiment causing high error rate"
          description: "Error rate {{ $value }}% during chaos experiment"

      - alert: ChaosExperimentHighLatency
        expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 2
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Chaos experiment causing high latency"
          description: "P95 latency {{ $value }}s during chaos experiment"

      - alert: ChaosExperimentBlastRadiusExceeded
        expr: chaos_experiments_active > 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Multiple chaos experiments running simultaneously"
          description: "{{ $value }} experiments active (blast radius violation)"

Incident Reporter Integration

This module sends real-time chaos experiment updates to incident management systems (PagerDuty, Slack, Opsgenie).

/**
 * Chaos Incident Reporter
 * Sends experiment results to incident management systems
 */

import axios from 'axios';

interface IncidentReport {
  experimentId: string;
  experimentName: string;
  status: 'success' | 'failure' | 'aborted';
  duration: number;
  blastRadius: number;
  impactedServices: string[];
  metrics: {
    errorRate: number;
    p95Latency: number;
    availabilityPercentage: number;
  };
}

export class ChaosIncidentReporter {
  /**
   * Send Slack notification
   */
  async sendSlackNotification(
    webhookUrl: string,
    report: IncidentReport
  ): Promise<void> {
    const statusEmoji = {
      success: '✅',
      failure: '❌',
      aborted: '⚠️',
    };

    const color = {
      success: '#36a64f',
      failure: '#ff0000',
      aborted: '#ffaa00',
    };

    const message = {
      attachments: [
        {
          color: color[report.status],
          title: `${statusEmoji[report.status]} Chaos Experiment: ${report.experimentName}`,
          fields: [
            {
              title: 'Status',
              value: report.status.toUpperCase(),
              short: true,
            },
            {
              title: 'Duration',
              value: `${report.duration}s`,
              short: true,
            },
            {
              title: 'Blast Radius',
              value: `${report.blastRadius}%`,
              short: true,
            },
            {
              title: 'Error Rate',
              value: `${(report.metrics.errorRate * 100).toFixed(2)}%`,
              short: true,
            },
            {
              title: 'P95 Latency',
              value: `${report.metrics.p95Latency}ms`,
              short: true,
            },
            {
              title: 'Availability',
              value: `${report.metrics.availabilityPercentage.toFixed(2)}%`,
              short: true,
            },
            {
              title: 'Impacted Services',
              value: report.impactedServices.join(', '),
              short: false,
            },
          ],
          footer: 'Chaos Engineering',
          ts: Math.floor(Date.now() / 1000),
        },
      ],
    };

    await axios.post(webhookUrl, message);
  }

  /**
   * Send PagerDuty incident
   */
  async sendPagerDutyIncident(
    integrationKey: string,
    report: IncidentReport
  ): Promise<void> {
    if (report.status === 'success') {
      return; // Don't create PagerDuty incidents for successful experiments
    }

    const severity = report.status === 'failure' ? 'critical' : 'warning';

    const event = {
      routing_key: integrationKey,
      event_action: 'trigger',
      payload: {
        summary: `Chaos Experiment Failed: ${report.experimentName}`,
        severity,
        source: 'chaos-engineering',
        custom_details: {
          experiment_id: report.experimentId,
          duration: `${report.duration}s`,
          blast_radius: `${report.blastRadius}%`,
          error_rate: `${(report.metrics.errorRate * 100).toFixed(2)}%`,
          p95_latency: `${report.metrics.p95Latency}ms`,
          impacted_services: report.impactedServices.join(', '),
        },
      },
    };

    await axios.post('https://events.pagerduty.com/v2/enqueue', event);
  }
}

Safety Guardrails: Preventing Chaos from Becoming Disaster

Chaos experiments can escalate into real incidents if blast radius limits are exceeded or rollback mechanisms fail. Implement these guardrails to ensure experiments remain controlled.

Blast Radius Limiter

This module enforces maximum impact constraints: never affect more than X% of instances, Y% of users, or Z concurrent experiments.

/**
 * Blast Radius Limiter
 * Ensures chaos experiments don't exceed safe impact thresholds
 */

interface BlastRadiusConfig {
  maxInstancePercentage: number;    // Max % of instances to affect
  maxUserPercentage: number;         // Max % of users to impact
  maxConcurrentExperiments: number;  // Max simultaneous experiments
  minHealthyInstances: number;       // Never go below this count
}

export class BlastRadiusLimiter {
  constructor(private config: BlastRadiusConfig) {}

  /**
   * Validate experiment doesn't exceed blast radius limits
   */
  async validateExperiment(
    targetInstances: string[],
    totalInstances: number,
    activeExperiments: number
  ): Promise<{ allowed: boolean; reason?: string }> {
    // Check concurrent experiment limit
    if (activeExperiments >= this.config.maxConcurrentExperiments) {
      return {
        allowed: false,
        reason: `Max concurrent experiments reached (${this.config.maxConcurrentExperiments})`,
      };
    }

    // Check instance percentage limit
    const impactPercentage = (targetInstances.length / totalInstances) * 100;
    if (impactPercentage > this.config.maxInstancePercentage) {
      return {
        allowed: false,
        reason: `Would impact ${impactPercentage.toFixed(1)}% of instances (max: ${this.config.maxInstancePercentage}%)`,
      };
    }

    // Check min healthy instances
    const remainingInstances = totalInstances - targetInstances.length;
    if (remainingInstances < this.config.minHealthyInstances) {
      return {
        allowed: false,
        reason: `Would leave ${remainingInstances} healthy instances (min: ${this.config.minHealthyInstances})`,
      };
    }

    return { allowed: true };
  }
}

Auto-Rollback Trigger

This script monitors experiment metrics and automatically triggers rollback when SLA violations are detected.

/**
 * Auto-Rollback Trigger
 * Monitors experiments and triggers rollback on SLA violations
 */

import { CloudWatchClient, GetMetricStatisticsCommand } from '@aws-sdk/client-cloudwatch';

interface SlaThreshold {
  metricName: string;
  namespace: string;
  threshold: number;
  comparison: 'greaterThan' | 'lessThan';
}

export class AutoRollbackTrigger {
  private cwClient: CloudWatchClient;

  constructor(region: string = 'us-east-1') {
    this.cwClient = new CloudWatchClient({ region });
  }

  /**
   * Monitor SLA metrics and trigger rollback if violated
   */
  async monitorAndTriggerRollback(
    thresholds: SlaThreshold[],
    rollbackCallback: () => Promise<void>
  ): Promise<void> {
    const intervalMs = 10000; // Check every 10s

    const interval = setInterval(async () => {
      for (const threshold of thresholds) {
        const violated = await this.checkThreshold(threshold);

        if (violated) {
          console.error(`❌ SLA VIOLATION: ${threshold.metricName} exceeded ${threshold.threshold}`);
          clearInterval(interval);
          await rollbackCallback();
          return;
        }
      }
    }, intervalMs);
  }

  /**
   * Check if metric exceeds threshold
   */
  private async checkThreshold(threshold: SlaThreshold): Promise<boolean> {
    const command = new GetMetricStatisticsCommand({
      Namespace: threshold.namespace,
      MetricName: threshold.metricName,
      StartTime: new Date(Date.now() - 60000),
      EndTime: new Date(),
      Period: 60,
      Statistics: ['Average'],
    });

    const response = await this.cwClient.send(command);
    const value = response.Datapoints?.[0]?.Average || 0;

    if (threshold.comparison === 'greaterThan') {
      return value > threshold.threshold;
    } else {
      return value < threshold.threshold;
    }
  }
}

Production Chaos Engineering Checklist

Before running chaos experiments in production, validate these readiness criteria:

Infrastructure Readiness

  • Auto Scaling Groups configured with min/max capacity
  • Health checks enabled (ELB, Route 53)
  • Multi-region deployment (for region failure experiments)
  • Backup and restore procedures tested
  • Rollback automation tested in staging

Observability Readiness

  • Prometheus/CloudWatch metrics published for all services
  • Grafana dashboards showing real-time SLAs
  • Alerting configured (Slack, PagerDuty, Opsgenie)
  • Distributed tracing enabled (Jaeger, X-Ray)
  • Log aggregation configured (CloudWatch Logs, Datadog)

Safety Guardrails

  • Blast radius limits configured (max 20% of instances)
  • Minimum healthy instance count enforced (min 3)
  • Concurrent experiment limit (max 1)
  • Business hours restriction enabled (9am-5pm weekdays)
  • Auto-rollback triggers configured (error rate > 5%)

Team Readiness

  • On-call engineer notified before experiments
  • Runbooks updated with chaos experiment procedures
  • Incident response plan includes "chaos gone wrong" scenarios
  • GameDay practice runs completed in staging
  • Stakeholders informed of chaos engineering program

Compliance & Audit

  • Chaos experiments logged for audit trail
  • Experiment approval workflow (for production changes)
  • Change management integration (ServiceNow, Jira)
  • Post-experiment reports generated automatically
  • SOC 2 / ISO 27001 compliance validated

Conclusion: Building Antifragile ChatGPT Apps

Chaos engineering transforms your ChatGPT app from "hoping it survives failures" to "proving it thrives under adversity." By continuously injecting controlled failures—instance terminations, network partitions, resource exhaustion—you expose weaknesses before they cause customer-facing outages. Netflix runs Chaos Monkey in production 24/7 because they know that the best time to discover a critical bug is before a real incident, not during one.

The seven production-ready code examples in this guide provide everything you need to implement chaos engineering today: Netflix-style Chaos Monkey for infrastructure failures, Chaos Toolkit for structured experiments, observability integration for blast radius measurement, and safety guardrails for preventing chaos from escalating into disasters. Start small—terminate one instance in your staging environment—then gradually expand to production, multi-region failures, and eventually 24/7 continuous chaos.

The goal isn't to cause failures—it's to build confidence that your system survives them. Because in distributed systems, failure is not a possibility; it's a guarantee. The only question is whether you discover it during a controlled experiment or a 3am pager alert.

Ready to build a ChatGPT app that survives chaos? Start building with MakeAIHQ's no-code platform and deploy resilient apps in 48 hours—with built-in observability, auto-scaling, and disaster recovery features designed for chaos engineering.

Internal Resources

External Resources