Chaos Engineering (Chaos Monkey) for ChatGPT Apps: Production-Ready Failure Testing
When Netflix's Chaos Monkey randomly terminates production instances, it's not sabotage—it's survival training. In the high-stakes world of ChatGPT applications serving millions of conversations daily, hoping your infrastructure survives failures is not a strategy. Chaos engineering transforms hope into certainty by proactively injecting failures in controlled experiments, exposing weaknesses before they cause customer-facing outages.
Traditional testing validates that your system works when everything functions correctly. Chaos engineering validates that your system continues working when critical components fail. It answers the uncomfortable questions: What happens when your primary database region goes offline mid-conversation? How does your app behave when OpenAI's API returns 500 errors for 5 minutes straight? Can your monitoring detect a memory leak before it crashes your MCP server?
ChatGPT apps face unique chaos scenarios that standard web applications never encounter: streaming response failures mid-sentence, widget state corruption during network partitions, authentication token expiration during long-running tool calls, and rate limit exhaustion during traffic spikes. These failures don't happen in isolation—they cascade. A database slowdown triggers timeouts, which trigger retries, which exhaust connection pools, which crash app servers, which trigger failover, which overwhelms the secondary region.
This comprehensive guide implements chaos engineering using three battle-tested approaches: Netflix Chaos Monkey (infrastructure chaos), Chaos Toolkit (experiment framework), and custom failure injection for ChatGPT-specific scenarios. You'll learn how to build automated chaos experiments that run continuously in production, observability integrations that measure blast radius in real-time, and safety guardrails that prevent experiments from escalating into actual disasters.
By the end, you'll have production-ready chaos scripts that randomly terminate instances, inject network latency, exhaust resources, and corrupt data—all while maintaining SLAs and building confidence in your system's resilience. Whether you're preparing for SOC 2 Type II certification or recovering from your third outage this quarter, chaos engineering transforms your ChatGPT app from fragile to antifragile.
Chaos Engineering Principles for ChatGPT Apps
Chaos engineering isn't random destruction—it's the scientific method applied to distributed systems. The Principles of Chaos Engineering provide the foundation:
1. Build a Hypothesis Around Steady State Behavior
Define metrics that represent normal operation: 95th percentile response time < 500ms, error rate < 0.1%, conversation completion rate > 99.5%. Your hypothesis: "The system will maintain these metrics even when [specific failure occurs]."
2. Vary Real-World Events
Inject failures that actually happen in production: cloud provider outages (AWS us-east-1 downtime), dependency failures (OpenAI API rate limits), resource exhaustion (memory leaks), network issues (packet loss, latency spikes).
3. Run Experiments in Production
Staging environments can't replicate production traffic patterns, data volumes, or inter-service dependencies. Real chaos happens in production with real users (but within controlled blast radius).
4. Automate Experiments to Run Continuously
Manual chaos experiments provide one-time validation. Continuous chaos (GameDays running daily) catches regressions introduced by new deployments, configuration changes, and dependency updates.
5. Minimize Blast Radius
Start with 1% of traffic, single availability zone, or canary environment. Gradually expand as confidence grows. Always maintain abort mechanisms.
ChatGPT-Specific Chaos Scenarios
Streaming Response Failures: Kill streaming connections mid-sentence to validate client reconnection logic and conversation state recovery.
Widget State Corruption: Inject invalid JSON into window.openai.setWidgetState() calls to test error boundaries and graceful degradation.
Tool Call Timeouts: Delay MCP server responses beyond timeout thresholds to validate retry logic and user feedback mechanisms.
Authentication Failures: Expire OAuth tokens mid-conversation to test token refresh flows and session recovery.
Rate Limit Exhaustion: Flood OpenAI API with requests to trigger rate limiting and validate backoff/retry strategies.
Chaos Monkey Implementation: Infrastructure Failure Injection
Netflix's Chaos Monkey randomly terminates instances during business hours, forcing teams to build systems that survive instance failures. Here's a production-ready implementation for ChatGPT apps running on AWS/GCP.
Chaos Monkey Core Engine
This Python script identifies candidate instances, randomly selects victims based on configured probability, and terminates them while logging all actions for audit trails.
#!/usr/bin/env python3
"""
Chaos Monkey for ChatGPT Apps
Randomly terminates instances to validate resilience
Usage:
python chaos_monkey.py --config config.yaml --dry-run
python chaos_monkey.py --config config.yaml --execute
"""
import os
import sys
import random
import logging
import argparse
from datetime import datetime, time
from typing import List, Dict, Optional
import yaml
import boto3
from dataclasses import dataclass
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s [%(levelname)s] %(message)s',
handlers=[
logging.FileHandler('chaos_monkey.log'),
logging.StreamHandler(sys.stdout)
]
)
logger = logging.getLogger(__name__)
@dataclass
class ChaosConfig:
"""Chaos Monkey configuration"""
enabled: bool
probability: float # 0.0-1.0 chance of terminating instance
min_instances: int # Never reduce below this count
business_hours_only: bool
business_hours_start: int # 9 AM
business_hours_end: int # 5 PM
excluded_tags: List[str] # Don't terminate instances with these tags
target_regions: List[str]
target_services: List[str]
blast_radius_limit: int # Max instances to terminate per run
class ChaosMonkey:
def __init__(self, config: ChaosConfig, dry_run: bool = True):
self.config = config
self.dry_run = dry_run
self.ec2_clients = {
region: boto3.client('ec2', region_name=region)
for region in config.target_regions
}
self.ecs_clients = {
region: boto3.client('ecs', region_name=region)
for region in config.target_regions
}
def is_business_hours(self) -> bool:
"""Check if current time is within business hours"""
if not self.config.business_hours_only:
return True
now = datetime.now().time()
start = time(self.config.business_hours_start, 0)
end = time(self.config.business_hours_end, 0)
return start <= now <= end
def get_candidate_instances(self, region: str) -> List[Dict]:
"""Find instances eligible for termination"""
ec2 = self.ec2_clients[region]
# Get running instances with target service tags
response = ec2.describe_instances(
Filters=[
{'Name': 'instance-state-name', 'Values': ['running']},
{'Name': 'tag:Service', 'Values': self.config.target_services}
]
)
candidates = []
for reservation in response['Reservations']:
for instance in reservation['Instances']:
# Skip excluded instances
tags = {tag['Key']: tag['Value'] for tag in instance.get('Tags', [])}
if any(excluded in tags.get('Environment', '') for excluded in self.config.excluded_tags):
continue
candidates.append({
'instance_id': instance['InstanceId'],
'region': region,
'service': tags.get('Service', 'unknown'),
'environment': tags.get('Environment', 'unknown'),
'launch_time': instance['LaunchTime']
})
return candidates
def count_healthy_instances(self, service: str, region: str) -> int:
"""Count running instances for service to ensure min_instances"""
ec2 = self.ec2_clients[region]
response = ec2.describe_instances(
Filters=[
{'Name': 'instance-state-name', 'Values': ['running']},
{'Name': 'tag:Service', 'Values': [service]}
]
)
count = sum(
len(reservation['Instances'])
for reservation in response['Reservations']
)
return count
def terminate_instance(self, instance: Dict) -> bool:
"""Terminate a single instance"""
instance_id = instance['instance_id']
region = instance['region']
logger.info(f"Terminating instance {instance_id} in {region} "
f"(service: {instance['service']}, env: {instance['environment']})")
if self.dry_run:
logger.info(f"DRY RUN: Would terminate {instance_id}")
return True
try:
ec2 = self.ec2_clients[region]
ec2.terminate_instances(InstanceIds=[instance_id])
logger.info(f"Successfully terminated {instance_id}")
return True
except Exception as e:
logger.error(f"Failed to terminate {instance_id}: {str(e)}")
return False
def run_chaos_experiment(self) -> Dict:
"""Execute chaos experiment: randomly terminate instances"""
if not self.config.enabled:
logger.info("Chaos Monkey is disabled in configuration")
return {'status': 'disabled', 'terminated': []}
if not self.is_business_hours():
logger.info("Outside business hours, skipping chaos experiment")
return {'status': 'outside_hours', 'terminated': []}
# Collect candidates across all regions
all_candidates = []
for region in self.config.target_regions:
candidates = self.get_candidate_instances(region)
all_candidates.extend(candidates)
logger.info(f"Found {len(all_candidates)} candidate instances")
# Select victims based on probability
victims = []
for instance in all_candidates:
if random.random() < self.config.probability:
# Check min_instances constraint
service = instance['service']
region = instance['region']
healthy_count = self.count_healthy_instances(service, region)
if healthy_count > self.config.min_instances:
victims.append(instance)
else:
logger.warning(
f"Skipping {instance['instance_id']}: "
f"would violate min_instances ({healthy_count} running)"
)
# Apply blast radius limit
if len(victims) > self.config.blast_radius_limit:
logger.warning(
f"Limiting victims from {len(victims)} to {self.config.blast_radius_limit} "
f"(blast_radius_limit)"
)
victims = random.sample(victims, self.config.blast_radius_limit)
# Terminate victims
terminated = []
for victim in victims:
if self.terminate_instance(victim):
terminated.append(victim)
logger.info(f"Chaos experiment complete: terminated {len(terminated)} instances")
return {
'status': 'completed',
'candidates': len(all_candidates),
'victims_selected': len(victims),
'terminated': terminated,
'timestamp': datetime.now().isoformat()
}
def load_config(config_path: str) -> ChaosConfig:
"""Load Chaos Monkey configuration from YAML"""
with open(config_path, 'r') as f:
config_data = yaml.safe_load(f)
return ChaosConfig(
enabled=config_data.get('enabled', False),
probability=config_data.get('probability', 0.1),
min_instances=config_data.get('min_instances', 2),
business_hours_only=config_data.get('business_hours_only', True),
business_hours_start=config_data.get('business_hours_start', 9),
business_hours_end=config_data.get('business_hours_end', 17),
excluded_tags=config_data.get('excluded_tags', ['production']),
target_regions=config_data.get('target_regions', ['us-east-1']),
target_services=config_data.get('target_services', ['chatgpt-mcp-server']),
blast_radius_limit=config_data.get('blast_radius_limit', 1)
)
def main():
parser = argparse.ArgumentParser(description='Chaos Monkey for ChatGPT Apps')
parser.add_argument('--config', required=True, help='Path to config.yaml')
parser.add_argument('--dry-run', action='store_true', help='Simulate without terminating')
parser.add_argument('--execute', action='store_true', help='Actually terminate instances')
args = parser.parse_args()
if not args.dry_run and not args.execute:
parser.error('Must specify either --dry-run or --execute')
config = load_config(args.config)
monkey = ChaosMonkey(config, dry_run=args.dry_run)
result = monkey.run_chaos_experiment()
print(f"\nChaos Monkey Results:")
print(f"Status: {result['status']}")
print(f"Terminated: {len(result.get('terminated', []))} instances")
if result.get('terminated'):
print("\nTerminated Instances:")
for instance in result['terminated']:
print(f" - {instance['instance_id']} ({instance['service']}, {instance['region']})")
if __name__ == '__main__':
main()
Configuration file (config.yaml):
# Chaos Monkey Configuration
enabled: true
probability: 0.2 # 20% chance per instance
min_instances: 3 # Never go below 3 instances
business_hours_only: true
business_hours_start: 9 # 9 AM
business_hours_end: 17 # 5 PM
excluded_tags:
- production-critical
- chaos-exclude
target_regions:
- us-east-1
- us-west-2
target_services:
- chatgpt-mcp-server
- chatgpt-widget-runtime
- chatgpt-auth-service
blast_radius_limit: 2 # Max 2 instances per run
Automated Instance Terminator with AWS SDK
This script integrates with Auto Scaling Groups to terminate instances while ensuring replacements are launched automatically.
/**
* Chaos Monkey Instance Terminator for AWS Auto Scaling Groups
* Terminates instances while maintaining ASG desired capacity
*/
import {
EC2Client,
DescribeInstancesCommand,
TerminateInstancesCommand,
} from '@aws-sdk/client-ec2';
import {
AutoScalingClient,
DescribeAutoScalingGroupsCommand,
TerminateInstanceInAutoScalingGroupCommand,
} from '@aws-sdk/client-auto-scaling';
import { CloudWatchClient, PutMetricDataCommand } from '@aws-sdk/client-cloudwatch';
interface ChaosTarget {
asgName: string;
region: string;
minHealthy: number;
}
interface TerminationResult {
instanceId: string;
asgName: string;
success: boolean;
error?: string;
timestamp: string;
}
export class AutoScalingChaosMonkey {
private ec2Client: EC2Client;
private asgClient: AutoScalingClient;
private cwClient: CloudWatchClient;
constructor(region: string = 'us-east-1') {
this.ec2Client = new EC2Client({ region });
this.asgClient = new AutoScalingClient({ region });
this.cwClient = new CloudWatchClient({ region });
}
/**
* Get healthy instance count for ASG
*/
async getHealthyInstanceCount(asgName: string): Promise<number> {
const command = new DescribeAutoScalingGroupsCommand({
AutoScalingGroupNames: [asgName],
});
const response = await this.asgClient.send(command);
const asg = response.AutoScalingGroups?.[0];
if (!asg) {
throw new Error(`ASG not found: ${asgName}`);
}
// Count instances in healthy state
const healthyCount = asg.Instances?.filter(
(instance) =>
instance.HealthStatus === 'Healthy' &&
instance.LifecycleState === 'InService'
).length || 0;
return healthyCount;
}
/**
* Select random victim instance from ASG
*/
async selectVictim(asgName: string): Promise<string | null> {
const command = new DescribeAutoScalingGroupsCommand({
AutoScalingGroupNames: [asgName],
});
const response = await this.asgClient.send(command);
const asg = response.AutoScalingGroups?.[0];
if (!asg?.Instances || asg.Instances.length === 0) {
return null;
}
// Filter to healthy instances only
const healthyInstances = asg.Instances.filter(
(instance) =>
instance.HealthStatus === 'Healthy' &&
instance.LifecycleState === 'InService'
);
if (healthyInstances.length === 0) {
return null;
}
// Random selection
const victim =
healthyInstances[Math.floor(Math.random() * healthyInstances.length)];
return victim.InstanceId || null;
}
/**
* Terminate instance in ASG (ASG will launch replacement)
*/
async terminateInstance(
instanceId: string,
asgName: string,
decrementCapacity: boolean = false
): Promise<TerminationResult> {
const result: TerminationResult = {
instanceId,
asgName,
success: false,
timestamp: new Date().toISOString(),
};
try {
const command = new TerminateInstanceInAutoScalingGroupCommand({
InstanceId: instanceId,
ShouldDecrementDesiredCapacity: decrementCapacity,
});
await this.asgClient.send(command);
result.success = true;
console.log(
`✅ Terminated instance ${instanceId} in ASG ${asgName} ` +
`(decrement: ${decrementCapacity})`
);
// Publish CloudWatch metric
await this.publishMetric(asgName, 1);
} catch (error) {
result.error = error instanceof Error ? error.message : String(error);
console.error(`❌ Failed to terminate ${instanceId}: ${result.error}`);
}
return result;
}
/**
* Publish chaos termination metric to CloudWatch
*/
async publishMetric(asgName: string, terminationCount: number): Promise<void> {
const command = new PutMetricDataCommand({
Namespace: 'ChaosEngineering',
MetricData: [
{
MetricName: 'InstanceTerminations',
Value: terminationCount,
Unit: 'Count',
Timestamp: new Date(),
Dimensions: [
{
Name: 'AutoScalingGroup',
Value: asgName,
},
],
},
],
});
await this.cwClient.send(command);
}
/**
* Run chaos experiment on target ASG
*/
async runChaosExperiment(target: ChaosTarget): Promise<TerminationResult[]> {
console.log(`🔥 Starting chaos experiment on ASG: ${target.asgName}`);
const healthyCount = await this.getHealthyInstanceCount(target.asgName);
console.log(`Healthy instances: ${healthyCount}, Min required: ${target.minHealthy}`);
if (healthyCount <= target.minHealthy) {
console.log(`⚠️ Aborting: Would violate min_healthy constraint`);
return [];
}
const victimId = await this.selectVictim(target.asgName);
if (!victimId) {
console.log(`⚠️ No eligible victims found in ${target.asgName}`);
return [];
}
console.log(`🎯 Selected victim: ${victimId}`);
// Terminate without decrementing capacity (ASG will launch replacement)
const result = await this.terminateInstance(victimId, target.asgName, false);
return [result];
}
}
// Example usage
async function main() {
const monkey = new AutoScalingChaosMonkey('us-east-1');
const targets: ChaosTarget[] = [
{
asgName: 'chatgpt-mcp-server-asg',
region: 'us-east-1',
minHealthy: 3,
},
{
asgName: 'chatgpt-widget-runtime-asg',
region: 'us-east-1',
minHealthy: 2,
},
];
for (const target of targets) {
const results = await monkey.runChaosExperiment(target);
console.log(`Results:`, results);
}
}
if (require.main === module) {
main().catch(console.error);
}
Schedule Manager for Continuous Chaos
This scheduler runs chaos experiments continuously during business hours, integrating with AWS EventBridge for cron-based execution.
/**
* Chaos Monkey Schedule Manager
* Runs experiments on cron schedule with safety checks
*/
import { EventBridgeClient, PutRuleCommand, PutTargetsCommand } from '@aws-sdk/client-eventbridge';
import { SNSClient, PublishCommand } from '@aws-sdk/client-sns';
interface ChaosSchedule {
name: string;
cronExpression: string; // e.g., "cron(0 9-17 ? * MON-FRI *)"
enabled: boolean;
targets: string[]; // ASG names
notificationTopic: string;
}
export class ChaosScheduleManager {
private ebClient: EventBridgeClient;
private snsClient: SNSClient;
constructor(region: string = 'us-east-1') {
this.ebClient = new EventBridgeClient({ region });
this.snsClient = new SNSClient({ region });
}
/**
* Create EventBridge rule for chaos experiment
*/
async createSchedule(schedule: ChaosSchedule): Promise<void> {
console.log(`Creating chaos schedule: ${schedule.name}`);
// Create EventBridge rule
const ruleCommand = new PutRuleCommand({
Name: `chaos-monkey-${schedule.name}`,
Description: `Chaos engineering schedule for ${schedule.name}`,
ScheduleExpression: schedule.cronExpression,
State: schedule.enabled ? 'ENABLED' : 'DISABLED',
});
const ruleResponse = await this.ebClient.send(ruleCommand);
console.log(`Rule ARN: ${ruleResponse.RuleArn}`);
// Add Lambda target (assumes chaos Lambda exists)
const targetCommand = new PutTargetsCommand({
Rule: `chaos-monkey-${schedule.name}`,
Targets: [
{
Id: '1',
Arn: `arn:aws:lambda:us-east-1:123456789012:function:chaos-monkey-executor`,
Input: JSON.stringify({
targets: schedule.targets,
notificationTopic: schedule.notificationTopic,
}),
},
],
});
await this.ebClient.send(targetCommand);
console.log(`✅ Schedule created: ${schedule.name}`);
}
/**
* Send chaos experiment notification
*/
async sendNotification(
topicArn: string,
subject: string,
message: string
): Promise<void> {
const command = new PublishCommand({
TopicArn: topicArn,
Subject: subject,
Message: message,
});
await this.snsClient.send(command);
}
}
// Example schedules
const schedules: ChaosSchedule[] = [
{
name: 'weekday-business-hours',
cronExpression: 'cron(0 9-17 ? * MON-FRI *)', // Every hour 9am-5pm weekdays
enabled: true,
targets: ['chatgpt-mcp-server-asg', 'chatgpt-widget-runtime-asg'],
notificationTopic: 'arn:aws:sns:us-east-1:123456789012:chaos-alerts',
},
{
name: 'weekend-reduced-chaos',
cronExpression: 'cron(0 12 ? * SAT-SUN *)', // Noon on weekends
enabled: true,
targets: ['chatgpt-mcp-server-asg'],
notificationTopic: 'arn:aws:sns:us-east-1:123456789012:chaos-alerts',
},
];
Chaos Toolkit: Structured Experiment Framework
Chaos Toolkit provides a declarative YAML framework for defining chaos experiments with hypothesis validation, automated rollbacks, and extensible drivers.
Chaos Toolkit Experiment Definition
This experiment validates that your ChatGPT app maintains SLAs when the primary database region experiences 50% packet loss.
# chaos-experiment-database-latency.yaml
# Validates resilience to database network partition
version: 1.0.0
title: "Database Network Latency Resilience"
description: "Validate ChatGPT app continues serving requests when primary database region experiences 50% packet loss"
configuration:
app_url: "https://api.makeaihq.com/health"
database_instance: "chatgpt-db-primary"
packet_loss_percentage: 50
experiment_duration: 300 # 5 minutes
# Define steady state: what "normal" looks like
steady-state-hypothesis:
title: "Application remains healthy with acceptable latency"
probes:
- name: "health-check-responds"
type: probe
tolerance:
type: "http"
status: 200
timeout: 2
provider:
type: http
url: "${app_url}"
timeout: 5
- name: "api-latency-acceptable"
type: probe
tolerance:
type: "latency"
target: "p95"
lower: 0
upper: 2000 # 2 seconds max
provider:
type: python
module: chaos_toolkit_addons.probes
func: measure_api_latency
arguments:
url: "${app_url}/api/apps"
samples: 10
- name: "error-rate-low"
type: probe
tolerance:
type: "range"
target: "error_rate"
lower: 0
upper: 0.01 # Max 1% errors
provider:
type: python
module: chaos_toolkit_addons.probes
func: measure_error_rate
arguments:
cloudwatch_namespace: "ChatGPTApp"
metric_name: "5XXErrors"
period: 60
# Actions to inject failure
method:
- name: "inject-database-packet-loss"
type: action
provider:
type: python
module: chaosaws.ec2.actions
func: inject_packet_loss
arguments:
instance_ids:
- "${database_instance}"
packet_loss: "${packet_loss_percentage}"
duration: "${experiment_duration}"
interface: "eth0"
- name: "monitor-recovery"
type: probe
provider:
type: python
module: chaos_toolkit_addons.probes
func: monitor_metrics
arguments:
duration: "${experiment_duration}"
metrics:
- name: "DatabaseConnections"
namespace: "AWS/RDS"
- name: "ReadLatency"
namespace: "AWS/RDS"
- name: "WriteLatency"
namespace: "AWS/RDS"
# Rollback actions to restore normal state
rollbacks:
- name: "remove-packet-loss"
type: action
provider:
type: python
module: chaosaws.ec2.actions
func: remove_packet_loss
arguments:
instance_ids:
- "${database_instance}"
- name: "verify-recovery"
type: probe
provider:
type: http
url: "${app_url}"
timeout: 5
tolerance:
type: "http"
status: 200
# When to abort experiment
abort-conditions:
- name: "error-rate-critical"
type: probe
tolerance:
type: "range"
target: "error_rate"
lower: 0
upper: 0.10 # Abort if >10% errors
provider:
type: python
module: chaos_toolkit_addons.probes
func: measure_error_rate
arguments:
cloudwatch_namespace: "ChatGPTApp"
metric_name: "5XXErrors"
period: 60
Steady State Hypothesis Validator
This TypeScript module defines custom probes for validating ChatGPT app health during chaos experiments.
/**
* Chaos Toolkit Custom Probes for ChatGPT Apps
* Measures steady state metrics during experiments
*/
import axios from 'axios';
import { CloudWatchClient, GetMetricStatisticsCommand } from '@aws-sdk/client-cloudwatch';
interface ProbeResult {
success: boolean;
value: number | string;
message: string;
}
/**
* Measure API latency percentiles
*/
export async function measureApiLatency(
url: string,
samples: number = 10,
targetPercentile: number = 95
): Promise<ProbeResult> {
const latencies: number[] = [];
for (let i = 0; i < samples; i++) {
const start = Date.now();
try {
await axios.get(url, { timeout: 5000 });
latencies.push(Date.now() - start);
} catch (error) {
latencies.push(5000); // Timeout treated as 5s latency
}
}
latencies.sort((a, b) => a - b);
const percentileIndex = Math.floor((targetPercentile / 100) * latencies.length);
const p95Latency = latencies[percentileIndex];
return {
success: p95Latency < 2000, // Success if p95 < 2s
value: p95Latency,
message: `P${targetPercentile} latency: ${p95Latency}ms`,
};
}
/**
* Measure error rate from CloudWatch metrics
*/
export async function measureErrorRate(
cloudwatchNamespace: string,
metricName: string,
period: number = 60
): Promise<ProbeResult> {
const cwClient = new CloudWatchClient({ region: 'us-east-1' });
const endTime = new Date();
const startTime = new Date(endTime.getTime() - period * 1000);
const command = new GetMetricStatisticsCommand({
Namespace: cloudwatchNamespace,
MetricName: metricName,
StartTime: startTime,
EndTime: endTime,
Period: period,
Statistics: ['Sum'],
});
const response = await cwClient.send(command);
const errorCount = response.Datapoints?.[0]?.Sum || 0;
// Get total request count
const totalCommand = new GetMetricStatisticsCommand({
Namespace: cloudwatchNamespace,
MetricName: 'RequestCount',
StartTime: startTime,
EndTime: endTime,
Period: period,
Statistics: ['Sum'],
});
const totalResponse = await cwClient.send(totalCommand);
const totalCount = totalResponse.Datapoints?.[0]?.Sum || 1;
const errorRate = errorCount / totalCount;
return {
success: errorRate < 0.01, // Success if < 1% errors
value: errorRate,
message: `Error rate: ${(errorRate * 100).toFixed(2)}%`,
};
}
/**
* Monitor metrics during experiment
*/
export async function monitorMetrics(
duration: number,
metrics: Array<{ name: string; namespace: string }>
): Promise<ProbeResult> {
const cwClient = new CloudWatchClient({ region: 'us-east-1' });
const results: Record<string, number[]> = {};
const intervalMs = 10000; // Sample every 10s
const iterations = Math.floor(duration / (intervalMs / 1000));
for (let i = 0; i < iterations; i++) {
for (const metric of metrics) {
const command = new GetMetricStatisticsCommand({
Namespace: metric.namespace,
MetricName: metric.name,
StartTime: new Date(Date.now() - 60000),
EndTime: new Date(),
Period: 60,
Statistics: ['Average'],
});
const response = await cwClient.send(command);
const value = response.Datapoints?.[0]?.Average || 0;
if (!results[metric.name]) {
results[metric.name] = [];
}
results[metric.name].push(value);
}
await new Promise((resolve) => setTimeout(resolve, intervalMs));
}
// Calculate summary statistics
const summary = Object.entries(results).map(([name, values]) => {
const avg = values.reduce((a, b) => a + b, 0) / values.length;
const max = Math.max(...values);
return `${name}: avg=${avg.toFixed(2)}, max=${max.toFixed(2)}`;
});
return {
success: true,
value: JSON.stringify(results),
message: `Monitored ${iterations} samples: ${summary.join(', ')}`,
};
}
Automated Rollback System
This module implements automatic rollback when experiments exceed blast radius limits or violate SLA constraints.
/**
* Chaos Toolkit Rollback Automation
* Automatically reverts infrastructure changes when experiments fail
*/
import { EC2Client, RevokeSecurityGroupIngressCommand } from '@aws-sdk/client-ec2';
import { ECSClient, UpdateServiceCommand } from '@aws-sdk/client-ecs';
interface RollbackAction {
type: 'security_group' | 'ecs_service' | 'custom';
resourceId: string;
originalState: Record<string, any>;
}
export class ChaosRollbackOrchestrator {
private ec2Client: EC2Client;
private ecsClient: ECSClient;
private rollbackStack: RollbackAction[] = [];
constructor(region: string = 'us-east-1') {
this.ec2Client = new EC2Client({ region });
this.ecsClient = new ECSClient({ region });
}
/**
* Register rollback action for later execution
*/
registerRollback(action: RollbackAction): void {
this.rollbackStack.push(action);
console.log(`Registered rollback: ${action.type} - ${action.resourceId}`);
}
/**
* Execute all rollback actions in reverse order
*/
async executeRollbacks(): Promise<void> {
console.log(`Executing ${this.rollbackStack.length} rollback actions`);
// Execute in reverse order (LIFO)
while (this.rollbackStack.length > 0) {
const action = this.rollbackStack.pop()!;
await this.executeRollback(action);
}
console.log('✅ All rollbacks completed');
}
/**
* Execute single rollback action
*/
private async executeRollback(action: RollbackAction): Promise<void> {
console.log(`Rolling back: ${action.type} - ${action.resourceId}`);
try {
switch (action.type) {
case 'security_group':
await this.rollbackSecurityGroup(action);
break;
case 'ecs_service':
await this.rollbackEcsService(action);
break;
case 'custom':
await this.rollbackCustom(action);
break;
}
} catch (error) {
console.error(`❌ Rollback failed for ${action.resourceId}:`, error);
throw error;
}
}
/**
* Rollback security group rule changes
*/
private async rollbackSecurityGroup(action: RollbackAction): Promise<void> {
const { securityGroupId, ipPermissions } = action.originalState;
const command = new RevokeSecurityGroupIngressCommand({
GroupId: securityGroupId,
IpPermissions: ipPermissions,
});
await this.ec2Client.send(command);
console.log(`✅ Security group ${securityGroupId} rolled back`);
}
/**
* Rollback ECS service changes (restore original task count)
*/
private async rollbackEcsService(action: RollbackAction): Promise<void> {
const { cluster, service, desiredCount } = action.originalState;
const command = new UpdateServiceCommand({
cluster,
service,
desiredCount,
});
await this.ecsClient.send(command);
console.log(`✅ ECS service ${service} rolled back to ${desiredCount} tasks`);
}
/**
* Custom rollback handler
*/
private async rollbackCustom(action: RollbackAction): Promise<void> {
// Execute custom rollback logic
console.log(`Custom rollback for ${action.resourceId}:`, action.originalState);
}
}
// Example usage in chaos experiment
async function runExperimentWithRollback() {
const rollback = new ChaosRollbackOrchestrator('us-east-1');
try {
// Register rollback actions BEFORE making changes
rollback.registerRollback({
type: 'ecs_service',
resourceId: 'chatgpt-mcp-server',
originalState: {
cluster: 'production',
service: 'chatgpt-mcp-server',
desiredCount: 5,
},
});
// Execute chaos action (reduce ECS task count)
// ... chaos logic here ...
// If experiment fails, rollback
if (experimentFailed) {
await rollback.executeRollbacks();
}
} catch (error) {
console.error('Experiment error:', error);
await rollback.executeRollbacks();
}
}
Observability Integration: Tracking Chaos Experiments
Chaos experiments without observability are guesswork. Integrate with Prometheus, Grafana, and CloudWatch to measure blast radius, detect cascading failures, and validate SLA maintenance.
Experiment Tracker with Prometheus Metrics
This module tracks all chaos experiments, publishes metrics to Prometheus, and integrates with incident management systems.
/**
* Chaos Experiment Tracker
* Records experiment metadata and publishes metrics
*/
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
import axios from 'axios';
interface ExperimentMetadata {
experimentId: string;
name: string;
hypothesis: string;
startTime: Date;
endTime?: Date;
status: 'running' | 'success' | 'failure' | 'aborted';
impactedServices: string[];
blastRadiusPercentage: number;
}
export class ChaosExperimentTracker {
private registry: Registry;
private experimentsCounter: Counter;
private experimentDuration: Histogram;
private activeExperiments: Gauge;
private experiments: Map<string, ExperimentMetadata> = new Map();
constructor() {
this.registry = new Registry();
this.experimentsCounter = new Counter({
name: 'chaos_experiments_total',
help: 'Total number of chaos experiments',
labelNames: ['status', 'experiment_name'],
registers: [this.registry],
});
this.experimentDuration = new Histogram({
name: 'chaos_experiment_duration_seconds',
help: 'Duration of chaos experiments',
labelNames: ['experiment_name', 'status'],
buckets: [10, 30, 60, 120, 300, 600],
registers: [this.registry],
});
this.activeExperiments = new Gauge({
name: 'chaos_experiments_active',
help: 'Number of currently running chaos experiments',
registers: [this.registry],
});
}
/**
* Start tracking chaos experiment
*/
startExperiment(metadata: Omit<ExperimentMetadata, 'startTime' | 'status'>): string {
const experiment: ExperimentMetadata = {
...metadata,
startTime: new Date(),
status: 'running',
};
this.experiments.set(metadata.experimentId, experiment);
this.activeExperiments.inc();
console.log(`🔥 Chaos experiment started: ${metadata.name}`);
return metadata.experimentId;
}
/**
* End experiment and record results
*/
endExperiment(experimentId: string, status: 'success' | 'failure' | 'aborted'): void {
const experiment = this.experiments.get(experimentId);
if (!experiment) {
console.error(`Experiment not found: ${experimentId}`);
return;
}
experiment.endTime = new Date();
experiment.status = status;
const durationSeconds =
(experiment.endTime.getTime() - experiment.startTime.getTime()) / 1000;
// Record metrics
this.experimentsCounter.inc({ status, experiment_name: experiment.name });
this.experimentDuration.observe(
{ experiment_name: experiment.name, status },
durationSeconds
);
this.activeExperiments.dec();
console.log(
`✅ Chaos experiment ${status}: ${experiment.name} (${durationSeconds}s)`
);
}
/**
* Get metrics for Prometheus scraping
*/
async getMetrics(): Promise<string> {
return this.registry.metrics();
}
/**
* Send experiment report to incident management system
*/
async sendReport(experimentId: string, webhookUrl: string): Promise<void> {
const experiment = this.experiments.get(experimentId);
if (!experiment) return;
const duration = experiment.endTime
? (experiment.endTime.getTime() - experiment.startTime.getTime()) / 1000
: 0;
const report = {
title: `Chaos Experiment: ${experiment.name}`,
status: experiment.status,
duration: `${duration}s`,
hypothesis: experiment.hypothesis,
impacted_services: experiment.impactedServices.join(', '),
blast_radius: `${experiment.blastRadiusPercentage}%`,
timestamp: experiment.startTime.toISOString(),
};
await axios.post(webhookUrl, report);
}
}
Prometheus Metrics Collector
This script collects metrics during chaos experiments to identify impact on SLAs, error rates, and latency.
# prometheus-chaos-metrics.yaml
# Prometheus scrape config for chaos experiments
global:
scrape_interval: 10s
evaluation_interval: 10s
scrape_configs:
- job_name: 'chaos-experiments'
static_configs:
- targets: ['localhost:9090']
metrics_path: '/metrics'
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: 'chaos-tracker'
- job_name: 'chatgpt-app'
static_configs:
- targets: ['api.makeaihq.com:443']
metrics_path: '/metrics'
scheme: https
metric_relabel_configs:
- source_labels: [__name__]
regex: '(http_request_duration_seconds|http_requests_total|error_rate)'
action: keep
# Alert rules for chaos experiments
rule_files:
- 'chaos-alerts.yaml'
Alert rules (chaos-alerts.yaml):
groups:
- name: chaos_experiment_alerts
interval: 10s
rules:
- alert: ChaosExperimentHighErrorRate
expr: rate(http_requests_total{status=~"5.."}[1m]) > 0.05
for: 1m
labels:
severity: critical
annotations:
summary: "Chaos experiment causing high error rate"
description: "Error rate {{ $value }}% during chaos experiment"
- alert: ChaosExperimentHighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 2
for: 2m
labels:
severity: warning
annotations:
summary: "Chaos experiment causing high latency"
description: "P95 latency {{ $value }}s during chaos experiment"
- alert: ChaosExperimentBlastRadiusExceeded
expr: chaos_experiments_active > 1
for: 1m
labels:
severity: critical
annotations:
summary: "Multiple chaos experiments running simultaneously"
description: "{{ $value }} experiments active (blast radius violation)"
Incident Reporter Integration
This module sends real-time chaos experiment updates to incident management systems (PagerDuty, Slack, Opsgenie).
/**
* Chaos Incident Reporter
* Sends experiment results to incident management systems
*/
import axios from 'axios';
interface IncidentReport {
experimentId: string;
experimentName: string;
status: 'success' | 'failure' | 'aborted';
duration: number;
blastRadius: number;
impactedServices: string[];
metrics: {
errorRate: number;
p95Latency: number;
availabilityPercentage: number;
};
}
export class ChaosIncidentReporter {
/**
* Send Slack notification
*/
async sendSlackNotification(
webhookUrl: string,
report: IncidentReport
): Promise<void> {
const statusEmoji = {
success: '✅',
failure: '❌',
aborted: '⚠️',
};
const color = {
success: '#36a64f',
failure: '#ff0000',
aborted: '#ffaa00',
};
const message = {
attachments: [
{
color: color[report.status],
title: `${statusEmoji[report.status]} Chaos Experiment: ${report.experimentName}`,
fields: [
{
title: 'Status',
value: report.status.toUpperCase(),
short: true,
},
{
title: 'Duration',
value: `${report.duration}s`,
short: true,
},
{
title: 'Blast Radius',
value: `${report.blastRadius}%`,
short: true,
},
{
title: 'Error Rate',
value: `${(report.metrics.errorRate * 100).toFixed(2)}%`,
short: true,
},
{
title: 'P95 Latency',
value: `${report.metrics.p95Latency}ms`,
short: true,
},
{
title: 'Availability',
value: `${report.metrics.availabilityPercentage.toFixed(2)}%`,
short: true,
},
{
title: 'Impacted Services',
value: report.impactedServices.join(', '),
short: false,
},
],
footer: 'Chaos Engineering',
ts: Math.floor(Date.now() / 1000),
},
],
};
await axios.post(webhookUrl, message);
}
/**
* Send PagerDuty incident
*/
async sendPagerDutyIncident(
integrationKey: string,
report: IncidentReport
): Promise<void> {
if (report.status === 'success') {
return; // Don't create PagerDuty incidents for successful experiments
}
const severity = report.status === 'failure' ? 'critical' : 'warning';
const event = {
routing_key: integrationKey,
event_action: 'trigger',
payload: {
summary: `Chaos Experiment Failed: ${report.experimentName}`,
severity,
source: 'chaos-engineering',
custom_details: {
experiment_id: report.experimentId,
duration: `${report.duration}s`,
blast_radius: `${report.blastRadius}%`,
error_rate: `${(report.metrics.errorRate * 100).toFixed(2)}%`,
p95_latency: `${report.metrics.p95Latency}ms`,
impacted_services: report.impactedServices.join(', '),
},
},
};
await axios.post('https://events.pagerduty.com/v2/enqueue', event);
}
}
Safety Guardrails: Preventing Chaos from Becoming Disaster
Chaos experiments can escalate into real incidents if blast radius limits are exceeded or rollback mechanisms fail. Implement these guardrails to ensure experiments remain controlled.
Blast Radius Limiter
This module enforces maximum impact constraints: never affect more than X% of instances, Y% of users, or Z concurrent experiments.
/**
* Blast Radius Limiter
* Ensures chaos experiments don't exceed safe impact thresholds
*/
interface BlastRadiusConfig {
maxInstancePercentage: number; // Max % of instances to affect
maxUserPercentage: number; // Max % of users to impact
maxConcurrentExperiments: number; // Max simultaneous experiments
minHealthyInstances: number; // Never go below this count
}
export class BlastRadiusLimiter {
constructor(private config: BlastRadiusConfig) {}
/**
* Validate experiment doesn't exceed blast radius limits
*/
async validateExperiment(
targetInstances: string[],
totalInstances: number,
activeExperiments: number
): Promise<{ allowed: boolean; reason?: string }> {
// Check concurrent experiment limit
if (activeExperiments >= this.config.maxConcurrentExperiments) {
return {
allowed: false,
reason: `Max concurrent experiments reached (${this.config.maxConcurrentExperiments})`,
};
}
// Check instance percentage limit
const impactPercentage = (targetInstances.length / totalInstances) * 100;
if (impactPercentage > this.config.maxInstancePercentage) {
return {
allowed: false,
reason: `Would impact ${impactPercentage.toFixed(1)}% of instances (max: ${this.config.maxInstancePercentage}%)`,
};
}
// Check min healthy instances
const remainingInstances = totalInstances - targetInstances.length;
if (remainingInstances < this.config.minHealthyInstances) {
return {
allowed: false,
reason: `Would leave ${remainingInstances} healthy instances (min: ${this.config.minHealthyInstances})`,
};
}
return { allowed: true };
}
}
Auto-Rollback Trigger
This script monitors experiment metrics and automatically triggers rollback when SLA violations are detected.
/**
* Auto-Rollback Trigger
* Monitors experiments and triggers rollback on SLA violations
*/
import { CloudWatchClient, GetMetricStatisticsCommand } from '@aws-sdk/client-cloudwatch';
interface SlaThreshold {
metricName: string;
namespace: string;
threshold: number;
comparison: 'greaterThan' | 'lessThan';
}
export class AutoRollbackTrigger {
private cwClient: CloudWatchClient;
constructor(region: string = 'us-east-1') {
this.cwClient = new CloudWatchClient({ region });
}
/**
* Monitor SLA metrics and trigger rollback if violated
*/
async monitorAndTriggerRollback(
thresholds: SlaThreshold[],
rollbackCallback: () => Promise<void>
): Promise<void> {
const intervalMs = 10000; // Check every 10s
const interval = setInterval(async () => {
for (const threshold of thresholds) {
const violated = await this.checkThreshold(threshold);
if (violated) {
console.error(`❌ SLA VIOLATION: ${threshold.metricName} exceeded ${threshold.threshold}`);
clearInterval(interval);
await rollbackCallback();
return;
}
}
}, intervalMs);
}
/**
* Check if metric exceeds threshold
*/
private async checkThreshold(threshold: SlaThreshold): Promise<boolean> {
const command = new GetMetricStatisticsCommand({
Namespace: threshold.namespace,
MetricName: threshold.metricName,
StartTime: new Date(Date.now() - 60000),
EndTime: new Date(),
Period: 60,
Statistics: ['Average'],
});
const response = await this.cwClient.send(command);
const value = response.Datapoints?.[0]?.Average || 0;
if (threshold.comparison === 'greaterThan') {
return value > threshold.threshold;
} else {
return value < threshold.threshold;
}
}
}
Production Chaos Engineering Checklist
Before running chaos experiments in production, validate these readiness criteria:
Infrastructure Readiness
- Auto Scaling Groups configured with min/max capacity
- Health checks enabled (ELB, Route 53)
- Multi-region deployment (for region failure experiments)
- Backup and restore procedures tested
- Rollback automation tested in staging
Observability Readiness
- Prometheus/CloudWatch metrics published for all services
- Grafana dashboards showing real-time SLAs
- Alerting configured (Slack, PagerDuty, Opsgenie)
- Distributed tracing enabled (Jaeger, X-Ray)
- Log aggregation configured (CloudWatch Logs, Datadog)
Safety Guardrails
- Blast radius limits configured (max 20% of instances)
- Minimum healthy instance count enforced (min 3)
- Concurrent experiment limit (max 1)
- Business hours restriction enabled (9am-5pm weekdays)
- Auto-rollback triggers configured (error rate > 5%)
Team Readiness
- On-call engineer notified before experiments
- Runbooks updated with chaos experiment procedures
- Incident response plan includes "chaos gone wrong" scenarios
- GameDay practice runs completed in staging
- Stakeholders informed of chaos engineering program
Compliance & Audit
- Chaos experiments logged for audit trail
- Experiment approval workflow (for production changes)
- Change management integration (ServiceNow, Jira)
- Post-experiment reports generated automatically
- SOC 2 / ISO 27001 compliance validated
Conclusion: Building Antifragile ChatGPT Apps
Chaos engineering transforms your ChatGPT app from "hoping it survives failures" to "proving it thrives under adversity." By continuously injecting controlled failures—instance terminations, network partitions, resource exhaustion—you expose weaknesses before they cause customer-facing outages. Netflix runs Chaos Monkey in production 24/7 because they know that the best time to discover a critical bug is before a real incident, not during one.
The seven production-ready code examples in this guide provide everything you need to implement chaos engineering today: Netflix-style Chaos Monkey for infrastructure failures, Chaos Toolkit for structured experiments, observability integration for blast radius measurement, and safety guardrails for preventing chaos from escalating into disasters. Start small—terminate one instance in your staging environment—then gradually expand to production, multi-region failures, and eventually 24/7 continuous chaos.
The goal isn't to cause failures—it's to build confidence that your system survives them. Because in distributed systems, failure is not a possibility; it's a guarantee. The only question is whether you discover it during a controlled experiment or a 3am pager alert.
Ready to build a ChatGPT app that survives chaos? Start building with MakeAIHQ's no-code platform and deploy resilient apps in 48 hours—with built-in observability, auto-scaling, and disaster recovery features designed for chaos engineering.
Internal Resources
- ChatGPT App Testing Guide - Comprehensive testing strategies
- Disaster Recovery Planning for ChatGPT Apps - RTO/RPO optimization and backup automation
- Incident Response Planning - Complete incident lifecycle management
- Load Testing and Capacity Planning - Validate resilience under load
- Performance Testing Guide - Measure SLA compliance
- CI/CD with Continuous Testing - Integrate chaos in deployment pipelines
- Monitoring and Observability - Real-time metrics and alerting
- Enterprise ChatGPT Apps - Production-grade resilience features
- Blue-Green Deployment at Scale - Zero-downtime deployments
- Canary Releases - Gradual rollout with automated rollback
External Resources
- Principles of Chaos Engineering - Foundational chaos engineering principles
- Chaos Toolkit Documentation - Official Chaos Toolkit guides
- Netflix Chaos Monkey - Original Chaos Monkey implementation