Chaos Engineering for ChatGPT Apps: Resilience Testing Guide
Building ChatGPT apps that can withstand real-world failures requires more than traditional testing. Chaos engineering provides a systematic approach to discovering weaknesses before they cause outages. This guide covers implementing chaos experiments, fault injection, network disruption, and automated resilience testing for production ChatGPT applications.
Whether you're running MCP servers, widget backends, or distributed ChatGPT architectures, chaos engineering helps you build confidence in your system's ability to handle turbulent conditions. Learn how to implement continuous chaos experiments, automate GameDay scenarios, and establish resilience as a core engineering practice.
Understanding Chaos Engineering Principles
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. Unlike traditional testing that validates expected behavior, chaos engineering actively introduces failures to discover unknown weaknesses.
The Five Pillars of Chaos
Steady State Hypothesis: Define what "normal" looks like for your ChatGPT app. This includes response times, error rates, throughput, and user experience metrics. For ChatGPT apps, steady state might include 95th percentile response time under 2 seconds, error rate below 0.1%, and MCP tool call success rate above 99.5%.
Real-World Events: Inject failures that mirror production scenarios. For ChatGPT apps, this includes API timeout failures, database connection drops, network partition between MCP server and tools, OAuth token expiration, and widget rendering failures. Focus on failures you've experienced or fear most.
Production Experiments: Run chaos experiments in production, not just staging. Staging environments rarely match production complexity, traffic patterns, or data volumes. Production chaos with proper safeguards reveals真实 weaknesses that staging cannot.
Minimize Blast Radius: Start small and expand gradually. Begin with 1% of traffic, single availability zones, or canary deployments. Use automated rollback mechanisms and circuit breakers to contain damage. For ChatGPT apps, this might mean running chaos on non-critical widgets before core MCP servers.
Automate Experiments: Manual chaos is not sustainable. Automate experiment execution, monitoring, analysis, and reporting. Schedule regular GameDays, integrate chaos into CI/CD pipelines, and treat resilience testing as continuous validation rather than one-time events.
Chaos Engineering for ChatGPT Apps
ChatGPT apps present unique chaos engineering challenges. Your MCP server might handle requests perfectly, but what happens when OpenAI's API times out? Your widget might render beautifully, but can it gracefully degrade when backend services fail?
Key failure scenarios to test include MCP protocol failures (malformed requests, timeout during tool execution), widget runtime failures (JavaScript errors, missing dependencies), authentication failures (expired tokens, OAuth flow interruption), database failures (connection pool exhaustion, query timeout), and network failures (latency spikes, packet loss, DNS resolution failures).
Learn more about building resilient ChatGPT apps and MCP server optimization best practices.
Implementing Litmus Chaos for Kubernetes
Litmus Chaos is a CNCF project providing comprehensive chaos engineering for Kubernetes environments. If you're running ChatGPT apps on Kubernetes, Litmus offers declarative experiment definitions, workflow orchestration, and deep integration with observability tools.
Installing Litmus Chaos
# litmus-operator-install.yaml
apiVersion: v1
kind: Namespace
metadata:
name: litmus
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: litmus
namespace: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: litmus
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps", "secrets", "events"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets", "daemonsets", "replicasets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["litmuschaos.io"]
resources: ["chaosengines", "chaosexperiments", "chaosresults"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: litmus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: litmus
subjects:
- kind: ServiceAccount
name: litmus
namespace: litmus
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: chaos-operator
namespace: litmus
spec:
replicas: 1
selector:
matchLabels:
name: chaos-operator
template:
metadata:
labels:
name: chaos-operator
spec:
serviceAccountName: litmus
containers:
- name: chaos-operator
image: litmuschaos/chaos-operator:latest
command:
- chaos-operator
env:
- name: WATCH_NAMESPACE
value: ""
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: OPERATOR_NAME
value: "chaos-operator"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: chaos-exporter
namespace: litmus
spec:
replicas: 1
selector:
matchLabels:
app: chaos-exporter
template:
metadata:
labels:
app: chaos-exporter
spec:
serviceAccountName: litmus
containers:
- name: chaos-exporter
image: litmuschaos/chaos-exporter:latest
ports:
- containerPort: 8080
name: metrics
ChaosEngine for ChatGPT MCP Server
# mcp-server-chaos-engine.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: mcp-server-chaos
namespace: chatgpt-apps
spec:
appinfo:
appns: chatgpt-apps
applabel: "app=mcp-server"
appkind: deployment
# Chaos experiment configuration
engineState: active
chaosServiceAccount: litmus-admin
# Monitor chaos progress
monitoring: true
# Annotate application resources
annotationCheck: true
# Job cleanup policy
jobCleanUpPolicy: retain
# Experiments to run
experiments:
- name: pod-delete
spec:
components:
env:
# Total chaos duration (seconds)
- name: TOTAL_CHAOS_DURATION
value: "60"
# Chaos interval (seconds)
- name: CHAOS_INTERVAL
value: "10"
# Force delete (no graceful shutdown)
- name: FORCE
value: "false"
# Number of pods to delete
- name: PODS_AFFECTED_PERC
value: "50"
# Target specific pods by label
- name: TARGET_PODS
value: ""
# Sequence (serial or parallel)
- name: SEQUENCE
value: "parallel"
- name: pod-network-latency
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "120"
- name: NETWORK_LATENCY
value: "2000"
- name: JITTER
value: "500"
- name: CONTAINER_RUNTIME
value: "containerd"
- name: SOCKET_PATH
value: "/run/containerd/containerd.sock"
- name: NETWORK_INTERFACE
value: "eth0"
- name: TARGET_CONTAINER
value: "mcp-server"
- name: pod-memory-hog
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "90"
# Memory to consume (MB)
- name: MEMORY_CONSUMPTION
value: "512"
# Number of workers
- name: NUMBER_OF_WORKERS
value: "4"
- name: TARGET_CONTAINER
value: "mcp-server"
- name: container-kill
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "15"
- name: CONTAINER_RUNTIME
value: "containerd"
- name: SOCKET_PATH
value: "/run/containerd/containerd.sock"
- name: TARGET_CONTAINER
value: "mcp-server"
# Signal to send (SIGKILL, SIGTERM)
- name: SIGNAL
value: "SIGKILL"
Chaos Workflow Orchestration
# chatgpt-app-chaos-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: chatgpt-resilience-workflow
namespace: litmus
spec:
entrypoint: resilience-pipeline
serviceAccountName: litmus-admin
# Cleanup on completion
ttlStrategy:
secondsAfterCompletion: 3600
templates:
- name: resilience-pipeline
steps:
# Step 1: Baseline metrics
- - name: collect-baseline
template: prometheus-query
arguments:
parameters:
- name: query
value: "avg_over_time(mcp_request_duration_seconds[5m])"
# Step 2: Network chaos
- - name: network-latency
template: chaos-experiment
arguments:
parameters:
- name: experiment
value: "pod-network-latency"
- name: duration
value: "180"
# Step 3: Pod deletion
- - name: pod-delete
template: chaos-experiment
arguments:
parameters:
- name: experiment
value: "pod-delete"
- name: duration
value: "120"
# Step 4: Memory stress
- - name: memory-hog
template: chaos-experiment
arguments:
parameters:
- name: experiment
value: "pod-memory-hog"
- name: duration
value: "150"
# Step 5: Compare metrics
- - name: analyze-impact
template: prometheus-query
arguments:
parameters:
- name: query
value: "avg_over_time(mcp_request_duration_seconds[5m])"
# Step 6: Generate report
- - name: generate-report
template: chaos-report
- name: chaos-experiment
inputs:
parameters:
- name: experiment
- name: duration
container:
image: litmuschaos/litmus-checker:latest
command: ["/bin/bash"]
args:
- -c
- |
kubectl apply -f - <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: workflow-{{inputs.parameters.experiment}}
namespace: chatgpt-apps
spec:
appinfo:
appns: chatgpt-apps
applabel: "app=mcp-server"
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: {{inputs.parameters.experiment}}
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "{{inputs.parameters.duration}}"
EOF
# Wait for experiment completion
kubectl wait --for=condition=complete \
chaosengine/workflow-{{inputs.parameters.experiment}} \
-n chatgpt-apps --timeout={{inputs.parameters.duration}}s
- name: prometheus-query
inputs:
parameters:
- name: query
container:
image: curlimages/curl:latest
command: ["/bin/sh"]
args:
- -c
- |
curl -s "http://prometheus:9090/api/v1/query?query={{inputs.parameters.query}}" \
| jq '.data.result[0].value[1]' > /tmp/metric.txt
cat /tmp/metric.txt
- name: chaos-report
container:
image: python:3.11-slim
command: ["/bin/bash"]
args:
- -c
- |
cat > /tmp/report.py <<'PYTHON'
import json
import subprocess
from datetime import datetime
# Fetch chaos results
results = subprocess.check_output([
"kubectl", "get", "chaosresult",
"-n", "chatgpt-apps",
"-o", "json"
])
data = json.loads(results)
print("=" * 60)
print("ChatGPT App Chaos Engineering Report")
print("=" * 60)
print(f"Generated: {datetime.now().isoformat()}")
print()
for item in data.get("items", []):
name = item["metadata"]["name"]
verdict = item["spec"]["experimentStatus"]["verdict"]
print(f"Experiment: {name}")
print(f"Verdict: {verdict}")
print(f"ProbeSuccess: {item['status'].get('probeSuccessPercentage', 'N/A')}")
print("-" * 60)
PYTHON
python3 /tmp/report.py
Explore Kubernetes deployment strategies for ChatGPT apps and container orchestration patterns.
Network Chaos Engineering
Network failures are among the most common production issues. ChatGPT apps are particularly vulnerable because they depend on external APIs, OAuth providers, databases, and distributed MCP tools. Network chaos helps validate timeout handling, retry logic, circuit breakers, and graceful degradation.
Network Latency Injection
# network-latency-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: chatgpt-network-latency
namespace: litmus
spec:
definition:
scope: Namespaced
permissions:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "patch", "delete", "create"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "get", "list", "patch", "update"]
image: litmuschaos/go-runner:latest
imagePullPolicy: Always
args:
- -c
- ./experiments -name pod-network-latency
command:
- /bin/bash
env:
# Target network interface
- name: NETWORK_INTERFACE
value: "eth0"
# Latency to inject (ms)
- name: NETWORK_LATENCY
value: "2000"
# Latency variation (jitter, ms)
- name: JITTER
value: "500"
# Chaos duration (seconds)
- name: TOTAL_CHAOS_DURATION
value: "180"
# Container runtime
- name: CONTAINER_RUNTIME
value: "containerd"
# Runtime socket path
- name: SOCKET_PATH
value: "/run/containerd/containerd.sock"
# Target specific container
- name: TARGET_CONTAINER
value: "mcp-server"
# Traffic direction (ingress, egress, both)
- name: DESTINATION_IPS
value: ""
# Destination ports (comma-separated)
- name: DESTINATION_PORTS
value: "443,5432,6379"
# Source ports
- name: SOURCE_PORTS
value: ""
# Percentage of packets to affect
- name: NETWORK_PACKET_LOSS_PERCENTAGE
value: "0"
# Percentage of packets to duplicate
- name: NETWORK_PACKET_DUPLICATION_PERCENTAGE
value: "0"
# Percentage of packets to corrupt
- name: NETWORK_PACKET_CORRUPTION_PERCENTAGE
value: "0"
labels:
name: chatgpt-network-latency
app.kubernetes.io/part-of: litmus
app.kubernetes.io/component: experiment-job
app.kubernetes.io/version: latest
Packet Loss Simulation
# packet-loss-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: mcp-packet-loss
namespace: chatgpt-apps
spec:
appinfo:
appns: chatgpt-apps
applabel: "app=mcp-server"
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-network-loss
spec:
components:
env:
# Packet loss percentage (0-100)
- name: NETWORK_PACKET_LOSS_PERCENTAGE
value: "30"
# Chaos duration
- name: TOTAL_CHAOS_DURATION
value: "120"
# Network interface
- name: NETWORK_INTERFACE
value: "eth0"
# Target specific IPs
- name: DESTINATION_IPS
value: "10.0.0.0/8,172.16.0.0/12"
# Target ports (PostgreSQL, Redis, OpenAI API)
- name: DESTINATION_PORTS
value: "5432,6379,443"
- name: CONTAINER_RUNTIME
value: "containerd"
- name: SOCKET_PATH
value: "/run/containerd/containerd.sock"
probe:
# HTTP health check probe
- name: mcp-health-check
type: httpProbe
mode: Continuous
httpProbe/inputs:
url: "http://mcp-server:8080/health"
insecureSkipVerify: false
method:
get:
criteria: ==
responseCode: "200"
runProperties:
probeTimeout: 5
interval: 10
retry: 3
probePollingInterval: 2
# Command probe for API connectivity
- name: openai-api-reachable
type: cmdProbe
mode: Edge
cmdProbe/inputs:
command: curl -s -o /dev/null -w "%{http_code}" https://api.openai.com/v1/models
comparator:
type: string
criteria: contains
value: "200"
runProperties:
probeTimeout: 10
interval: 5
retry: 2
# Prometheus metrics probe
- name: error-rate-threshold
type: promProbe
mode: Continuous
promProbe/inputs:
endpoint: "http://prometheus:9090"
query: "rate(mcp_requests_total{status='error'}[1m])"
comparator:
criteria: "<="
value: "0.05"
runProperties:
probeTimeout: 5
interval: 10
retry: 1
DNS Failure Injection
#!/bin/bash
# dns-chaos-experiment.sh
set -e
NAMESPACE="chatgpt-apps"
DEPLOYMENT="mcp-server"
DURATION=300
CHAOS_POD=""
cleanup() {
echo "Cleaning up DNS chaos..."
if [ -n "$CHAOS_POD" ]; then
kubectl exec -n "$NAMESPACE" "$CHAOS_POD" -- \
sh -c "rm -f /etc/hosts.chaos && \
[ -f /etc/hosts.backup ] && mv /etc/hosts.backup /etc/hosts || true"
fi
echo "DNS chaos cleanup complete"
}
trap cleanup EXIT INT TERM
echo "Starting DNS chaos experiment for $DEPLOYMENT"
# Get target pod
CHAOS_POD=$(kubectl get pods -n "$NAMESPACE" \
-l "app=$DEPLOYMENT" \
-o jsonpath='{.items[0].metadata.name}')
if [ -z "$CHAOS_POD" ]; then
echo "Error: No pods found for deployment $DEPLOYMENT"
exit 1
fi
echo "Target pod: $CHAOS_POD"
# Backup original /etc/hosts
kubectl exec -n "$NAMESPACE" "$CHAOS_POD" -- \
sh -c "cp /etc/hosts /etc/hosts.backup"
# Inject DNS failures
cat <<EOF | kubectl exec -i -n "$NAMESPACE" "$CHAOS_POD" -- sh -c "cat > /etc/hosts.chaos"
127.0.0.1 localhost
# DNS chaos - redirect critical domains to non-existent IPs
192.0.2.1 api.openai.com
192.0.2.1 auth.openai.com
192.0.2.1 postgresql.database.svc.cluster.local
192.0.2.1 redis.cache.svc.cluster.local
192.0.2.1 oauth.google.com
192.0.2.1 accounts.google.com
EOF
kubectl exec -n "$NAMESPACE" "$CHAOS_POD" -- \
sh -c "cat /etc/hosts.chaos > /etc/hosts"
echo "DNS chaos injected. Monitoring for $DURATION seconds..."
# Monitor application health
START_TIME=$(date +%s)
ERROR_COUNT=0
SUCCESS_COUNT=0
while true; do
CURRENT_TIME=$(date +%s)
ELAPSED=$((CURRENT_TIME - START_TIME))
if [ $ELAPSED -ge $DURATION ]; then
break
fi
# Check pod health
if kubectl get pod -n "$NAMESPACE" "$CHAOS_POD" \
-o jsonpath='{.status.phase}' | grep -q "Running"; then
SUCCESS_COUNT=$((SUCCESS_COUNT + 1))
else
ERROR_COUNT=$((ERROR_COUNT + 1))
fi
# Check application metrics
RESPONSE=$(kubectl exec -n "$NAMESPACE" "$CHAOS_POD" -- \
curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health || echo "000")
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Health check: $RESPONSE"
sleep 10
done
# Calculate success rate
TOTAL_CHECKS=$((SUCCESS_COUNT + ERROR_COUNT))
if [ $TOTAL_CHECKS -gt 0 ]; then
SUCCESS_RATE=$((SUCCESS_COUNT * 100 / TOTAL_CHECKS))
echo "DNS chaos complete. Success rate: $SUCCESS_RATE% ($SUCCESS_COUNT/$TOTAL_CHECKS)"
else
echo "DNS chaos complete. No health checks performed."
fi
Learn about API resilience patterns and error handling best practices.
Infrastructure Chaos Engineering
Infrastructure chaos tests how your ChatGPT app handles compute, memory, disk, and orchestration failures. These experiments validate resource limits, autoscaling policies, persistent volume handling, and cluster resilience.
Pod Deletion Chaos
# pod-delete-chaos.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-chaos
namespace: chatgpt-apps
spec:
appinfo:
appns: chatgpt-apps
applabel: "app=mcp-server"
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
# Terminate engine on experiment completion
terminationGracePeriodSeconds: 30
experiments:
- name: pod-delete
spec:
components:
env:
# Total duration (seconds)
- name: TOTAL_CHAOS_DURATION
value: "180"
# Interval between deletions (seconds)
- name: CHAOS_INTERVAL
value: "30"
# Percentage of pods to delete (0-100)
- name: PODS_AFFECTED_PERC
value: "50"
# Force delete without graceful shutdown
- name: FORCE
value: "false"
# Randomize pod selection
- name: RANDOMNESS
value: "true"
# Target specific pods by name
- name: TARGET_PODS
value: ""
# Sequence (serial or parallel)
- name: SEQUENCE
value: "parallel"
probe:
# Check deployment availability
- name: deployment-available
type: k8sProbe
mode: Continuous
k8sProbe/inputs:
group: apps
version: v1
resource: deployments
namespace: chatgpt-apps
fieldSelector: metadata.name=mcp-server
operation: present
runProperties:
probeTimeout: 5
interval: 10
retry: 3
# Check minimum replica count
- name: min-replicas-running
type: cmdProbe
mode: Continuous
cmdProbe/inputs:
command: |
kubectl get deployment mcp-server -n chatgpt-apps \
-o jsonpath='{.status.availableReplicas}' | \
awk '{exit !($1 >= 2)}'
comparator:
type: int
criteria: ">="
value: "2"
runProperties:
probeTimeout: 5
interval: 10
retry: 2
# End-to-end API test
- name: api-functional
type: httpProbe
mode: Edge
httpProbe/inputs:
url: "http://mcp-server:8080/api/v1/tools"
insecureSkipVerify: false
method:
get:
criteria: ==
responseCode: "200"
runProperties:
probeTimeout: 10
interval: 5
retry: 3
Memory Stress Experiment
# memory-stress-chaos.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: memory-stress-chaos
namespace: chatgpt-apps
spec:
appinfo:
appns: chatgpt-apps
applabel: "app=mcp-server"
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-memory-hog
spec:
components:
env:
# Memory to consume (MB)
- name: MEMORY_CONSUMPTION
value: "1024"
# Number of workers
- name: NUMBER_OF_WORKERS
value: "8"
# Chaos duration (seconds)
- name: TOTAL_CHAOS_DURATION
value: "120"
# Target specific container
- name: TARGET_CONTAINER
value: "mcp-server"
# Percentage of pods to affect
- name: PODS_AFFECTED_PERC
value: "100"
# Memory consumption percentage (relative to limit)
- name: MEMORY_PERCENTAGE
value: "80"
# Sequence
- name: SEQUENCE
value: "parallel"
probe:
# Monitor memory usage
- name: memory-usage-acceptable
type: promProbe
mode: Continuous
promProbe/inputs:
endpoint: "http://prometheus:9090"
query: |
container_memory_working_set_bytes{
namespace="chatgpt-apps",
pod=~"mcp-server-.*"
} / container_spec_memory_limit_bytes{
namespace="chatgpt-apps",
pod=~"mcp-server-.*"
} * 100
comparator:
criteria: "<="
value: "95"
runProperties:
probeTimeout: 5
interval: 10
retry: 1
# Check OOM kills
- name: no-oom-kills
type: cmdProbe
mode: OnChaos
cmdProbe/inputs:
command: |
kubectl get events -n chatgpt-apps \
--field-selector reason=OOMKilling \
--sort-by='.lastTimestamp' | \
tail -n 1 | grep -q "ago" && exit 1 || exit 0
comparator:
type: int
criteria: ==
value: "0"
runProperties:
probeTimeout: 5
interval: 30
retry: 1
Node Failure Simulation
#!/usr/bin/env python3
# node-chaos-monkey.py
import random
import time
import subprocess
import json
from datetime import datetime, timedelta
from typing import List, Dict
class NodeChaosMonkey:
"""Simulates node failures for ChatGPT app resilience testing."""
def __init__(
self,
namespace: str = "chatgpt-apps",
target_label: str = "app=mcp-server",
chaos_duration: int = 300,
node_failure_percent: float = 0.33
):
self.namespace = namespace
self.target_label = target_label
self.chaos_duration = chaos_duration
self.node_failure_percent = node_failure_percent
self.affected_nodes = []
self.start_time = None
def get_nodes_running_workload(self) -> List[str]:
"""Get list of nodes running target workload."""
try:
# Get pods for target workload
result = subprocess.run([
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", self.target_label,
"-o", "json"
], capture_output=True, text=True, check=True)
pods = json.loads(result.stdout)
# Extract unique node names
nodes = set()
for pod in pods.get("items", []):
node_name = pod["spec"].get("nodeName")
if node_name:
nodes.add(node_name)
return list(nodes)
except subprocess.CalledProcessError as e:
print(f"Error getting nodes: {e}")
return []
def cordon_node(self, node_name: str) -> bool:
"""Mark node as unschedulable."""
try:
subprocess.run([
"kubectl", "cordon", node_name
], check=True, capture_output=True)
print(f"[{datetime.now().isoformat()}] Cordoned node: {node_name}")
return True
except subprocess.CalledProcessError as e:
print(f"Error cordoning node {node_name}: {e}")
return False
def drain_node(self, node_name: str, force: bool = False) -> bool:
"""Drain pods from node."""
try:
cmd = [
"kubectl", "drain", node_name,
"--delete-emptydir-data",
"--ignore-daemonsets",
"--timeout=60s"
]
if force:
cmd.append("--force")
subprocess.run(cmd, check=True, capture_output=True)
print(f"[{datetime.now().isoformat()}] Drained node: {node_name}")
return True
except subprocess.CalledProcessError as e:
print(f"Error draining node {node_name}: {e}")
return False
def uncordon_node(self, node_name: str) -> bool:
"""Mark node as schedulable."""
try:
subprocess.run([
"kubectl", "uncordon", node_name
], check=True, capture_output=True)
print(f"[{datetime.now().isoformat()}] Uncordoned node: {node_name}")
return True
except subprocess.CalledProcessError as e:
print(f"Error uncordoning node {node_name}: {e}")
return False
def check_deployment_health(self) -> Dict[str, any]:
"""Check deployment health metrics."""
try:
result = subprocess.run([
"kubectl", "get", "deployment",
"-n", self.namespace,
"-l", self.target_label,
"-o", "json"
], capture_output=True, text=True, check=True)
deployments = json.loads(result.stdout)
health = {
"healthy": True,
"total_replicas": 0,
"available_replicas": 0,
"unavailable_replicas": 0
}
for deployment in deployments.get("items", []):
status = deployment.get("status", {})
health["total_replicas"] += status.get("replicas", 0)
health["available_replicas"] += status.get("availableReplicas", 0)
health["unavailable_replicas"] += status.get("unavailableReplicas", 0)
# Consider healthy if at least 50% replicas available
if health["total_replicas"] > 0:
availability = health["available_replicas"] / health["total_replicas"]
health["healthy"] = availability >= 0.5
return health
except subprocess.CalledProcessError as e:
print(f"Error checking deployment health: {e}")
return {"healthy": False}
def run_chaos_experiment(self):
"""Execute node chaos experiment."""
print("=" * 60)
print("Node Chaos Monkey - ChatGPT App Resilience Test")
print("=" * 60)
print(f"Namespace: {self.namespace}")
print(f"Target: {self.target_label}")
print(f"Duration: {self.chaos_duration}s")
print(f"Node failure rate: {self.node_failure_percent * 100}%")
print()
# Get nodes running workload
nodes = self.get_nodes_running_workload()
if not nodes:
print("Error: No nodes found running target workload")
return
print(f"Found {len(nodes)} nodes running workload: {nodes}")
# Select nodes to fail
num_nodes_to_fail = max(1, int(len(nodes) * self.node_failure_percent))
self.affected_nodes = random.sample(nodes, num_nodes_to_fail)
print(f"Targeting {num_nodes_to_fail} nodes for chaos: {self.affected_nodes}")
print()
self.start_time = datetime.now()
try:
# Cordon and drain nodes
for node in self.affected_nodes:
if self.cordon_node(node):
self.drain_node(node, force=False)
time.sleep(5)
print()
print(f"Node failures injected. Monitoring for {self.chaos_duration}s...")
print()
# Monitor deployment health
check_interval = 15
checks_performed = 0
healthy_checks = 0
while (datetime.now() - self.start_time).total_seconds() < self.chaos_duration:
health = self.check_deployment_health()
checks_performed += 1
if health["healthy"]:
healthy_checks += 1
print(f"[{datetime.now().isoformat()}] Health check #{checks_performed}:")
print(f" Available: {health['available_replicas']}/{health['total_replicas']}")
print(f" Status: {'HEALTHY' if health['healthy'] else 'DEGRADED'}")
print()
time.sleep(check_interval)
# Calculate success rate
if checks_performed > 0:
success_rate = (healthy_checks / checks_performed) * 100
print(f"Chaos experiment complete. Success rate: {success_rate:.1f}% ({healthy_checks}/{checks_performed})")
finally:
# Cleanup: uncordon nodes
print()
print("Cleaning up node chaos...")
for node in self.affected_nodes:
self.uncordon_node(node)
print("Node chaos cleanup complete")
if __name__ == "__main__":
monkey = NodeChaosMonkey(
namespace="chatgpt-apps",
target_label="app=mcp-server",
chaos_duration=300,
node_failure_percent=0.33
)
monkey.run_chaos_experiment()
Discover high availability architectures and disaster recovery planning.
Automated Chaos GameDays
Chaos GameDays are time-boxed chaos engineering exercises that test organizational resilience, not just technical resilience. Automated GameDays remove manual coordination overhead and enable continuous resilience validation.
Automated GameDay Orchestration
#!/bin/bash
# chaos-gameday-orchestrator.sh
set -e
NAMESPACE="chatgpt-apps"
GAMEDAY_DURATION=3600 # 1 hour
REPORT_DIR="./chaos-reports"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
mkdir -p "$REPORT_DIR"
echo "========================================="
echo "Chaos GameDay - ChatGPT App Resilience"
echo "========================================="
echo "Start time: $(date)"
echo "Duration: ${GAMEDAY_DURATION}s ($(($GAMEDAY_DURATION / 60)) minutes)"
echo
# Pre-GameDay baseline
echo "Collecting baseline metrics..."
kubectl top pods -n "$NAMESPACE" > "$REPORT_DIR/baseline-pods-$TIMESTAMP.txt"
kubectl top nodes > "$REPORT_DIR/baseline-nodes-$TIMESTAMP.txt"
curl -s "http://prometheus:9090/api/v1/query?query=avg_over_time(mcp_request_duration_seconds[5m])" \
| jq -r '.data.result[0].value[1]' > "$REPORT_DIR/baseline-latency-$TIMESTAMP.txt"
curl -s "http://prometheus:9090/api/v1/query?query=rate(mcp_requests_total{status='error'}[5m])" \
| jq -r '.data.result[0].value[1]' > "$REPORT_DIR/baseline-errors-$TIMESTAMP.txt"
echo "Baseline metrics collected"
echo
# Scenario 1: Network latency (15 minutes)
echo "Scenario 1: Network latency injection"
kubectl apply -f - <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: gameday-network-latency
namespace: $NAMESPACE
spec:
appinfo:
appns: $NAMESPACE
applabel: "app=mcp-server"
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "900"
- name: NETWORK_LATENCY
value: "3000"
- name: JITTER
value: "1000"
EOF
echo "Waiting 900s for network latency chaos..."
sleep 900
kubectl delete chaosengine gameday-network-latency -n "$NAMESPACE"
echo "Network latency chaos complete"
echo
# Cool-down period
echo "Cool-down period (5 minutes)..."
sleep 300
# Scenario 2: Pod deletion (15 minutes)
echo "Scenario 2: Random pod deletion"
kubectl apply -f - <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: gameday-pod-delete
namespace: $NAMESPACE
spec:
appinfo:
appns: $NAMESPACE
applabel: "app=mcp-server"
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "900"
- name: CHAOS_INTERVAL
value: "60"
- name: PODS_AFFECTED_PERC
value: "50"
EOF
echo "Waiting 900s for pod deletion chaos..."
sleep 900
kubectl delete chaosengine gameday-pod-delete -n "$NAMESPACE"
echo "Pod deletion chaos complete"
echo
# Cool-down period
echo "Cool-down period (5 minutes)..."
sleep 300
# Scenario 3: Memory stress (15 minutes)
echo "Scenario 3: Memory stress"
kubectl apply -f - <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: gameday-memory-stress
namespace: $NAMESPACE
spec:
appinfo:
appns: $NAMESPACE
applabel: "app=mcp-server"
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-memory-hog
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "900"
- name: MEMORY_CONSUMPTION
value: "768"
- name: NUMBER_OF_WORKERS
value: "6"
EOF
echo "Waiting 900s for memory stress chaos..."
sleep 900
kubectl delete chaosengine gameday-memory-stress -n "$NAMESPACE"
echo "Memory stress chaos complete"
echo
# Post-GameDay analysis
echo "Collecting post-GameDay metrics..."
kubectl top pods -n "$NAMESPACE" > "$REPORT_DIR/post-pods-$TIMESTAMP.txt"
kubectl top nodes > "$REPORT_DIR/post-nodes-$TIMESTAMP.txt"
curl -s "http://prometheus:9090/api/v1/query?query=avg_over_time(mcp_request_duration_seconds[5m])" \
| jq -r '.data.result[0].value[1]' > "$REPORT_DIR/post-latency-$TIMESTAMP.txt"
curl -s "http://prometheus:9090/api/v1/query?query=rate(mcp_requests_total{status='error'}[5m])" \
| jq -r '.data.result[0].value[1]' > "$REPORT_DIR/post-errors-$TIMESTAMP.txt"
# Fetch chaos results
kubectl get chaosresult -n "$NAMESPACE" -o json > "$REPORT_DIR/chaos-results-$TIMESTAMP.json"
echo "Post-GameDay metrics collected"
echo
# Generate report
python3 <<PYTHON
import json
from datetime import datetime
print("=" * 60)
print("Chaos GameDay Report")
print("=" * 60)
print(f"Timestamp: $TIMESTAMP")
print(f"Generated: {datetime.now().isoformat()}")
print()
# Load chaos results
with open("$REPORT_DIR/chaos-results-$TIMESTAMP.json") as f:
results = json.load(f)
total_experiments = len(results.get("items", []))
passed_experiments = sum(1 for item in results.get("items", [])
if item["spec"]["experimentStatus"]["verdict"] == "Pass")
print(f"Experiments run: {total_experiments}")
print(f"Passed: {passed_experiments}")
print(f"Failed: {total_experiments - passed_experiments}")
print(f"Success rate: {(passed_experiments / total_experiments * 100) if total_experiments > 0 else 0:.1f}%")
print()
# Load baseline metrics
with open("$REPORT_DIR/baseline-latency-$TIMESTAMP.txt") as f:
baseline_latency = float(f.read().strip())
with open("$REPORT_DIR/post-latency-$TIMESTAMP.txt") as f:
post_latency = float(f.read().strip())
with open("$REPORT_DIR/baseline-errors-$TIMESTAMP.txt") as f:
baseline_errors = float(f.read().strip())
with open("$REPORT_DIR/post-errors-$TIMESTAMP.txt") as f:
post_errors = float(f.read().strip())
print("Performance Impact:")
print(f" Baseline latency: {baseline_latency:.3f}s")
print(f" Post-chaos latency: {post_latency:.3f}s")
print(f" Latency increase: {((post_latency / baseline_latency - 1) * 100) if baseline_latency > 0 else 0:.1f}%")
print()
print(f" Baseline error rate: {baseline_errors:.4f}")
print(f" Post-chaos error rate: {post_errors:.4f}")
print(f" Error rate change: {((post_errors / baseline_errors - 1) * 100) if baseline_errors > 0 else 0:.1f}%")
print()
# Detailed experiment results
print("Experiment Details:")
print("-" * 60)
for item in results.get("items", []):
name = item["metadata"]["name"]
verdict = item["spec"]["experimentStatus"]["verdict"]
probe_success = item["status"].get("probeSuccessPercentage", "N/A")
print(f" {name}")
print(f" Verdict: {verdict}")
print(f" Probe Success: {probe_success}%")
print()
print("=" * 60)
print("Full report saved to: $REPORT_DIR/")
PYTHON
echo
echo "Chaos GameDay complete!"
echo "End time: $(date)"
Monitoring and Alerting for Chaos
# chaos-prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: chaos-prometheus-rules
namespace: monitoring
data:
chaos-alerts.yml: |
groups:
- name: chaos-engineering
interval: 30s
rules:
# Alert when chaos experiment fails
- alert: ChaosExperimentFailed
expr: |
litmuschaos_experiment_verdict{verdict="Fail"} > 0
for: 1m
labels:
severity: critical
team: sre
annotations:
summary: "Chaos experiment {{ $labels.experiment }} failed"
description: "Experiment {{ $labels.experiment }} in namespace {{ $labels.namespace }} has failed. System may not be resilient to injected failures."
# Alert when probe success rate is low
- alert: ChaosProbeSuccessLow
expr: |
litmuschaos_probe_success_percentage < 80
for: 5m
labels:
severity: warning
team: sre
annotations:
summary: "Chaos probe {{ $labels.probe }} success rate low"
description: "Probe {{ $labels.probe }} success rate is {{ $value }}%, below 80% threshold during chaos experiment."
# Alert when error rate spikes during chaos
- alert: ErrorRateSpikeInChaos
expr: |
rate(mcp_requests_total{status="error"}[5m])
/
rate(mcp_requests_total[5m])
> 0.05
for: 3m
labels:
severity: warning
team: dev
annotations:
summary: "Error rate spike during chaos experiment"
description: "MCP server error rate is {{ $value | humanizePercentage }}, exceeding 5% threshold during chaos testing."
# Alert when latency increases significantly
- alert: LatencyIncreaseDuringChaos
expr: |
histogram_quantile(0.95,
rate(mcp_request_duration_seconds_bucket[5m])
) > 5
for: 5m
labels:
severity: warning
team: dev
annotations:
summary: "95th percentile latency high during chaos"
description: "MCP server 95th percentile latency is {{ $value }}s, exceeding 5s threshold during chaos experiment."
# Alert when deployment availability drops
- alert: DeploymentAvailabilityLow
expr: |
kube_deployment_status_replicas_available
/
kube_deployment_spec_replicas
< 0.5
for: 2m
labels:
severity: critical
team: sre
annotations:
summary: "Deployment {{ $labels.deployment }} availability low"
description: "Deployment {{ $labels.deployment }} has only {{ $value | humanizePercentage }} replicas available during chaos experiment."
# Alert when pod restarts increase
- alert: PodRestartsDuringChaos
expr: |
rate(kube_pod_container_status_restarts_total{namespace="chatgpt-apps"}[10m]) > 0.1
for: 5m
labels:
severity: warning
team: dev
annotations:
summary: "Pod {{ $labels.pod }} restarting frequently"
description: "Pod {{ $labels.pod }} is restarting at {{ $value }} restarts/second during chaos experiment."
Explore monitoring ChatGPT apps and alerting best practices.
Conclusion: Building Antifragile ChatGPT Apps
Chaos engineering transforms resilience from an aspiration into a measurable engineering practice. By continuously injecting failures, you discover weaknesses before customers do, validate recovery procedures under realistic conditions, and build organizational confidence in system behavior.
For ChatGPT apps, where user experience depends on multiple external dependencies (OpenAI APIs, OAuth providers, databases, caching layers), chaos engineering is not optional. It's the only way to ensure your app handles the inevitable failures gracefully.
Start with small experiments in non-production environments, gradually expand to production with proper safeguards, automate GameDays to make resilience testing continuous, and treat every incident as an opportunity to expand your chaos experiment library.
Ready to build resilient ChatGPT apps with automated chaos engineering? Try MakeAIHQ.com for instant ChatGPT app creation with built-in resilience patterns. Our platform includes production-ready MCP servers, automated failover, circuit breakers, and retry logic that pass the most demanding chaos experiments.
Related resources:
- Building High Availability ChatGPT Apps
- MCP Server Optimization Guide
- Kubernetes Deployment for ChatGPT Apps
- Monitoring and Observability Best Practices
- API Gateway Patterns for ChatGPT Apps
- Error Handling in MCP Servers
- Disaster Recovery Planning
External references:
- Principles of Chaos Engineering - Official chaos engineering manifesto
- Netflix Chaos Monkey - Original chaos engineering tool
- Litmus Chaos Documentation - CNCF chaos engineering platform