Chaos Engineering for ChatGPT Apps: Resilience Testing Guide

Building ChatGPT apps that can withstand real-world failures requires more than traditional testing. Chaos engineering provides a systematic approach to discovering weaknesses before they cause outages. This guide covers implementing chaos experiments, fault injection, network disruption, and automated resilience testing for production ChatGPT applications.

Whether you're running MCP servers, widget backends, or distributed ChatGPT architectures, chaos engineering helps you build confidence in your system's ability to handle turbulent conditions. Learn how to implement continuous chaos experiments, automate GameDay scenarios, and establish resilience as a core engineering practice.

Understanding Chaos Engineering Principles

Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. Unlike traditional testing that validates expected behavior, chaos engineering actively introduces failures to discover unknown weaknesses.

The Five Pillars of Chaos

Steady State Hypothesis: Define what "normal" looks like for your ChatGPT app. This includes response times, error rates, throughput, and user experience metrics. For ChatGPT apps, steady state might include 95th percentile response time under 2 seconds, error rate below 0.1%, and MCP tool call success rate above 99.5%.

Real-World Events: Inject failures that mirror production scenarios. For ChatGPT apps, this includes API timeout failures, database connection drops, network partition between MCP server and tools, OAuth token expiration, and widget rendering failures. Focus on failures you've experienced or fear most.

Production Experiments: Run chaos experiments in production, not just staging. Staging environments rarely match production complexity, traffic patterns, or data volumes. Production chaos with proper safeguards reveals真实 weaknesses that staging cannot.

Minimize Blast Radius: Start small and expand gradually. Begin with 1% of traffic, single availability zones, or canary deployments. Use automated rollback mechanisms and circuit breakers to contain damage. For ChatGPT apps, this might mean running chaos on non-critical widgets before core MCP servers.

Automate Experiments: Manual chaos is not sustainable. Automate experiment execution, monitoring, analysis, and reporting. Schedule regular GameDays, integrate chaos into CI/CD pipelines, and treat resilience testing as continuous validation rather than one-time events.

Chaos Engineering for ChatGPT Apps

ChatGPT apps present unique chaos engineering challenges. Your MCP server might handle requests perfectly, but what happens when OpenAI's API times out? Your widget might render beautifully, but can it gracefully degrade when backend services fail?

Key failure scenarios to test include MCP protocol failures (malformed requests, timeout during tool execution), widget runtime failures (JavaScript errors, missing dependencies), authentication failures (expired tokens, OAuth flow interruption), database failures (connection pool exhaustion, query timeout), and network failures (latency spikes, packet loss, DNS resolution failures).

Learn more about building resilient ChatGPT apps and MCP server optimization best practices.

Implementing Litmus Chaos for Kubernetes

Litmus Chaos is a CNCF project providing comprehensive chaos engineering for Kubernetes environments. If you're running ChatGPT apps on Kubernetes, Litmus offers declarative experiment definitions, workflow orchestration, and deep integration with observability tools.

Installing Litmus Chaos

# litmus-operator-install.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: litmus
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: litmus
  namespace: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: litmus
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps", "secrets", "events"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets", "daemonsets", "replicasets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["litmuschaos.io"]
    resources: ["chaosengines", "chaosexperiments", "chaosresults"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: litmus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: litmus
subjects:
  - kind: ServiceAccount
    name: litmus
    namespace: litmus
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chaos-operator
  namespace: litmus
spec:
  replicas: 1
  selector:
    matchLabels:
      name: chaos-operator
  template:
    metadata:
      labels:
        name: chaos-operator
    spec:
      serviceAccountName: litmus
      containers:
        - name: chaos-operator
          image: litmuschaos/chaos-operator:latest
          command:
            - chaos-operator
          env:
            - name: WATCH_NAMESPACE
              value: ""
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: OPERATOR_NAME
              value: "chaos-operator"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chaos-exporter
  namespace: litmus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: chaos-exporter
  template:
    metadata:
      labels:
        app: chaos-exporter
    spec:
      serviceAccountName: litmus
      containers:
        - name: chaos-exporter
          image: litmuschaos/chaos-exporter:latest
          ports:
            - containerPort: 8080
              name: metrics

ChaosEngine for ChatGPT MCP Server

# mcp-server-chaos-engine.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: mcp-server-chaos
  namespace: chatgpt-apps
spec:
  appinfo:
    appns: chatgpt-apps
    applabel: "app=mcp-server"
    appkind: deployment

  # Chaos experiment configuration
  engineState: active
  chaosServiceAccount: litmus-admin

  # Monitor chaos progress
  monitoring: true

  # Annotate application resources
  annotationCheck: true

  # Job cleanup policy
  jobCleanUpPolicy: retain

  # Experiments to run
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            # Total chaos duration (seconds)
            - name: TOTAL_CHAOS_DURATION
              value: "60"

            # Chaos interval (seconds)
            - name: CHAOS_INTERVAL
              value: "10"

            # Force delete (no graceful shutdown)
            - name: FORCE
              value: "false"

            # Number of pods to delete
            - name: PODS_AFFECTED_PERC
              value: "50"

            # Target specific pods by label
            - name: TARGET_PODS
              value: ""

            # Sequence (serial or parallel)
            - name: SEQUENCE
              value: "parallel"

    - name: pod-network-latency
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "120"

            - name: NETWORK_LATENCY
              value: "2000"

            - name: JITTER
              value: "500"

            - name: CONTAINER_RUNTIME
              value: "containerd"

            - name: SOCKET_PATH
              value: "/run/containerd/containerd.sock"

            - name: NETWORK_INTERFACE
              value: "eth0"

            - name: TARGET_CONTAINER
              value: "mcp-server"

    - name: pod-memory-hog
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "90"

            # Memory to consume (MB)
            - name: MEMORY_CONSUMPTION
              value: "512"

            # Number of workers
            - name: NUMBER_OF_WORKERS
              value: "4"

            - name: TARGET_CONTAINER
              value: "mcp-server"

    - name: container-kill
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"

            - name: CHAOS_INTERVAL
              value: "15"

            - name: CONTAINER_RUNTIME
              value: "containerd"

            - name: SOCKET_PATH
              value: "/run/containerd/containerd.sock"

            - name: TARGET_CONTAINER
              value: "mcp-server"

            # Signal to send (SIGKILL, SIGTERM)
            - name: SIGNAL
              value: "SIGKILL"

Chaos Workflow Orchestration

# chatgpt-app-chaos-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: chatgpt-resilience-workflow
  namespace: litmus
spec:
  entrypoint: resilience-pipeline
  serviceAccountName: litmus-admin

  # Cleanup on completion
  ttlStrategy:
    secondsAfterCompletion: 3600

  templates:
    - name: resilience-pipeline
      steps:
        # Step 1: Baseline metrics
        - - name: collect-baseline
            template: prometheus-query
            arguments:
              parameters:
                - name: query
                  value: "avg_over_time(mcp_request_duration_seconds[5m])"

        # Step 2: Network chaos
        - - name: network-latency
            template: chaos-experiment
            arguments:
              parameters:
                - name: experiment
                  value: "pod-network-latency"
                - name: duration
                  value: "180"

        # Step 3: Pod deletion
        - - name: pod-delete
            template: chaos-experiment
            arguments:
              parameters:
                - name: experiment
                  value: "pod-delete"
                - name: duration
                  value: "120"

        # Step 4: Memory stress
        - - name: memory-hog
            template: chaos-experiment
            arguments:
              parameters:
                - name: experiment
                  value: "pod-memory-hog"
                - name: duration
                  value: "150"

        # Step 5: Compare metrics
        - - name: analyze-impact
            template: prometheus-query
            arguments:
              parameters:
                - name: query
                  value: "avg_over_time(mcp_request_duration_seconds[5m])"

        # Step 6: Generate report
        - - name: generate-report
            template: chaos-report

    - name: chaos-experiment
      inputs:
        parameters:
          - name: experiment
          - name: duration
      container:
        image: litmuschaos/litmus-checker:latest
        command: ["/bin/bash"]
        args:
          - -c
          - |
            kubectl apply -f - <<EOF
            apiVersion: litmuschaos.io/v1alpha1
            kind: ChaosEngine
            metadata:
              name: workflow-{{inputs.parameters.experiment}}
              namespace: chatgpt-apps
            spec:
              appinfo:
                appns: chatgpt-apps
                applabel: "app=mcp-server"
                appkind: deployment
              engineState: active
              chaosServiceAccount: litmus-admin
              experiments:
                - name: {{inputs.parameters.experiment}}
                  spec:
                    components:
                      env:
                        - name: TOTAL_CHAOS_DURATION
                          value: "{{inputs.parameters.duration}}"
            EOF

            # Wait for experiment completion
            kubectl wait --for=condition=complete \
              chaosengine/workflow-{{inputs.parameters.experiment}} \
              -n chatgpt-apps --timeout={{inputs.parameters.duration}}s

    - name: prometheus-query
      inputs:
        parameters:
          - name: query
      container:
        image: curlimages/curl:latest
        command: ["/bin/sh"]
        args:
          - -c
          - |
            curl -s "http://prometheus:9090/api/v1/query?query={{inputs.parameters.query}}" \
              | jq '.data.result[0].value[1]' > /tmp/metric.txt
            cat /tmp/metric.txt

    - name: chaos-report
      container:
        image: python:3.11-slim
        command: ["/bin/bash"]
        args:
          - -c
          - |
            cat > /tmp/report.py <<'PYTHON'
            import json
            import subprocess
            from datetime import datetime

            # Fetch chaos results
            results = subprocess.check_output([
                "kubectl", "get", "chaosresult",
                "-n", "chatgpt-apps",
                "-o", "json"
            ])

            data = json.loads(results)

            print("=" * 60)
            print("ChatGPT App Chaos Engineering Report")
            print("=" * 60)
            print(f"Generated: {datetime.now().isoformat()}")
            print()

            for item in data.get("items", []):
                name = item["metadata"]["name"]
                verdict = item["spec"]["experimentStatus"]["verdict"]

                print(f"Experiment: {name}")
                print(f"Verdict: {verdict}")
                print(f"ProbeSuccess: {item['status'].get('probeSuccessPercentage', 'N/A')}")
                print("-" * 60)
            PYTHON

            python3 /tmp/report.py

Explore Kubernetes deployment strategies for ChatGPT apps and container orchestration patterns.

Network Chaos Engineering

Network failures are among the most common production issues. ChatGPT apps are particularly vulnerable because they depend on external APIs, OAuth providers, databases, and distributed MCP tools. Network chaos helps validate timeout handling, retry logic, circuit breakers, and graceful degradation.

Network Latency Injection

# network-latency-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: chatgpt-network-latency
  namespace: litmus
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["get", "list", "patch", "delete", "create"]
      - apiGroups: [""]
        resources: ["events"]
        verbs: ["create", "get", "list", "patch", "update"]

    image: litmuschaos/go-runner:latest
    imagePullPolicy: Always

    args:
      - -c
      - ./experiments -name pod-network-latency

    command:
      - /bin/bash

    env:
      # Target network interface
      - name: NETWORK_INTERFACE
        value: "eth0"

      # Latency to inject (ms)
      - name: NETWORK_LATENCY
        value: "2000"

      # Latency variation (jitter, ms)
      - name: JITTER
        value: "500"

      # Chaos duration (seconds)
      - name: TOTAL_CHAOS_DURATION
        value: "180"

      # Container runtime
      - name: CONTAINER_RUNTIME
        value: "containerd"

      # Runtime socket path
      - name: SOCKET_PATH
        value: "/run/containerd/containerd.sock"

      # Target specific container
      - name: TARGET_CONTAINER
        value: "mcp-server"

      # Traffic direction (ingress, egress, both)
      - name: DESTINATION_IPS
        value: ""

      # Destination ports (comma-separated)
      - name: DESTINATION_PORTS
        value: "443,5432,6379"

      # Source ports
      - name: SOURCE_PORTS
        value: ""

      # Percentage of packets to affect
      - name: NETWORK_PACKET_LOSS_PERCENTAGE
        value: "0"

      # Percentage of packets to duplicate
      - name: NETWORK_PACKET_DUPLICATION_PERCENTAGE
        value: "0"

      # Percentage of packets to corrupt
      - name: NETWORK_PACKET_CORRUPTION_PERCENTAGE
        value: "0"

    labels:
      name: chatgpt-network-latency
      app.kubernetes.io/part-of: litmus
      app.kubernetes.io/component: experiment-job
      app.kubernetes.io/version: latest

Packet Loss Simulation

# packet-loss-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: mcp-packet-loss
  namespace: chatgpt-apps
spec:
  appinfo:
    appns: chatgpt-apps
    applabel: "app=mcp-server"
    appkind: deployment

  engineState: active
  chaosServiceAccount: litmus-admin

  experiments:
    - name: pod-network-loss
      spec:
        components:
          env:
            # Packet loss percentage (0-100)
            - name: NETWORK_PACKET_LOSS_PERCENTAGE
              value: "30"

            # Chaos duration
            - name: TOTAL_CHAOS_DURATION
              value: "120"

            # Network interface
            - name: NETWORK_INTERFACE
              value: "eth0"

            # Target specific IPs
            - name: DESTINATION_IPS
              value: "10.0.0.0/8,172.16.0.0/12"

            # Target ports (PostgreSQL, Redis, OpenAI API)
            - name: DESTINATION_PORTS
              value: "5432,6379,443"

            - name: CONTAINER_RUNTIME
              value: "containerd"

            - name: SOCKET_PATH
              value: "/run/containerd/containerd.sock"

        probe:
          # HTTP health check probe
          - name: mcp-health-check
            type: httpProbe
            mode: Continuous
            httpProbe/inputs:
              url: "http://mcp-server:8080/health"
              insecureSkipVerify: false
              method:
                get:
                  criteria: ==
                  responseCode: "200"
            runProperties:
              probeTimeout: 5
              interval: 10
              retry: 3
              probePollingInterval: 2

          # Command probe for API connectivity
          - name: openai-api-reachable
            type: cmdProbe
            mode: Edge
            cmdProbe/inputs:
              command: curl -s -o /dev/null -w "%{http_code}" https://api.openai.com/v1/models
              comparator:
                type: string
                criteria: contains
                value: "200"
            runProperties:
              probeTimeout: 10
              interval: 5
              retry: 2

          # Prometheus metrics probe
          - name: error-rate-threshold
            type: promProbe
            mode: Continuous
            promProbe/inputs:
              endpoint: "http://prometheus:9090"
              query: "rate(mcp_requests_total{status='error'}[1m])"
              comparator:
                criteria: "<="
                value: "0.05"
            runProperties:
              probeTimeout: 5
              interval: 10
              retry: 1

DNS Failure Injection

#!/bin/bash
# dns-chaos-experiment.sh

set -e

NAMESPACE="chatgpt-apps"
DEPLOYMENT="mcp-server"
DURATION=300
CHAOS_POD=""

cleanup() {
  echo "Cleaning up DNS chaos..."

  if [ -n "$CHAOS_POD" ]; then
    kubectl exec -n "$NAMESPACE" "$CHAOS_POD" -- \
      sh -c "rm -f /etc/hosts.chaos && \
             [ -f /etc/hosts.backup ] && mv /etc/hosts.backup /etc/hosts || true"
  fi

  echo "DNS chaos cleanup complete"
}

trap cleanup EXIT INT TERM

echo "Starting DNS chaos experiment for $DEPLOYMENT"

# Get target pod
CHAOS_POD=$(kubectl get pods -n "$NAMESPACE" \
  -l "app=$DEPLOYMENT" \
  -o jsonpath='{.items[0].metadata.name}')

if [ -z "$CHAOS_POD" ]; then
  echo "Error: No pods found for deployment $DEPLOYMENT"
  exit 1
fi

echo "Target pod: $CHAOS_POD"

# Backup original /etc/hosts
kubectl exec -n "$NAMESPACE" "$CHAOS_POD" -- \
  sh -c "cp /etc/hosts /etc/hosts.backup"

# Inject DNS failures
cat <<EOF | kubectl exec -i -n "$NAMESPACE" "$CHAOS_POD" -- sh -c "cat > /etc/hosts.chaos"
127.0.0.1 localhost

# DNS chaos - redirect critical domains to non-existent IPs
192.0.2.1 api.openai.com
192.0.2.1 auth.openai.com
192.0.2.1 postgresql.database.svc.cluster.local
192.0.2.1 redis.cache.svc.cluster.local
192.0.2.1 oauth.google.com
192.0.2.1 accounts.google.com
EOF

kubectl exec -n "$NAMESPACE" "$CHAOS_POD" -- \
  sh -c "cat /etc/hosts.chaos > /etc/hosts"

echo "DNS chaos injected. Monitoring for $DURATION seconds..."

# Monitor application health
START_TIME=$(date +%s)
ERROR_COUNT=0
SUCCESS_COUNT=0

while true; do
  CURRENT_TIME=$(date +%s)
  ELAPSED=$((CURRENT_TIME - START_TIME))

  if [ $ELAPSED -ge $DURATION ]; then
    break
  fi

  # Check pod health
  if kubectl get pod -n "$NAMESPACE" "$CHAOS_POD" \
       -o jsonpath='{.status.phase}' | grep -q "Running"; then
    SUCCESS_COUNT=$((SUCCESS_COUNT + 1))
  else
    ERROR_COUNT=$((ERROR_COUNT + 1))
  fi

  # Check application metrics
  RESPONSE=$(kubectl exec -n "$NAMESPACE" "$CHAOS_POD" -- \
    curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health || echo "000")

  echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Health check: $RESPONSE"

  sleep 10
done

# Calculate success rate
TOTAL_CHECKS=$((SUCCESS_COUNT + ERROR_COUNT))
if [ $TOTAL_CHECKS -gt 0 ]; then
  SUCCESS_RATE=$((SUCCESS_COUNT * 100 / TOTAL_CHECKS))
  echo "DNS chaos complete. Success rate: $SUCCESS_RATE% ($SUCCESS_COUNT/$TOTAL_CHECKS)"
else
  echo "DNS chaos complete. No health checks performed."
fi

Learn about API resilience patterns and error handling best practices.

Infrastructure Chaos Engineering

Infrastructure chaos tests how your ChatGPT app handles compute, memory, disk, and orchestration failures. These experiments validate resource limits, autoscaling policies, persistent volume handling, and cluster resilience.

Pod Deletion Chaos

# pod-delete-chaos.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-chaos
  namespace: chatgpt-apps
spec:
  appinfo:
    appns: chatgpt-apps
    applabel: "app=mcp-server"
    appkind: deployment

  engineState: active
  chaosServiceAccount: litmus-admin

  # Terminate engine on experiment completion
  terminationGracePeriodSeconds: 30

  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            # Total duration (seconds)
            - name: TOTAL_CHAOS_DURATION
              value: "180"

            # Interval between deletions (seconds)
            - name: CHAOS_INTERVAL
              value: "30"

            # Percentage of pods to delete (0-100)
            - name: PODS_AFFECTED_PERC
              value: "50"

            # Force delete without graceful shutdown
            - name: FORCE
              value: "false"

            # Randomize pod selection
            - name: RANDOMNESS
              value: "true"

            # Target specific pods by name
            - name: TARGET_PODS
              value: ""

            # Sequence (serial or parallel)
            - name: SEQUENCE
              value: "parallel"

        probe:
          # Check deployment availability
          - name: deployment-available
            type: k8sProbe
            mode: Continuous
            k8sProbe/inputs:
              group: apps
              version: v1
              resource: deployments
              namespace: chatgpt-apps
              fieldSelector: metadata.name=mcp-server
              operation: present
            runProperties:
              probeTimeout: 5
              interval: 10
              retry: 3

          # Check minimum replica count
          - name: min-replicas-running
            type: cmdProbe
            mode: Continuous
            cmdProbe/inputs:
              command: |
                kubectl get deployment mcp-server -n chatgpt-apps \
                  -o jsonpath='{.status.availableReplicas}' | \
                  awk '{exit !($1 >= 2)}'
              comparator:
                type: int
                criteria: ">="
                value: "2"
            runProperties:
              probeTimeout: 5
              interval: 10
              retry: 2

          # End-to-end API test
          - name: api-functional
            type: httpProbe
            mode: Edge
            httpProbe/inputs:
              url: "http://mcp-server:8080/api/v1/tools"
              insecureSkipVerify: false
              method:
                get:
                  criteria: ==
                  responseCode: "200"
            runProperties:
              probeTimeout: 10
              interval: 5
              retry: 3

Memory Stress Experiment

# memory-stress-chaos.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: memory-stress-chaos
  namespace: chatgpt-apps
spec:
  appinfo:
    appns: chatgpt-apps
    applabel: "app=mcp-server"
    appkind: deployment

  engineState: active
  chaosServiceAccount: litmus-admin

  experiments:
    - name: pod-memory-hog
      spec:
        components:
          env:
            # Memory to consume (MB)
            - name: MEMORY_CONSUMPTION
              value: "1024"

            # Number of workers
            - name: NUMBER_OF_WORKERS
              value: "8"

            # Chaos duration (seconds)
            - name: TOTAL_CHAOS_DURATION
              value: "120"

            # Target specific container
            - name: TARGET_CONTAINER
              value: "mcp-server"

            # Percentage of pods to affect
            - name: PODS_AFFECTED_PERC
              value: "100"

            # Memory consumption percentage (relative to limit)
            - name: MEMORY_PERCENTAGE
              value: "80"

            # Sequence
            - name: SEQUENCE
              value: "parallel"

        probe:
          # Monitor memory usage
          - name: memory-usage-acceptable
            type: promProbe
            mode: Continuous
            promProbe/inputs:
              endpoint: "http://prometheus:9090"
              query: |
                container_memory_working_set_bytes{
                  namespace="chatgpt-apps",
                  pod=~"mcp-server-.*"
                } / container_spec_memory_limit_bytes{
                  namespace="chatgpt-apps",
                  pod=~"mcp-server-.*"
                } * 100
              comparator:
                criteria: "<="
                value: "95"
            runProperties:
              probeTimeout: 5
              interval: 10
              retry: 1

          # Check OOM kills
          - name: no-oom-kills
            type: cmdProbe
            mode: OnChaos
            cmdProbe/inputs:
              command: |
                kubectl get events -n chatgpt-apps \
                  --field-selector reason=OOMKilling \
                  --sort-by='.lastTimestamp' | \
                  tail -n 1 | grep -q "ago" && exit 1 || exit 0
              comparator:
                type: int
                criteria: ==
                value: "0"
            runProperties:
              probeTimeout: 5
              interval: 30
              retry: 1

Node Failure Simulation

#!/usr/bin/env python3
# node-chaos-monkey.py

import random
import time
import subprocess
import json
from datetime import datetime, timedelta
from typing import List, Dict

class NodeChaosMonkey:
    """Simulates node failures for ChatGPT app resilience testing."""

    def __init__(
        self,
        namespace: str = "chatgpt-apps",
        target_label: str = "app=mcp-server",
        chaos_duration: int = 300,
        node_failure_percent: float = 0.33
    ):
        self.namespace = namespace
        self.target_label = target_label
        self.chaos_duration = chaos_duration
        self.node_failure_percent = node_failure_percent
        self.affected_nodes = []
        self.start_time = None

    def get_nodes_running_workload(self) -> List[str]:
        """Get list of nodes running target workload."""
        try:
            # Get pods for target workload
            result = subprocess.run([
                "kubectl", "get", "pods",
                "-n", self.namespace,
                "-l", self.target_label,
                "-o", "json"
            ], capture_output=True, text=True, check=True)

            pods = json.loads(result.stdout)

            # Extract unique node names
            nodes = set()
            for pod in pods.get("items", []):
                node_name = pod["spec"].get("nodeName")
                if node_name:
                    nodes.add(node_name)

            return list(nodes)

        except subprocess.CalledProcessError as e:
            print(f"Error getting nodes: {e}")
            return []

    def cordon_node(self, node_name: str) -> bool:
        """Mark node as unschedulable."""
        try:
            subprocess.run([
                "kubectl", "cordon", node_name
            ], check=True, capture_output=True)

            print(f"[{datetime.now().isoformat()}] Cordoned node: {node_name}")
            return True

        except subprocess.CalledProcessError as e:
            print(f"Error cordoning node {node_name}: {e}")
            return False

    def drain_node(self, node_name: str, force: bool = False) -> bool:
        """Drain pods from node."""
        try:
            cmd = [
                "kubectl", "drain", node_name,
                "--delete-emptydir-data",
                "--ignore-daemonsets",
                "--timeout=60s"
            ]

            if force:
                cmd.append("--force")

            subprocess.run(cmd, check=True, capture_output=True)

            print(f"[{datetime.now().isoformat()}] Drained node: {node_name}")
            return True

        except subprocess.CalledProcessError as e:
            print(f"Error draining node {node_name}: {e}")
            return False

    def uncordon_node(self, node_name: str) -> bool:
        """Mark node as schedulable."""
        try:
            subprocess.run([
                "kubectl", "uncordon", node_name
            ], check=True, capture_output=True)

            print(f"[{datetime.now().isoformat()}] Uncordoned node: {node_name}")
            return True

        except subprocess.CalledProcessError as e:
            print(f"Error uncordoning node {node_name}: {e}")
            return False

    def check_deployment_health(self) -> Dict[str, any]:
        """Check deployment health metrics."""
        try:
            result = subprocess.run([
                "kubectl", "get", "deployment",
                "-n", self.namespace,
                "-l", self.target_label,
                "-o", "json"
            ], capture_output=True, text=True, check=True)

            deployments = json.loads(result.stdout)

            health = {
                "healthy": True,
                "total_replicas": 0,
                "available_replicas": 0,
                "unavailable_replicas": 0
            }

            for deployment in deployments.get("items", []):
                status = deployment.get("status", {})

                health["total_replicas"] += status.get("replicas", 0)
                health["available_replicas"] += status.get("availableReplicas", 0)
                health["unavailable_replicas"] += status.get("unavailableReplicas", 0)

            # Consider healthy if at least 50% replicas available
            if health["total_replicas"] > 0:
                availability = health["available_replicas"] / health["total_replicas"]
                health["healthy"] = availability >= 0.5

            return health

        except subprocess.CalledProcessError as e:
            print(f"Error checking deployment health: {e}")
            return {"healthy": False}

    def run_chaos_experiment(self):
        """Execute node chaos experiment."""
        print("=" * 60)
        print("Node Chaos Monkey - ChatGPT App Resilience Test")
        print("=" * 60)
        print(f"Namespace: {self.namespace}")
        print(f"Target: {self.target_label}")
        print(f"Duration: {self.chaos_duration}s")
        print(f"Node failure rate: {self.node_failure_percent * 100}%")
        print()

        # Get nodes running workload
        nodes = self.get_nodes_running_workload()

        if not nodes:
            print("Error: No nodes found running target workload")
            return

        print(f"Found {len(nodes)} nodes running workload: {nodes}")

        # Select nodes to fail
        num_nodes_to_fail = max(1, int(len(nodes) * self.node_failure_percent))
        self.affected_nodes = random.sample(nodes, num_nodes_to_fail)

        print(f"Targeting {num_nodes_to_fail} nodes for chaos: {self.affected_nodes}")
        print()

        self.start_time = datetime.now()

        try:
            # Cordon and drain nodes
            for node in self.affected_nodes:
                if self.cordon_node(node):
                    self.drain_node(node, force=False)

                time.sleep(5)

            print()
            print(f"Node failures injected. Monitoring for {self.chaos_duration}s...")
            print()

            # Monitor deployment health
            check_interval = 15
            checks_performed = 0
            healthy_checks = 0

            while (datetime.now() - self.start_time).total_seconds() < self.chaos_duration:
                health = self.check_deployment_health()
                checks_performed += 1

                if health["healthy"]:
                    healthy_checks += 1

                print(f"[{datetime.now().isoformat()}] Health check #{checks_performed}:")
                print(f"  Available: {health['available_replicas']}/{health['total_replicas']}")
                print(f"  Status: {'HEALTHY' if health['healthy'] else 'DEGRADED'}")
                print()

                time.sleep(check_interval)

            # Calculate success rate
            if checks_performed > 0:
                success_rate = (healthy_checks / checks_performed) * 100
                print(f"Chaos experiment complete. Success rate: {success_rate:.1f}% ({healthy_checks}/{checks_performed})")

        finally:
            # Cleanup: uncordon nodes
            print()
            print("Cleaning up node chaos...")
            for node in self.affected_nodes:
                self.uncordon_node(node)

            print("Node chaos cleanup complete")

if __name__ == "__main__":
    monkey = NodeChaosMonkey(
        namespace="chatgpt-apps",
        target_label="app=mcp-server",
        chaos_duration=300,
        node_failure_percent=0.33
    )

    monkey.run_chaos_experiment()

Discover high availability architectures and disaster recovery planning.

Automated Chaos GameDays

Chaos GameDays are time-boxed chaos engineering exercises that test organizational resilience, not just technical resilience. Automated GameDays remove manual coordination overhead and enable continuous resilience validation.

Automated GameDay Orchestration

#!/bin/bash
# chaos-gameday-orchestrator.sh

set -e

NAMESPACE="chatgpt-apps"
GAMEDAY_DURATION=3600  # 1 hour
REPORT_DIR="./chaos-reports"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)

mkdir -p "$REPORT_DIR"

echo "========================================="
echo "Chaos GameDay - ChatGPT App Resilience"
echo "========================================="
echo "Start time: $(date)"
echo "Duration: ${GAMEDAY_DURATION}s ($(($GAMEDAY_DURATION / 60)) minutes)"
echo

# Pre-GameDay baseline
echo "Collecting baseline metrics..."
kubectl top pods -n "$NAMESPACE" > "$REPORT_DIR/baseline-pods-$TIMESTAMP.txt"
kubectl top nodes > "$REPORT_DIR/baseline-nodes-$TIMESTAMP.txt"

curl -s "http://prometheus:9090/api/v1/query?query=avg_over_time(mcp_request_duration_seconds[5m])" \
  | jq -r '.data.result[0].value[1]' > "$REPORT_DIR/baseline-latency-$TIMESTAMP.txt"

curl -s "http://prometheus:9090/api/v1/query?query=rate(mcp_requests_total{status='error'}[5m])" \
  | jq -r '.data.result[0].value[1]' > "$REPORT_DIR/baseline-errors-$TIMESTAMP.txt"

echo "Baseline metrics collected"
echo

# Scenario 1: Network latency (15 minutes)
echo "Scenario 1: Network latency injection"
kubectl apply -f - <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: gameday-network-latency
  namespace: $NAMESPACE
spec:
  appinfo:
    appns: $NAMESPACE
    applabel: "app=mcp-server"
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "900"
            - name: NETWORK_LATENCY
              value: "3000"
            - name: JITTER
              value: "1000"
EOF

echo "Waiting 900s for network latency chaos..."
sleep 900
kubectl delete chaosengine gameday-network-latency -n "$NAMESPACE"
echo "Network latency chaos complete"
echo

# Cool-down period
echo "Cool-down period (5 minutes)..."
sleep 300

# Scenario 2: Pod deletion (15 minutes)
echo "Scenario 2: Random pod deletion"
kubectl apply -f - <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: gameday-pod-delete
  namespace: $NAMESPACE
spec:
  appinfo:
    appns: $NAMESPACE
    applabel: "app=mcp-server"
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "900"
            - name: CHAOS_INTERVAL
              value: "60"
            - name: PODS_AFFECTED_PERC
              value: "50"
EOF

echo "Waiting 900s for pod deletion chaos..."
sleep 900
kubectl delete chaosengine gameday-pod-delete -n "$NAMESPACE"
echo "Pod deletion chaos complete"
echo

# Cool-down period
echo "Cool-down period (5 minutes)..."
sleep 300

# Scenario 3: Memory stress (15 minutes)
echo "Scenario 3: Memory stress"
kubectl apply -f - <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: gameday-memory-stress
  namespace: $NAMESPACE
spec:
  appinfo:
    appns: $NAMESPACE
    applabel: "app=mcp-server"
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-memory-hog
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "900"
            - name: MEMORY_CONSUMPTION
              value: "768"
            - name: NUMBER_OF_WORKERS
              value: "6"
EOF

echo "Waiting 900s for memory stress chaos..."
sleep 900
kubectl delete chaosengine gameday-memory-stress -n "$NAMESPACE"
echo "Memory stress chaos complete"
echo

# Post-GameDay analysis
echo "Collecting post-GameDay metrics..."
kubectl top pods -n "$NAMESPACE" > "$REPORT_DIR/post-pods-$TIMESTAMP.txt"
kubectl top nodes > "$REPORT_DIR/post-nodes-$TIMESTAMP.txt"

curl -s "http://prometheus:9090/api/v1/query?query=avg_over_time(mcp_request_duration_seconds[5m])" \
  | jq -r '.data.result[0].value[1]' > "$REPORT_DIR/post-latency-$TIMESTAMP.txt"

curl -s "http://prometheus:9090/api/v1/query?query=rate(mcp_requests_total{status='error'}[5m])" \
  | jq -r '.data.result[0].value[1]' > "$REPORT_DIR/post-errors-$TIMESTAMP.txt"

# Fetch chaos results
kubectl get chaosresult -n "$NAMESPACE" -o json > "$REPORT_DIR/chaos-results-$TIMESTAMP.json"

echo "Post-GameDay metrics collected"
echo

# Generate report
python3 <<PYTHON
import json
from datetime import datetime

print("=" * 60)
print("Chaos GameDay Report")
print("=" * 60)
print(f"Timestamp: $TIMESTAMP")
print(f"Generated: {datetime.now().isoformat()}")
print()

# Load chaos results
with open("$REPORT_DIR/chaos-results-$TIMESTAMP.json") as f:
    results = json.load(f)

total_experiments = len(results.get("items", []))
passed_experiments = sum(1 for item in results.get("items", [])
                         if item["spec"]["experimentStatus"]["verdict"] == "Pass")

print(f"Experiments run: {total_experiments}")
print(f"Passed: {passed_experiments}")
print(f"Failed: {total_experiments - passed_experiments}")
print(f"Success rate: {(passed_experiments / total_experiments * 100) if total_experiments > 0 else 0:.1f}%")
print()

# Load baseline metrics
with open("$REPORT_DIR/baseline-latency-$TIMESTAMP.txt") as f:
    baseline_latency = float(f.read().strip())

with open("$REPORT_DIR/post-latency-$TIMESTAMP.txt") as f:
    post_latency = float(f.read().strip())

with open("$REPORT_DIR/baseline-errors-$TIMESTAMP.txt") as f:
    baseline_errors = float(f.read().strip())

with open("$REPORT_DIR/post-errors-$TIMESTAMP.txt") as f:
    post_errors = float(f.read().strip())

print("Performance Impact:")
print(f"  Baseline latency: {baseline_latency:.3f}s")
print(f"  Post-chaos latency: {post_latency:.3f}s")
print(f"  Latency increase: {((post_latency / baseline_latency - 1) * 100) if baseline_latency > 0 else 0:.1f}%")
print()
print(f"  Baseline error rate: {baseline_errors:.4f}")
print(f"  Post-chaos error rate: {post_errors:.4f}")
print(f"  Error rate change: {((post_errors / baseline_errors - 1) * 100) if baseline_errors > 0 else 0:.1f}%")
print()

# Detailed experiment results
print("Experiment Details:")
print("-" * 60)
for item in results.get("items", []):
    name = item["metadata"]["name"]
    verdict = item["spec"]["experimentStatus"]["verdict"]
    probe_success = item["status"].get("probeSuccessPercentage", "N/A")

    print(f"  {name}")
    print(f"    Verdict: {verdict}")
    print(f"    Probe Success: {probe_success}%")
    print()

print("=" * 60)
print("Full report saved to: $REPORT_DIR/")
PYTHON

echo
echo "Chaos GameDay complete!"
echo "End time: $(date)"

Monitoring and Alerting for Chaos

# chaos-prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-prometheus-rules
  namespace: monitoring
data:
  chaos-alerts.yml: |
    groups:
      - name: chaos-engineering
        interval: 30s
        rules:
          # Alert when chaos experiment fails
          - alert: ChaosExperimentFailed
            expr: |
              litmuschaos_experiment_verdict{verdict="Fail"} > 0
            for: 1m
            labels:
              severity: critical
              team: sre
            annotations:
              summary: "Chaos experiment {{ $labels.experiment }} failed"
              description: "Experiment {{ $labels.experiment }} in namespace {{ $labels.namespace }} has failed. System may not be resilient to injected failures."

          # Alert when probe success rate is low
          - alert: ChaosProbeSuccessLow
            expr: |
              litmuschaos_probe_success_percentage < 80
            for: 5m
            labels:
              severity: warning
              team: sre
            annotations:
              summary: "Chaos probe {{ $labels.probe }} success rate low"
              description: "Probe {{ $labels.probe }} success rate is {{ $value }}%, below 80% threshold during chaos experiment."

          # Alert when error rate spikes during chaos
          - alert: ErrorRateSpikeInChaos
            expr: |
              rate(mcp_requests_total{status="error"}[5m])
              /
              rate(mcp_requests_total[5m])
              > 0.05
            for: 3m
            labels:
              severity: warning
              team: dev
            annotations:
              summary: "Error rate spike during chaos experiment"
              description: "MCP server error rate is {{ $value | humanizePercentage }}, exceeding 5% threshold during chaos testing."

          # Alert when latency increases significantly
          - alert: LatencyIncreaseDuringChaos
            expr: |
              histogram_quantile(0.95,
                rate(mcp_request_duration_seconds_bucket[5m])
              ) > 5
            for: 5m
            labels:
              severity: warning
              team: dev
            annotations:
              summary: "95th percentile latency high during chaos"
              description: "MCP server 95th percentile latency is {{ $value }}s, exceeding 5s threshold during chaos experiment."

          # Alert when deployment availability drops
          - alert: DeploymentAvailabilityLow
            expr: |
              kube_deployment_status_replicas_available
              /
              kube_deployment_spec_replicas
              < 0.5
            for: 2m
            labels:
              severity: critical
              team: sre
            annotations:
              summary: "Deployment {{ $labels.deployment }} availability low"
              description: "Deployment {{ $labels.deployment }} has only {{ $value | humanizePercentage }} replicas available during chaos experiment."

          # Alert when pod restarts increase
          - alert: PodRestartsDuringChaos
            expr: |
              rate(kube_pod_container_status_restarts_total{namespace="chatgpt-apps"}[10m]) > 0.1
            for: 5m
            labels:
              severity: warning
              team: dev
            annotations:
              summary: "Pod {{ $labels.pod }} restarting frequently"
              description: "Pod {{ $labels.pod }} is restarting at {{ $value }} restarts/second during chaos experiment."

Explore monitoring ChatGPT apps and alerting best practices.

Conclusion: Building Antifragile ChatGPT Apps

Chaos engineering transforms resilience from an aspiration into a measurable engineering practice. By continuously injecting failures, you discover weaknesses before customers do, validate recovery procedures under realistic conditions, and build organizational confidence in system behavior.

For ChatGPT apps, where user experience depends on multiple external dependencies (OpenAI APIs, OAuth providers, databases, caching layers), chaos engineering is not optional. It's the only way to ensure your app handles the inevitable failures gracefully.

Start with small experiments in non-production environments, gradually expand to production with proper safeguards, automate GameDays to make resilience testing continuous, and treat every incident as an opportunity to expand your chaos experiment library.

Ready to build resilient ChatGPT apps with automated chaos engineering? Try MakeAIHQ.com for instant ChatGPT app creation with built-in resilience patterns. Our platform includes production-ready MCP servers, automated failover, circuit breakers, and retry logic that pass the most demanding chaos experiments.

Related resources:

  • Building High Availability ChatGPT Apps
  • MCP Server Optimization Guide
  • Kubernetes Deployment for ChatGPT Apps
  • Monitoring and Observability Best Practices
  • API Gateway Patterns for ChatGPT Apps
  • Error Handling in MCP Servers
  • Disaster Recovery Planning

External references: