Alerting Strategies & On-Call for ChatGPT Apps
Production ChatGPT applications require intelligent alerting systems that notify the right people at the right time without overwhelming your team with false positives. The difference between a well-designed alerting strategy and alert chaos is the line between confident scaling and constant firefighting.
This comprehensive guide covers production-ready alerting strategies, on-call management, escalation policies, and alert fatigue prevention specifically designed for ChatGPT applications. You'll learn how to design symptom-based alerts, implement PagerDuty integration, configure escalation chains, and prevent the alert fatigue that plagues many production systems.
Whether you're supporting 100 users or 100,000, effective alerting ensures you catch critical issues before users do while maintaining team sanity. Alert design isn't just about setting thresholds—it's about creating actionable, contextual notifications that enable rapid response and resolution.
By the end of this guide, you'll have production-tested alert configurations, escalation policies, and integration patterns that scale from startup to enterprise. Let's build an alerting system that protects your application without burning out your team.
The Foundation: Symptom-Based Alert Design
The most critical principle of effective alerting is focusing on symptoms (user-facing impact) rather than causes (internal component failures). A symptom-based approach ensures every alert represents a real problem requiring immediate attention.
Symptoms vs. Causes
Symptom-based alerts (actionable):
- "API response time p95 > 5 seconds for 5 minutes" (users experiencing slowness)
- "Error rate > 5% for 10 minutes" (users encountering failures)
- "ChatGPT API quota exhausted" (users unable to get responses)
Cause-based alerts (often noise):
- "Container CPU > 80%" (may be normal under load)
- "Redis connection count > 100" (may not impact users)
- "Disk usage > 70%" (not immediately critical)
Alert Severity Levels
Structure alerts into clear severity tiers with defined response expectations:
# alert-rules.yaml - Production Alert Configuration
#
# Severity Levels:
# - P0/Critical: Immediate response required, page on-call
# - P1/High: Response within 15 minutes, notify team
# - P2/Medium: Response within 1 hour, ticket created
# - P3/Low: Review during business hours
groups:
- name: chatgpt_app_critical
interval: 30s
rules:
# P0: Complete Service Outage
- alert: ChatGPTAppDown
expr: up{job="chatgpt-app"} == 0
for: 2m
labels:
severity: critical
team: platform
component: core
annotations:
summary: "ChatGPT app {{ $labels.instance }} is down"
description: "Application has been unreachable for 2 minutes. This is a complete service outage."
runbook_url: "https://docs.company.com/runbooks/app-down"
dashboard_url: "https://grafana.company.com/d/chatgpt-overview"
# P0: High Error Rate
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Error rate above 5% (current: {{ $value | humanizePercentage }})"
description: "Users are experiencing significant failures. Investigate immediately."
impact: "{{ $value | humanizePercentage }} of user requests failing"
# P0: API Latency Degradation
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
) > 5
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "API latency p95 > 5s on {{ $labels.endpoint }}"
description: "Users experiencing severe slowness. Current p95: {{ $value }}s"
# P1: ChatGPT API Quota Warning
- alert: ChatGPTQuotaNearLimit
expr: |
(
chatgpt_api_quota_used
/
chatgpt_api_quota_limit
) > 0.90
for: 10m
labels:
severity: high
team: platform
annotations:
summary: "ChatGPT API quota at {{ $value | humanizePercentage }}"
description: "Approaching quota limit. May need to throttle or upgrade."
# P1: Memory Pressure
- alert: HighMemoryUsage
expr: |
(
container_memory_usage_bytes{container="chatgpt-app"}
/
container_spec_memory_limit_bytes{container="chatgpt-app"}
) > 0.90
for: 15m
labels:
severity: high
team: platform
annotations:
summary: "Memory usage at {{ $value | humanizePercentage }}"
description: "Risk of OOM kills. Consider scaling horizontally."
# P2: Elevated Error Rate
- alert: ModerateErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[10m]))
/
sum(rate(http_requests_total[10m]))
) > 0.01
for: 10m
labels:
severity: medium
team: platform
annotations:
summary: "Error rate above 1% (current: {{ $value | humanizePercentage }})"
description: "Elevated but not critical error rate. Monitor for escalation."
# P3: Certificate Expiration Warning
- alert: TLSCertificateExpiringSoon
expr: |
(
probe_ssl_earliest_cert_expiry{job="blackbox"}
- time()
) / 86400 < 14
for: 1h
labels:
severity: low
team: platform
annotations:
summary: "TLS certificate expires in {{ $value | humanizeDuration }}"
description: "Renew certificate before expiration to avoid service disruption."
Making Alerts Actionable
Every alert must answer three questions:
- What is wrong? (Clear symptom description)
- Why does it matter? (User impact)
- What should I do? (Runbook link or first steps)
Threshold Configuration Strategies
Setting appropriate thresholds prevents both alert fatigue (thresholds too sensitive) and missed incidents (thresholds too lax).
Static Thresholds
For well-understood metrics with stable baselines:
// static-thresholds.ts - Fixed Threshold Configuration
interface ThresholdConfig {
metric: string;
threshold: number;
duration: string;
severity: 'critical' | 'high' | 'medium' | 'low';
description: string;
}
const STATIC_THRESHOLDS: ThresholdConfig[] = [
{
metric: 'error_rate',
threshold: 0.05, // 5%
duration: '5m',
severity: 'critical',
description: 'User-facing errors above acceptable threshold'
},
{
metric: 'response_time_p95',
threshold: 5.0, // 5 seconds
duration: '5m',
severity: 'critical',
description: 'API slowness impacting user experience'
},
{
metric: 'chatgpt_api_quota_usage',
threshold: 0.90, // 90%
duration: '10m',
severity: 'high',
description: 'Risk of hitting quota limit'
},
{
metric: 'disk_usage',
threshold: 0.85, // 85%
duration: '1h',
severity: 'medium',
description: 'Disk space running low'
}
];
// Generate Prometheus alert rules from threshold config
export function generateAlertRules(thresholds: ThresholdConfig[]): string {
const rules = thresholds.map(t => `
- alert: ${metricToAlertName(t.metric)}
expr: ${t.metric} > ${t.threshold}
for: ${t.duration}
labels:
severity: ${t.severity}
annotations:
summary: "${t.description}"
current_value: "{{ $value }}"
threshold: "${t.threshold}"
`);
return `groups:\n - name: static_thresholds\n rules:${rules.join('')}`;
}
function metricToAlertName(metric: string): string {
return metric
.split('_')
.map(word => word.charAt(0).toUpperCase() + word.slice(1))
.join('');
}
Dynamic Thresholds with Anomaly Detection
For metrics with time-based patterns or seasonal variation:
// dynamic-thresholds.ts - Anomaly Detection for Alerting
import { Client } from '@prometheus/client';
interface AnomalyConfig {
metric: string;
windowSize: string; // e.g., '1h', '1d', '1w'
stdDevMultiplier: number; // 2.0 = 2 standard deviations
minSamples: number;
}
export class AnomalyDetector {
private prometheus: Client;
constructor(prometheusUrl: string) {
this.prometheus = new Client({ url: prometheusUrl });
}
/**
* Generate dynamic alert rule based on historical baseline
*
* Example: Alert if error rate exceeds (baseline + 2*stddev)
*/
async generateDynamicThreshold(config: AnomalyConfig): Promise<string> {
const { metric, windowSize, stdDevMultiplier, minSamples } = config;
// Calculate baseline (mean) over historical window
const baselineQuery = `avg_over_time(${metric}[${windowSize}])`;
// Calculate standard deviation
const stdDevQuery = `stddev_over_time(${metric}[${windowSize}])`;
// Dynamic threshold = baseline + (stdDev * multiplier)
const thresholdQuery = `
(${baselineQuery} + (${stdDevQuery} * ${stdDevMultiplier}))
`;
// Only alert if we have sufficient historical data
const alertExpr = `
(
${metric} > ${thresholdQuery}
and
count_over_time(${metric}[${windowSize}]) >= ${minSamples}
)
`;
return `
- alert: ${metric.toUpperCase()}_ANOMALY
expr: |
${alertExpr}
for: 10m
labels:
severity: high
type: anomaly
annotations:
summary: "Anomaly detected in ${metric}"
current_value: "{{ $value }}"
baseline: "{{ query \\"${baselineQuery}\\" | first | value }}"
threshold: "{{ query \\"${thresholdQuery}\\" | first | value }}"
description: "Metric exceeds baseline by ${stdDevMultiplier} standard deviations"
`;
}
/**
* Time-based threshold adjustment
* Example: Higher error tolerance during maintenance windows
*/
generateTimeBasedThreshold(
metric: string,
businessHoursThreshold: number,
offHoursThreshold: number
): string {
return `
- alert: ${metric.toUpperCase()}_HIGH
expr: |
(
(${metric} > ${businessHoursThreshold})
and
(hour() >= 9 and hour() < 17) # Business hours: 9am-5pm
and
(day_of_week() > 0 and day_of_week() < 6) # Monday-Friday
)
or
(
(${metric} > ${offHoursThreshold})
and
(
hour() < 9 or hour() >= 17
or day_of_week() == 0 or day_of_week() == 6
)
)
for: 5m
labels:
severity: high
annotations:
summary: "High ${metric} during {{ if and (ge (hour) 9) (lt (hour) 17) }}business hours{{ else }}off-hours{{ end }}"
`;
}
}
Escalation Policies & On-Call Management
Well-designed escalation policies ensure incidents reach the right people without delay or confusion.
Escalation Chain Configuration
// escalation-policy.ts - PagerDuty-Style Escalation
interface OnCallSchedule {
teamId: string;
primaryOnCall: string;
secondaryOnCall: string;
managerEscalation: string;
schedule: {
timezone: string;
rotationWeeks: number;
};
}
interface EscalationRule {
delay: number; // minutes
targets: string[]; // user IDs or team IDs
notifyChannels: ('sms' | 'email' | 'phone' | 'push')[];
}
interface EscalationPolicy {
id: string;
name: string;
description: string;
rules: EscalationRule[];
acknowledgementTimeout: number; // minutes
autoResolveTimeout: number; // minutes
}
export const CHATGPT_APP_ESCALATION: EscalationPolicy = {
id: 'chatgpt-app-critical',
name: 'ChatGPT App Critical Escalation',
description: 'Escalation policy for P0/P1 incidents',
acknowledgementTimeout: 5,
autoResolveTimeout: 60,
rules: [
{
delay: 0, // Immediate
targets: ['oncall-primary'],
notifyChannels: ['sms', 'push', 'phone']
},
{
delay: 5, // After 5 minutes if not acknowledged
targets: ['oncall-primary', 'oncall-secondary'],
notifyChannels: ['sms', 'phone']
},
{
delay: 10, // After 10 minutes total
targets: ['oncall-primary', 'oncall-secondary', 'team-lead'],
notifyChannels: ['sms', 'phone']
},
{
delay: 20, // After 20 minutes total - critical escalation
targets: ['oncall-primary', 'oncall-secondary', 'team-lead', 'engineering-manager'],
notifyChannels: ['sms', 'phone']
}
]
};
export const CHATGPT_APP_ESCALATION_MEDIUM: EscalationPolicy = {
id: 'chatgpt-app-medium',
name: 'ChatGPT App Medium Priority',
description: 'Escalation for P2/P3 incidents',
acknowledgementTimeout: 15,
autoResolveTimeout: 240,
rules: [
{
delay: 0,
targets: ['oncall-primary'],
notifyChannels: ['push', 'email']
},
{
delay: 15,
targets: ['oncall-primary', 'oncall-secondary'],
notifyChannels: ['push', 'sms']
}
]
};
// Map alert severity to escalation policy
export function getEscalationPolicy(severity: string): EscalationPolicy {
switch (severity) {
case 'critical':
case 'high':
return CHATGPT_APP_ESCALATION;
case 'medium':
case 'low':
return CHATGPT_APP_ESCALATION_MEDIUM;
default:
return CHATGPT_APP_ESCALATION_MEDIUM;
}
}
PagerDuty Integration
// pagerduty-integration.ts - Production PagerDuty Integration
import axios from 'axios';
interface PagerDutyEvent {
routing_key: string; // Integration key
event_action: 'trigger' | 'acknowledge' | 'resolve';
dedup_key?: string; // Unique incident identifier
payload: {
summary: string;
severity: 'critical' | 'error' | 'warning' | 'info';
source: string;
timestamp?: string;
component?: string;
group?: string;
class?: string;
custom_details?: Record<string, any>;
};
links?: Array<{
href: string;
text: string;
}>;
images?: Array<{
src: string;
href?: string;
alt?: string;
}>;
}
export class PagerDutyIntegration {
private readonly apiUrl = 'https://events.pagerduty.com/v2/enqueue';
private routingKey: string;
constructor(routingKey: string) {
this.routingKey = routingKey;
}
/**
* Trigger a new incident in PagerDuty
*/
async triggerAlert(
summary: string,
severity: 'critical' | 'error' | 'warning' | 'info',
details: {
source: string;
component?: string;
runbookUrl?: string;
dashboardUrl?: string;
customDetails?: Record<string, any>;
}
): Promise<{ dedup_key: string; status: string }> {
const dedupKey = `chatgpt-${details.source}-${Date.now()}`;
const event: PagerDutyEvent = {
routing_key: this.routingKey,
event_action: 'trigger',
dedup_key: dedupKey,
payload: {
summary,
severity,
source: details.source,
timestamp: new Date().toISOString(),
component: details.component,
custom_details: details.customDetails
},
links: [
details.runbookUrl && {
href: details.runbookUrl,
text: 'Runbook'
},
details.dashboardUrl && {
href: details.dashboardUrl,
text: 'Dashboard'
}
].filter(Boolean) as any
};
const response = await axios.post(this.apiUrl, event);
return {
dedup_key: dedupKey,
status: response.data.status
};
}
/**
* Acknowledge an existing incident
*/
async acknowledgeAlert(dedupKey: string): Promise<void> {
const event: PagerDutyEvent = {
routing_key: this.routingKey,
event_action: 'acknowledge',
dedup_key: dedupKey,
payload: {
summary: 'Acknowledged',
severity: 'info',
source: 'automation'
}
};
await axios.post(this.apiUrl, event);
}
/**
* Resolve an incident
*/
async resolveAlert(dedupKey: string, resolutionNote?: string): Promise<void> {
const event: PagerDutyEvent = {
routing_key: this.routingKey,
event_action: 'resolve',
dedup_key: dedupKey,
payload: {
summary: resolutionNote || 'Resolved',
severity: 'info',
source: 'automation'
}
};
await axios.post(this.apiUrl, event);
}
/**
* Send alert from Prometheus AlertManager webhook
*/
async handlePrometheusWebhook(alertmanagerPayload: any): Promise<void> {
const alerts = alertmanagerPayload.alerts;
for (const alert of alerts) {
const dedupKey = `${alert.labels.alertname}-${alert.labels.instance || 'global'}`;
if (alert.status === 'firing') {
await this.triggerAlert(
alert.annotations.summary || alert.labels.alertname,
this.mapSeverity(alert.labels.severity),
{
source: alert.labels.instance || 'unknown',
component: alert.labels.component,
runbookUrl: alert.annotations.runbook_url,
dashboardUrl: alert.annotations.dashboard_url,
customDetails: {
labels: alert.labels,
annotations: alert.annotations,
startsAt: alert.startsAt,
generatorURL: alert.generatorURL
}
}
);
} else if (alert.status === 'resolved') {
await this.resolveAlert(
dedupKey,
'Alert resolved automatically'
);
}
}
}
private mapSeverity(prometheuseSeverity: string): 'critical' | 'error' | 'warning' | 'info' {
switch (prometheuseSeverity?.toLowerCase()) {
case 'critical':
return 'critical';
case 'high':
return 'error';
case 'medium':
return 'warning';
default:
return 'info';
}
}
}
// Usage example
const pagerduty = new PagerDutyIntegration(process.env.PAGERDUTY_ROUTING_KEY!);
// Trigger critical alert
await pagerduty.triggerAlert(
'ChatGPT App Down - Complete Outage',
'critical',
{
source: 'prod-chatgpt-app-1',
component: 'core-api',
runbookUrl: 'https://docs.company.com/runbooks/app-down',
dashboardUrl: 'https://grafana.company.com/d/chatgpt-overview',
customDetails: {
instance: 'prod-chatgpt-app-1',
region: 'us-east-1',
downtime_duration: '2m',
last_successful_health_check: '2026-12-25T10:00:00Z'
}
}
);
Alert Fatigue Prevention
Alert fatigue—when teams become desensitized to alerts—is one of the biggest risks in production monitoring. Prevention requires intentional alert design.
Alert Grouping & Deduplication
// alert-grouping.ts - Intelligent Alert Aggregation
interface Alert {
id: string;
name: string;
severity: string;
labels: Record<string, string>;
annotations: Record<string, string>;
timestamp: Date;
fingerprint: string;
}
interface AlertGroup {
groupKey: string;
alerts: Alert[];
firstFired: Date;
lastFired: Date;
count: number;
}
export class AlertGrouper {
private groups = new Map<string, AlertGroup>();
private groupWindow = 5 * 60 * 1000; // 5 minutes
/**
* Group similar alerts together to prevent notification spam
*
* Example: 10 pods crashing → 1 notification, not 10
*/
addAlert(alert: Alert): { isNew: boolean; group: AlertGroup } {
const groupKey = this.generateGroupKey(alert);
let group = this.groups.get(groupKey);
if (!group) {
// New group
group = {
groupKey,
alerts: [alert],
firstFired: alert.timestamp,
lastFired: alert.timestamp,
count: 1
};
this.groups.set(groupKey, group);
return { isNew: true, group };
}
// Check if alert is within grouping window
const timeSinceLastAlert = alert.timestamp.getTime() - group.lastFired.getTime();
if (timeSinceLastAlert < this.groupWindow) {
// Add to existing group
group.alerts.push(alert);
group.lastFired = alert.timestamp;
group.count++;
return { isNew: false, group };
} else {
// Old group expired, create new one
const newGroup: AlertGroup = {
groupKey,
alerts: [alert],
firstFired: alert.timestamp,
lastFired: alert.timestamp,
count: 1
};
this.groups.set(groupKey, newGroup);
return { isNew: true, group: newGroup };
}
}
/**
* Generate group key based on alert characteristics
*
* Alerts with same name, severity, and component are grouped together
*/
private generateGroupKey(alert: Alert): string {
const keyComponents = [
alert.name,
alert.severity,
alert.labels.component || 'unknown',
alert.labels.environment || 'production'
];
return keyComponents.join(':');
}
/**
* Format grouped alert notification
*/
formatGroupNotification(group: AlertGroup): string {
const firstAlert = group.alerts[0];
if (group.count === 1) {
return `${firstAlert.annotations.summary}`;
}
return `
🚨 ${firstAlert.name} (${group.count} instances)
First occurrence: ${group.firstFired.toISOString()}
Last occurrence: ${group.lastFired.toISOString()}
Affected components:
${this.getAffectedComponents(group)}
Summary: ${firstAlert.annotations.summary}
`.trim();
}
private getAffectedComponents(group: AlertGroup): string {
const components = new Set(
group.alerts
.map(a => a.labels.instance || a.labels.pod || 'unknown')
.slice(0, 10) // Limit to 10 to avoid huge notifications
);
const componentList = Array.from(components).join('\n- ');
const remaining = group.count - components.size;
if (remaining > 0) {
return `- ${componentList}\n... and ${remaining} more`;
}
return `- ${componentList}`;
}
}
Alert Silencing & Maintenance Windows
// alert-silencing.ts - Silence Alerts During Maintenance
interface Silence {
id: string;
matchers: Array<{
name: string;
value: string;
isRegex: boolean;
}>;
startsAt: Date;
endsAt: Date;
createdBy: string;
comment: string;
}
export class AlertSilencer {
private silences: Silence[] = [];
/**
* Create silence for planned maintenance
*
* Example: Silence all alerts during deployment window
*/
createSilence(
matchers: Silence['matchers'],
duration: number, // minutes
comment: string,
createdBy: string
): Silence {
const silence: Silence = {
id: `silence-${Date.now()}`,
matchers,
startsAt: new Date(),
endsAt: new Date(Date.now() + duration * 60 * 1000),
createdBy,
comment
};
this.silences.push(silence);
return silence;
}
/**
* Check if alert should be silenced
*/
isSilenced(alert: Alert): boolean {
const now = new Date();
return this.silences.some(silence => {
// Check if silence is active
if (now < silence.startsAt || now > silence.endsAt) {
return false;
}
// Check if alert matches silence criteria
return silence.matchers.every(matcher => {
const alertValue = alert.labels[matcher.name];
if (!alertValue) return false;
if (matcher.isRegex) {
return new RegExp(matcher.value).test(alertValue);
}
return alertValue === matcher.value;
});
});
}
/**
* Create maintenance window silence
*/
createMaintenanceWindow(
startTime: Date,
endTime: Date,
component: string,
engineer: string
): Silence {
return this.createSilence(
[
{ name: 'component', value: component, isRegex: false },
{ name: 'environment', value: 'production', isRegex: false }
],
(endTime.getTime() - startTime.getTime()) / (60 * 1000),
`Planned maintenance on ${component}`,
engineer
);
}
/**
* Silence all non-critical alerts during incident response
*/
createIncidentFocus(incidentId: string, responder: string): Silence {
return this.createSilence(
[
{ name: 'severity', value: 'medium|low', isRegex: true }
],
60, // 1 hour
`Focusing on incident ${incidentId}`,
responder
);
}
}
Root Cause Alert Suppression
// root-cause-suppression.ts - Suppress Downstream Alerts
interface AlertDependency {
upstream: string; // Alert name
downstream: string[]; // Dependent alert names
suppressionWindow: number; // minutes
}
const ALERT_DEPENDENCIES: AlertDependency[] = [
{
upstream: 'ChatGPTAppDown',
downstream: [
'HighLatency',
'HighErrorRate',
'LowThroughput',
'HealthCheckFailing'
],
suppressionWindow: 30
},
{
upstream: 'DatabaseDown',
downstream: [
'HighDatabaseLatency',
'DatabaseConnectionPoolExhausted',
'HighErrorRate'
],
suppressionWindow: 30
},
{
upstream: 'ChatGPTAPIQuotaExhausted',
downstream: [
'ChatGPTAPIErrors',
'HighLatency',
'UserComplaintsHigh'
],
suppressionWindow: 60
}
];
export class RootCauseSuppressor {
private activeUpstreamAlerts = new Map<string, Date>();
/**
* Record when upstream alert fires
*/
recordUpstreamAlert(alertName: string): void {
this.activeUpstreamAlerts.set(alertName, new Date());
}
/**
* Check if downstream alert should be suppressed
*/
shouldSuppress(alertName: string): boolean {
const now = new Date();
for (const dep of ALERT_DEPENDENCIES) {
if (!dep.downstream.includes(alertName)) continue;
const upstreamFiredAt = this.activeUpstreamAlerts.get(dep.upstream);
if (!upstreamFiredAt) continue;
const timeSinceUpstream = (now.getTime() - upstreamFiredAt.getTime()) / (60 * 1000);
if (timeSinceUpstream < dep.suppressionWindow) {
console.log(
`Suppressing ${alertName} due to upstream alert ${dep.upstream} ` +
`(fired ${timeSinceUpstream.toFixed(1)}m ago)`
);
return true;
}
}
return false;
}
/**
* Clear upstream alert (when resolved)
*/
clearUpstreamAlert(alertName: string): void {
this.activeUpstreamAlerts.delete(alertName);
}
}
Notification Channels & Routing
Different alert severities require different notification mechanisms.
Multi-Channel Notification Strategy
// notification-router.ts - Route Alerts to Appropriate Channels
import { WebClient as SlackClient } from '@slack/web-api';
import { PagerDutyIntegration } from './pagerduty-integration';
interface NotificationChannel {
name: string;
type: 'slack' | 'pagerduty' | 'email' | 'webhook';
config: any;
}
export class NotificationRouter {
private slackClient: SlackClient;
private pagerduty: PagerDutyIntegration;
constructor(
slackToken: string,
pagerdutyRoutingKey: string
) {
this.slackClient = new SlackClient(slackToken);
this.pagerduty = new PagerDutyIntegration(pagerdutyRoutingKey);
}
/**
* Route alert to appropriate channels based on severity
*/
async routeAlert(alert: Alert): Promise<void> {
switch (alert.severity) {
case 'critical':
// P0: PagerDuty page + Slack alert
await Promise.all([
this.sendToPagerDuty(alert),
this.sendToSlack(alert, '#incidents-critical', true)
]);
break;
case 'high':
// P1: Slack alert + PagerDuty notification
await Promise.all([
this.sendToSlack(alert, '#incidents-high', true),
this.sendToPagerDuty(alert)
]);
break;
case 'medium':
// P2: Slack notification only
await this.sendToSlack(alert, '#monitoring-alerts', false);
break;
case 'low':
// P3: Log only (or slack during business hours)
if (this.isBusinessHours()) {
await this.sendToSlack(alert, '#monitoring-info', false);
}
break;
}
}
private async sendToPagerDuty(alert: Alert): Promise<void> {
await this.pagerduty.triggerAlert(
alert.annotations.summary,
alert.severity as any,
{
source: alert.labels.instance || 'unknown',
component: alert.labels.component,
runbookUrl: alert.annotations.runbook_url,
dashboardUrl: alert.annotations.dashboard_url,
customDetails: {
labels: alert.labels,
annotations: alert.annotations
}
}
);
}
private async sendToSlack(
alert: Alert,
channel: string,
mentionOnCall: boolean
): Promise<void> {
const color = this.getSlackColor(alert.severity);
const emoji = this.getEmoji(alert.severity);
const blocks = [
{
type: 'header',
text: {
type: 'plain_text',
text: `${emoji} ${alert.name}`,
emoji: true
}
},
{
type: 'section',
fields: [
{
type: 'mrkdwn',
text: `*Severity:*\n${alert.severity.toUpperCase()}`
},
{
type: 'mrkdwn',
text: `*Component:*\n${alert.labels.component || 'Unknown'}`
},
{
type: 'mrkdwn',
text: `*Environment:*\n${alert.labels.environment || 'production'}`
},
{
type: 'mrkdwn',
text: `*Instance:*\n${alert.labels.instance || 'N/A'}`
}
]
},
{
type: 'section',
text: {
type: 'mrkdwn',
text: `*Description:*\n${alert.annotations.description || alert.annotations.summary}`
}
}
];
// Add action buttons
if (alert.annotations.runbook_url || alert.annotations.dashboard_url) {
blocks.push({
type: 'actions',
elements: [
alert.annotations.runbook_url && {
type: 'button',
text: { type: 'plain_text', text: 'Runbook', emoji: true },
url: alert.annotations.runbook_url,
style: 'primary'
},
alert.annotations.dashboard_url && {
type: 'button',
text: { type: 'plain_text', text: 'Dashboard', emoji: true },
url: alert.annotations.dashboard_url
}
].filter(Boolean) as any
});
}
let text = `${emoji} *${alert.name}*`;
if (mentionOnCall) {
text = `<!subteam^S01234ABCDE> ${text}`; // Replace with actual on-call group ID
}
await this.slackClient.chat.postMessage({
channel,
text,
blocks,
attachments: [{
color,
fallback: alert.annotations.summary
}]
});
}
private getSlackColor(severity: string): string {
switch (severity) {
case 'critical': return '#FF0000'; // Red
case 'high': return '#FF6600'; // Orange
case 'medium': return '#FFCC00'; // Yellow
default: return '#0099FF'; // Blue
}
}
private getEmoji(severity: string): string {
switch (severity) {
case 'critical': return '🚨';
case 'high': return '⚠️';
case 'medium': return '⚡';
default: return 'ℹ️';
}
}
private isBusinessHours(): boolean {
const now = new Date();
const hour = now.getHours();
const day = now.getDay();
// Monday-Friday, 9am-5pm
return day >= 1 && day <= 5 && hour >= 9 && hour < 17;
}
}
Conclusion: Building a Sustainable Alerting Culture
Effective alerting isn't just about configuration—it's about culture. Every alert should represent a real problem requiring human intervention, with clear ownership and actionable guidance.
Alerting Best Practices Recap
- Symptom-based alerts: Focus on user impact, not internal component states
- Appropriate thresholds: Use static thresholds for stable metrics, dynamic for variable ones
- Clear escalation: Define who gets notified when, with appropriate timeouts
- Prevent fatigue: Group related alerts, suppress downstream noise, silence during maintenance
- Actionable notifications: Include runbook links, dashboard URLs, and context
- Multi-channel routing: Route by severity—PagerDuty for P0, Slack for P2
- Regular review: Tune thresholds based on false positive rates and incident data
Take Your ChatGPT App Monitoring Further
Ready to build production ChatGPT applications with world-class monitoring and alerting?
MakeAIHQ provides the complete infrastructure for ChatGPT app development, including:
- Pre-configured monitoring dashboards with Prometheus + Grafana
- Built-in PagerDuty integration for instant alerting
- Production-ready alert rules for ChatGPT-specific metrics
- On-call rotation management and escalation policies
- Real-time incident tracking and postmortem tools
Start building with MakeAIHQ →
Or explore our related guides:
- Prometheus Metrics Collection for ChatGPT Apps
- Grafana Monitoring Dashboards for ChatGPT Apps
- Incident Response & Postmortems for ChatGPT Apps
- Complete Guide to Building ChatGPT Applications
About the Author: The MakeAIHQ team has built and scaled ChatGPT applications serving millions of users, implementing production monitoring systems that catch issues before customers notice. We've learned these alerting strategies through years of on-call experience and hundreds of incidents.