Observability

Production Alerting for an AI Gaming Platform: PagerDuty, Prometheus, Grafana

March 28, 20268 min read

When your application is live, revenue is flowing, and players are online, a monitoring gap is not a hypothetical risk. It is a countdown.

This is how we built a complete, multi-tier alerting and observability system for an AI-native gaming client whose infrastructure spans Kafka, Redshift, AWS Bedrock, and a real-time game backend in full production.

The starting point

The client had good infrastructure. Built over 2 to 3 years by engineers who knew what they were doing. The problem was not bad code or bad architecture. The problem was that nobody would know when it stopped running well.

Basic CloudWatch alarms existed for EC2 metrics. But there was no structured severity model, no on-call rotation, no escalation path, and no unified view across the full stack. One engineer checking a dashboard when something felt slow is not observability. It is guesswork.

The specific gaps that needed fixing:

  • No visibility into Kafka consumer lag or broker health
  • No alerting on AWS Bedrock API errors or cost anomalies
  • No on-call schedule, no escalation policy, no backup coverage
  • No log aggregation or correlation across services
  • Engineers being woken up for informational events that required no action

The stack we built around

Before writing a single alert rule, we mapped every component that could affect player experience or SLA.

LayerComponentWhy it matters
StreamingApache KafkaGame events, player actions, telemetry
Data warehouseAmazon RedshiftPlayer analytics, billing, reporting
AI inferenceAWS BedrockLLM calls for in-game AI features
InfrastructureKubernetes on EKSWorkload orchestration
MetricsPrometheus via kube-prometheus-stackTime-series metrics for everything
LogsLoki + Promtail + CloudWatch integrationUnified log aggregation
VisualisationGrafanaDashboards and alert evaluation
Alert routingAlertmanagerDeduplication, grouping, routing
Incident managementPagerDutyOn-call schedules, escalation, calls

Designing the three-tier alert model

Not every alert deserves a phone call at 2 AM. Not every issue can wait until morning standup. The first design decision was a tiering model, enforced before writing a single alert rule.

TierNameWhat it meansResponse
T1CriticalProduction down or degraded. SLA breach imminent.PagerDuty call + SMS immediately. Escalates to backup in 5 minutes.
T2WarningDegrading trend. Will become critical if not addressed.PagerDuty push notification. 30-minute response window.
T3AdvisoryInformational. Cost spike, capacity threshold, minor anomaly.Slack only. Reviewed in business hours. No on-call wake.

This model is not complicated. But enforcing it before writing rules means every engineer who touches the system knows exactly what will happen when their alert fires. That predictability is what makes on-call sustainable.

What we actually monitored

Over 50 alert rules configured across the full stack. Representative examples below.

Kafka: Consumer lag per topic per consumer group (T1 above 10k messages, T2 above 5k). Broker under-replicated partitions (T1 immediately). Producer request failure rate (T2 above 1%, T1 above 5%). Disk utilisation on broker nodes (T2 at 75%, T1 at 90%). Message throughput anomaly detection via 7-day baseline comparison.

Redshift: ETL job failures via CloudWatch Events (T1). Disk space utilisation (T1 at 90%). Long-running queries over 5 minutes (T2 with query ID in alert body). WLM slot contention (T3 advisory).

AWS Bedrock: API error rate by model (T1 above 5%, T2 above 1%). Latency p95 per model (T2 when p95 exceeds 3s, T1 above 8s). Throttling events (T2 when sustained). Model availability via synthetic health check every 60 seconds. Token usage cost spike alerting (T3 advisory, reviewed daily).

Cost and spend: Daily cost anomaly detection using the AWS Cost Explorer API (T3). Per-service budget threshold alerts (T2 at 80%, T1 at 100%). Unexpected Bedrock token spend spikes at 150% of 7-day rolling average (T2).

Kubernetes: Pod crash looping (T1 immediately). Node not ready (T1 immediately). PVC usage above 80% (T2), above 95% (T1). HPA at max replicas for a sustained period (T2, potential scaling ceiling).

Application layer: Synthetic uptime checks every 30 seconds per endpoint (T1 on failure). HTTP 5xx error rate (T1 above 5%, T2 above 1%). Game server connection failures per region (T1 above threshold). API response time p99 per service (T2 when SLA breached).

The PagerDuty setup

We deployed PagerDuty escalation policies with two on-call schedules: primary and backup, on weekly rotation.

Escalation flow: Alertmanager routes to PagerDuty. PagerDuty pages primary on-call. If unacknowledged in 5 minutes, escalates to backup. If unacknowledged in 10 minutes, escalates to engineering lead.

T1 alerts fire phone calls and SMS simultaneously. T2 alerts fire push notifications only. T3 alerts go to Slack and never touch PagerDuty.

Every PagerDuty incident is auto-populated with a Grafana deep-link scoped to the incident time window, the alert context, and a runbook URL. The on-call engineer taps one link and lands on the right dashboard with the right time range. No hunting.

The Prometheus, Loki, and Grafana setup

Prometheus was deployed via the kube-prometheus-stack Helm chart, which bundles Prometheus, Alertmanager, and the node exporter together. We extended it with custom ServiceMonitors for each application component.

Loki handles log aggregation. Promtail runs as a DaemonSet collecting container logs and pushing them to Loki. CloudWatch Logs integration brings in AWS-native logs for Redshift, Bedrock, and the EKS control plane. Everything surfaces in one Grafana instance.

Five core dashboards:

  1. Platform overview: golden signals (latency, traffic, errors, saturation) per service
  2. Kafka operations: consumer lag per topic, broker health, throughput trends
  3. AI inference: Bedrock latency, error rates, token usage and cost per model
  4. Cost and spend: daily spend by service, budget burn rate, anomaly detection overlay
  5. Incident timeline: deployments, alert firings, and incidents on a shared annotated time axis

The incident timeline dashboard alone saves 20 minutes per investigation. When something breaks, the first question is always "what changed?" Deployments and alerts overlaid on the same timeline means the answer is usually visible in under 30 seconds.

What changed

MetricBeforeAfter
Undetected production incidentsUnknown, no alertingZero since go-live
Alert noiseEvery CloudWatch alarm paged on-call97% reduction. T1 only wakes engineers.
Mean time to acknowledgeNo defined processUnder 5 minutes for T1
Cost anomaly detectionNone2 anomalies caught in first 2 weeks
Log visibilityCloudWatch only, no cross-service correlationUnified log search in Grafana across all services
On-call coverageAd hoc, no rotation24/7 with defined escalation and backup

The most meaningful change is not in the table. Engineers stopped dreading on-call. When every alert that wakes you up is a genuine production issue with full context attached, on-call becomes manageable rather than exhausting.

If you are running AI workloads in production and your monitoring is still someone checking a dashboard, you have a gap that will cost you at the worst possible moment. Our Observability service covers the full setup from alert design to PagerDuty configuration.

If your production monitoring is still someone checking a dashboard, book a free 30-minute call.

Want Results Like These for Your Stack?

We build production-grade infrastructure for AI startups and technical founders. Let's talk about your project.

Book a Free 30-Min Call

Your infra shouldn't be the thing slowing you down.

Book a free 30-minute call. We'll look at your current setup and tell you exactly what's costing you money, what's a deployment risk, and what we'd fix first. No pitch, no fluff.

AWSKubernetesDockerTerraformPythonReactArgoCDPrometheus