When your application is live, revenue is flowing, and players are online, a monitoring gap is not a hypothetical risk. It is a countdown.
This is how we built a complete, multi-tier alerting and observability system for an AI-native gaming client whose infrastructure spans Kafka, Redshift, AWS Bedrock, and a real-time game backend in full production.
The starting point
The client had good infrastructure. Built over 2 to 3 years by engineers who knew what they were doing. The problem was not bad code or bad architecture. The problem was that nobody would know when it stopped running well.
Basic CloudWatch alarms existed for EC2 metrics. But there was no structured severity model, no on-call rotation, no escalation path, and no unified view across the full stack. One engineer checking a dashboard when something felt slow is not observability. It is guesswork.
The specific gaps that needed fixing:
- No visibility into Kafka consumer lag or broker health
- No alerting on AWS Bedrock API errors or cost anomalies
- No on-call schedule, no escalation policy, no backup coverage
- No log aggregation or correlation across services
- Engineers being woken up for informational events that required no action
The stack we built around
Before writing a single alert rule, we mapped every component that could affect player experience or SLA.
| Layer | Component | Why it matters |
|---|---|---|
| Streaming | Apache Kafka | Game events, player actions, telemetry |
| Data warehouse | Amazon Redshift | Player analytics, billing, reporting |
| AI inference | AWS Bedrock | LLM calls for in-game AI features |
| Infrastructure | Kubernetes on EKS | Workload orchestration |
| Metrics | Prometheus via kube-prometheus-stack | Time-series metrics for everything |
| Logs | Loki + Promtail + CloudWatch integration | Unified log aggregation |
| Visualisation | Grafana | Dashboards and alert evaluation |
| Alert routing | Alertmanager | Deduplication, grouping, routing |
| Incident management | PagerDuty | On-call schedules, escalation, calls |
Designing the three-tier alert model
Not every alert deserves a phone call at 2 AM. Not every issue can wait until morning standup. The first design decision was a tiering model, enforced before writing a single alert rule.
| Tier | Name | What it means | Response |
|---|---|---|---|
| T1 | Critical | Production down or degraded. SLA breach imminent. | PagerDuty call + SMS immediately. Escalates to backup in 5 minutes. |
| T2 | Warning | Degrading trend. Will become critical if not addressed. | PagerDuty push notification. 30-minute response window. |
| T3 | Advisory | Informational. Cost spike, capacity threshold, minor anomaly. | Slack only. Reviewed in business hours. No on-call wake. |
This model is not complicated. But enforcing it before writing rules means every engineer who touches the system knows exactly what will happen when their alert fires. That predictability is what makes on-call sustainable.
What we actually monitored
Over 50 alert rules configured across the full stack. Representative examples below.
Kafka: Consumer lag per topic per consumer group (T1 above 10k messages, T2 above 5k). Broker under-replicated partitions (T1 immediately). Producer request failure rate (T2 above 1%, T1 above 5%). Disk utilisation on broker nodes (T2 at 75%, T1 at 90%). Message throughput anomaly detection via 7-day baseline comparison.
Redshift: ETL job failures via CloudWatch Events (T1). Disk space utilisation (T1 at 90%). Long-running queries over 5 minutes (T2 with query ID in alert body). WLM slot contention (T3 advisory).
AWS Bedrock: API error rate by model (T1 above 5%, T2 above 1%). Latency p95 per model (T2 when p95 exceeds 3s, T1 above 8s). Throttling events (T2 when sustained). Model availability via synthetic health check every 60 seconds. Token usage cost spike alerting (T3 advisory, reviewed daily).
Cost and spend: Daily cost anomaly detection using the AWS Cost Explorer API (T3). Per-service budget threshold alerts (T2 at 80%, T1 at 100%). Unexpected Bedrock token spend spikes at 150% of 7-day rolling average (T2).
Kubernetes: Pod crash looping (T1 immediately). Node not ready (T1 immediately). PVC usage above 80% (T2), above 95% (T1). HPA at max replicas for a sustained period (T2, potential scaling ceiling).
Application layer: Synthetic uptime checks every 30 seconds per endpoint (T1 on failure). HTTP 5xx error rate (T1 above 5%, T2 above 1%). Game server connection failures per region (T1 above threshold). API response time p99 per service (T2 when SLA breached).
The PagerDuty setup
We deployed PagerDuty escalation policies with two on-call schedules: primary and backup, on weekly rotation.
Escalation flow: Alertmanager routes to PagerDuty. PagerDuty pages primary on-call. If unacknowledged in 5 minutes, escalates to backup. If unacknowledged in 10 minutes, escalates to engineering lead.
T1 alerts fire phone calls and SMS simultaneously. T2 alerts fire push notifications only. T3 alerts go to Slack and never touch PagerDuty.
Every PagerDuty incident is auto-populated with a Grafana deep-link scoped to the incident time window, the alert context, and a runbook URL. The on-call engineer taps one link and lands on the right dashboard with the right time range. No hunting.
The Prometheus, Loki, and Grafana setup
Prometheus was deployed via the kube-prometheus-stack Helm chart, which bundles Prometheus, Alertmanager, and the node exporter together. We extended it with custom ServiceMonitors for each application component.
Loki handles log aggregation. Promtail runs as a DaemonSet collecting container logs and pushing them to Loki. CloudWatch Logs integration brings in AWS-native logs for Redshift, Bedrock, and the EKS control plane. Everything surfaces in one Grafana instance.
Five core dashboards:
- Platform overview: golden signals (latency, traffic, errors, saturation) per service
- Kafka operations: consumer lag per topic, broker health, throughput trends
- AI inference: Bedrock latency, error rates, token usage and cost per model
- Cost and spend: daily spend by service, budget burn rate, anomaly detection overlay
- Incident timeline: deployments, alert firings, and incidents on a shared annotated time axis
The incident timeline dashboard alone saves 20 minutes per investigation. When something breaks, the first question is always "what changed?" Deployments and alerts overlaid on the same timeline means the answer is usually visible in under 30 seconds.
What changed
| Metric | Before | After |
|---|---|---|
| Undetected production incidents | Unknown, no alerting | Zero since go-live |
| Alert noise | Every CloudWatch alarm paged on-call | 97% reduction. T1 only wakes engineers. |
| Mean time to acknowledge | No defined process | Under 5 minutes for T1 |
| Cost anomaly detection | None | 2 anomalies caught in first 2 weeks |
| Log visibility | CloudWatch only, no cross-service correlation | Unified log search in Grafana across all services |
| On-call coverage | Ad hoc, no rotation | 24/7 with defined escalation and backup |
The most meaningful change is not in the table. Engineers stopped dreading on-call. When every alert that wakes you up is a genuine production issue with full context attached, on-call becomes manageable rather than exhausting.
If you are running AI workloads in production and your monitoring is still someone checking a dashboard, you have a gap that will cost you at the worst possible moment. Our Observability service covers the full setup from alert design to PagerDuty configuration.
If your production monitoring is still someone checking a dashboard, book a free 30-minute call.