When your application is live, revenue is flowing, and players are online, a monitoring gap is not a hypothetical risk. It is a countdown.

This is how we built a complete, multi-tier alerting and observability system for an AI-native gaming client whose infrastructure spans Kafka, Redshift, AWS Bedrock, and a real-time game backend in full production.

The starting point

The client had good infrastructure. Built over 2 to 3 years by engineers who knew what they were doing. The problem was not bad code or bad architecture. The problem was that nobody would know when it stopped running well.

Basic CloudWatch alarms existed for EC2 metrics. But there was no structured severity model, no on-call rotation, no escalation path, and no unified view across the full stack. One engineer checking a dashboard when something felt slow is not observability. It is guesswork.

The specific gaps that needed fixing:

No visibility into Kafka consumer lag or broker health
No alerting on AWS Bedrock API errors or cost anomalies
No on-call schedule, no escalation policy, no backup coverage
No log aggregation or correlation across services
Engineers being woken up for informational events that required no action

The stack we built around

Before writing a single alert rule, we mapped every component that could affect player experience or SLA.

Layer	Component	Why it matters
Streaming	Apache Kafka	Game events, player actions, telemetry
Data warehouse	Amazon Redshift	Player analytics, billing, reporting
AI inference	AWS Bedrock	LLM calls for in-game AI features
Infrastructure	Kubernetes on EKS	Workload orchestration
Metrics	Prometheus via kube-prometheus-stack	Time-series metrics for everything
Logs	Loki + Promtail + CloudWatch integration	Unified log aggregation
Visualisation	Grafana	Dashboards and alert evaluation
Alert routing	Alertmanager	Deduplication, grouping, routing
Incident management	PagerDuty	On-call schedules, escalation, calls

Designing the three-tier alert model

Not every alert deserves a phone call at 2 AM. Not every issue can wait until morning standup. The first design decision was a tiering model, enforced before writing a single alert rule.

Tier	Name	What it means	Response
T1	Critical	Production down or degraded. SLA breach imminent.	PagerDuty call + SMS immediately. Escalates to backup in 5 minutes.
T2	Warning	Degrading trend. Will become critical if not addressed.	PagerDuty push notification. 30-minute response window.
T3	Advisory	Informational. Cost spike, capacity threshold, minor anomaly.	Slack only. Reviewed in business hours. No on-call wake.

This model is not complicated. But enforcing it before writing rules means every engineer who touches the system knows exactly what will happen when their alert fires. That predictability is what makes on-call sustainable.

What we actually monitored

Over 50 alert rules configured across the full stack. Representative examples below.

Kafka: Consumer lag per topic per consumer group (T1 above 10k messages, T2 above 5k). Broker under-replicated partitions (T1 immediately). Producer request failure rate (T2 above 1%, T1 above 5%). Disk utilisation on broker nodes (T2 at 75%, T1 at 90%). Message throughput anomaly detection via 7-day baseline comparison.

Redshift: ETL job failures via CloudWatch Events (T1). Disk space utilisation (T1 at 90%). Long-running queries over 5 minutes (T2 with query ID in alert body). WLM slot contention (T3 advisory).

AWS Bedrock: API error rate by model (T1 above 5%, T2 above 1%). Latency p95 per model (T2 when p95 exceeds 3s, T1 above 8s). Throttling events (T2 when sustained). Model availability via synthetic health check every 60 seconds. Token usage cost spike alerting (T3 advisory, reviewed daily).

Cost and spend: Daily cost anomaly detection using the AWS Cost Explorer API (T3). Per-service budget threshold alerts (T2 at 80%, T1 at 100%). Unexpected Bedrock token spend spikes at 150% of 7-day rolling average (T2).

Kubernetes: Pod crash looping (T1 immediately). Node not ready (T1 immediately). PVC usage above 80% (T2), above 95% (T1). HPA at max replicas for a sustained period (T2, potential scaling ceiling).

Application layer: Synthetic uptime checks every 30 seconds per endpoint (T1 on failure). HTTP 5xx error rate (T1 above 5%, T2 above 1%). Game server connection failures per region (T1 above threshold). API response time p99 per service (T2 when SLA breached).

The PagerDuty setup

We deployed PagerDuty escalation policies with two on-call schedules: primary and backup, on weekly rotation.

Escalation flow: Alertmanager routes to PagerDuty. PagerDuty pages primary on-call. If unacknowledged in 5 minutes, escalates to backup. If unacknowledged in 10 minutes, escalates to engineering lead.

T1 alerts fire phone calls and SMS simultaneously. T2 alerts fire push notifications only. T3 alerts go to Slack and never touch PagerDuty.

Every PagerDuty incident is auto-populated with a Grafana deep-link scoped to the incident time window, the alert context, and a runbook URL. The on-call engineer taps one link and lands on the right dashboard with the right time range. No hunting.

The Prometheus, Loki, and Grafana setup

Prometheus was deployed via the kube-prometheus-stack Helm chart, which bundles Prometheus, Alertmanager, and the node exporter together. We extended it with custom ServiceMonitors for each application component.

Loki handles log aggregation. Promtail runs as a DaemonSet collecting container logs and pushing them to Loki. CloudWatch Logs integration brings in AWS-native logs for Redshift, Bedrock, and the EKS control plane. Everything surfaces in one Grafana instance.

Five core dashboards:

Platform overview: golden signals (latency, traffic, errors, saturation) per service
Kafka operations: consumer lag per topic, broker health, throughput trends
AI inference: Bedrock latency, error rates, token usage and cost per model
Cost and spend: daily spend by service, budget burn rate, anomaly detection overlay
Incident timeline: deployments, alert firings, and incidents on a shared annotated time axis

The incident timeline dashboard alone saves 20 minutes per investigation. When something breaks, the first question is always "what changed?" Deployments and alerts overlaid on the same timeline means the answer is usually visible in under 30 seconds.

What changed

Metric	Before	After
Undetected production incidents	Unknown, no alerting	Zero since go-live
Alert noise	Every CloudWatch alarm paged on-call	97% reduction. T1 only wakes engineers.
Mean time to acknowledge	No defined process	Under 5 minutes for T1
Cost anomaly detection	None	2 anomalies caught in first 2 weeks
Log visibility	CloudWatch only, no cross-service correlation	Unified log search in Grafana across all services
On-call coverage	Ad hoc, no rotation	24/7 with defined escalation and backup

The most meaningful change is not in the table. Engineers stopped dreading on-call. When every alert that wakes you up is a genuine production issue with full context attached, on-call becomes manageable rather than exhausting.

If you are running AI workloads in production and your monitoring is still someone checking a dashboard, you have a gap that will cost you at the worst possible moment. Our Observability service covers the full setup from alert design to PagerDuty configuration.

If your production monitoring is still someone checking a dashboard, book a free 30-minute call.

Production Alerting for an AI Gaming Platform: PagerDuty, Prometheus, Grafana