Back to Blog
Observability
8 min readApril 6, 2026

How We Built a Full-Stack Observability System for a Production AI Gaming Platform

When your application is live, revenue is flowing, and players are online, a monitoring gap is not a hypothetical risk. It is a countdown.

This is the story of how we built a complete, multi-tier alerting and observability system for an AI-native gaming client whose infrastructure spans Kafka, Redshift, AWS Bedrock, and a real-time game backend in full production.

The starting point

The client already had infrastructure. Good infrastructure, built over time, running a gaming platform with AI-driven features including procedural content, real-time recommendations, and player behavior analytics powered by AWS Bedrock. The stack was running. The problem was that nobody would know when it stopped running well.

There was basic CloudWatch in place. But there were no structured alert tiers, no on-call rotation, no escalation paths, and no unified view across the full stack. One engineer checking a dashboard when something felt slow is not observability. It is guesswork.

The stack we built around

Before designing the alerting architecture, we mapped every component that could affect the player experience or the business SLAs.

  • Streaming: Apache Kafka (game events, player actions, telemetry)
  • Data warehouse: Amazon Redshift (player analytics, billing, reporting)
  • AI inference: AWS Bedrock (LLM calls for in-game AI features)
  • Infrastructure: Kubernetes on AWS EKS, Docker, Terraform-managed
  • Application layer: Game servers, API services, matchmaking
  • Metrics: Prometheus, AWS CloudWatch
  • Logs: Loki + Promtail, CloudWatch Logs
  • Visualization: Grafana
  • Alert routing: Alertmanager
  • On-call and incident: PagerDuty

Designing the three-tier alert model

Not every alert deserves a phone call at 2 AM. Not every issue can wait until morning standup.

The first design decision was creating a clear, enforced tiering system where every single alert we configured was assigned a tier before it was written.

  • T1 Critical: Production is down or degraded for players. SLA breach imminent or underway. Immediate PagerDuty call + SMS. Escalates to backup in 5 minutes if unacknowledged.
  • T2 Warning: Degrading trend that will become critical if not addressed. PagerDuty notification + Slack message. On-call engineer investigates within 30 minutes.
  • T3 Advisory: Informational. Cost spike, capacity threshold approaching, minor anomaly. Slack only. Reviewed during business hours. No on-call wake.

This model is not complicated. But having it written down and enforced before a single alert rule is written changes everything. It means every engineer who touches the system knows exactly what will happen when their alert fires.

What we actually monitored

Once the tier model was locked, we mapped every component to its alerts.

Kafka

  • Consumer lag per topic per consumer group (T1 if lag exceeds 10k messages, T2 at 5k)
  • Broker under-replicated partitions (T1 immediately)
  • Producer request failure rate (T2 above 1%, T1 above 5%)
  • Disk utilization on broker nodes (T2 at 75%, T1 at 90%)
  • Message throughput anomaly detection via Grafana alerting on rate deviation from 7-day baseline

Redshift

  • Query queue depth (T2 when queries waiting exceeds threshold)
  • Long-running queries over 5 minutes (T2 advisory with query ID in alert body)
  • Disk space utilization (T1 at 90%)
  • WLM slot contention (T3 advisory)
  • Nightly ETL job failure detection via CloudWatch Events (T1)

AWS Bedrock

  • API error rate by model (T1 above 5% errors, T2 above 1%)
  • Latency p95 per model (T2 when p95 exceeds 3s, T1 above 8s)
  • Token usage rate and cost spike alerting (T3 advisory, reviewed daily)
  • Throttling events (T2 when sustained throttling detected)
  • Model availability via synthetic health check every 60 seconds

Cost and pricing

  • Daily cost anomaly detection across all AWS services using Cost Explorer API (T3)
  • Per-service budget threshold alerts (T2 at 80% of monthly budget, T1 at 100%)
  • Unexpected Bedrock token spend spikes (T2 at 150% of rolling 7-day average)
  • ECR storage growth rate monitoring (T3 advisory)

Application and availability

  • Synthetic uptime checks every 30 seconds per service endpoint (T1 on failure)
  • HTTP error rate 5xx (T1 above 5%, T2 above 1%)
  • Game server connection failures per region (T1 above threshold)
  • Matchmaking queue latency (T2 when average wait time doubles baseline)
  • API response time p99 per service (T2 when SLA threshold breached)

Kubernetes cluster

  • Pod crash looping (T1 immediately)
  • Node not ready (T1 immediately)
  • PVC usage above 80% (T2), above 95% (T1)
  • HPA at max replicas for sustained period (T2, potential scaling ceiling)
  • CPU and memory throttling rates per namespace (T3 advisory)

The PagerDuty setup

PagerDuty was configured with two on-call schedules: a primary rotation and a backup escalation. Shifts were set on a weekly rotation.

The escalation policy was straightforward. Alertmanager routes the alert to PagerDuty. PagerDuty pages the primary on-call engineer. If not acknowledged within 5 minutes, it escalates to the backup. If not acknowledged within 10 minutes, it escalates to the engineering lead.

All T1 alerts trigger phone calls and SMS simultaneously. T2 alerts trigger push notifications only. T3 alerts route to Slack and do not touch PagerDuty at all.

Every PagerDuty incident is linked to a Grafana dashboard deep-link in the alert body. When the on-call engineer gets paged, they open their phone, tap the link, and land directly on the relevant dashboard scoped to the time window of the incident. No searching. No context switching.

The Prometheus, Loki, and Grafana setup

Prometheus was deployed inside the Kubernetes cluster via the kube-prometheus-stack Helm chart, which gives you Prometheus, Alertmanager, and the node exporter in a single deployment. We extended it with custom ServiceMonitors for each application component.

Loki handles log aggregation. Promtail runs as a DaemonSet collecting container logs and pushing them to Loki. CloudWatch Logs Integration brings in the AWS-native logs for Redshift, Bedrock, and the EKS control plane. Everything surfaces in Grafana.

We built five core Grafana dashboards:

  • Platform overview: Golden signals (latency, traffic, errors, saturation) per service
  • Kafka operations: Consumer lag, throughput, broker health
  • AI inference: Bedrock latency, error rates, token usage by model
  • Cost and spend: Daily spend by service, budget burn rate, anomaly overlay
  • Incident timeline: Annotated view of deployments, alerts, and incidents on a shared time axis

The incident timeline dashboard alone saves 20 minutes per incident investigation. When something breaks, the first question is always "what changed?" Having deployments and alerts overlaid on the same timeline means the answer is usually visible within 30 seconds.

What this gives the engineering team

Before this setup, the question was: "is anything broken?" The answer required someone to look.

After this setup, the question is: "has anything we care about gone outside its expected bounds?" The system answers that automatically, routes it to the right person, at the right severity, at any hour, with the context they need to act.

Three specific things that changed for the engineering team:

  • They sleep better. T2 and T3 alerts do not wake anyone up. Only genuine production issues create phone calls.
  • They debug faster. Every alert fires with dashboard links, relevant log queries, and runbook links built into the alert body.
  • They catch cost leaks early. Cost anomaly alerting flagged two unexpected spend increases in the first two weeks of operation.

If you are running AI workloads in production and your monitoring is still "someone checks the dashboard," you have a gap. The tooling to fix it is not expensive. The time to set it up properly is two to three weeks. The cost of not having it is a 3 AM incident with no context and no escalation path.

Want to see how we would set this up for your stack? Book a free 30-minute call.

Want Results Like These for Your Stack?

We build production-grade infrastructure for AI startups and technical founders. Let's talk about your project.

Book a Free 30-Min Call

© 2026 Eprecisio Technologies LLC. All rights reserved.