Monitoring AI Systems: What to Measure and Why
Your AI system is running. Revenue is flowing. Invoices are processing. Then, silently, the model’s quality degrades. You don’t notice for three weeks because you’re not measuring the right things.
Your AI system is running. Revenue is flowing. Invoices are processing.
Then, silently, the model’s quality degrades. Responses become shorter. Extraction accuracy drifts from 98% to 87%. Token costs per invocation start creeping up.
You don’t notice for three weeks because you’re not measuring the right things.
Most teams monitor uptime: “Is the system responding?” But for AI, uptime is not enough. The system is up, but it’s broken.
I’m going to walk you through what to measure, why it matters, and how to set it up.
Beyond Uptime: The AI Metrics
Uptime is a binary: working or not. AI systems degrade gracefully. They keep running while silently getting worse.
Metric 1: Token Cost Per Invocation
Every time you call Claude, you pay for input tokens + output tokens. If your average token cost is trending up, something is wrong.
Common causes:
- Prompt is bloated (you’re repeating context).
- Output got longer (model is verbose).
- Input data is more complex (invoices changed format).
Track this in CloudWatch:
Set up a dashboard to see your average and trend. Alert if average token cost increases >10% from baseline.
Metric 2: Output Validation Failure Rate
You don’t trust Claude implicitly. You validate the output. “Does it have all required fields? Is the amount a number? Is the date in the right format?”
Track how many outputs fail validation:
Alert if validation success rate drops below 95%. This indicates model drift or input data quality issues.
Metric 3: Latency Percentiles (P50, P95, P99)
How long does an invocation take? Track not just average, but percentiles:
- P50 (median): 50% of calls finish faster than this.
- P95: 95% of calls finish faster.
- P99: 99% of calls finish faster.
Why percentiles matter: Your average might be 500ms. But if P99 is 5 seconds, 1 in 100 users waits forever. That’s a bad experience.
CloudWatch automatically computes P50, P95, P99 if you let it aggregate over time. Set up alarms: “Alert if P95 latency > 2 seconds.”
Metric 4: Model Drift Detection
Over time, the model’s outputs change. Not catastrophically, but subtly. Responses get shorter. Confidence decreases. Hallucinations increase.
Detect this by comparing output distributions:
Track “drift alerts” as a metric. If they spike, investigate.
Metric 5: End-to-End Success Rate
This is your business metric. Not “did the AI run,” but “did the full process complete successfully?”
For invoice processing: “Did we extract data, validate it, store it, and send the notification?”
Alert if end-to-end success < 98%. This is a customer-impacting metric.
The Dashboard
Set up a CloudWatch dashboard with these metrics:
Alerts (What to Set)
- Token cost: Alert if avg >110% of baseline.
- Validation success: Alert if <95%.
- P95 latency: Alert if >2x baseline (or >2 seconds absolute).
- Drift alerts: Alert if any detected in last hour.
- End-to-end success: Alert if <98%.
- Daily spend: Alert if >20% of monthly budget.
The Real Scenario
You’re processing invoices. Everything looks normal. One morning, your dashboard shows:
- Token cost: up 25%
- Validation success: down to 88%
- P95 latency: up to 4 seconds
- Drift alerts: 3 in the last hour
You check the logs. A client uploaded invoices in a new format (two-page PDFs instead of one-page). Claude is processing more tokens per invoice, generating longer responses, and making more mistakes.
You choose to pre-process the PDFs (split two-page documents) before sending to Claude. Metrics normalize within hours.
Without monitoring, you wouldn’t have noticed for days.
Integration with Datadog
Datadog is richer than CloudWatch, but CloudWatch is free and sufficient for SMBs. If you want to stream to Datadog:
Bottom Line
Track five things from day one:
- Token costs
- Validation rates
- Latency percentiles
- Model drift
- End-to-end success
Alert on all of these. This is not overkill. This is how you stay ahead of silent degradation.
Get a ready-to-use CloudWatch dashboard template for AI systems
Pre-built dashboard JSON, validation rules, drift detection thresholds, and alert configuration. Copy, paste, deploy in 15 minutes.
No spam. Unsubscribe anytime.
Running AI in production?
I help SMBs build observability-first AI systems on AWS. Let’s make sure you catch degradation before your customers do.
Book Your Free Discovery Call