April 14, 2026 8 min read Monitoring, CloudWatch, AI Ops

Monitoring AI Systems: What to Measure and Why

Your AI system is running. Revenue is flowing. Invoices are processing. Then, silently, the model’s quality degrades. You don’t notice for three weeks because you’re not measuring the right things.

Your AI system is running. Revenue is flowing. Invoices are processing.

Then, silently, the model’s quality degrades. Responses become shorter. Extraction accuracy drifts from 98% to 87%. Token costs per invocation start creeping up.

You don’t notice for three weeks because you’re not measuring the right things.

Most teams monitor uptime: “Is the system responding?” But for AI, uptime is not enough. The system is up, but it’s broken.

I’m going to walk you through what to measure, why it matters, and how to set it up.

Beyond Uptime: The AI Metrics

Uptime is a binary: working or not. AI systems degrade gracefully. They keep running while silently getting worse.

Metric 1: Token Cost Per Invocation

Every time you call Claude, you pay for input tokens + output tokens. If your average token cost is trending up, something is wrong.

Common causes:

Track this in CloudWatch:

import boto3 cloudwatch = boto3.client("cloudwatch") def log_token_cost(input_tokens, output_tokens, invocation_id): total_tokens = input_tokens + output_tokens cost = (input_tokens * 3 / 1_000_000) + (output_tokens * 15 / 1_000_000) cloudwatch.put_metric_data( Namespace="AISystem", MetricData=[ { "MetricName": "TokensPerInvocation", "Value": total_tokens, "Unit": "Count" }, { "MetricName": "CostPerInvocation", "Value": cost, "Unit": "None", "Dimensions": [{"Name": "Operation", "Value": "ExtractInvoice"}] } ] )

Set up a dashboard to see your average and trend. Alert if average token cost increases >10% from baseline.

Metric 2: Output Validation Failure Rate

You don’t trust Claude implicitly. You validate the output. “Does it have all required fields? Is the amount a number? Is the date in the right format?”

Track how many outputs fail validation:

def validate_extraction(response): """Validate extracted data.""" required_fields = ["vendor", "amount", "date"] issues = [] for field in required_fields: if field not in response: issues.append(f"Missing {field}") elif field == "amount" and not isinstance(response["amount"], (int, float)): issues.append(f"Amount not numeric: {response['amount']}") elif field == "date" and not is_valid_date(response["date"]): issues.append(f"Date invalid: {response['date']}") success = len(issues) == 0 cloudwatch.put_metric_data( Namespace="AISystem", MetricData=[ { "MetricName": "ValidationSuccess", "Value": 1 if success else 0, "Unit": "Count" } ] ) return success, issues

Alert if validation success rate drops below 95%. This indicates model drift or input data quality issues.

Metric 3: Latency Percentiles (P50, P95, P99)

How long does an invocation take? Track not just average, but percentiles:

Why percentiles matter: Your average might be 500ms. But if P99 is 5 seconds, 1 in 100 users waits forever. That’s a bad experience.

import time def measure_latency(start_time): duration_ms = (time.time() - start_time) * 1000 cloudwatch.put_metric_data( Namespace="AISystem", MetricData=[ { "MetricName": "InvocationLatency", "Value": duration_ms, "Unit": "Milliseconds" } ] )

CloudWatch automatically computes P50, P95, P99 if you let it aggregate over time. Set up alarms: “Alert if P95 latency > 2 seconds.”

Metric 4: Model Drift Detection

Over time, the model’s outputs change. Not catastrophically, but subtly. Responses get shorter. Confidence decreases. Hallucinations increase.

Detect this by comparing output distributions:

def detect_drift(current_response, baseline_stats): """Compare current output to historical baseline.""" current_length = len(current_response.split()) if current_length < baseline_stats["min_length"]: return "DRIFT_SHORT" elif current_length > baseline_stats["max_length"]: return "DRIFT_LONG" else: return "OK"

Track “drift alerts” as a metric. If they spike, investigate.

Metric 5: End-to-End Success Rate

This is your business metric. Not “did the AI run,” but “did the full process complete successfully?”

For invoice processing: “Did we extract data, validate it, store it, and send the notification?”

def track_success(invoice_id, steps_completed): """Steps: extract → validate → store → notify.""" success = len(steps_completed) == 4 failed_step = None if success else [s for s in ["extract", "validate", "store", "notify"] if s not in steps_completed][0] cloudwatch.put_metric_data( Namespace="AISystem", MetricData=[ { "MetricName": "EndToEndSuccess", "Value": 1 if success else 0, "Unit": "Count" }, { "MetricName": "FailurePoint", "Value": 1, "Unit": "Count", "Dimensions": [{"Name": "Step", "Value": failed_step or "none"}] } ] )

Alert if end-to-end success < 98%. This is a customer-impacting metric.

The Dashboard

Set up a CloudWatch dashboard with these metrics:

[Token Costs] - Avg tokens/invocation (line chart, 30-day trend) - Avg cost/invocation (line chart) - Alert threshold (red line at 110% of baseline) [Validation] - Validation success rate (gauge, should be 95%+) - Failed validations by type (bar chart) [Latency] - P50, P95, P99 latency (line charts) [Drift] - Drift alerts (line chart, spike = problem) [Business Metrics] - End-to-end success rate (gauge, should be 98%+) - Failed invocations (bar chart by step) [Cost Tracking] - Total cost this month (counter) - Daily spend trend (line chart) - Invocation count (line chart)

Alerts (What to Set)

The Real Scenario

You’re processing invoices. Everything looks normal. One morning, your dashboard shows:

You check the logs. A client uploaded invoices in a new format (two-page PDFs instead of one-page). Claude is processing more tokens per invoice, generating longer responses, and making more mistakes.

You choose to pre-process the PDFs (split two-page documents) before sending to Claude. Metrics normalize within hours.

Without monitoring, you wouldn’t have noticed for days.

Integration with Datadog

Datadog is richer than CloudWatch, but CloudWatch is free and sufficient for SMBs. If you want to stream to Datadog:

from datadog import initialize, api options = {"api_key": "...", "app_key": "..."} initialize(**options) api.Metric.send( metric="aisystem.tokens_per_invocation", points=total_tokens, tags=["env:prod", "service:invoice-processor"] )

Bottom Line

Track five things from day one:

  1. Token costs
  2. Validation rates
  3. Latency percentiles
  4. Model drift
  5. End-to-end success

Alert on all of these. This is not overkill. This is how you stay ahead of silent degradation.

Get a ready-to-use CloudWatch dashboard template for AI systems

Pre-built dashboard JSON, validation rules, drift detection thresholds, and alert configuration. Copy, paste, deploy in 15 minutes.

Pre-built CloudWatch dashboard Alert thresholds & rules Python snippet library

No spam. Unsubscribe anytime.

or

Running AI in production?

I help SMBs build observability-first AI systems on AWS. Let’s make sure you catch degradation before your customers do.

Book Your Free Discovery Call
← The Small Business Guide to AWS Costs