April 14, 2026 8 min read Monitoring, CloudWatch, AI Ops

Monitoring AI Systems: What to Measure and Why

Your AI system is running. Revenue is flowing. Invoices are processing. Then, silently, the model’s quality degrades. You don’t notice for three weeks because you’re not measuring the right things.

Your AI system is running. Revenue is flowing. Invoices are processing.

Then, silently, the model’s quality degrades. Responses become shorter. Extraction accuracy drifts from 98% to 87%. Token costs per invocation start creeping up.

You don’t notice for three weeks because you’re not measuring the right things.

Most teams monitor uptime: “Is the system responding?” But for AI, uptime is not enough. The system is up, but it’s broken.

I’m going to walk you through what to measure, why it matters, and how to set it up.

Beyond Uptime: The AI Metrics

Uptime is a binary: working or not. AI systems degrade gracefully. They keep running while silently getting worse.

Metric 1: Token Cost Per Invocation

Every time you call Claude, you pay for input tokens + output tokens. If your average token cost is trending up, something is wrong.

Common causes:

Prompt is bloated (you’re repeating context).
Output got longer (model is verbose).
Input data is more complex (invoices changed format).

Track this in CloudWatch:

import boto3

cloudwatch = boto3.client("cloudwatch")

def log_token_cost(input_tokens, output_tokens, invocation_id):
    total_tokens = input_tokens + output_tokens
    cost = (input_tokens * 3 / 1_000_000) + (output_tokens * 15 / 1_000_000)

    cloudwatch.put_metric_data(
        Namespace="AISystem",
        MetricData=[
            {
                "MetricName": "TokensPerInvocation",
                "Value": total_tokens,
                "Unit": "Count"
            },
            {
                "MetricName": "CostPerInvocation",
                "Value": cost,
                "Unit": "None",
                "Dimensions": [{"Name": "Operation", "Value": "ExtractInvoice"}]
            }
        ]
    )

Set up a dashboard to see your average and trend. Alert if average token cost increases >10% from baseline.

Metric 2: Output Validation Failure Rate

You don’t trust Claude implicitly. You validate the output. “Does it have all required fields? Is the amount a number? Is the date in the right format?”

Track how many outputs fail validation:

def validate_extraction(response):
    """Validate extracted data."""
    required_fields = ["vendor", "amount", "date"]
    issues = []

    for field in required_fields:
        if field not in response:
            issues.append(f"Missing {field}")
        elif field == "amount" and not isinstance(response["amount"], (int, float)):
            issues.append(f"Amount not numeric: {response['amount']}")
        elif field == "date" and not is_valid_date(response["date"]):
            issues.append(f"Date invalid: {response['date']}")

    success = len(issues) == 0
    cloudwatch.put_metric_data(
        Namespace="AISystem",
        MetricData=[
            {
                "MetricName": "ValidationSuccess",
                "Value": 1 if success else 0,
                "Unit": "Count"
            }
        ]
    )

    return success, issues

Alert if validation success rate drops below 95%. This indicates model drift or input data quality issues.

Metric 3: Latency Percentiles (P50, P95, P99)

How long does an invocation take? Track not just average, but percentiles:

P50 (median): 50% of calls finish faster than this.
P95: 95% of calls finish faster.
P99: 99% of calls finish faster.

Why percentiles matter: Your average might be 500ms. But if P99 is 5 seconds, 1 in 100 users waits forever. That’s a bad experience.

import time

def measure_latency(start_time):
    duration_ms = (time.time() - start_time) * 1000
    cloudwatch.put_metric_data(
        Namespace="AISystem",
        MetricData=[
            {
                "MetricName": "InvocationLatency",
                "Value": duration_ms,
                "Unit": "Milliseconds"
            }
        ]
    )

CloudWatch automatically computes P50, P95, P99 if you let it aggregate over time. Set up alarms: “Alert if P95 latency > 2 seconds.”

Metric 4: Model Drift Detection

Over time, the model’s outputs change. Not catastrophically, but subtly. Responses get shorter. Confidence decreases. Hallucinations increase.

Detect this by comparing output distributions:

def detect_drift(current_response, baseline_stats):
    """Compare current output to historical baseline."""
    current_length = len(current_response.split())

    if current_length < baseline_stats["min_length"]:
        return "DRIFT_SHORT"
    elif current_length > baseline_stats["max_length"]:
        return "DRIFT_LONG"
    else:
        return "OK"

Track “drift alerts” as a metric. If they spike, investigate.

Metric 5: End-to-End Success Rate

This is your business metric. Not “did the AI run,” but “did the full process complete successfully?”

For invoice processing: “Did we extract data, validate it, store it, and send the notification?”

def track_success(invoice_id, steps_completed):
    """Steps: extract → validate → store → notify."""
    success = len(steps_completed) == 4
    failed_step = None if success else [s for s in ["extract", "validate", "store", "notify"] if s not in steps_completed][0]

    cloudwatch.put_metric_data(
        Namespace="AISystem",
        MetricData=[
            {
                "MetricName": "EndToEndSuccess",
                "Value": 1 if success else 0,
                "Unit": "Count"
            },
            {
                "MetricName": "FailurePoint",
                "Value": 1,
                "Unit": "Count",
                "Dimensions": [{"Name": "Step", "Value": failed_step or "none"}]
            }
        ]
    )

Alert if end-to-end success < 98%. This is a customer-impacting metric.

The Dashboard

Set up a CloudWatch dashboard with these metrics:

[Token Costs]
- Avg tokens/invocation (line chart, 30-day trend)
- Avg cost/invocation (line chart)
- Alert threshold (red line at 110% of baseline)

[Validation]
- Validation success rate (gauge, should be 95%+)
- Failed validations by type (bar chart)

[Latency]
- P50, P95, P99 latency (line charts)

[Drift]
- Drift alerts (line chart, spike = problem)

[Business Metrics]
- End-to-end success rate (gauge, should be 98%+)
- Failed invocations (bar chart by step)

[Cost Tracking]
- Total cost this month (counter)
- Daily spend trend (line chart)
- Invocation count (line chart)

Alerts (What to Set)

Token cost: Alert if avg >110% of baseline.
Validation success: Alert if <95%.
P95 latency: Alert if >2x baseline (or >2 seconds absolute).
Drift alerts: Alert if any detected in last hour.
End-to-end success: Alert if <98%.
Daily spend: Alert if >20% of monthly budget.

The Real Scenario

You’re processing invoices. Everything looks normal. One morning, your dashboard shows:

Token cost: up 25%
Validation success: down to 88%
P95 latency: up to 4 seconds
Drift alerts: 3 in the last hour

You check the logs. A client uploaded invoices in a new format (two-page PDFs instead of one-page). Claude is processing more tokens per invoice, generating longer responses, and making more mistakes.

You choose to pre-process the PDFs (split two-page documents) before sending to Claude. Metrics normalize within hours.

Without monitoring, you wouldn’t have noticed for days.

Integration with Datadog

Datadog is richer than CloudWatch, but CloudWatch is free and sufficient for SMBs. If you want to stream to Datadog:

from datadog import initialize, api

options = {"api_key": "...", "app_key": "..."}
initialize(**options)

api.Metric.send(
    metric="aisystem.tokens_per_invocation",
    points=total_tokens,
    tags=["env:prod", "service:invoice-processor"]
)

Bottom Line

Track five things from day one:

Token costs
Validation rates
Latency percentiles
Model drift
End-to-end success

Alert on all of these. This is not overkill. This is how you stay ahead of silent degradation.

Get a ready-to-use CloudWatch dashboard template for AI systems

Pre-built dashboard JSON, validation rules, drift detection thresholds, and alert configuration. Copy, paste, deploy in 15 minutes.

✓ Pre-built CloudWatch dashboard ✓ Alert thresholds & rules ✓ Python snippet library

JavaScript is required for the email signup form. Please enable JavaScript or email us directly at hello@threemoonsnetwork.net.

No spam. Unsubscribe anytime.

Running AI in production?

I help SMBs build observability-first AI systems on AWS. Let’s make sure you catch degradation before your customers do.

Book Your Free Discovery Call