June 20, 2026 8 min read AI · SRE · Automation

Building AI Systems That Survive Model Updates

Claude gets updated. So does GPT. So does every other model. The last time your model provider shipped a new version, what happened to your system? If you’re like most teams: you kept running the old model because switching costs time and testing. Or you switched and something broke. There’s a better way — build systems that expect models to change.

The Problem: Implicit Model Dependencies

Most teams write code like this:

response = client.messages.create(
    model="claude-3-5-sonnet",
    messages=messages
)

What happens when Anthropic releases Claude 4?

You either:

Keep using Sonnet (you miss improvements, you’re stuck on an old version).
Switch to Claude 4 (your output format might change, your tests fail, your production breaks).

That’s vendor risk. Your system is tightly coupled to a specific model, but you don’t know it.

The Solution: Version Pinning + Contracts

Here’s the pattern:

1. Version Pinning in Configuration

Never hardcode the model in your code. Put it in configuration, and be explicit about versions.

Bad:

model = "claude-3-5-sonnet" # In code. Hard to change.

Good:

# config.yaml
models:
  invoice_extraction: "claude-3-5-sonnet-20241022"  # Specific version date
  customer_insights: "claude-opus-4-1-20250805"
  metadata_extraction: "claude-3-5-haiku-20241022"

Then in your code:

import yaml

with open("config.yaml") as f:
    config = yaml.safe_load(f)

model = config["models"]["invoice_extraction"]

response = client.messages.create(
    model=model,
    messages=messages
)

When you want to upgrade, you change the config, not the code. You test the new version separately. You roll back if needed.

2. Output Schema Contracts

Define what valid output looks like. Your model must conform to it.

from pydantic import BaseModel, EmailStr
from typing import Optional

class InvoiceData(BaseModel):
    """Contract for invoice extraction output."""
    invoice_number: str
    amount: float
    vendor_name: str
    customer_email: EmailStr
    due_date: str  # YYYY-MM-DD format

    class Config:
        # This makes it strict: no extra fields allowed
        extra = "forbid"

When the model returns data, validate it:

response = client.messages.create(
    model=model,
    messages=messages
)

try:
    invoice = InvoiceData.model_validate_json(response.content[0].text)
    # Valid. Use it.
except ValidationError as e:
    # Invalid. Reject it or retry with a corrected prompt.
    logger.error(f"Validation failed: {e}")
    raise

If a new model version returns different formats (extra fields, missing fields, wrong types), your validation catches it immediately. You don’t silently save bad data.

3. Regression Testing

Test the model’s output on known inputs before you roll out.

import json

# Test data: input -> expected output
test_cases = [
    {
        "input": "Invoice #12345 for $1,000 due May 15",
        "expected": {
            "invoice_number": "12345",
            "amount": 1000.0,
            "due_date": "2026-05-15"
        }
    },
    {
        "input": "Acme Corp owes us $500.50, invoice INV-999, email: contact@acme.com",
        "expected": {
            "invoice_number": "INV-999",
            "amount": 500.50,
            "vendor_name": "Acme Corp",
            "customer_email": "contact@acme.com"
        }
    }
]

def test_model(model_name, test_cases):
    """Test model against known inputs."""
    passed = 0
    failed = 0

    for test in test_cases:
        response = client.messages.create(
            model=model_name,
            messages=[{"role": "user", "content": test["input"]}]
        )

        try:
            result = InvoiceData.model_validate_json(response.content[0].text)

            # Simple comparison: did we get the key fields right?
            if (result.invoice_number == test["expected"]["invoice_number"] and
                abs(result.amount - test["expected"]["amount"]) < 0.01):
                passed += 1
            else:
                failed += 1
                print(f"FAIL: {test['input']}")
                print(f"Expected: {test['expected']}")
                print(f"Got: {result.model_dump()}")
        except ValidationError:
            failed += 1
            print(f"FAIL (validation): {test['input']}")

    print(f"Results: {passed}/{len(test_cases)} passed")
    return failed == 0

# Test the old model and the new one
old_model_ok = test_model("claude-3-5-sonnet-20241022", test_cases)
new_model_ok = test_model("claude-3-5-sonnet-20250415", test_cases)

if old_model_ok and new_model_ok:
    # Safe to switch
    print("All tests passed. Ready to upgrade.")
else:
    print("Some tests failed. Do not upgrade yet.")

Run this before switching models in production. Catch regressions before they hit customers.

4. Gradual Rollout with Canary Metrics

When you’re ready to upgrade:

Deploy the new model to 5% of traffic. (Use feature flags or weighted traffic.)
Monitor for 1 week: error rates, validation failures, response times.
If metrics are good, roll out to 100%.
If metrics degrade, revert immediately.

Example CloudWatch alarm:

# Monitor validation failure rate
# If it spikes from 0.1% to 5%, page the team
threshold = 5  # percent

validation_failures = cloudwatch.get_metric_statistics(
    Namespace="YourApp",
    MetricName="ModelValidationFailures",
    StartTime=now - timedelta(hours=1),
    EndTime=now,
    Period=300,
    Statistics=["Sum"]
)

failure_rate = (sum(dp["Sum"] for dp in validation_failures) / total_requests) * 100

if failure_rate > threshold:
    sns.publish(TopicArn=alarm_topic, Message=f"Model validation spike: {failure_rate}%")

Practical Timeline

Week 1: New Claude model is released. You see it announced. Pin the version in your config and wait.

Week 2-3: Run regression tests against the new model. 50+ test cases. Real examples from your production data. Do they pass?

Week 4: If tests pass, canary rollout. 5% of traffic to the new model for 1 week. Monitor errors, latency, validation failures.

Week 5: Full rollout. All traffic to the new model. Keep old model config as a rollback option.

Ongoing: Model providers deprecate old versions. When deprecation is announced, run new tests and plan the upgrade.

Cost reality check: Running 50 regression test cases at $0.003 per call = $0.15. Run it weekly and you’re spending $0.60/month to stay safe. Worth it. A single production outage costs more.

The Mindset

Treat model updates like security patches. They’re good, they’re necessary, but they require testing before rollout.

Your system should expect models to change. It should validate every output. It should have a rapid rollback path. Build that, and you’re future-proof.

We’ve implemented this pattern across 20+ production systems. It’s the difference between breaking on updates and thriving through them.

Get the free AI Readiness Checklist

15 questions to diagnose your team’s AI readiness, where you’ll see ROI fastest, and what to tackle first.

✓ Takes 5 minutes ✓ Actionable next steps ✓ No sales pitch

No spam. Unsubscribe anytime.

Ready to build AI that actually works?

Let’s talk about how SRE discipline transforms AI from a risky experiment into a reliable business system.

Book Your Free Discovery Call