March 25, 2026 12 min read SRE · AI · Architecture

The SRE Approach to AI: Why Reliability Engineering Is the Missing Piece in Business Automation

The AI industry has a delivery problem. Not a capability problem. Models can do incredible things, but most AI projects fail not because the AI didn't work — they fail because nobody built the operational scaffolding to keep it working. SRE is that scaffolding.

The 90% Nobody Talks About

When people discuss AI automation, they talk about the model. The prompt engineering. The architecture of the decision logic. The benchmarks. But here's the uncomfortable truth: the model is maybe 10% of a production system.

The other 90% is everything else.

Input validation. Error handling. Retry logic. Monitoring. Alerting. Deployment pipelines. Rollback capability. Cost controls. Documentation. Runbooks. The stuff that makes the difference between a project that delivers consistent ROI and one that becomes shelfware gathering dust in some shared drive.

This isn't glamorous. It doesn't make conference talks or Twitter threads. But it's what separates a demo that works on Tuesday from a system that works on Tuesday, Wednesday, and three years from now when you need to debug why last month's output was wrong.

When we look at failed AI projects, almost none failed because the model was bad. They failed because:

Nobody handled the API rate limit changing overnight
Model output validation wasn't in place, so garbage got passed downstream silently
Failed items vanished instead of being queued for reprocessing
Cost spiked unexpectedly because there was no budget monitoring
When something broke, there was no runbook — just panic
The infrastructure was built by clicking the AWS console, so nobody could reproduce it

These are operational problems, not AI problems. And Site Reliability Engineering has been solving these exact problems for two decades.

SRE Principles Applied to AI

SRE is fundamentally about reliability through engineering. It takes the operational lessons from companies that run planet-scale systems 24/7 and formalizes them into practices that work at any scale. Those principles translate directly to AI automation. Here's how.

1. Define SLOs and Track Error Budgets

An SLO is a concrete commitment about how well your system should perform. For traditional systems, it might be "99.5% of requests complete within 500ms." For AI automation, it's equally concrete: "95% of model responses must parse as valid JSON. P99 latency must stay under 5 seconds. Silent data loss must be less than 0.1%."

Once you've defined what "working" means, you track your error budget. If you've committed to 99% uptime, you have a budget of 7.2 hours of downtime per month. Once you've spent it, you stop deploying changes and focus on stability. This forces a conversation: Is the new feature worth the stability risk?

Most AI projects have never defined an SLO. They just hope everything works. Hope is not a strategy.

The key insight: You can't manage what you don't measure. Define SLOs before you build. Track them continuously. Let them inform your priorities.

2. Toil Elimination (Not Automation Theater)

Toil is repetitive manual work that doesn't scale and doesn't provide learning. In traditional SRE, toil is running the same runbook every time a service breaks. In AI automation, toil is babysitting the system.

The irony is that most AI projects create toil instead of eliminating it. You get humans manually validating model output. Humans reprocessing failed batches. Humans debugging why JSON parsing failed. Humans checking whether the pipeline completed successfully.

A well-engineered AI system handles all of that automatically. Validation failures route to a dead letter queue. Failed items are automatically retried with exponential backoff. Cost overruns trigger alarms. Output that doesn't conform to the expected schema triggers an incident. The system degrades gracefully instead of catastrophically failing.

The goal isn't to automate everything. It's to design the system so that humans only get paged when something actually requires human judgment.

3. Incident Response and Graceful Degradation

Your AI automation will fail. This isn't pessimism — it's statistics. API providers have outages. Rate limits change. Models occasionally produce invalid output. The question isn't whether failure will happen. It's whether you've designed for it.

SRE has a playbook for this. First: have a runbook. When the system alerts, what's the first thing you check? What's the remediation path? What's the escalation procedure? The runbook should be so clear that someone on-call at 2am can execute it without needing to think.

Second: design for graceful degradation. If the AI API is overloaded, don't cascade the failure to your users. Queue the request. Serve a cached response. Route it to a human reviewer. Give downstream systems a clear signal that this particular item didn't complete, so they can handle it gracefully instead of assuming success.

Third: version everything. If a model update causes a 15% increase in output validation failures, you need to roll it back in minutes, not hours. That means the previous version is always available. You can A/B test. You can canary new versions to a small subset of traffic before rolling out globally.

4. Change Management and Deployment Discipline

Every prompt change is a deployment. Every model version update is a deployment. Every schema modification is a deployment. Each one can break production just as thoroughly as shipping bad code.

Treat them that way. Version your prompts in Git. Validate changes against your test suite. Deploy to staging first. Canary to 5% of production traffic before going to 100%. Monitor the metrics that matter — validation failures, latency, cost. Roll back if something goes wrong. Do post-mortems on failures to understand the root cause and prevent repetition.

This sounds like overhead. It's actually the difference between a system you can operate with confidence and one you're constantly terrified to touch.

5. Capacity Planning and Cost Control

AI APIs have rate limits. They have costs that scale with usage. They have rate limit resets at unpredictable times. Most teams have no idea what their throttling thresholds are until they hit them at scale.

Capacity planning means knowing, before you ship, what your peak usage will be. Can your AI provider handle it? Do you need to request a higher rate limit? What's your fallback if you exceed it? What does the cost curve look like at 10x current volume? At 100x?

Cost control means setting budget alarms. The moment your monthly spend hits 80% of your budget, you should know. If it hits 100%, the system should stop invoking the AI API and fail gracefully. Unchecked cost growth has killed more projects than bad models ever have.

Rule of thumb: If you can't answer "What's our P99 latency?" and "What's our monthly cost?" without looking it up, your observability isn't mature.

6. Infrastructure as Code Is Non-Negotiable

If your AI automation was deployed by clicking through the AWS console, it's not production-ready. It's not reproducible. When you need to debug a configuration issue, deploy to a second region, or hand off to another team, you'll be reverse-engineering your own setup from memory and screenshots.

Everything must be in code. Terraform. CloudFormation. CDK. Pick one and commit to it. Every Lambda function. Every API Gateway. Every DynamoDB table. Every IAM permission. Every alarm and dashboard. Version it. Code review it. Merge it to main only after tests pass.

The bar: you should be able to destroy your entire environment and recreate it from code in under 30 minutes. If it takes longer, your infrastructure isn't clean enough.

The Three Moons Stack

We've applied these principles to build a reference architecture that works. It's not the only way to do it, but it's a pattern that scales from proof of concept to production and handles the operational requirements that catch most teams off guard.

# Three Moons AI Automation Architecture
# Data flows left to right, with observability and retry logic throughout

┌─────────────────────────────────────────────────────────────────┐
│ INGEST                                                          │
│ API Gateway (with rate limiting, CORS, request validation)     │
└──────────────────┬──────────────────────────────────────────────┘
                   │
                   v
┌─────────────────────────────────────────────────────────────────┐
│ PROCESS                                                         │
│ Lambda (invoke AI model, validate output, handle errors)        │
└──────────────────┬──────────────────────────────────────────────┘
                   │
         ┌─────┴──────┬──────────┐
         v            v         v
    Success    Error   Invalid
       │          │        │
       v          v        v
   DynamoDB  Retry    DLQ
   + S3      Queue    (inspect)
       │                  │
       └─────────┬──────────┘
               │
               v
┌─────────────────────────────────────────────────────────────────┐
│ OBSERVE                                                         │
│ CloudWatch (metrics, logs, alarms, dashboards)                 │
│ X-Ray (distributed tracing for debugging)                      │
└─────────────────────────────────────────────────────────────────┘
    

Each component does one thing well, and together they handle the operational requirements that catch teams off guard:

AWS Lambda for compute. It auto-scales. You pay only for what you use. No servers to manage, patch, or keep alive. Each invocation starts fresh, which eliminates state leakage bugs. If something breaks, the blast radius is one invocation, not an entire server.

API Gateway for ingress. Built-in throttling prevents your costs from spiraling if a customer misconfigures their integration. Built-in request validation catches malformed input before it reaches your Lambda. CORS configuration means you don't get tripped up by browser security.

DynamoDB for state. On-demand billing means no capacity planning. It auto-scales with your traffic. Transactions ensure that concurrent updates don't corrupt data. Time-to-live policies mean you can set and forget data cleanup.

S3 for data landing zones. It's the most durable storage in AWS. Lifecycle policies automatically archive old data to cheaper storage. Versioning means you can recover from mistakes. Access logs give you visibility into who's reading and writing data.

CloudWatch for observability. Custom metrics track what matters: invocation latency, token usage, validation failures, cost. Alarms page you when thresholds are breached. Dashboards give you the 30-second health check. Logs give you the detailed debugging when something goes wrong.

Terraform for everything. Every Lambda function. Every table. Every alarm. Every policy. Version-controlled, reviewable, reproducible.

GitHub Actions for CI/CD. Linting. Testing. Validation. Deploy only when checks pass. Rollback is a single git revert. History is auditable.

Anthropic Claude for inference. Structured outputs mean the JSON you get back actually parses. Tool use means Claude can integrate with your systems natively. Pricing is predictable. Performance is consistent.

This stack is boring. That's the point. It works. It scales. It's operationally sound. It's the opposite of building on whatever the newest, trendiest technology is.

What This Means for Your Business

You get AI automation that works on day 30, not just day 1. The demo was cool. But day 30 is when the real value comes — when the system is actually handling your production traffic and saving money.

You get documentation and runbooks. Your team doesn't stay dependent on us to babysit your system. If something breaks at 2am, your on-call engineer can follow the runbook without panic. If something needs to change, they understand how to change it safely.

You get infrastructure-as-code. Everything is in Git. Everything is reviewable. Everything is reproducible. Want to stand up a staging environment? Deploy to a second region? Migrate to a different team? It's code — you can do all of that without calling us.

You get monitoring so you know the system is working without checking manually. Alarms tell you when something's wrong. Dashboards tell you whether it's working. Metrics tell you whether you're getting ROI.

You get a warranty, essentially. We're confident enough in what we built that we'll commit to SLOs. We'll document the failure modes. We'll help you run it. We'll do post-mortems on any incidents. We're not burning bridges after deployment — we're built for the long term.

The Anti-Pattern Checklist

When evaluating an AI consultant or vendor, watch for these red flags. Any one of these should make you question whether you're getting a production system or a demo that'll need a complete rewrite when it hits production load.

"We'll build a custom model." You almost certainly don't need one. Off-the-shelf models (Claude, GPT-4, etc.) are better than anything but the largest companies can build. Prompt engineering gets you 90% of the way there. Custom models are a liability, not a moat.
No mention of monitoring or observability. If they don't talk about how you'll know when it's broken, it will break and you won't find out until the damage is done.
No infrastructure-as-code. If they hand you a "deployment guide" instead of Terraform, you're buying a demo, not a system.
No testing strategy. If they don't have automated tests, changes will break things. There's no way around it.
"It works in my notebook" as a deployment strategy. If they're deploying by running Jupyter notebooks, that's not production. That's hobby hour.
No runbook or handoff documentation. If they can't hand off the system to your team with confidence, you're paying to be dependent on them forever.
Hourly billing with no fixed scope. If they can't tell you the cost upfront, scope creep will bury you. SRE-led projects have clear scope and fixed budgets.

The Conversation That Matters

Most companies approach AI automation backwards. They start with "What's the coolest thing AI can do?" and work backward from there. We do the opposite.

We start with "What's your highest-value manual process? What are you currently doing manually that's costing you money or time? If we automated that, what would you do with the freed-up capacity?"

Then we ask the hard questions: What are the failure modes? What happens if the system gets a piece of bad data? If it's rate-limited? If it produces output that doesn't meet your schema? If cost spirals? If the provider has an outage?

Then we build a system that handles all of those things, deploy it with SRE discipline, document it, hand it off, and stay available to support it.

That's a very different conversation from "Let me build you an impressive AI demo."

Get the free AI Readiness Checklist

15 questions to diagnose your team’s AI readiness, where you’ll see ROI fastest, and what to tackle first.

✓ Takes 5 minutes ✓ Actionable next steps ✓ No sales pitch

No spam. Unsubscribe anytime.

Ready to build AI that actually works?

Let’s look at how SRE discipline transforms AI from a risky experiment into a reliable business system.

Book Your Free Discovery Call