April 21, 2026 · 7 min read · Operations & Reliability

The Runbook: Your AI System’s Most Important Document

It’s 2am. Your invoice processing pipeline stopped working. Invoices are backing up. Customers are annoyed. Your team is in Slack. Someone asks: “What do we do?” If you don’t have a runbook, the answer is: panic and debug live. If you do, the answer is on page 3.

A runbook is the difference between “we hope someone knows what to do” and “we have a plan.”

What Goes in a Runbook

A runbook is not a tutorial. It’s not documentation of how the system works. It’s a decision tree for people operating the system under pressure.

1. System Overview (One Page)

Start with a 5-minute summary:

INVOICE PROCESSING PIPELINE

Purpose: Parse uploaded invoices, extract line items, store in DynamoDB.

Main components:
- S3 bucket: retail-invoices-prod
- Lambda: invoice-parser (runs on S3 PUT events)
- DynamoDB table: invoices_staging
- DynamoDB table: invoices_final
- CloudWatch Logs: /aws/lambda/invoice-parser

Owner: Charles (charles@threemoonsnetwork.net)
Backup: DevOps team (slack: #devops)

On-call: Check PagerDuty rotation

Keep this visible. Print it. Post it in Slack. New team members start here.

2. Dependencies and Integrations

What does this system rely on? What can break it?

EXTERNAL DEPENDENCIES

Anthropic Claude API
  - Used for: Document parsing and line-item extraction
  - Failure mode: Rate limit exceeded, API down
  - Impact: New invoices don't process
  - Fallback: Queue to SQS, retry in 1 hour

AWS Lambda
  - Used for: Event-driven processing
  - Failure mode: Concurrent execution limit hit
  - Impact: Invoices process slowly
  - Mitigation: Reserved concurrency set to 100

AWS DynamoDB
  - Used for: Storing parsed invoices
  - Failure mode: Throttled writes (exceeds capacity)
  - Impact: Processing slows, might fail
  - Mitigation: On-demand billing enabled (auto-scales)

AWS S3
  - Used for: Storing raw PDF files
  - Failure mode: Bucket misconfigured, access denied
  - Impact: Lambda can't read files, crashes
  - Prevention: Bucket policy tested quarterly

Each dependency should list: what it does, how it can fail, what the impact is, how you detect it, and what you do about it.

3. Failure Modes and Detection

What breaks? How do you know?

FAILURE MODE: Lambda timeout (function runs >15 min)

Detection:
  - CloudWatch alarm: "invoice-parser timeout count > 0"
  - Slack notification in #alerts
  - Logs show: "Task timed out after 900.00 seconds"

Why it happens:
  - Stuck waiting for Claude API response
  - Large PDF file takes >10 min to parse
  - DynamoDB write throttled

How bad is it:
  - Severity: HIGH (invoices not processed, backlog grows)
  - Customer impact: Invoices appear received but don't appear in system

Immediate action:
  1. Check Lambda CloudWatch Logs for errors
  2. Check DynamoDB throttling metrics
  3. Check Claude API status (status.anthropic.com)
  4. If Claude API is up, increase Lambda timeout to 20 min (temporary)
  5. Notify charles@threemoonsnetwork.net

FAILURE MODE: DynamoDB write errors

Detection:
  - Lambda logs show: "ConditionalCheckFailedException"
  - CloudWatch alarm: "invoice-parser errors > 5/min"

Why it happens:
  - Duplicate invoice IDs (same file processed twice)
  - Schema mismatch (new field added, old code doesn't handle it)

How bad is it:
  - Severity: MEDIUM (some invoices fail, others succeed)
  - Customer impact: Partial data loss, reconciliation issues

Immediate action:
  1. Check CloudWatch Logs for the specific error
  2. If ConditionalCheckFailedException:
     - Verify idempotency tracking in DynamoDB
     - Check if file was already processed (look for content_id)
  3. If schema error:
     - Check if DynamoDB table schema matches deployed code
     - Review recent deployments
     - Rollback if necessary
  4. Escalate to charles@threemoonsnetwork.net

Document every critical failure mode. Include how it manifests (what the user sees, what logs show), root causes, severity, immediate steps, and the escalation path.

4. Remediation Steps

What do you actually do when something breaks?

PROCEDURE: Unblock invoice processing after service outage

Assuming: Claude API is back up, no data is corrupted

Steps:
  1. Check current queue depth
     Run: aws sqs get-queue-attributes \
       --queue-url <url> \
       --attribute-names ApproximateNumberOfMessages
     If > 1000: continue to step 2.
     If < 100: manual retries might be faster.

  2. Trigger replay Lambda
     aws lambda invoke \
       --function-name invoice-parser-replay \
       --payload '{"date": "2024-04-01"}' response.json
     This reprocesses all unprocessed invoices from that date forward.

  3. Monitor CloudWatch metrics
     Watch: invoice-parser errors, duration, success rate
     For 5 minutes after triggering replay
     Expected: errors drop to < 1%, success rate > 99%

  4. Verify data integrity
     Compare processed count vs. expected count for the period
     If mismatch > 5%, escalate to charles@threemoonsnetwork.net

  5. Update status page
     Slack: "Invoice processing restored, backlog being replayed"
     Update incident in PagerDuty

Remediation steps should be specific (not “check the logs” but “run this exact command”), testable (someone new should be able to follow them), and safe (no risk of making things worse).

5. Escalation Paths

Who do you call? When?

ESCALATION

Level 1 (You, on-call engineer):
  - Reboot Lambda (update and redeploy)
  - Clear CloudWatch alarms
  - Check API status pages
  - Expected time to resolution: 15 minutes

Level 2 (Charles):
  - Reached if Level 1 doesn't resolve in 15 min
  - Slack: charles@threemoonsnetwork.net
  - Expected time to respond: 15 minutes

Level 3 (Full team):
  - Reached if issue is still unresolved after 30 min
  - Slack: #incidents
  - Page everyone on rotation
  - Expected time to respond: 5 minutes

Critical escalation (Customer-facing, financial impact):
  - Page PagerDuty immediately
  - Notify customers via status page
  - Stand up incident bridge

6. Contact Information

Keep it simple and accurate. Update it monthly.

CONTACTS

On-call Rotation:
  - https://pagerduty.com/incidents

Slack Channels:
  - #alerts (automated)
  - #incidents (manual escalation)
  - @charles (direct message)

Email:
  charles@threemoonsnetwork.net

External:
  - Anthropic API Support: support@anthropic.com
  - AWS Support: your support plan

Using the Runbook

A runbook only works if people actually use it. Make it easy:

Pin it. Slack channel, GitHub wiki, printed on the wall.
Version it. Date each update. Keep it current.
Test it. Quarterly, run through each procedure. Time yourself.
Refine it. After every incident, update the relevant section.

A Runbook Saves Your Sanity

At 2am, a runbook means: 5 minutes to identify the problem, 10 minutes to execute the fix, and everyone knows exactly what to do.

Without one, you’re debugging production while your system is on fire.

Build the runbook before you need it.

Key takeaway: A runbook isn’t documentation — it’s a decision tree for people under pressure. System overview, dependencies, failure modes, remediation steps, escalation paths, and contacts. Build it before 2am. Test it quarterly. Update it after every incident.

Get the runbook template

The Markdown-based runbook template I use for every production AI system I deploy. Pre-filled sections for system overview, dependencies, failure modes, remediation, and escalation paths. Fork it and customize.

Production-ready runbook template
Failure mode documentation guide
Escalation path framework

Your AI system doesn’t have a runbook?

I help teams build operational documentation and incident response plans for production AI systems. Let’s make sure 2am doesn’t catch you off guard.

Book Your Free Discovery Call

No spam. Unsubscribe anytime.

← Previous AI for Retail: Demand Forecasting Without a Data Science Team