← Back to Blog
April 21, 2026 · 7 min read · Operations & Reliability

The Runbook: Your AI System’s Most Important Document

It’s 2am. Your invoice processing pipeline stopped working. Invoices are backing up. Customers are annoyed. Your team is in Slack. Someone asks: “What do we do?” If you don’t have a runbook, the answer is: panic and debug live. If you do, the answer is on page 3.

A runbook is the difference between “we hope someone knows what to do” and “we have a plan.”

What Goes in a Runbook

A runbook is not a tutorial. It’s not documentation of how the system works. It’s a decision tree for people operating the system under pressure.

1. System Overview (One Page)

Start with a 5-minute summary:

INVOICE PROCESSING PIPELINE Purpose: Parse uploaded invoices, extract line items, store in DynamoDB. Main components: - S3 bucket: retail-invoices-prod - Lambda: invoice-parser (runs on S3 PUT events) - DynamoDB table: invoices_staging - DynamoDB table: invoices_final - CloudWatch Logs: /aws/lambda/invoice-parser Owner: Charles (charles@threemoonsnetwork.net) Backup: DevOps team (slack: #devops) On-call: Check PagerDuty rotation

Keep this visible. Print it. Post it in Slack. New team members start here.

2. Dependencies and Integrations

What does this system rely on? What can break it?

EXTERNAL DEPENDENCIES Anthropic Claude API - Used for: Document parsing and line-item extraction - Failure mode: Rate limit exceeded, API down - Impact: New invoices don't process - Fallback: Queue to SQS, retry in 1 hour AWS Lambda - Used for: Event-driven processing - Failure mode: Concurrent execution limit hit - Impact: Invoices process slowly - Mitigation: Reserved concurrency set to 100 AWS DynamoDB - Used for: Storing parsed invoices - Failure mode: Throttled writes (exceeds capacity) - Impact: Processing slows, might fail - Mitigation: On-demand billing enabled (auto-scales) AWS S3 - Used for: Storing raw PDF files - Failure mode: Bucket misconfigured, access denied - Impact: Lambda can't read files, crashes - Prevention: Bucket policy tested quarterly

Each dependency should list: what it does, how it can fail, what the impact is, how you detect it, and what you do about it.

3. Failure Modes and Detection

What breaks? How do you know?

FAILURE MODE: Lambda timeout (function runs >15 min) Detection: - CloudWatch alarm: "invoice-parser timeout count > 0" - Slack notification in #alerts - Logs show: "Task timed out after 900.00 seconds" Why it happens: - Stuck waiting for Claude API response - Large PDF file takes >10 min to parse - DynamoDB write throttled How bad is it: - Severity: HIGH (invoices not processed, backlog grows) - Customer impact: Invoices appear received but don't appear in system Immediate action: 1. Check Lambda CloudWatch Logs for errors 2. Check DynamoDB throttling metrics 3. Check Claude API status (status.anthropic.com) 4. If Claude API is up, increase Lambda timeout to 20 min (temporary) 5. Notify charles@threemoonsnetwork.net
FAILURE MODE: DynamoDB write errors Detection: - Lambda logs show: "ConditionalCheckFailedException" - CloudWatch alarm: "invoice-parser errors > 5/min" Why it happens: - Duplicate invoice IDs (same file processed twice) - Schema mismatch (new field added, old code doesn't handle it) How bad is it: - Severity: MEDIUM (some invoices fail, others succeed) - Customer impact: Partial data loss, reconciliation issues Immediate action: 1. Check CloudWatch Logs for the specific error 2. If ConditionalCheckFailedException: - Verify idempotency tracking in DynamoDB - Check if file was already processed (look for content_id) 3. If schema error: - Check if DynamoDB table schema matches deployed code - Review recent deployments - Rollback if necessary 4. Escalate to charles@threemoonsnetwork.net

Document every critical failure mode. Include how it manifests (what the user sees, what logs show), root causes, severity, immediate steps, and the escalation path.

4. Remediation Steps

What do you actually do when something breaks?

PROCEDURE: Unblock invoice processing after service outage Assuming: Claude API is back up, no data is corrupted Steps: 1. Check current queue depth Run: aws sqs get-queue-attributes \ --queue-url <url> \ --attribute-names ApproximateNumberOfMessages If > 1000: continue to step 2. If < 100: manual retries might be faster. 2. Trigger replay Lambda aws lambda invoke \ --function-name invoice-parser-replay \ --payload '{"date": "2024-04-01"}' response.json This reprocesses all unprocessed invoices from that date forward. 3. Monitor CloudWatch metrics Watch: invoice-parser errors, duration, success rate For 5 minutes after triggering replay Expected: errors drop to < 1%, success rate > 99% 4. Verify data integrity Compare processed count vs. expected count for the period If mismatch > 5%, escalate to charles@threemoonsnetwork.net 5. Update status page Slack: "Invoice processing restored, backlog being replayed" Update incident in PagerDuty

Remediation steps should be specific (not “check the logs” but “run this exact command”), testable (someone new should be able to follow them), and safe (no risk of making things worse).

5. Escalation Paths

Who do you call? When?

ESCALATION Level 1 (You, on-call engineer): - Reboot Lambda (update and redeploy) - Clear CloudWatch alarms - Check API status pages - Expected time to resolution: 15 minutes Level 2 (Charles): - Reached if Level 1 doesn't resolve in 15 min - Slack: charles@threemoonsnetwork.net - Expected time to respond: 15 minutes Level 3 (Full team): - Reached if issue is still unresolved after 30 min - Slack: #incidents - Page everyone on rotation - Expected time to respond: 5 minutes Critical escalation (Customer-facing, financial impact): - Page PagerDuty immediately - Notify customers via status page - Stand up incident bridge

6. Contact Information

Keep it simple and accurate. Update it monthly.

CONTACTS On-call Rotation: - https://pagerduty.com/incidents Slack Channels: - #alerts (automated) - #incidents (manual escalation) - @charles (direct message) Email: charles@threemoonsnetwork.net External: - Anthropic API Support: support@anthropic.com - AWS Support: your support plan

Using the Runbook

A runbook only works if people actually use it. Make it easy:

  1. Pin it. Slack channel, GitHub wiki, printed on the wall.
  2. Version it. Date each update. Keep it current.
  3. Test it. Quarterly, run through each procedure. Time yourself.
  4. Refine it. After every incident, update the relevant section.

A Runbook Saves Your Sanity

At 2am, a runbook means: 5 minutes to identify the problem, 10 minutes to execute the fix, and everyone knows exactly what to do.

Without one, you’re debugging production while your system is on fire.

Build the runbook before you need it.

Key takeaway: A runbook isn’t documentation — it’s a decision tree for people under pressure. System overview, dependencies, failure modes, remediation steps, escalation paths, and contacts. Build it before 2am. Test it quarterly. Update it after every incident.