← Back to Blog
April 15, 2026 · 8 min read · Handoff, Consulting, SRE

How I Structure Client Projects for Zero-Downtime Handoff

Here's what happens when an AI consultant leaves without a plan: six months later, something breaks and nobody knows how to fix it. Here's the structure that prevents that.

The Handoff Problem

You've spent three months building an AI system for a client. The system works. It's in production. Then you hand it over and disappear—and the client's team has no idea how to operate, debug, or extend it. Within a month, something fails. There's no runbook. The dashboard is cryptic. The code comments assume knowledge you forgot to transfer. The client now has an expensive system they don't understand.

This happens because most AI consulting projects treat handoff as an afterthought. It shouldn't be. Handoff is the product.

Core principle: A new engineer on the client's team should be able to fork the repo, read the documentation, run the setup script, and understand the entire system—architecture, monitoring, deployment, and troubleshooting—in 30 minutes.

I structure every client engagement around this. Here's the framework.

Phase 1: Project Repository Structure

The repo is the single source of truth. Everything about the system lives here. This is not optional.

project/ ├── README.md # What this is, how to run it, 5-min quick start ├── ARCHITECTURE.md # System design, data flows, decision rationale ├── SETUP.md # Environment setup, dependencies, credentials ├── RUNBOOK.md # Operational procedures, troubleshooting, escalation ├── DEVELOPMENT.md # Contributing guidelines, local dev setup ├── src/ │ ├── handlers/ # Lambda handlers, API routes │ ├── agents/ # AI agent code │ ├── utils/ # Shared utilities, SDK wrappers │ └── config.py # Environment-based configuration ├── terraform/ │ ├── main.tf │ ├── monitoring.tf │ ├── variables.tf │ └── outputs.tf ├── .github/workflows/ # CI/CD pipelines ├── docker-compose.yml # Local dev environment ├── tests/ ├── .env.example └── .gitignore

Every file has a purpose. Every directory is named clearly. Most importantly: no surprises. The client's team should land on a file path and understand immediately why it exists and what it does.

Phase 2: Documentation That Actually Gets Read

Most consulting documentation is written for you—the consultant. It's thorough but dense. It never gets read after handoff. I flip this: I write for the client.

README.md

The first 100 words must answer: "What does this system do, and why should I care?" Then the quick start. The quick start is 5 minutes. No tangents. Copy-paste-run.

ARCHITECTURE.md

A diagram (Mermaid, ASCII, whatever). Then a one-paragraph description of the data flow. Then the detailed breakdown: what each component does, why it exists, and what trade-offs were made. Include links to monitoring dashboards, GitHub issues, and relevant AWS docs. Make it clear where the AI logic lives and what it's actually doing.

RUNBOOK.md

This is the survival guide. It answers: "Something is broken. What do I do?" It should include:

  • How to check system health
  • Common failure modes and fixes
  • How to read logs and metrics
  • When to escalate and who to contact
  • How to do a quick manual run if automated execution fails

Here's an example structure:

# RUNBOOK: Invoice Processing System ## Quick Health Check 1. Check CloudWatch Dashboard: - Navigate to [Invoice Processing Dashboard] - Look for red alerts (error rates > 5%) - Check Lambda cold start times (should be < 2s) 2. Check Recent Logs: - aws logs tail /aws/lambda/process-invoice --follow - aws logs tail /aws/lambda/extract-text --follow 3. Check DynamoDB: - Go to DynamoDB → invoices table - Scan for items with status='ERROR' - Look at the 'error_reason' field for patterns ## Common Issues ### Error Rate Spiking (> 10%) - **Cause:** Claude API rate limiting or document parsing failures - **Fix:** 1. Check CloudWatch for 429 errors 2. If rate limited: Wait 5 minutes, then manual retry 3. If parsing errors: Check document format (PDF vs image vs scanned) ### Cold Start Delay (> 5s) - **Cause:** Lambda initialization on first invoke - **Fix:** 1. Trigger manual invocation: aws lambda invoke --function-name process-invoice /dev/null 2. Keep function warm with CloudWatch scheduled event (1 min after deploy) ### DynamoDB Throttling - **Cause:** Write throughput exceeded - **Fix:** 1. Go to DynamoDB → invoices table → Billing 2. Increase on-demand capacity or reserved capacity 3. Escalate to SRE team ## Manual Invoke (If Automation Fails) aws lambda invoke \ --function-name process-invoice \ --payload '{"bucket":"invoices","key":"document.pdf"}' \ response.json cat response.json ## Escalation - **API errors (5xx):** Check CloudWatch Insights, escalate to Charles - **Data corruption:** Do not attempt recovery, contact immediately - **Cost spike (> $500/day):** Trigger auto-rollback, contact team

Phase 3: Monitoring as Code

Dashboards are useless if they disappear when someone misconfigures CloudWatch. Monitoring must be in Terraform. Every critical metric gets an alert. Every alert has a runbook link.

resource "aws_cloudwatch_dashboard" "invoice_processing" { dashboard_name = "invoice-processing" dashboard_body = jsonencode({ widgets = [ { type = "metric" properties = { metrics = [ ["AWS/Lambda", "Duration", { stat = "Average" }], ["AWS/Lambda", "Errors", { stat = "Sum" }], ["AWS/Lambda", "ConcurrentExecutions", { stat = "Maximum" }], ] period = "60" stat = "Average" region = var.aws_region title = "Lambda Health" } }, { type = "log" properties = { query = "fields @timestamp, @message, error_type | stats count() by error_type" region = var.aws_region title = "Error Breakdown (Last Hour)" } } ] }) } resource "aws_cloudwatch_metric_alarm" "lambda_error_rate" { alarm_name = "invoice-processing-errors" comparison_operator = "GreaterThanThreshold" evaluation_periods = "2" metric_name = "Errors" namespace = "AWS/Lambda" period = "300" statistic = "Sum" threshold = "5" alarm_actions = [aws_sns_topic.alerts.arn] alarm_description = "RUNBOOK: See docs/RUNBOOK.md - Error Rate Spiking section" }

The key: alarm descriptions link directly to the runbook section. When an alert fires, the client can click through to the fix.

Phase 4: CI/CD as a Safety Net

The client's team should never need to think about deployment. Every push to main automatically runs tests, builds, deploys, and verifies health. They focus on code. The pipeline handles the rest.

name: Deploy Invoice Processor on: push: branches: [main] pull_request: branches: [main] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - uses: actions/setup-python@v4 with: python-version: '3.11' - run: pip install -r requirements.txt - run: pytest tests/ --cov=src - run: ruff check src/ --fix deploy: needs: test if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - uses: hashicorp/setup-terraform@v2 - run: terraform -chdir=terraform plan - run: terraform -chdir=terraform apply -auto-approve - name: Health Check run: curl -f https://api.example.com/health || exit 1

Phase 5: Live Training and Knowledge Transfer

Documentation is passive. Training is active. Two weeks before handoff, I schedule four 1-hour sessions with the client's team:

  1. Session 1: Architecture Deep Dive — Why the system is structured the way it is. Where the AI logic lives. Trade-offs made.
  2. Session 2: Day-to-Day Operations — How to deploy, how to read logs, how to add features, how to handle common failures.
  3. Session 3: Hands-On Lab — They deploy a change. They break something intentionally and fix it. They check the dashboard and interpret metrics.
  4. Session 4: Q&A and Offboarding Checklist — Anything they didn't understand. Walking through the final handoff checklist item by item.

These sessions are recorded. The client gets a private video library they can reference later.

Phase 6: 30-Day Warranty and Offboarding

The project doesn't end on day 90. It ends 30 days after handoff. During those 30 days, the client can escalate directly—I respond within 24 hours. This is not "support." It's warranty. The system should work. If it doesn't, we fix it together, and I teach them how to prevent it next time.

On day 30, we have a final offboarding call:

  • Review the system's health metrics over 30 days. Are error rates stable? Is performance acceptable? Are costs in line with projections?
  • Did they deploy any changes? What worked? What was hard?
  • Are there any lingering questions or concerns?
  • What's the escalation path if something breaks after day 30? (Typically: check the runbook, escalate to their team lead, or contact me for consulting retainer.)

The guarantee: After day 30, the client's team should be able to operate the system without me. Not perfectly—they'll grow into it—but independently and safely.

The Handoff Checklist

Use this to verify you're ready before handing off:

  • Repository structure is clean and documented
  • README.md has a 5-minute quick start that actually works
  • ARCHITECTURE.md explains the system and its design decisions
  • RUNBOOK.md covers common failure modes and fixes
  • All code has comments explaining the "why," not just the "what"
  • Monitoring dashboard is in Terraform and linked from the runbook
  • All alerts have descriptions that link to runbook sections
  • CI/CD pipeline runs tests, lints, and deploys automatically
  • Credentials are environment-based, never in code
  • .env.example file shows all required variables
  • All four training sessions completed and recorded
  • Client team has deployed at least one change and fixed one "broken" system
  • 30-day warranty period is defined and communicated
  • Offboarding checklist is created and scheduled for day 30

Why This Works

This structure works because it removes ambiguity. The client's team doesn't have to guess how the system works. They don't have to reverse-engineer your code. Every decision is documented. Every operation is scripted. Every failure has a known response.

It also builds confidence. After day 30, they know the system because they've operated it. They've broken it and fixed it. They've deployed changes. They're not dependent on you. The system is theirs.

And it protects you. A clean handoff is a completed project. You can move on to the next client knowing this one doesn't need you.