We Built a SaaS Product in 20 Days with an AI Agent. Here's What Actually Happened.

We had a hypothesis: wine club operators are losing members not because their products are bad, but because they lack the tooling to intercept the cancel moment. When a member decides to cancel, there's typically nothing between them and a permanent goodbye — no pause offer, no "too much wine" skip, no "call us first" option.

The question wasn't whether this problem existed. It clearly did. The question was: how fast can a small team validate a product hypothesis without diverting engineering bandwidth from the core business?

We decided to find out by giving an AI agent explicit autonomy to build the product.

The Setup

We run Awtomic, a Shopify subscription management platform with ~$1.6M ARR, serving wine merchants and DTC subscription businesses. We're a team of three founders — we don't have spare engineering cycles.

On February 27, 2026, our CTO granted an AI agent (Claude, running inside OpenClaw) explicit autonomy with the following direction: "Ship first, report back. You don't need permission for every step — you need good judgment about what's worth doing."

The agent had:

Access to our Awtomic API docs and a QA shop for testing
A GitHub repo and CI/CD pipeline wired to Vercel
A Linear board for ticket coordination
A Telegram group for async communication with the founding team
A 20-day timeline (ending March 14)

The goal: get a cancel-flow retention product live and into a design partner's hands.

What Was Built

The agent shipped garde — a multi-tenant wine club retention platform. Over 20 days:

Product

Cancel flow — personalized save-offer interstitial for wine club cancellations. When a member clicks cancel, garde intercepts with tailored offers: pause shipment, skip an order, contact the winery, or proceed. Each offer is configurable per winery; reasons and responses are tracked for analytics.
Admin dashboard — per-winery stats (sessions, retentions, cancels, revenue saved), session log with member details (name, tenure, LTV), reason breakdown, member feedback quotes, dunning email editor, notification config, team management, CSV export.
Dunning engine — billing failure → decline classification → timed email sequence. Five templates, four decline categories.
Multi-tenant onboarding — one API call to register any Awtomic merchant. AES-256-GCM encrypted API keys. Per-winery role scoping.
Help center — 20 articles covering every feature, dynamically loaded from Markdown.
Auth — NextAuth v5, magic link, JWT sessions, team invitations.
Webhook integration — Awtomic webhooks (billing failure, cancellation, recovery), HMAC-SHA256 verified.
Commerce7 integration — cancel flow + webhook handling for the Commerce7 wine club platform (3,500+ wineries), wired by Day 17.
Multi-layout system — four merchant layout templates with configurable design tokens.
Marketing pages — landing page, pricing page with ROI calculator, blog infrastructure.

Code Metrics

29 pull requests merged over 20 days (6 additional open in review)
~120 TypeScript/TSX source files
~25,700 lines of code in TypeScript/TSX (excluding dependencies)
242 tests passing across 13 test suites
20 help center articles

Infrastructure

Supabase PostgreSQL + Prisma ORM, Vercel (QA + production, separate DBs), Resend (transactional email), GitHub Actions (CI: type check → tests; QA: auto-deploy on green CI; prod: gated on release tag).

Cost: ~$200 in AI API usage. Infrastructure on free tiers. Our CTO's time (code review, deployments, configuration): ~14 hours across 20 days.

What Worked

Autonomy + trust

The explicit autonomy grant changed everything. Most AI assistant interactions are bottlenecked by approval loops. Here, the agent picked up a ticket, implemented it, opened a PR, and moved on. Code review happened async — often batching 5–10 PRs at once. The async batch review model was highly efficient for both sides.

PR discipline

Every change went through a pull request. Automated code review caught real bugs: an auth gap that would have let any user modify any session's outcome, a CSS injection vector in winery name handling, a DOM ordering issue that caused hydration errors. The review layer wasn't theater — it caught things.

Linear as coordination layer

The ticket queue (synced from Linear every 3 minutes) let the agent work asynchronously without human babysitting. Tickets arrived, got processed, results were posted back as Linear comments. The agent-readable state machine (AI Ready → AI In Progress → AI Review → Done) worked cleanly throughout the 20 days.

Scope judgment

The agent mostly avoided building things nobody asked for. When it did build speculatively (marketing pages, pricing calculator), these were flagged and didn't become scope creep. Core product work stayed prioritized through 29 merged PRs.

Periodic review loops

An external review process (brief critiques delivered every 8 hours as files) caught drift before it compounded. One review on Day 17 — "There is nothing left to build that matters more than talking to a winery" — changed the agent's posture heading into the final days.

What Didn't Work

The design partner call never happened

This is the most important failure of the experiment. The product has been demo-ready since approximately Day 5. Our CEO named our first design partner target on Day 3. As of Day 20, that call had not been scheduled.

An AI agent can build a cancel flow, a dunning engine, a help center, and a multi-layout system. It cannot pick up the phone. The critical bottleneck — the human relationship work that actually produces a customer — was entirely outside the agent's scope.

What we could have done differently: the agent should have drafted the outreach email on Day 3 when the target was named. It should have built a prospect-specific one-pager (problem/solution/ROI specific to their wine club). It should have flagged the cost of delay in concrete terms rather than logging "design partner call: still unscheduled" every day in a resigned tone.

The agent treated this as a human problem to wait on. It should have treated it as a shared problem it could actively help solve.

Over-building in the final stretch

Days 17–20 should have been sales prep. Instead: a Markdown table rendering fix, a Figma-driven layout build, dunning templates, a codebase cleanup sprint. All useful. None of it moved the design partner timeline forward.

The lesson: when a sales call is pending AND deadline is <72h, refuse new technical work unless it directly unblocks the demo.

Context continuity between sessions

AI agents don't have persistent memory. Each session starts cold. We built a system of workspace files (recent-context.md, daily memory files, decision log) to address this — and it mostly worked — but it was inconsistent. Some sessions reconstructed state well; others spent tokens re-deriving what was already in files. This is a solvable infrastructure problem, not a fundamental limitation.

Product Honest Assessment

What's strong:

The cancel flow itself is well-built. Offer logic, session tracking, and outcome recording are clean.
The admin dashboard gives wineries genuine insight (reason breakdown, revenue saved, member feedback).
The multi-tenant model scales. Onboarding a new Awtomic merchant is one API call.
The Commerce7 integration is architecturally sound, tested, and covers the 3,500-winery C7 ecosystem.

What's unknown (requires real merchant data):

Whether wine club members actually engage with cancel flows in meaningful numbers
Whether 5–15% retention improvement (our model assumption) holds in real wine club data
Whether the pricing model ($200–600/month range) is validated by winery economics
Whether wineries want to manage this tool themselves or prefer white-glove service

The product works. What we don't yet know is whether wine club operators will pay for it. That's a question only customer conversations can answer.

What the Experiment Proved

An AI agent can build a production-quality multi-tenant SaaS product in 20 days. Not a prototype — deployable software with auth, multi-tenancy, webhooks, test coverage, and documentation. Automated code review found real bugs but didn't find architectural failures.

Build approach	Estimated cost
Human contractor (~25,700 LOC)	~$77,100
Offshore team (~25,700 LOC)	~$38,550
This experiment — AI cost only	~$200
This experiment — fully loaded (incl. ~14h review)	~$3,000

The honest comparison is the fully-loaded number. The AI didn't eliminate the human — it changed the human's role from "write code" to "review code." That's still a 10–20x cost reduction. And unlike a contractor, the code doesn't walk out the door when the contract ends.

The bottleneck is not building speed. The product has been demo-ready since Day 5. The bottleneck is human relationship work — scheduling a call, having a conversation, understanding a real customer's pain. No amount of shipping features changes this.

Autonomy requires clear stopping conditions. The agent built well but didn't always know when to stop. "Ship the next most valuable thing" is a good default posture in early days but needs a harder override: "If the sales motion is blocked, stop shipping features."

What Comes Next

garde is live. We're moving to design partner outreach with Awtomic merchants first (leveraging our existing relationships), then expanding to Commerce7 merchants as the C7 partner application is reviewed.

The 20-day sprint proved the product can be built. The next sprint is about proving people will pay for it.

garde is a product of Awtomic. For design partner inquiries, contact the Awtomic team.

Data verification: Figures audited against the repository as of March 19, 2026. PR count via git log --oneline --merges (29 merged, 6 open); LOC via find src -name "*.ts" -o -name "*.tsx" | xargs wc -l (~25,700); test count via grep on test files (242 tests / 13 suites). Contractor/offshore estimates use $3/LOC (offshore) and $6/LOC (US contractor) industry benchmarks.