4/5/2026

We Used Two Claude Instances to Architect a Payment System. Here's What the Reviewer Caught.

I told one AI to design the system, and another to destroy it. The second one was better at its job.

- A project manager watching two robots argue

Our team is modernizing a legacy platform. The biggest gap is payments — and historically, it's also been our most problematic area. Someone had to volunteer to build a replacement: a payment processor that improves scale, reliability, and performance over what we have today. A world-class system, on a tight timeline.

That someone was me.

I needed help. Not just with the work — with the thinking. Payment systems are where architecture mistakes are most expensive and most embarrassing: multiple external integrations with different protocols, failure modes that cross service boundaries, financial accuracy requirements where "close enough" is a bug, regulatory constraints that aren't optional, concurrency issues that only manifest under real load. A single-pass architecture review would catch the obvious things. I needed something more adversarial than that.

So I ran two Claude instances: one as architect, one as adversarial reviewer. Then, once the loop converged, I ran one final human-directed audit with explicit stakes. Five rounds of review. 75 total findings. Zero lines of application code written yet. Here's what happened.

The Problem with Single-Instance Review

If you ask the same Claude instance to review its own architecture, you get politeness. You get "this looks solid, here are some minor suggestions." The instance has context on why it made each decision, and that context creates a sycophantic bias — it's hard to adversarially attack your own reasoning when you remember building it.

The two-instance approach removes that. The reviewer has no emotional investment in the architecture. It reads the documents cold, the same way a new team member would. It doesn't know why the architect chose X over Y — it just sees whether X is fully specified.

That's exactly how the biggest gap got caught.

The Catch That Made Us Sweat

By the end of the architecture phase, we had 17 documents covering provider abstraction, idempotency, event schemas, auto-pay, PCI compliance, and more. The architect instance had designed a clean Provider interface with Charge, Void, Refund, CreateToken. We had event schemas for payments-charge-captured, payments-refund-completed. The system handled the happy path beautifully.

Then the reviewer read the existing system analysis and asked a simple question: what happens when a payment bounces?

Not a refund — a return. An NSF check. A chargeback from Visa 60 days later. A bank reversal that we don't initiate and can't predict.

The architect had designed refunds (customer asks for money back) but had completely missed returns (bank takes the money back). These are fundamentally different operations:

A refund is something we initiate. We call provider.Refund(), publish an event, done.
A return is something that happens to us. It arrives via webhook, buried in a daily batch settlement file, or when a clerk notices a bounced check. It reverses the original payment, reopens bills that were marked paid, and may assess a fee.

The reviewer's exact finding:

"We mention refunds but never designed the return flow — a returned payment (bounced check, NSF, chargeback) is different from a refund. It reverses the original payment, reopens bills, and optionally applies a return fee. This is a core payment service operation."

There was no ReturnNotification struct. No BatchReturnProvider interface. No state machine for disputes. No consideration of the three different channels returns arrive through (webhooks, batch files, manual staff action). No return fee logic. No bill-reopening event.

This wasn't a minor gap. If we'd started coding without this, we would have discovered it the first time a customer's ACH payment bounced — in production, with real money, with no way to handle it except manual database surgery.

After the fix, returns-voids-refunds.md became one of the most detailed documents in the repo: three detection channels, a canonical ReturnNotification struct, a BatchReturnProvider interface, a full database schema with lookup tables, a state machine for chargebacks (pending, processed, failed, disputed), and auto-routing logic deciding void vs. refund based on settlement state.

The Exact Workflow

Here's what we actually did, reconstructed from the commit history.

Phase 0: Discovery (Architect Instance)

One Claude instance analyzed the existing payment system and the current platform, then produced:

legacy-payment-system.md — complete teardown of the old system (12+ tables, 9 payment types, multiple flows)
current-state.md — what the platform already handles
gap-analysis.md — feature parity matrix with phased priorities
cross-team-feature-parity.md — 16 features that belong in other services

Phase 1: Architecture (Architect Instance)

Same instance designed the system across 13 architecture documents: provider abstraction (Strategy + Registry pattern), three-layer idempotency (Redis + DB + provider reference ID), event schema (Pub/Sub topics, message schemas, outbox pattern), auto-pay (fan-out, exponential backoff, balance caching), PCI tokenization (SAQ A-EP, AES-256-GCM, key rotation), database standards, testing strategy, configuration UI, and more.

Phase 2: Review Round 1 (Reviewer Instance)

Fresh instance. Read all 17 documents. Produced 21 findings. This round caught missing features — entire areas the architect hadn't designed.

3 high-severity gaps: returns, void/refund routing, fee engine
4 medium-severity gaps: IVR, receipts, settlement reconciliation, hosted form lifecycle
14 features correctly identified as belonging in other services

Phase 3: Remediation + Review Round 2

Architect addressed all 21 findings. Reviewer re-read everything. 21 new findings — but a completely different category. Not missing features. Missing specifications in features that existed:

The shared.Money type (int64 cents, rounding rules, parsing) was referenced everywhere but never defined
The cross-service failure loop was undefined: what happens when the downstream service can't allocate a captured payment?
Idempotency payload hashing had no canonicalization spec (JSON field ordering affects SHA-256)
Redis could silently evict idempotency keys under memory pressure
Two concurrent reversal requests could race past validation
The circuit breaker was a stated design goal with no specification
Auto-pay would silently skip charges forever if the balance cache stopped updating

Phase 4: Remediation + Review Round 3

Architect addressed all 21. Added circuit-breaker.md, expanded database-standards.md with the Money type, added table schemas. Reviewer re-read everything. 11 findings — smaller, more focused: naming inconsistencies, a circuit breaker threshold specified two conflicting ways, infrastructure tables used everywhere but never schema-defined.

This round caught consistency issues introduced by the updates themselves.

Phase 5: Remediation + Review Round 4

Architect addressed all 11. Reviewer re-read everything. 1 finding — a low-severity documentation note about two internal infrastructure tables that correctly deviated from the project schema template but didn't explain why. Without a comment, a future developer would likely "fix" them to match the standard and add unnecessary overhead. The fix: a single explanatory comment.

After Round 4, the reviewer's own summary:

"After 4 review rounds (21 + 21 + 11 + 1 = 54 findings), we've converged. The single remaining finding is a documentation clarification, not a design issue. The severity distribution tells the story."

Phase 6: Final Human-Directed Audit

This is where the process changed. The Claude-vs-Claude loop had converged — but I wasn't done. I opened a fresh session and gave it a different kind of prompt. Not "review the architecture." Something with explicit stakes:

This repo contains architectural documentation for a new payment system.
I want you to assume the role of a critical software architect. This
system has to be right, it has to be thorough, it has to be complete.
Bad payment systems are not an option.

Upon assuming this persona, review this repo, all markdown and other
files, for accuracy and completeness. The goal of this process is to
have code-ready instructions when we're done and walk through the full
implementation. We will then begin testing.

I cannot stress enough — this documentation has to lead to a correct
implementation. Help me get there with one big, final review of our
design plan.

The reviewer came back with 21 new findings — including 6 critical ones. Its opening verdict:

"The architecture is 90% code-ready. The remaining 10% is missing table definitions, naming inconsistencies, a formal API spec, and a handful of contradictions that would cause real bugs if they reached production."

The 6 critical findings included a Redis eviction policy contradiction between two documents (noeviction in one, allkeys-lru in another — a double-charge bug waiting to happen), a missing CREATE TABLE for the most-referenced table in the entire system, invalid PostgreSQL primary key definitions on partitioned tables that would fail on CREATE TABLE, and four additional tables listed in the architecture summary with no schema definitions anywhere.

These were real. They would have caused real problems. And they survived four prior review rounds.

The Full Pattern

| Round | Findings | What It Caught | | ----------- | -------- | --------------------------------------------- | | Review 1 | 21 | Missing features | | Review 2 | 21 | Missing specifications | | Review 3 | 11 | Inconsistencies | | Review 4 | 1 | Documentation clarity | | Final audit | 21 | Schema gaps, contradictions, missing API spec |

The final audit didn't undo the prior work — the reviewer explicitly noted that the foundation was strong, all 54 prior findings were confirmed resolved, and the core design decisions were sound. What it found was a different layer: the kind of gaps that only surface when you read 30+ documents with fresh eyes and ask "could a developer actually build this tomorrow?"

Other Catches Worth Mentioning

The cross-service black hole. Payment captured successfully at the provider. Event published. The downstream service tries to allocate, fails (customer was deleted between payment initiation and capture). Now what? The payment exists at the provider, the money moved, but no one in the system knows it failed. The fix: the downstream service publishes an allocation-failed event, the payment service flags the transaction, a daily reconciliation job alerts ops on anything stuck.

The Money type that nobody defined. Every document said "cents-based arithmetic" and NUMERIC(12,2) but no document specified the actual Go type, how to parse "149.99" from a provider response, or what rounding rule to use for percentage-based fee calculations. In a payment system. Without the reviewer asking "where is this type defined?", three developers would have implemented it three different ways. The fix: a concrete spec — int64 cents internally, round half-up after full calculation, no intermediate rounding, no floating point in business logic.

The silent auto-pay failure mode. Auto-pay skips when the customer balance cache is stale (>7 days old). The architect designed this as a safety measure. The reviewer asked: what if the upstream service stops publishing balance events entirely? Every auto-pay run would skip every customer, silently, forever. The fix: two alerts — a charges_skipped_stale_cache counter with a >5% threshold, and a zero-updates-in-24-hours canary.

What "Done" Actually Looks Like

The Claude-vs-Claude loop converges naturally, and the convergence signal is real: severity distribution shifting downward round over round, findings caused by the fixes themselves rather than structural gaps, the reviewer agreeing with decisions rather than challenging them.

But "converged" and "done" are not the same thing.

The loop is optimized for the kind of adversarial pressure it was designed for — catching what the architect missed, finding underspecified features, surfacing inconsistencies. What it's less optimized for is the final pass: reading 30+ documents as a unified whole and asking whether a developer could actually build from them tomorrow.

That requires a different prompt. Explicit stakes. A persona with authority to say "this is not code-ready." The final audit found 21 things the loop didn't — not because the loop failed, but because it was doing a different job.

The full process:

Claude-vs-Claude loop until severity distribution converges to documentation-only findings
Human-directed final audit with explicit implementation stakes and a fresh session

Both steps matter. Neither replaces the other.

What to Steal

The Prompt for the Architect Instance

You are designing a [service name] for [platform].

Context:
- [What the service does]
- [What already exists that it integrates with]
- [Key constraints: compliance, performance, etc.]

Discovery documents are in [location]. Read them all.

Design the architecture. For each component, produce a document covering:
- The decision and alternatives considered
- Interface contracts (API schemas, message contracts)
- Database schema (CREATE TABLE with constraints and indexes)
- Event schemas (topic names, message payloads)
- Error handling and failure modes
- How it integrates with adjacent services

Be specific enough that a developer could implement from your docs
without asking clarifying questions.

The Prompt for the Reviewer Instance

Read all the markdown files in this repository. These are architecture
documents for a [service type].

Conduct a deep review. For each finding:
- State what's wrong or missing
- Cite the specific document and section
- Explain why it matters (what breaks if unfixed)
- Rate severity: Critical (fix before coding), High (fix during Phase 1),
  Medium (fix before production), Low (nice to have)

Look specifically for:
- Features mentioned in discovery/existing system analysis but not designed
- Interface methods referenced but never specified
- Types used but never defined
- Failure modes with no recovery path
- Cross-service interactions with no error handling
- Inconsistencies between documents
- Implicit assumptions that should be explicit decisions

Do not be polite. If something is missing, say it's missing.

The Final Audit Prompt

This repo contains architectural documentation for a [system type].
Assume the role of a critical software architect. This system has to
be right, thorough, and complete.

Review all documentation for accuracy and completeness. The goal is
code-ready instructions. A developer should be able to implement
directly from these docs without clarifying questions.

For each finding:
- State exactly what's missing or wrong
- Cite the specific document and section
- Explain the consequence if unfixed
- Rate severity: Critical (blocks implementation), High, Medium, Low

Do not be polite. If something is missing or contradictory, say so.

The Iteration Loop

Architect instance builds the architecture docs
Reviewer instance reads everything cold, produces findings
Architect addresses all findings
Repeat until findings are documentation-only — no behavioral changes
Run the final audit with explicit implementation stakes in a fresh session
Address all findings, then implement

Five rounds worked for us (75 findings total: 21 + 21 + 11 + 1 + 21). Your mileage will vary, but the pattern should hold: gaps, then specs, then inconsistencies, then clarity, then a final implementation-readiness check that catches what the loop missed.

The value isn't in either instance being smarter than a human architect. The value is in the adversarial separation — and in knowing when to change the nature of the adversary.

The loop uses a standard reviewer. The final audit uses a critical architect who has been told the stakes explicitly. These produce different findings because they're asking different questions. Use both.

You could do this with two humans and a senior architect doing a final review. But the Claude version runs in hours instead of weeks, costs dollars instead of salaries, and produces a written artifact at every step that you can review, diff, and commit.

We haven't written a line of application code yet. But we have 30+ architecture documents, 5 review documents, and 75 findings all resolved.

I'll never be the developer who isn't afraid they forgot something. That's not how I'm wired. But I feel better about this than I would have — and more importantly, I won't spend weeks in testing discovering edge cases that should have been covered from day one. Those 75 findings would have been production incidents, not review comments. That's the difference.