Designing Money Movement That Survives Failure

Ask an engineer to "move money from A to B" and they'll write you a function in five minutes. Debit one account, credit the other, call the payment rail, return success. It works in the demo. It works in staging. It works for the first ten thousand transactions in production. Then a connection drops at exactly the wrong moment, and you have a customer whose money has left their account, a rail that may or may not have received the instruction, and a support ticket that says "where is my $4,000." That moment, not the happy path, is the actual job. Everything else is plumbing.

I've come to believe the single most useful reframe for anyone building a money-moving product is this: the happy path is trivial and the entire engineering effort is failure handling. If your design starts from "what happens when the transfer succeeds," you've already lost. You start from the failures and work backward. Here's how I think about that.

Money movement is a set of states, not an action

The most common and most expensive mistake is modeling a payment as a verb, something that happens, rather than a noun with a lifecycle. "Send payment" is not an event. It's a state machine that lives for seconds, days, or in the case of disputes, months.

A transfer is never simply "done." It's initiated, then authorized, then captured, then settled, and from almost any of those states it can branch to failed, reversed, or returned. Each of those is a real, persisted state with explicit allowed transitions. You authorize, you capture; you don't jump from initiated to settled because some webhook fired out of order. The state machine enforces that. An ACH debit can return five business days after it appeared to settle, your model has to have a state for "looks settled but still reversible," because pretending that window doesn't exist is how you let customers spend money that's about to vanish.

Once you model states explicitly, three things you previously ignored become unavoidable, and that's the point:

Timeouts. Every non-terminal state needs a clock. A payment stuck in "authorized" for six hours is not a payment, it's an incident. The state machine should know that and escalate it.
Recovery transitions. What happens to a capture that fails? It can't just disappear. It needs a defined path: retry, reverse the authorization, or flag for manual resolution.
Terminality. Which states are final and which are not. "Settled" feels final until a chargeback proves it wasn't.

This is also why the ledger has to be the source of truth underneath all of this. The state machine governs what can happen next; the ledger records what actually happened, immutably, in double-entry. They are different jobs and you need both.

The ambiguity that defines the discipline

Here is the scenario that every payments engineer eventually meets: you debit the customer, you call the rail, and the connection drops before you get a response. Did it go through or not? You genuinely do not know. The rail might have processed it and the acknowledgment got lost on the way back. Or the request never landed. From your side, both look identical.

If you retry blindly, you might double-send. If you assume failure and refund, you might refund a payment that actually completed, now you're out the money. There is no clever way to guess correctly. The only way out is to make the operation safe to repeat and to have a process that finds the truth later.

That's two mechanisms working together:

Idempotency. Every money-moving request carries a client-generated idempotency key. The rail, and your own system, treats a repeated key as "you already asked me this, here's the same answer," not as a new instruction. This is what makes "just retry it" a safe sentence instead of a reckless one. Stripe, Adyen, and every serious rail support idempotency keys precisely because this ambiguity is universal. If you build internal money movement, you build the same guarantee yourself.
Reconciliation. Idempotency lets you retry safely; reconciliation tells you what's actually true. You pull the rail's record of reality, settlement files, transaction reports, and diff it against your own ledger, every single day. Anything that doesn't match is a stuck or ambiguous transaction that gets resolved deterministically. The dropped-connection payment from this morning gets classified tonight, not by a guess, but by checking the authoritative source.

Idempotency keeps you from making the ambiguity worse. Reconciliation resolves it. Neither alone is sufficient.

Holds, settlement, and the fact that timelines don't agree

Money does not move on one clock. A card authorization places a hold instantly, the funds are earmarked but nothing has actually moved. Settlement happens later, sometimes a day, sometimes more, and it can settle for a different amount than the hold (think the tip added after a restaurant pre-auth, or a partial capture). Your system has to hold these two facts simultaneously: "we reserved X" and "we eventually moved Y," and reconcile the gap.

This is where naive systems quietly corrupt their own books. They treat the hold as the transaction, the customer's available balance drifts away from reality, and the divergence only surfaces weeks later in a reconciliation report nobody was reading. Design for the two timelines from day one. A hold and a settlement are related but distinct ledger events, and the relationship between them is itself a thing you track.

Refunds fail, and disputes arrive from the past

Two failure classes that teams almost always under-build.

Refunds are money movement too, which means refunds can fail. The destination account is closed. The original card is cancelled. The rail rejects it. A refund is not a database flag that says "refunded: true"; it's a new transaction with its own state machine, its own retries, its own failure modes. If you modeled the original payment as a state machine, good, do exactly the same for the money going back. Teams that treat refunds as a trivial inverse get a special kind of pain: a customer who was told they were refunded and was not.

Chargebacks and disputes arrive weeks later and have to find their original transaction. A dispute on a payment from six weeks ago lands in your system as a new event referencing a transaction that, as far as your happy-path code is concerned, was finished and forgotten. This is why nothing is ever truly deleted and why every transaction carries durable, queryable identifiers that the rail also knows. When a dispute webhook arrives, you must be able to locate the original, attach the dispute, move that transaction into a disputed state, and run the response workflow, all without a human grepping through logs. If your data model can't answer "what payment does this dispute belong to" in one query, you don't have a payments system, you have a liability.

Sagas, not distributed transactions

Engineers coming from a single-database world reach for the transaction: wrap the debit, the rail call, and the credit in one atomic block, roll it all back if anything fails. This is a fantasy across service and network boundaries. You cannot two-phase-commit an external payment rail. The rail is not in your transaction. The network is not reliable. There is no rollback for "money I already told a bank to move."

The real pattern is the saga: a sequence of local steps, each with a defined compensating action. You don't roll back, you move forward into a correcting state. If the credit fails after the debit succeeded, you don't pretend it never happened; you issue a compensating reversal and you record both. Combined with the state machine, this gives you something a distributed transaction never could: a complete, auditable history of what was attempted and what corrected it. The books tell the whole story, including the messy parts.

"Uptime" is the wrong metric

I'll say the thing plainly. For a money-moving system, uptime is close to a vanity metric. A service can be 99.99% available and still produce wrong balances, double-charge customers, and lose track of disputes. Availability tells you the server answered. It tells you nothing about whether the answer was correct.

The metric that matters is: does the system always resolve to correct books? Every transaction, including every failed and ambiguous one, eventually lands in a known, correct, reconciled state. A payment that's stuck for an hour is acceptable if it resolves correctly; a payment that returns "success" instantly but corrupts the ledger is a catastrophe. Optimize for correctness-on-resolution, not for speed-of-response. They are not the same goal, and when they conflict, correctness wins every time.

That principle reshapes your observability. You don't just alert on 500s and latency. You alert on stuck states: transactions sitting in a non-terminal state past their timeout, reconciliation diffs that didn't clear, retry counts climbing, compensation actions firing more than baseline. Your dashboards are about the shape of the state machine in aggregate, how many payments are in flight, how long they've been there, where they're piling up. A growing population of "authorized but not captured" is a leading indicator of an incident you can catch before a customer does. This is the kind of foundational rigor that separates a fintech that scales from one that quietly accumulates a reconciliation crisis.

How we build it

At Kunso this is not a thing we bolt on at the end. We model the state machine, the idempotency contract, the reconciliation loop, and the compensation logic first, because the AI-native workflows we use let us generate, test, and exhaustively simulate failure paths at a speed that makes thorough correctness affordable rather than a luxury you cut for the deadline. The happy path was always going to work. We spend our effort where the money actually gets lost.

The lesson underneath all of it: a money-moving system is judged on the bad day, not the demo. Design for the duplicated settlement, the reversal that races a retry, the balance read mid-transaction, and the happy path takes care of itself. Optimize for correct books on resolution rather than uptime on the dashboard, and most of fintech's existential failures shrink back into ordinary engineering.