writing — may 2026 · 6 min
Agents that do bookkeeping, with humans where it matters
Earlier this year I spoke with a US accounting firm about their client bookkeeping operation. Afterwards I wrote them a short architecture concept: how much of the recurring monthly work could AI agents honestly take on today, without pretending the failure modes don't exist. This is that concept.
01The work
A typical client-accounting team runs a four-step monthly cycle. A junior bookkeeper imports bank and card transactions and assigns each one to a category in the client's chart of accounts. A senior bookkeeper reconciles the ledger against the actual bank statement, line by line. A controller reviews the completed books, makes adjusting entries, and signs off. Then the financial package goes out.
Most of the first three steps follows learnable rules. A vendor maps to the same category month after month; a certain pattern always means a transfer; a certain anomaly always deserves a closer look. The interesting move is shifting the rule-based parts to software so human time concentrates on the parts that actually require judgment.
02The shape of the system
Three coordinated agents and one human approval gate. The accounting platform stays the system of record, and no agent writes to it directly — every entry passes through a person first.
- bank datacsv exports or a feed
- 01 — categorizehistory first, the model only for the new
- 02 — reconciledeterministic matching, no llm arithmetic
- 03 — qaexceptions, sorted by severity
- human reviewapprove, edit, or reject
- ledgersystem of record
The first agent categorizes. For each transaction it checks history before it checks intelligence: a vector lookup over the client's past categorizations, so a vendor that has been filed the same way fifty times costs nothing to file the fifty-first time. Only genuinely novel transactions go to the language model, which gets the transaction, the chart of accounts and the closest historical examples, and returns a category with a confidence score. High confidence is queued for approval; low confidence is flagged for an explicit human decision.
The second agent reconciles — matching recorded entries against the bank statement and flagging what's missing, duplicated, or off by some amount. The third reads the closed period, compares it against prior ones, and produces an exception list sorted by severity: the category that suddenly tripled, the payroll account that went quiet in a month with payroll, the suspiciously round number where amounts are usually messy.
The reviewer sees all of it in one dashboard — proposed category, confidence, the system's reasoning, one-click approve, edit, or reject. Bookkeepers stop categorizing hundreds of routine transactions and start reviewing a short list of flagged ones, with a full audit trail of what was decided and why.
03The decisions that matter
The language model never touches arithmetic. Language models are text predictors with well-documented arithmetic failure modes, and in bookkeeping a single wrong sum can mean a misfiled return. So the split is enforced in code: the model reads descriptions, matches vendors, explains its choices — and deterministic Python does every sum, balance check, and match. If debits and credits don't balance, the pipeline halts. No number that reaches the books was generated by AI. This is the single most consequential decision in the design.
The approval gate is code, not policy. The orchestration layer physically pauses before any write and resumes only on explicit human approval — at maximum model confidence, still. Loosening that later, for narrow categories the accountants trust, is their decision to make, not the software's.
Boring database, on purpose. Decisions, approvals, and write events live in PostgreSQL because financial records need ACID guarantees — a crashed write either commits fully or not at all. The same database does vector search for the history lookups, so the system learns from every approved categorization: accuracy goes up and API cost goes down as history accumulates.
Client data stays inside one envelope. The model is accessed through the firm's existing cloud provider rather than a public AI API, so financial data never leaves that infrastructure and nothing is retained for training — the posture a licensed firm's confidentiality obligations actually require.
Work with the APIs you actually get. The dominant accounting platform doesn't expose bank-feed data through its public API, so the pipeline ingests from the source — CSV exports or a bank-data provider — and writes approved entries back through the standard API. And because real client bases are never on one platform, the agents are platform-agnostic: they emit structured journal entries, and an integration layer translates per destination.
04Limits
The design covers the highest-volume, most repetitive work: categorization, reconciliation, quality review. It does not attempt multi-entity consolidations, sales tax, foreign currency, or the journal entries that never come from a bank feed — depreciation, payroll allocations, amortization. The QA agent would notice those are missing; it wouldn't create them. And nothing here touches professional judgment: GAAP questions, audit defense, tax planning stay human.
05Closing
What I wanted the concept to show is that the recurring portion of this work is genuinely automatable now — with proper respect for the failure modes of language models and the integrity requirements of financial data. This stopped being a research problem. It's an engineering problem, and a solvable one.