← All articlesCost Control

Spend Tokens: Budget Envelopes That Actually Block Overspend

Most AI budget controls are alerts. A spend token is a hard limit. Here's the difference and why it matters.

·5 min read

Most AI "budget controls" are actually budget monitors. They watch your spending and tell you when it's too high. By then, it's already too high.

A spend token is different. It's a pre-authorized budget envelope that the gateway checks — atomically, before the API call is placed — and rejects if the envelope is exhausted. The spending stops. Not after the fact. Before.

The distinction sounds small. In practice, it's the difference between a $14,000 weekend and a clean 402 response.

What a spend token is

A spend token has five properties:

An organization scope. Which org minted it and which org's traffic can use it.

A project scope. Which project the token is authorized for. A token minted for your "search feature" rejects requests tagged for your "summarization feature."

An allowed model list. You can restrict a token to specific models. A token that only allows claude-3-haiku can't be used to make a claude-opus call. This is how you prevent cost escalation through model substitution.

A spend limit. The maximum USD value that can be reserved against this token. When the limit is reached, the next request gets a 402.

A TTL. An expiry time after which the token is invalid regardless of remaining balance. A token for a weekend batch job should expire Monday morning.

The atomic reserve-then-call pattern

The critical implementation detail: the gateway reserves estimated cost against the envelope before the request goes to the provider, in a database transaction.

Here's why that matters: without atomic reserve, you can have two concurrent API calls both check the envelope, both see $5 remaining, both proceed with a $4.50 request, and both complete — for a total spend of $9 against a $5 envelope. The limit is enforced on paper but not in practice.

With atomic reserve, the first call reserves $4.50 (envelope now has $0.50 remaining). The second call tries to reserve $4.50, finds insufficient balance, and gets a 402. The envelope is actually protected.

After the request completes, the gateway settles against the reserved amount using the actual cost. If the estimated cost was $4.50 and the actual was $4.20, the $0.30 difference is released back to the envelope. The reservation is an upper bound, not a final charge.

Who mints tokens and who uses them

Spend tokens are minted by an operator — someone with gateway admin access. Developers and agents use tokens; they don't create them.

This separation matters. The person who decides how much budget a feature gets (an engineering lead, a product manager, a finance approver) is different from the person writing the code that consumes that budget. The token is the approval, recorded in the system.

In practice: when a new AI feature goes through code review, the approval process includes minting a spend token with an appropriate limit and TTL. The feature's code is configured with that token's ID. The feature cannot spend more than approved. When the TTL expires, the token must be renewed — which is another approval checkpoint.

This is what a lightweight AI governance workflow looks like in an org that doesn't want a full procurement review for every feature.

The validator secret and agent identity

Each spend token has an associated validator secret — an HMAC key that proves a request came from an authorized agent. When the gateway sees a request with a spend token ID, it also requires an x-validator header. The gateway computes HMAC-SHA256(key=validator_secret, data=token_id) and checks it against the stored hash.

This prevents token ID borrowing: knowing a token ID (visible in a log) doesn't let you use it. You also need the validator secret, which should only exist in your agent's environment variables.

For multi-agent systems, this means each agent has its own validator secret. If one agent is compromised, you revoke its token and issue a new one without affecting other agents' budgets.

Spend token ledger as audit trail

Every reserve, every settle, every rejection — recorded in an append-only ledger table. At the end of a batch run, you can look at the ledger for a specific token and see:

  • How many calls were made
  • What the total reserved amount was vs. the total settled amount (the difference tells you about estimation accuracy)
  • How many 402s were issued (a high number might indicate a budget that was too small, or a retry loop)
  • Whether the TTL caused any late rejections

This ledger is also the answer to "what did this feature spend last month?" — not an approximation from a monitoring dashboard, but a complete accounting of every transaction against that budget.

Practical sizing

The most common question: how do I know what limit to set?

Start from the use case. A feature that processes 1,000 user requests per day, each requiring one model call averaging 2,000 tokens in and 500 tokens out, on a mid-tier model, costs roughly X per day. Multiply by 30, add a 20% buffer, that's your monthly envelope.

For agents with less predictable behavior, use smaller envelopes with shorter TTLs. A weekly envelope is easier to monitor than a monthly one. If an agent's behavior is unusually expensive one week, you'll know by Thursday instead of by month-end.

For new features in their first month, set a conservative limit — something that would cause a noticeable 402 rate if the feature behaves unexpectedly, but wouldn't embarrass you if it fires. Use the actual first month's spend to calibrate the permanent limit.

What the 402 response looks like

When a request is rejected due to budget exhaustion, the gateway returns:

{
  "error": {
    "type": "budget_exhausted",
    "message": "Spend token budget exhausted",
    "token_id": "st_...",
    "remaining_usd": 0.00
  }
}

This is a clean error that client code can handle. An agent receiving this response should stop, not retry. A user-facing feature receiving this response should show a graceful message ("AI features temporarily paused — your team has reached this month's AI budget") rather than a generic error.

Handling 402 gracefully is part of building a budget-aware AI application. The spend token gives you a reliable signal; the application layer decides what to do with it.


Spend Tokens are a core feature in Visionality. See them in action →

Visionality.AI

See how Visionality handles this.

30-minute demo. Live deployment. Your questions answered directly — no slides, no pitch.