Budget & Limits

The limits section controls how much a workflow or individual block is allowed to spend in cost, tokens, and wall-clock time. When a limit is breached, the engine either kills execution immediately or emits a warning and continues, depending on the on_exceed mode.

Workflow-level limits

Add a limits section at the top level of your workflow YAML:

version: "1.0"
limits:
  cost_cap_usd: 2.50
  token_cap: 100000
  max_duration_seconds: 300
  on_exceed: fail
  warn_at_pct: 0.8
workflow:
  name: research-pipeline
  entry: step_one
blocks:
  step_one:
    type: linear
    soul_ref: analyst

WorkflowLimitsDef fields

Field	Type	Default	Constraints	Description
`cost_cap_usd`	`float?`	`None`	`>= 0.0`	Maximum total LLM cost in USD
`token_cap`	`int?`	`None`	`>= 1`	Maximum total tokens (prompt + completion)
`max_duration_seconds`	`int?`	`None`	`1 -- 86400`	Wall-clock timeout for the entire workflow
`on_exceed`	`"warn" \| "fail"`	`"fail"`	---	What happens when a cap is breached
`warn_at_pct`	`float`	`0.8`	`0.0 -- 1.0`	Percentage threshold for early warning events

All fields are optional. If you omit limits entirely, no budget enforcement is applied.

Per-block limits

Individual blocks can have their own limits section:

blocks:
  expensive_step:
    type: linear
    soul_ref: analyst
    limits:
      cost_cap_usd: 1.00
      token_cap: 50000
      max_duration_seconds: 120
      on_exceed: fail

BlockLimitsDef fields

Field	Type	Default	Constraints	Description
`cost_cap_usd`	`float?`	`None`	`>= 0.0`	Maximum LLM cost for this block
`token_cap`	`int?`	`None`	`>= 1`	Maximum tokens for this block
`max_duration_seconds`	`int?`	`None`	`1 -- 86400`	Wall-clock timeout for this block
`on_exceed`	`"warn" \| "fail"`	`"fail"`	---	What happens when a cap is breached

Warn mode vs kill mode

The on_exceed field controls what happens when a budget cap is breached:

Kill mode (`on_exceed: "fail"`)

The default. When any cap is exceeded, the engine raises a BudgetKilledException immediately. The exception includes structured metadata:

scope --- "block" or "workflow"
block_id --- which block triggered the breach (if block-scoped)
limit_kind --- "cost_usd", "token_cap", or "timeout"
limit_value --- the configured cap
actual_value --- the value that exceeded the cap

Error route interaction: If the block that triggered the exception has an error_route configured, the BudgetKilledException is caught by the workflow’s generic exception handler. The block result is written with exit_handle: "error" and error metadata, and execution continues on the error route target block. Only when no error_route exists does the exception propagate and terminate the run with status: failed.

Flow-level timeouts are the exception: When a workflow-level max_duration_seconds fires, the resulting BudgetKilledException is raised outside the block execution loop. It cannot be caught by any block’s error_route and always terminates the run unconditionally.

Warn mode (`on_exceed: "warn"`)

When a cap is exceeded, the engine logs a warning but execution continues. The run can finish normally even after exceeding a budget cap. Use this for soft budgets where you want visibility but not hard stops.

limits:
  cost_cap_usd: 1.00
  on_exceed: warn
  warn_at_pct: 0.8

With this configuration, a warning event is emitted at 80% of the cost cap ($0.80), and another when the cap is exceeded ($1.00+), but execution continues.

How enforcement works

Budget enforcement uses Python contextvars to track the active budget session per asyncio task. The enforcement point is inside LiteLLMClient.achat() --- the single chokepoint for all LLM calls.

The enforcement chain

Workflow start: If the workflow has limits, a BudgetSession is created and set as the active budget via _active_budget ContextVar.
Block start: If a block has limits, a child BudgetSession is created with the workflow session as its parent. The child session replaces the active budget for the duration of that block.
LLM call returns: After each achat() call, the session’s accrue() method adds the cost and tokens. If the session has a parent, costs propagate up the chain automatically.
Cap check: After accrual, check_or_raise() walks the entire parent chain. If any session (block or workflow) has exceeded its cap with on_exceed: "fail", a BudgetKilledException is raised.
Block end: The block’s budget session is removed and the workflow session is restored.

Parent propagation

When a block has its own limits, the BudgetSession is created with the workflow session as parent. Every accrue() call on the child also increments the parent’s counters recursively up the chain. After accrual, check_or_raise() walks the entire parent chain, so both the block’s caps and the workflow’s caps are enforced on every LLM call:

Block accrues $0.50, 1000 tokens
  → Block session: cost=$0.50, tokens=1000
  → Parent (workflow) session: cost=$0.50, tokens=1000 (propagated)

This means a workflow with cost_cap_usd: 2.00 will kill the run even if the individual block has no limit, as long as the workflow’s total cost exceeds $2.00. Sub-flow budgets are not independent --- the parent workflow’s caps are always enforced because costs propagate upward through the parent chain.

Timeout enforcement

Timeouts work differently from cost and token caps:

Workflow timeout: Workflow.run() wraps the main execution loop in asyncio.wait_for(timeout=max_duration_seconds). If the timeout fires, a BudgetKilledException is raised with limit_kind="timeout".
Block timeout: execute_block() wraps the individual block dispatch in asyncio.wait_for(timeout=max_duration_seconds). Block timeouts are independent of the workflow timeout.

Dispatch branch isolation

When a dispatch block fans out to multiple exit branches, each branch gets an isolated child session created via create_isolated_child(branch_id=exit_id). The child inherits the parent’s exact caps (cost_cap_usd, token_cap, max_duration_seconds, on_exceed) but has no parent pointer --- it accumulates costs independently. This prevents concurrent branches from sharing mutable budget state during execution.

Each child is individually capped at the parent’s full cap value. For example, if the workflow cap is $5.00, each branch is independently allowed up to $5.00 --- not $5.00 divided by the number of branches. This means a single branch can hit the cap on its own before the others finish.

After all branches complete via asyncio.gather, each child’s totals are reconciled back to the parent session via reconcile_child(), which adds the child’s cost and token totals to the parent. Then the parent session’s caps are checked with check_or_raise().

Parent session: cost_cap=$5.00
  ├── Branch A (isolated, cap=$5.00): cost=$1.00
  ├── Branch B (isolated, cap=$5.00): cost=$2.00
  ├── Reconciliation: parent cost += $1.00 + $2.00 = $3.00
  └── Parent cap check: $3.00 ≤ $5.00 → passes

Complete example

version: "1.0"
limits:
  cost_cap_usd: 5.00
  token_cap: 200000
  max_duration_seconds: 600
  on_exceed: fail
  warn_at_pct: 0.8
workflow:
  name: budget-example
  entry: research
  transitions:
    - from: research
      to: summarize
    - from: summarize
blocks:
  research:
    type: linear
    soul_ref: researcher
    limits:
      cost_cap_usd: 3.00
      max_duration_seconds: 300
      on_exceed: fail
  summarize:
    type: linear
    soul_ref: writer
    limits:
      cost_cap_usd: 1.00
      on_exceed: warn

In this example:

The workflow will hard-stop at $5.00 total or 200k tokens or 10 minutes.
The research block will hard-stop at $3.00 or 5 minutes.
The summarize block will warn at $1.00 but continue running.
Both blocks’ costs propagate to the workflow total.