Skip to content

Eval Test Harness

The eval test harness lets you define test cases directly in your workflow YAML file. Each case specifies inputs, optional fixture outputs, and expected assertions per block. Fixture mode skips all LLM calls, making tests fast, free, and deterministic.

Add an eval: section at the top level of your workflow file, alongside blocks: and workflow::

custom/workflows/research.yaml
version: "1.0"
blocks:
analyze:
type: llm
soul_ref: analyst
prompt_template: "Analyze: {task_instruction}"
assertions:
- type: contains
value: "analysis"
workflow:
name: research
entry: analyze
transitions:
- from: analyze
to: null
eval:
threshold: 0.8
cases:
- id: basic_research
inputs:
task_instruction: "Research LLMs"
fixtures:
analyze: "LLMs have transformed software development. This analysis covers key trends."
expected:
analyze:
- type: contains
value: "analysis"

The eval: section is parsed as an EvalSectionDef model:

FieldTypeRequiredDefaultDescription
thresholdfloatno1.0 (when omitted)Minimum aggregate score for the suite to pass. Range: 0.0 to 1.0
caseslist[EvalCaseDef]yesAt least one test case required (min_length=1)

Case IDs must be unique within the eval section. Duplicates cause a validation error.

Each entry in cases: is an EvalCaseDef:

FieldTypeRequiredDefaultDescription
idstryesUnique identifier for this test case
descriptionstrnoNoneHuman-readable description of what this case tests
inputsdict[str, any]noNoneInput values passed to the executor
fixturesdict[str, str]noNoneBlock ID to output string mapping. Skips LLM calls for these blocks
expecteddict[str, list[dict]]noNoneBlock ID to list of assertion configs. These assertions are evaluated against block outputs

When a case provides fixtures that cover every block listed in expected, the eval runner operates in fixture mode:

  • No executor is called (no LLM calls, no API calls)
  • A WorkflowState is built directly from fixture strings
  • Assertions run against the fixture values

This makes tests instant and free. You can run hundreds of cases without spending tokens.

Fixture mode -- all expected blocks have fixtures
eval:
cases:
- id: fixture_only
fixtures:
analyze: "The LLM landscape is evolving rapidly."
summarize: "Summary: LLMs are improving."
expected:
analyze:
- type: contains
value: "LLM"
summarize:
- type: starts-with
value: "Summary"

If a case has expected blocks without matching fixtures, the runner requires an executor callback. If no executor is provided, it raises a RuntimeError.

When you run evals, the harness loads your workflow YAML, finds the eval: section, and executes each case. Fixture-only cases run instantly with no LLM calls. The result tells you whether the suite passed and gives per-case scores.

The suite result contains:

FieldTypeDescription
passedboolTrue if score >= threshold
scorefloatAverage of all case scores
thresholdfloatFrom eval.threshold (defaults to 1.0 if omitted)
case_resultslist[EvalCaseResult]Per-case breakdown

Each EvalCaseResult contains:

FieldTypeDescription
case_idstrMatches id from the YAML case
passedboolTrue if all block assertion suites passed
scorefloatAverage of block aggregate scores
block_resultsdict[str, AssertionsResult]Block ID to assertion results
  1. Each assertion produces a score (0.0 or 1.0 for deterministic types)
  2. Block score = weighted average of assertion scores
  3. Case score = average of block scores
  4. Suite score = average of case scores
  5. Suite passes if suite_score >= threshold

For cases that need actual LLM execution (no fixtures for some blocks), the harness runs each block through the real execution pipeline. This means those cases make live LLM calls, cost tokens, and produce non-deterministic results. Use executor mode when you want to validate actual model behavior rather than testing assertion logic against known outputs.

A single eval section can contain both fixture-only and executor-required cases. The runner decides per-case:

Mixed cases
eval:
threshold: 0.5
cases:
- id: fast_fixture_test
fixtures:
analyze: "LLMs are powerful language models."
expected:
analyze:
- type: contains
value: "LLM"
- id: live_execution_test
inputs:
task_instruction: "Research transformers"
expected:
analyze:
- type: contains
value: "transformer"

The fixture case runs without an executor. The live case requires one. If no executor is provided, only fixture cases succeed — the live case raises a RuntimeError.