Skip to content

Regressions

Regressions are quality problems that appear when comparing consecutive runs of the same workflow. Runsight automatically detects three types of regression by comparing each production run against its predecessor.

The EvalService._detect_node_regressions method compares two matching nodes (same node_id and soul_version) across consecutive production runs. It flags three conditions:

TypeConditionDelta payload
assertion_regressioneval_passed changed from True to False{eval_passed: false, baseline_eval_passed: true}
cost_spikeCost increased more than 20% vs previous run{cost_pct: <percentage>, baseline_cost: <previous cost>}
quality_dropeval_score dropped by more than 0.1{score_delta: <negative number>}

These fields live on the RunNode entity and are populated by the EvalObserver when a block with assertions completes:

  • eval_score (Optional[float]): The weighted average of all assertion scores for that node. A value of 1.0 means every assertion scored perfectly; 0.0 means all failed.
  • eval_passed (Optional[bool]): True only if every individual assertion passed. A node can have a high score but still fail if one assertion did not pass.
  • eval_results (Optional[Dict]): Detailed breakdown with per-assertion type, passed, score, and reason.

When a block has no assertions, all three fields are None and the node is excluded from regression detection.

Regression detection works at the run level via the EvalService:

For a single run (GET /api/runs/{run_id}/regressions):

  1. Find all production runs for the same workflow, ordered by created_at
  2. Identify the previous production run before the target run
  3. Match nodes between the two runs by (node_id, soul_version)
  4. For each matching pair, check the three regression conditions
  5. Return {count: N, issues: [...]} — or {count: 0, issues: []} if this is the first run

For a workflow (GET /api/workflows/{id}/regressions):

  1. Get all production runs for the workflow, ordered by created_at
  2. For each consecutive pair, compare matching nodes
  3. Each issue includes run_id and run_number to identify which run introduced it
  4. Return the aggregate across all run pairs

The first production run of a workflow always has zero regressions since there is no baseline to compare against.

Each regression issue in the response contains:

FieldTypeDescription
node_idstrThe block that regressed
node_namestrDisplay name of the node
typestrOne of assertion_regression, cost_spike, quality_drop
deltadictType-specific comparison data (see table above)
run_idstr(workflow endpoint only) Which run introduced this regression
run_numberint(workflow endpoint only) Sequential run number

Run-level pass rate is tracked via eval_pass_pct on the RunResponse schema. This field represents the percentage of eval-bearing nodes that passed in a run. The RunResponse also includes:

FieldTypeDescription
eval_pass_pctOptional[float]Percentage of nodes with eval_passed = True
eval_score_avgOptional[float]Average eval_score across all eval-bearing nodes
regression_countOptional[int]Number of regression issues detected for this run
regression_typeslist[str]List of regression type strings found (e.g., ["assertion_regression", "cost_spike"])

These fields appear in the runs list (GET /api/runs) and single run detail (GET /api/runs/{id}), enabling dashboards to show pass rate trends across runs.

The Run Detail view displays a priority banner when regressions are detected. The banner shows the regression count (e.g., “3 regressions found”) and appears at the top of the run detail page.

The frontend fetches regression data via useRunRegressions(runId) and displays the count. The workflow-level regression query (useWorkflowRegressions(workflowId)) provides cross-run regression history.

The regression response schema on the frontend validates three regression types: assertion_regression, cost_spike, and quality_drop. Unknown types are rejected by the WorkflowRegressionSchema validator.

The EvalService.get_attention_items method scans production runs from the last 24 hours and surfaces regressions as attention items on the dashboard. It flags the same three conditions as the regression endpoints, plus a new_baseline info item for the first production run of a soul version. Items are sorted by severity (warnings before info) and recency.