> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sirenspec.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Retry Policies

> Configure automatic retries and failure handling to make SirenSpec workflows resilient to transient provider errors.

## Overview

LLM provider APIs fail transiently — rate limits (429), overloaded servers (503), and flaky networks are normal in production. Without retry logic a single transient error silently kills an otherwise-healthy run. SirenSpec's retry policy system gives you configurable backoff, jitter, and structured failure handling with zero boilerplate.

***

## `retry` block

Add a `retry` block to any node to override the retry behaviour for that node.

```yaml theme={null}
nodes:
  classify:
    agent: classifier
    writes: working.intent
    retry:
      max_attempts: 3
      backoff: exponential      # linear | exponential | constant
      base_delay: 1.0           # seconds before the first retry
      max_delay: 30.0           # cap on the computed delay ceiling
      jitter: true              # add ±20 % random variation
      on: [429, 500, 502, 503, network_error]
```

### Fields

| Field                | Required | Default                | Description                                                                                                                                                      |
| -------------------- | -------- | ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `max_attempts`       | No       | `1`                    | Total number of attempts, including the first.                                                                                                                   |
| `backoff`            | No       | `constant`             | Delay growth strategy: `constant`, `linear`, or `exponential`.                                                                                                   |
| `base_delay`         | No       | `1.0`                  | Delay in seconds before attempt 2.                                                                                                                               |
| `max_delay`          | No       | `60.0`                 | Maximum delay regardless of backoff math.                                                                                                                        |
| `jitter`             | No       | `false`                | When `true`, adds a random ±20 % offset to each delay.                                                                                                           |
| `on`                 | No       | `[429, network_error]` | Trigger conditions. Integers match HTTP status codes; `network_error` matches connection failures; `guardrail_violation` matches an output `GuardrailViolation`. |
| `retry_on_guardrail` | No       | `false`                | When `true`, output guardrail checks run **inside** the retry loop so a `GuardrailViolation` re-runs the LLM call instead of failing immediately.                |

### Backoff strategies

| Strategy      | Delay formula                |
| ------------- | ---------------------------- |
| `constant`    | `base_delay` on every retry  |
| `linear`      | `base_delay × attempt`       |
| `exponential` | `base_delay × 2^(attempt-1)` |

All strategies are clamped to `max_delay`.

***

## Retrying on guardrail violations

By default, retries only fire on transient transport errors (HTTP codes and network failures). Output guardrails — such as `schema` — run *after* the retry loop, so a malformed-but-successful response fails the run outright.

Set `retry_on_guardrail: true` to fold output guardrail checks into the retry loop. A `GuardrailViolation` then counts as a retryable error and triggers another LLM call, giving the model additional chances to produce output that satisfies the guardrail. Add `guardrail_violation` to the `on` list to make the trigger explicit.

```yaml theme={null}
nodes:
  extract:
    agent: extractor          # must return JSON matching a schema guardrail
    writes: output.person
    retry:
      max_attempts: 3
      retry_on_guardrail: true
      on: [guardrail_violation]
```

This is the recommended pattern whenever an agent must satisfy a `schema` (or other output) guardrail — see [Guardrails](/guardrails).

***

## `on_failure` block

`on_failure` controls what happens when all retry attempts are exhausted.

```yaml theme={null}
nodes:
  classify:
    agent: classifier
    writes: working.intent
    retry:
      max_attempts: 3
      on: [429, 503]
    on_failure:
      action: fallback             # abort | fallback | skip | use_default
      fallback_node: classify_safe
      default_output: "unknown"
```

### Fields

| Field            | Required | Default | Description                                                                  |
| ---------------- | -------- | ------- | ---------------------------------------------------------------------------- |
| `action`         | No       | `abort` | What to do when retries are exhausted.                                       |
| `fallback_node`  | No       | —       | Node ID to route to when `action: fallback`.                                 |
| `default_output` | No       | —       | Static value written to the node's `writes` path when `action: use_default`. |

### Actions

| Action        | Behaviour                                                                            |
| ------------- | ------------------------------------------------------------------------------------ |
| `abort`       | Raises `RetryExhaustedError` and stops the run immediately.                          |
| `fallback`    | Routes execution to `fallback_node`. The failed node is marked skipped.              |
| `skip`        | Silently skips the node; downstream nodes that depend on its output receive nothing. |
| `use_default` | Writes `default_output` to the node's `writes` path and continues normally.          |

***

## Workflow-level defaults

Set retry and failure defaults at the top level of your workflow. Every node that does not specify its own `retry` or `on_failure` block inherits these defaults.

```yaml theme={null}
version: "0.1"

defaults:
  retry:
    max_attempts: 3
    backoff: exponential
    base_delay: 1.0
    on: [429, network_error]
  on_failure:
    action: abort

agents: { ... }
nodes: { ... }
```

Per-node `retry` and `on_failure` blocks completely override the defaults for that node — they are not merged field-by-field.

***

## Error types

| Exception             | When raised                                                                                           |
| --------------------- | ----------------------------------------------------------------------------------------------------- |
| `RetryExhaustedError` | All retry attempts failed and `on_failure.action` is `abort` (or unset). Subclass of `ProviderError`. |

`RetryExhaustedError` carries the node ID, the number of attempts made, and the last upstream exception so you can log and inspect the root cause.

***

## Tracing

Every retry attempt is recorded in the run trace under the node's `retry_attempts` list:

```json theme={null}
{
  "id": "classify",
  "retry_attempts": [
    { "attempt": 1, "delay_seconds": 1.0, "error": "HTTP 429: Too Many Requests" },
    { "attempt": 2, "delay_seconds": 2.0, "error": "HTTP 429: Too Many Requests" }
  ]
}
```

Silent retries make debugging impossible — every attempt, delay, and error is always logged.

***

## Full example

```yaml theme={null}
version: "0.1"

defaults:
  retry:
    max_attempts: 3
    backoff: exponential
    base_delay: 1.0
    on: [429, network_error]
  on_failure:
    action: abort

agents:
  classifier:
    model: "anthropic:claude-haiku-4-5-20251001"
    system: "Classify the ticket."

  safe_classifier:
    model: "openai:gpt-4o-mini"
    system: "Classify the ticket. Reply with a single word."

nodes:
  classify:
    agent: classifier
    writes: working.intent
    retry:
      max_attempts: 5
      backoff: exponential
      base_delay: 1.0
      max_delay: 30.0
      jitter: true
      on: [429, 500, 502, 503, network_error]
    on_failure:
      action: fallback
      fallback_node: classify_safe

  classify_safe:
    agent: safe_classifier
    writes: working.intent
    on_failure:
      action: use_default
      default_output: "unknown"
```
