Error Handling & Retries in OpenSearch ISM and Cross-Cluster Replication

Index State Management (ISM) and Cross-Cluster Replication (CCR) both run as asynchronous state machines that reconcile on a fixed interval, and when an action fails they do not raise an exception you can catch — the managed index simply parks in its current state with a failed: true flag and a spent retry counter, while the follower checkpoint quietly stops advancing. Without deliberate error handling this silent stall compounds: a rollover that could not create its target index keeps the write alias pinned to an oversized shard, disk fills, ingestion backs up, and by the time an alert fires the OpenSearch cluster is already amber. This guide covers how to classify ISM and CCR failures, encode bounded retry logic inside both the policy and the surrounding Python automation, and build the guardrails that turn a transient 503 into a self-healing recovery rather than a 3 a.m. page. It extends the automation surface established in the ISM Policy Implementation & Python Automation guide.

Failure classification reference

Recovery starts with classifying the failure signature, because the correct response to a 409 (wait and retry the same action) is the opposite of the correct response to a 400 (stop and alert — retrying a malformed policy just burns the counter). ISM surfaces the underlying cause through the _plugins/_ism/explain endpoint; CCR surfaces its state through _plugins/_replication/<index>/status. The table below maps every signature you will meet in production to whether it is worth retrying and what the automation should do. Treat the “Retryable” column as the contract the Python orchestrator enforces.

Signature	Source	Typical cause	Retryable	Automated response
`409 Conflict`	ISM action	Concurrent policy edit, rollover already in flight, alias contention	Yes	Backoff, re-`retry` the failed step
`503 Service Unavailable`	ISM / cluster	Master election, thread-pool saturation, circuit breaker trip	Yes	Backoff on cluster health, then `retry`
`429 Too Many Requests`	ISM / bulk	Write queue or `force_merge` throttling	Yes	Longer backoff, reduce concurrency
`400 Bad Request`	ISM policy	Malformed policy JSON, invalid threshold syntax	No	Fail fast, alert, hold the counter
`failed_step` exhausted	ISM state	Action retried out per its own `retry` block	No (auto)	Diagnose blocker, `retry` once cleared
`REPLICATION_FAILED`	CCR follower	Leader unreachable, index force-merged/deleted upstream	Conditional	Verify leader, pause/resume checkpoint
`REPLICATION_LAG`	CCR follower	Follower I/O saturation, leader throttling	Yes	Monitor lag, throttle leader writes

The _plugins/_ism/explain payload is the primary diagnostic vector. It returns the exact failed_step, the number of retries already consumed, and the underlying cause, so a recovery routine can target the precise execution boundary rather than blindly reapplying the whole policy. How those steps map onto the state graph is described in Phase Transition Logic; a retry is only ever an attempt to re-run the single step that failed, never a restart of the lifecycle.

Shell

curl -s "https://<cluster>:9200/_plugins/_ism/explain/logs-prod-*?pretty"

JSON

{
  "logs-prod-2026.07.04-000001": {
    "policy_id": "logs-hot-warm-cold",
    "action": {
      "name": "rollover",
      "failed": true,
      "consumed_retries": 3,
      "last_retry_time": 1751606400000
    },
    "step": { "name": "attempt_rollover", "status": "failed" },
    "info": {
      "cause": "rollover index [logs-prod-2026.07.04-000002] already exists",
      "message": "Rollover failed due to target index conflict"
    }
  }
}

The cause string is the fork in the road. already exists and unavailable are transient — a later cycle succeeds. invalid or mapping are permanent — retries are wasted. The automation in the sections below reads exactly these fields.

The retry lifecycle

ISM has two nested layers of retry, and confusing them is the most common source of runaway recovery loops. The inner layer is the retry block declared inside each policy action: OpenSearch itself re-runs a failed action up to count times with its own backoff before flipping failed: true. The outer layer is your operational retry — the POST _plugins/_ism/retry call that resets a fully-exhausted index so the engine will attempt the step again. Outer retries must only fire after the inner counter is spent and the blocking condition has cleared; otherwise you re-arm an index that immediately fails again and thrash the scheduler.

Backoff is bounded exponential with jitter so that a whole shard of managed indices failing at once does not synchronise their retries into a thundering herd against a recovering master. The delay for attempt $n$ is

t_n = \min\left(t_{\max},\; t_{base}\cdot 2^{n}\right) + \text{rand}(0, j)

where $t_{base}$ is the base delay, $t_{\max}$ caps the wait, and $j$ is a jitter window. Align $t_{base}$ with the OpenSearch cluster’s real recovery time — long enough for a translog flush, segment merge, and metadata commit to finish, which for indices governed by Rollover Trigger Configuration is typically tens of seconds, not milliseconds. The size and age boundaries that decide when those actions fire are the subject of Threshold Tuning Strategies.

1. Encode retry blocks in the policy

The first line of defence lives inside the policy itself. Every action that touches cluster metadata should carry an explicit retry block so OpenSearch absorbs transient failures before your automation ever sees them. The backoff mode may be exponential, constant, or linear; count bounds the inner loop; delay seeds the wait.

JSON

{
  "policy": {
    "description": "Lifecycle with per-action retry and error notification",
    "default_state": "hot",
    "error_notification": {
      "destination": { "slack": { "url": "https://hooks.slack.com/services/XXX" } },
      "message_template": {
        "source": "ISM failed on {{ctx.index}} at step {{ctx.results.0.name}}"
      }
    },
    "states": [
      {
        "name": "hot",
        "actions": [
          {
            "retry": { "count": 3, "backoff": "exponential", "delay": "10m" },
            "rollover": { "min_primary_shard_size": "40gb", "min_index_age": "7d" }
          }
        ],
        "transitions": [
          { "state_name": "warm", "conditions": { "min_index_age": "14d" } }
        ]
      },
      {
        "name": "warm",
        "actions": [
          {
            "retry": { "count": 5, "backoff": "exponential", "delay": "30m" },
            "force_merge": { "max_num_segments": 1 }
          }
        ],
        "transitions": []
      }
    ]
  }
}

The error_notification block fires once per action failure and is the difference between discovering a stall from a Slack message versus from a full disk. Give force_merge a generous count and delay — it is I/O-bound and legitimately slow on a busy warm tier, so an aggressive inner retry only adds load.

2. Wire the diagnostic and recovery endpoints

The two endpoints the automation polls, and the manual reset it issues, are worth memorising. explain reads state; retry resets an exhausted index so the engine reattempts the failed step. A change of policy is the escape hatch when the failure is the policy itself.

Shell

# Read the failure signature for a set of managed indices
curl -s "https://<cluster>:9200/_plugins/_ism/explain/logs-prod-*?pretty"

# Reset an exhausted index so ISM reattempts the failed step
curl -s -X POST "https://<cluster>:9200/_plugins/_ism/retry/logs-prod-2026.07.04-000001"

# Swap the policy when the failure is a bad policy, not a transient blocker
curl -s -X POST "https://<cluster>:9200/_plugins/_ism/change_policy/logs-prod-*" \
  -H "Content-Type: application/json" \
  -d '{"policy_id":"logs-hot-warm-cold-v2"}'

The account that issues these calls must hold write permission on the _plugins/_ism/* endpoints; scoping that service role without over-granting is covered in Security & Access Boundaries. A recovery job that returns 403 is not a retryable failure — it is a misconfigured role, and the automation should surface it as such rather than backing off forever.

3. Automate bounded recovery in Python

A resilient orchestrator reads the diagnostic endpoint, filters to genuinely-failed indices, respects the consumed-retry ceiling, and only issues an outer retry when the failure is classified as transient. The script below uses the official opensearch-py client so the connection, auth, and TLS come from one place, and tenacity for the backoff-with-jitter around the recovery call itself. Structured logging with a correlation id makes every recovery attempt auditable.

Python

import logging
import uuid
from typing import Any

from opensearchpy import OpenSearch, TransportError
from tenacity import (
    retry, stop_after_attempt, wait_random_exponential,
    retry_if_exception_type, before_sleep_log,
)

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s %(message)s",
)
logger = logging.getLogger("ism_recovery")

MAX_CONSUMED_RETRIES = 5          # outer ceiling before manual queue
TRANSIENT_MARKERS = ("already exists", "unavailable", "timeout", "rejected")


class ISMRecoveryClient:
    def __init__(self, client: OpenSearch, max_retries: int = MAX_CONSUMED_RETRIES):
        self.client = client
        self.max_retries = max_retries

    def explain(self, index_pattern: str) -> dict[str, Any]:
        # opensearch-py has no first-class ISM helper; call the endpoint directly.
        return self.client.transport.perform_request(
            "GET", f"/_plugins/_ism/explain/{index_pattern}"
        )

    def _is_transient(self, cause: str) -> bool:
        return any(marker in cause.lower() for marker in TRANSIENT_MARKERS)

    @retry(
        stop=stop_after_attempt(4),
        wait=wait_random_exponential(multiplier=10, max=120),
        retry=retry_if_exception_type(TransportError),
        before_sleep=before_sleep_log(logger, logging.WARNING),
        reraise=True,
    )
    def _retry_index(self, index: str) -> None:
        self.client.transport.perform_request(
            "POST", f"/_plugins/_ism/retry/{index}"
        )

    def recover(self, index_pattern: str) -> None:
        corr = uuid.uuid4().hex[:8]
        state = self.explain(index_pattern)
        for index, meta in state.items():
            if not isinstance(meta, dict):
                continue                       # skip _explain metadata keys
            action = meta.get("action", {})
            if not action.get("failed"):
                continue
            consumed = action.get("consumed_retries", 0)
            cause = meta.get("info", {}).get("cause", "")
            if consumed >= self.max_retries:
                logger.error("[%s] %s exhausted (%d) — manual review, cause=%s",
                             corr, index, consumed, cause)
                continue
            if not self._is_transient(cause):
                logger.error("[%s] %s permanent failure — not retrying, cause=%s",
                             corr, index, cause)
                continue
            try:
                self._retry_index(index)
                logger.info("[%s] %s retry issued (step=%s)",
                            corr, index, meta.get("step", {}).get("name"))
            except TransportError as exc:
                logger.error("[%s] %s retry failed after backoff: %s",
                             corr, index, exc)


if __name__ == "__main__":
    os_client = OpenSearch(
        hosts=[{"host": "localhost", "port": 9200}],
        http_auth=("automation", "…"),
        use_ssl=True, verify_certs=True,
    )
    ISMRecoveryClient(os_client).recover("logs-prod-*")

The routine never touches an index that is not failed, never exceeds the consumed-retry ceiling, and never retries a permanent failure — three invariants that keep it safe to run on a schedule. Wrapping the same logic in a supervised async worker so many indices recover in parallel is the domain of Async Execution Patterns, and the broader scheduling scaffold lives in Python Orchestration Frameworks.

4. Recover CCR followers

Cross-Cluster Replication does not expose a _retry endpoint for follower indices, so its recovery path is different from ISM’s. When a follower reports REPLICATION_FAILED or PAUSED, a naive re-attach can corrupt the checkpoint; the safe sequence is to verify the leader, then pause and resume to re-anchor the follower to the leader’s current checkpoint.

Shell

# Inspect follower replication state and lag
curl -s "https://<follower>:9200/_plugins/_replication/logs-prod-000001/_status?pretty"

# Re-anchor a failed follower: pause then resume against the leader
curl -s -X POST "https://<follower>:9200/_plugins/_replication/logs-prod-000001/_pause" \
  -H "Content-Type: application/json" -d '{}'
curl -s -X POST "https://<follower>:9200/_plugins/_replication/logs-prod-000001/_resume" \
  -H "Content-Type: application/json" -d '{}'

Before any resume, confirm the leader index still exists and has not been force-merged out from under the follower, and confirm the follower cluster has tier capacity to allocate the replicated shards — an inherited routing filter pointing at a missing tier will strand them, a failure mode traced in Node Role Allocation and Fallback Routing Strategies. See the official OpenSearch Cross-Cluster Replication documentation for checkpoint reconciliation and lag-tolerance tuning.

Verification

After a recovery run, confirm the index actually advanced rather than re-failed on the next cycle. Re-read explain and check that failed has cleared and the step has moved forward.

Shell

curl -s "https://<cluster>:9200/_plugins/_ism/explain/logs-prod-2026.07.04-000001?pretty" \
  | grep -E '"failed"|"name"|"step"'

JSON

{ "action": { "name": "rollover", "failed": false, "consumed_retries": 0 },
  "step": { "name": "attempt_rollover", "status": "completed" } }

A consumed_retries that has reset to 0 with failed: false means the outer retry took. If failed is still true with the same cause, the blocker has not cleared — do not keep retrying; route the index to the manual queue instead.

Operational guardrails

Automated recovery is only safe with hard limits around it. The settings below are the contract the orchestrator and policy enforce together; tune the values to your cluster’s throughput but keep every ceiling in place.

Guardrail	Setting / mechanism	Recommended value	Rationale
Inner action retries	policy `retry.count`	3–5	Absorb transients without masking real failures
Inner backoff	policy `retry.backoff` / `delay`	`exponential`, 10–30m	Give merges and metadata commits time to finish
Outer retry ceiling	`MAX_CONSUMED_RETRIES`	5	Beyond this, escalate to a human, never loop
Backoff cap	`wait_random_exponential(max=…)`	120s	Prevents unbounded waits during long outages
Circuit breaker	halt on `cluster.health == red`	stop all recovery	Never pile actions onto a red cluster
Blast-radius limit	halt if failed indices > 10%	pause + alert	A mass failure is systemic, not per-index
Disk headroom	`cluster.routing.allocation.disk.watermark.high`	90%	Rollover/allocation reject above the watermark

The circuit breaker is the most important row: if cluster health is red or more than a tenth of monitored indices fail at once, the failure is almost certainly systemic — a master flap, a full disk tier, a network partition — and issuing retries makes it worse. Halt, alert, and let a human decide. Idempotency underpins all of it: because explain is read-only and retry only re-arms an already-failed step, the recovery job is safe to re-run, which is what makes it schedulable.

Troubleshooting

Rollover fails with target index already exists. A previous rollover created the next index but crashed before repointing the write alias. Inspect the conflict, then remove the orphan or advance the alias manually before retrying.

Shell

curl -s "https://<cluster>:9200/_plugins/_ism/explain/logs-prod-*?pretty" | grep -A2 cause
# Fix: DELETE the orphaned empty index, or POST _aliases to move the write alias, then _ism/retry

Index parked in WAITING with retries exhausted. The inner retry.count is spent and failed: true is set. The engine will not touch it again until you reset it — but only after the blocker clears.

Shell

curl -s "https://<cluster>:9200/_plugins/_ism/explain/<index>?pretty"
# Fix: clear the underlying cause, then curl -X POST .../_plugins/_ism/retry/<index>

Every retry immediately re-fails. The failure is permanent (bad policy, invalid mapping) but the automation keeps re-arming it. Read the cause; if it contains invalid or mapping, swap the policy instead of retrying.

Shell

curl -s "https://<cluster>:9200/_plugins/_ism/explain/<index>?filter_path=**.cause"
# Fix: POST _plugins/_ism/change_policy/<index> with a corrected policy_id

Recovery job returns 429 under load. OpenSearch is throttling; the orchestrator’s concurrency is too high. Back off harder and cap parallel retries.

Shell

curl -s "https://<cluster>:9200/_cluster/health?pretty" | grep -E 'status|active_shards_percent'
# Fix: raise the backoff multiplier, lower worker concurrency, halt if status is red

CCR follower stuck at REPLICATION_LAG. The follower cannot keep pace with the leader’s write rate. Confirm follower I/O headroom before throttling the leader.

Shell

curl -s "https://<follower>:9200/_plugins/_replication/<index>/_status?pretty" | grep -i lag
# Fix: add follower I/O capacity, or throttle leader ingest until the checkpoint catches up

Frequently asked questions

What is the difference between a policy retry block and the _ism/retry API?

The policy retry block is the inner loop OpenSearch runs automatically — it re-attempts a failed action count times with its own backoff before flipping the index to failed: true. The POST _plugins/_ism/retry API is the outer, manual reset: it only does anything once the inner counter is exhausted, and it re-arms the index so the engine reattempts the failed step. Use the block to absorb transients silently; use the API only after the blocking condition has actually cleared.

Should my automation retry a 400 Bad Request?

No. A 400 means the request itself is malformed — a bad policy, an invalid threshold, an incompatible mapping — and no amount of waiting changes that. Retrying only burns the consumed-retry counter and delays the alert. Classify 4xx (except 409) as permanent, fail fast, and fix the policy with change_policy.

How do I stop a mass failure from triggering thousands of retries?

Put a blast-radius circuit breaker ahead of the recovery loop: if cluster health is red, or more than ~10% of monitored indices report failed at once, halt all automated retries and alert instead. A failure at that scale is systemic — a master flap, a full disk tier, a partition — and per-index retries pile load onto an already-struggling cluster.

Why did my CCR follower not recover after I re-attached it?

Re-attaching without pausing first can leave the follower anchored to a stale checkpoint. The safe sequence is pause then resume, which re-anchors the follower to the leader’s current checkpoint. First verify the leader index still exists and has not been force-merged, and that the follower cluster has tier capacity to allocate the replicated shards.

Implementing retry logic for stuck ISM transitions — the step-by-step recovery procedure for a single stalled index.
Phase Transition Logic — how failed steps map onto the state graph a retry re-runs.
Rollover Trigger Configuration — the action most likely to fail transiently, and how to time it.
Async Execution Patterns — recovering many indices in parallel without overloading the OpenSearch cluster.
Python Orchestration Frameworks — the scheduling scaffold that runs the recovery job.
Security & Access Boundaries — scoping the service role that issues retry and change_policy calls.

Up: ISM Policy Implementation & Python Automation

Error Handling & Retries in OpenSearch ISM and Cross-Cluster Replication

Failure classification reference #

The retry lifecycle #

1. Encode retry blocks in the policy #

2. Wire the diagnostic and recovery endpoints #

3. Automate bounded recovery in Python #

4. Recover CCR followers #

Verification #

Operational guardrails #

Troubleshooting #

Frequently asked questions #

Related #

Failure classification reference

The retry lifecycle

1. Encode retry blocks in the policy

2. Wire the diagnostic and recovery endpoints

3. Automate bounded recovery in Python

4. Recover CCR followers

Verification

Operational guardrails

Troubleshooting

Frequently asked questions

Related