Handling async ISM policy execution failures

This guide isolates and recovers indices whose OpenSearch Index State Management (ISM) phase actions failed silently on a background scheduler tick, so a 200 OK on policy attach never leaves a stalled index undetected.

ISM never runs a phase action inside the request that attaches the policy. The _plugins/_ism/add call returns 200 OK the moment the policy is registered, then the opensearch-job-scheduler plugin evaluates and executes the action on a later tick governed by plugins.index_state_management.job_interval (default 300s). That decoupling is what the parent Async Execution Patterns model is built on, and it is exactly why a rollover, allocation, or snapshot can fail minutes after the API said success. Reliable recovery therefore means polling execution state rather than trusting the attach response — a discipline the broader ISM Policy Implementation & Python Automation workflow depends on. This procedure extracts the failure signature from the explain API, forces a targeted retry, tunes the retry block that keeps re-failing, automates remediation across a fleet with opensearch-py, and defines a deterministic rollback when the action can never succeed.

Prerequisites

Confirm each item before issuing a retry. Retrying blindly against an unmet infrastructure constraint just re-queues the same failure and inflates retry_failed_count.

The ISM plugin and opensearch-job-scheduler are installed on every data and cluster-manager node, and plugins.index_state_management.job_interval is known so you can distinguish “still pending” from “failed”.
The automation service account holds fine-grained access to _plugins/_ism/explain, _plugins/_ism/retry, and _plugins/_ism/remove, scoped per Security & Access Boundaries.
Disk watermarks (low, high, flood_stage) are set for your hardware, since a breached flood_stage raises cluster_block_exception before any ISM action can run.
You can identify whether a target index is a Cross-Cluster Replication (CCR) follower — follower indices cannot execute rollover, shrink, or delete natively and must be driven from the leader.
The intended lifecycle sequence is documented, per Index Lifecycle Basics, so a forced state retry advances to the correct phase rather than skipping one.

Step-by-step procedure

1. Extract the failure signature from the explain API

ISM records the current state, action, step, and error on each managed index. The explain API is the only authoritative source of execution status — _cat/indices never shows it:

HTTP

GET _plugins/_ism/explain/<index_name>?pretty

Expected output for a failed rollover looks like this:

JSON

{
  "logs-app-2026.07": {
    "index.plugins.index_state_management.policy_id": "observability-lifecycle",
    "state": { "name": "hot", "start_time": 1751600000000 },
    "action": { "name": "rollover", "failed": true },
    "step": { "name": "attempt_rollover", "status": "failed" },
    "retry_failed_count": 3,
    "info": {
      "message": "Rollover failed",
      "cause": "cluster_block_exception … blocked by: [FORBIDDEN/12/index read-only / allow delete (api)]"
    }
  },
  "total_managed_indices": 1
}

Gotcha: failure is signalled by action.failed: true or step.status: "failed", never by a state literally named FAILED. Also note that index names are top-level keys alongside the integer total_managed_indices — filter that key out before iterating.

2. Classify the root cause before retrying

Map info.cause to a remediation path; a retry only helps once the underlying block is cleared:

cluster_block_exception / read-only / allow delete — a flood_stage watermark breach put the index into read-only. Free disk or scale the tier, then clear the block with PUT /<index>/_settings {"index.blocks.read_only_allow_delete": null}.
replication_conflict / replication_exception — the index is a CCR follower holding a write lock. Drive the transition from the leader or stop replication first; do not retry the follower.
unassigned shards / NO decision — allocation could not place a shard for the target tier. This is the failure that Fallback Routing Strategies exist to absorb; add capacity or a fallback attribute before retrying.
timeout_exception on force_merge / snapshot — the action exceeded its window. Raise the action-level timeout rather than retrying at the default.

Gotcha: transient causes (a brief master_not_discovered_exception) clear on their own; persistent causes (no node carries the tier attribute) will re-fail every retry. Separate the two before automating, or you build an infinite retry loop.

3. Force an immediate retry of the failed step

Once the block is cleared, do not wait for the next scheduler tick. An empty body re-runs the failed step in the current state and clears the failure flag:

HTTP

POST _plugins/_ism/retry/<index_name>

To skip a permanently blocked action and jump straight to a known-good phase, pass an explicit state:

HTTP

POST _plugins/_ism/retry/<index_name>
{
  "state": "warm"
}

Expected response — a failures: false payload confirms the flag was cleared and the step re-queued:

JSON

{ "updated_indices": 1, "failures": false, "failed_indices": [] }

Gotcha: the retry only re-queues work; execution still happens on the next tick. Supplying a state bypasses the blocked action entirely, so use it only when skipping that action (for example a failing shrink) is genuinely safe for the lifecycle defined in Index Lifecycle Basics.

4. Tune the retry block for causes that keep re-failing

If the same action exhausts its retries every cycle, the policy’s own retry block is too aggressive or the trigger fights cluster capacity. Widen the backoff and align the rollover trigger with real shard pressure — the same threshold discipline covered in Threshold Tuning Strategies:

JSON

{
  "policy": {
    "description": "Rollover with conservative retry and capacity-aware trigger",
    "default_state": "hot",
    "states": [
      {
        "name": "hot",
        "actions": [
          {
            "retry": { "count": 5, "backoff": "exponential", "delay": "10m" },
            "rollover": { "min_index_age": "1d", "min_primary_shard_size": "40gb" }
          }
        ],
        "transitions": [
          { "state_name": "warm", "conditions": { "min_index_age": "7d" } }
        ]
      }
    ]
  }
}

Gotcha: editing a policy with PUT _plugins/_ism/policies/<policy_id> bumps its seq_no but does not re-apply it to already-managed indices — run POST _plugins/_ism/change_policy/<index> (or retry) so live indices pick up the new retry block. Set the backoff delay at or above the job_interval so retries never overlap a pending scheduler tick.

5. Automate detection and remediation across the fleet

Manual polling does not scale past a handful of indices. This opensearch-py worker queries the explain endpoint for a pattern, isolates only genuinely failed indices, logs the signature, and issues a bounded retry. It mirrors the resilient client patterns in Error Handling & Retries:

Python

import os
import logging
from opensearchpy import OpenSearch
from opensearchpy.exceptions import ConnectionError, TransportError

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("ism_async_remediation")

def remediate_ism_failures(hosts: list, index_pattern: str) -> dict:
    """Detect ISM async failures for a pattern and issue targeted retries."""
    client = OpenSearch(
        hosts=hosts,
        http_auth=(os.environ["OPENSEARCH_USER"], os.environ["OPENSEARCH_PASSWORD"]),
        use_ssl=True,
        verify_certs=True,
        timeout=30,
    )
    results = {"retried": [], "skipped": []}

    try:
        explain = client.transport.perform_request(
            "GET", f"/_plugins/_ism/explain/{index_pattern}"
        )
    except (ConnectionError, TransportError) as e:
        logger.error("Explain request failed for %s: %s", index_pattern, e)
        return results

    for idx, data in explain.items():
        # total_managed_indices is an int, not a per-index dict — skip it.
        if not isinstance(data, dict):
            continue

        action = data.get("action", {})
        step = data.get("step", {})
        if not (action.get("failed") or step.get("status") == "failed"):
            continue

        cause = data.get("info", {}).get("cause", data.get("info", {}).get("message", "unknown"))
        logger.error("Index %s failed on action '%s': %s", idx, action.get("name"), cause)

        # Persistent infrastructure blocks must be cleared out-of-band, not retried.
        if "replication" in str(cause).lower() or "read-only" in str(cause).lower():
            logger.warning("Skipping retry for %s — needs manual clearance", idx)
            results["skipped"].append(idx)
            continue

        try:
            client.transport.perform_request("POST", f"/_plugins/_ism/retry/{idx}", body={})
            logger.info("Queued retry for %s", idx)
            results["retried"].append(idx)
        except TransportError as e:
            logger.warning("Retry rejected for %s: %s", idx, e)
            results["skipped"].append(idx)

    return results

Gotcha: the worker skips CCR and read-only causes on purpose — retrying those is the classic infinite loop. Schedule it as a Kubernetes CronJob at an interval longer than job_interval so each run observes the effect of the previous retry rather than stacking duplicates.

6. Roll back to a fallback policy when the action can never succeed

When retries are exhausted and the constraint is structural (a tier with no capacity, a follower that cannot roll over), stop the scheduler churn. Detach the failing policy, verify index health, then attach a simplified policy that enforces only essential retention:

HTTP

POST _plugins/_ism/remove/<index_name>

Then attach the fallback that skips the impossible action:

HTTP

POST _plugins/_ism/add/<index_name>
{
  "policy_id": "retention-only-fallback"
}

Gotcha: remove clears ISM metadata immediately, so record the last known state.name from Step 1 first — you need it to resume the full lifecycle later. Document every terminal signature in a runbook so the next incident is a lookup, not a re-diagnosis.

Verification

Confirm the retry actually cleared the failure and the index resumed its lifecycle.

Confirm the step completed and the counter reset:

Shell

curl -s "https://<cluster-endpoint>:9200/_plugins/_ism/explain/logs-app-2026.07?pretty"

A healthy result shows step.status: "completed" (or an advanced state.name), action.failed: false, and retry_failed_count: 0. A counter still climbing means the root cause from Step 2 is not yet cleared.

Confirm no lingering write block:

Shell

curl -s "https://<cluster-endpoint>:9200/logs-app-2026.07/_settings?filter_path=**.blocks"

An empty response is correct. A read_only_allow_delete: "true" means a watermark breach still holds the index read-only — the retry cannot succeed until it is cleared.

Confirm the retry did not cascade into shard failures:

Shell

curl -s "https://<cluster-endpoint>:9200/_cluster/health/logs-app-2026.07?pretty"

Expect "status": "green" and "unassigned_shards": 0. A drop to red during the retry window points at an allocation failure that needs capacity, not another retry.

Common failures

Symptom	Root cause	Fix command
`action.failed: true` on `rollover`, index read-only	`flood_stage` watermark breach set the index read-only	`PUT /<index>/_settings {"index.blocks.read_only_allow_delete": null}` then `POST _plugins/_ism/retry/<index>`
Retry succeeds but action fails again next tick	Policy `retry` block too aggressive; trigger fights capacity	`PUT _plugins/_ism/policies/<id>` with widened backoff, then `POST _plugins/_ism/change_policy/<index>`
`replication_conflict` on a follower index	CCR follower cannot execute the phase action	Drive the transition from the leader; `POST _plugins/_replication/<index>/_stop` before local action
`retry_failed_count` climbs, shards `UNASSIGNED`	No node carries the target tier attribute	`GET _cluster/allocation/explain` then add capacity or a fallback attribute before retrying
Retry returns `failures: true`	Missing rollover alias or policy/index mismatch	`GET _plugins/_ism/explain/<index>` to read the cause, correct the alias, then retry

Frequently asked questions

Why does _plugins/_ism/add return 200 even when the action later fails?

add only registers the policy against the index; it does not execute any action. Execution is dispatched asynchronously by the job scheduler on its next tick (default every 5 minutes). The 200 confirms attachment, not successful phase execution — always poll _plugins/_ism/explain to confirm the action ran.

Does an empty retry body re-run the whole state or just the failed step?

An empty body re-executes only the failed step in the current state and clears the failure flag. Passing {"state": "<name>"} skips the blocked action and transitions the index directly to the named state, which is how you bypass a permanently failing action like shrink or force_merge.

Why does my retry keep failing with the same cluster_block_exception?

The retry re-queues the action but does not clear the underlying block. A cluster_block_exception referencing read-only / allow delete is a flood_stage watermark breach — free disk (or raise the watermark temporarily), clear index.blocks.read_only_allow_delete, and only then retry. Retrying before the block is cleared just increments retry_failed_count.

Can I retry a Cross-Cluster Replication follower that failed a rollover?

No. Follower indices cannot execute rollover, shrink, or delete natively — those actions must run on the leader and replicate. Retrying the follower produces a replication_conflict. Manage the follower’s lifecycle from the leader side or stop replication before any local action.

Implementing retry logic for stuck ISM transitions — exponential backoff and circuit-breaker patterns for the _retry API.
Configuring index size and age thresholds for rollover — aligning triggers with capacity so actions stop failing.
Python automation for dynamic ISM policy updates — applying corrected policy DSL to live managed indices.

Up one level: Async Execution Patterns · Automation index: ISM Policy Implementation & Python Automation

Handling async ISM policy execution failures

Prerequisites #

Step-by-step procedure #

1. Extract the failure signature from the explain API #

2. Classify the root cause before retrying #

3. Force an immediate retry of the failed step #

4. Tune the retry block for causes that keep re-failing #

5. Automate detection and remediation across the fleet #

6. Roll back to a fallback policy when the action can never succeed #

Verification #

Common failures #

Frequently asked questions #

Related guides #

Prerequisites

Step-by-step procedure

1. Extract the failure signature from the explain API

2. Classify the root cause before retrying

3. Force an immediate retry of the failed step

4. Tune the retry block for causes that keep re-failing

5. Automate detection and remediation across the fleet

6. Roll back to a fallback policy when the action can never succeed

Verification

Common failures

Frequently asked questions

Related guides