Python Automation for Dynamic ISM Policy Updates

This guide shows how to read a live OpenSearch Index State Management (ISM) policy, mutate it in Python, and write it back — and then push the new version onto already-managed indices — without triggering version_conflict_engine_exception or leaving indices running a stale policy.

Editing an ISM policy from a script is deceptively risky. A PUT to _plugins/_ism/policies/{id} that omits the current version numbers silently loses a concurrent administrator’s change, and — the failure mode most teams miss — updating the policy document does not retroactively touch indices that are already running an older version of it. The procedure below performs a strict read-modify-write against _seq_no/_primary_term, then re-attaches the new version with change_policy, and finally rolls the change across many indices under bounded concurrency with an automatic rollback path. It extends the control-plane patterns in Python Orchestration Frameworks and applies the broader ISM Policy Implementation & Python Automation execution model.

Prerequisites

Confirm every item before you point this automation at a production cluster. A single omitted version parameter or an unscoped write can corrupt policy state across coordinating nodes.

The target policy already exists and is attached to at least one index — verify with GET _plugins/_ism/explain/<index>.
The automation service account holds fine-grained access to POST/PUT/GET _plugins/_ism/*, scoped per Security & Access Boundaries — a read-only role cannot write policies or call change_policy.
You understand which phase the change affects; align threshold edits with the evaluation rules in Phase Transition Logic.
Python 3.9+ with requests, aiohttp, and (optionally) jsonschema installed in the runtime.
A writable backups/ path so the rollback step can persist a known-good snapshot before each mutation.
If Cross-Cluster Replication (CCR) is active, know which policies are inherited by follower indices — updating a leader policy does not replicate the policy document to the follower cluster.

Step-by-step procedure

1. Fetch the current policy and its version metadata

OpenSearch ISM enforces optimistic concurrency through _seq_no and _primary_term. Every dynamic update must first read the current document and capture those two integers; you will echo them back on the write so OpenSearch can reject the update if anything changed in between. Configure the session with exponential backoff so transient 429/5xx responses do not abort a rollout.

Python

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from typing import Dict, Any, Tuple

def get_ism_session() -> requests.Session:
    session = requests.Session()
    retry_strategy = Retry(
        total=4,
        backoff_factor=0.5,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET", "PUT"]
    )
    session.mount("https://", HTTPAdapter(max_retries=retry_strategy))
    return session

def fetch_policy_metadata(
    session: requests.Session,
    base_url: str,
    policy_id: str
) -> Tuple[Dict[str, Any], int, int]:
    endpoint = f"{base_url}/_plugins/_ism/policies/{policy_id}"
    resp = session.get(endpoint, headers={"Accept": "application/json"})
    resp.raise_for_status()
    payload = resp.json()
    return (
        payload["policy"],
        payload["_seq_no"],
        payload["_primary_term"]
    )

Expected shape of the response (trimmed) — note the version fields sit at the top level, beside policy:

JSON

{
  "_id": "observability-lifecycle",
  "_seq_no": 17,
  "_primary_term": 3,
  "policy": { "policy_id": "observability-lifecycle", "default_state": "hot", "states": [ ] }
}

Gotcha: _seq_no is per-shard sequencing, not a monotonic global counter — never hard-code or reuse a value from a previous run. Always fetch it fresh immediately before the write.

2. Mutate the policy document in memory

ISM policies are nested JSON state machines. Mutate them by locating the correct state rather than assuming states[0], and edit the action fields in place. Getting the field names wrong here produces a policy that validates but never rolls over.

Python

def mutate_rollover_and_transition(
    policy_doc: Dict[str, Any],
    min_size_gb: int,
    min_age_hours: int,
    force_merge_segments: int = 1
) -> Dict[str, Any]:
    states = policy_doc.get("states", [])
    if not states:
        raise ValueError("Policy missing 'states' array")

    # Locate the default (hot) state rather than assuming index 0.
    default_state = policy_doc.get("default_state")
    hot_state = next((s for s in states if s.get("name") == default_state), states[0])
    hot_actions = hot_state.get("actions", [])

    for action in hot_actions:
        if "rollover" in action:
            # ISM rollover conditions are fields directly on the action object;
            # the size condition is `min_size` (there is no `max_size` / `conditions`).
            action["rollover"]["min_size"] = f"{min_size_gb}gb"
            action["rollover"]["min_index_age"] = f"{min_age_hours}h"
            break
    else:
        raise KeyError("Rollover action not found in default state")

    if not any("force_merge" in a for a in hot_actions):
        hot_actions.append({"force_merge": {"max_num_segments": force_merge_segments}})

    return policy_doc

Gotcha: ISM uses min_size and min_index_age as fields directly on the rollover action — not the Elasticsearch-style conditions: { max_size, max_age } object. Copying an ILM snippet here is the most common cause of a silently non-firing rollover. Validate the mutated document with jsonschema or a dry-run before committing, and set thresholds using the sizing math in Configuring index size and age thresholds for rollover.

3. Write the update back with optimistic concurrency

Echo the captured _seq_no and _primary_term as if_seq_no and if_primary_term query parameters. If any concurrent change advanced the version, OpenSearch returns 409 instead of clobbering it.

HTTP

PUT _plugins/_ism/policies/observability-lifecycle?if_seq_no=17&if_primary_term=3
{
  "policy": {
    "policy_id": "observability-lifecycle",
    "default_state": "hot",
    "states": [ ]
  }
}

Expected response — the version fields advance, confirming the write landed on top of the exact document you read:

JSON

{
  "_id": "observability-lifecycle",
  "_version": 6,
  "_seq_no": 18,
  "_primary_term": 3
}

Gotcha: omitting if_seq_no/if_primary_term disables the concurrency check entirely — the write always succeeds and silently overwrites whatever another operator just changed. Treat a 409 as a signal to re-fetch (Step 1) and re-apply, never to force the write.

4. Re-apply the new version to already-managed indices

This is the step most scripts skip. Updating the policy document changes the definition stored in OpenSearch, but indices that are already managed keep executing the version of the policy they were attached with. Force them onto the new version with change_policy, optionally pinning the state so an index does not jump mid-lifecycle.

HTTP

POST _plugins/_ism/change_policy/logs-app-*
{
  "policy_id": "observability-lifecycle",
  "state": "hot",
  "include": [ { "state": "hot" } ]
}

Expected response — every matched index reports the reassignment; anything under failures is still running the old version:

JSON

{ "updated_indices": 12, "failures": false, "failed_indices": [] }

Gotcha: change_policy only re-evaluates an index at the next ISM job run (plugins.index_state_management.job_interval, 5 minutes by default). Do not assume the new thresholds are live the instant the call returns. The include filter is a safety rail: without it, an index sitting in warm can be yanked back to a state the new policy defines differently.

5. Roll the update across many indices without overwhelming OpenSearch’s cluster manager

Serial PUTs across hundreds of policies queue work onto OpenSearch’s cluster-manager node and exhaust HTTP keep-alive connections. Parallelise with asyncio under a bounded semaphore, sharing one connection pool, and treat a 409 as retryable rather than fatal — the same discipline covered in Handling async ISM policy execution failures.

Python

import asyncio
import aiohttp
from typing import List

async def update_policy_async(
    session: aiohttp.ClientSession,
    base_url: str,
    policy_id: str,
    seq_no: int,
    primary_term: int,
    updated_policy: Dict[str, Any]
) -> bool:
    endpoint = f"{base_url}/_plugins/_ism/policies/{policy_id}"
    params = {"if_seq_no": seq_no, "if_primary_term": primary_term}
    try:
        async with session.put(endpoint, json=updated_policy, params=params) as resp:
            if resp.status == 200:
                return True
            elif resp.status == 409:
                # Version conflict: caller should re-fetch and retry.
                return False
            else:
                resp.raise_for_status()
    except aiohttp.ClientError as e:
        print(f"Network failure on {policy_id}: {e}")
        return False

async def rollout_policies(
    base_url: str,
    policy_updates: List[Tuple[str, int, int, Dict[str, Any]]],
    max_concurrency: int = 10
) -> None:
    semaphore = asyncio.Semaphore(max_concurrency)

    # Share a single session/connection pool across all updates (and close it).
    async with aiohttp.ClientSession() as session:
        async def bounded_update(args):
            async with semaphore:
                return await update_policy_async(session, base_url, *args)

        tasks = [bounded_update(update) for update in policy_updates]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    failed = sum(1 for r in results if r is False or isinstance(r, Exception))
    print(f"Rollout complete: {failed}/{len(policy_updates)} failed")

Keep max_concurrency well below OpenSearch’s cluster-manager write thread-pool capacity. The retry delay between conflict re-fetches should grow geometrically:

\text{delay}_\text{attempt} = \text{base} \times 2^{\,\text{attempt}}

Gotcha: never build the aiohttp.ClientSession inside the per-task coroutine. One session per rollout reuses the connection pool; a session per task re-runs the TLS handshake for every policy and exhausts sockets under load.

6. Guard the mutation with an automatic rollback path

For unattended runs, persist a known-good snapshot before each write and restore it if a non-recoverable error (for example a persistent illegal_argument_exception) survives the retry budget. This is the deterministic recovery contract described in Implementing retry logic for stuck ISM transitions.

Python

import json
import time
from pathlib import Path

def apply_with_rollback(
    session: requests.Session,
    base_url: str,
    policy_id: str,
    new_policy: Dict[str, Any],
    seq_no: int,
    primary_term: int,
    max_retries: int = 3
) -> bool:
    backup_path = Path(f"backups/{policy_id}_{int(time.time())}.json")
    backup_path.parent.mkdir(parents=True, exist_ok=True)

    # Persist the baseline (current policy + its version) before mutating.
    current, cur_seq, cur_term = fetch_policy_metadata(session, base_url, policy_id)
    with open(backup_path, "w") as f:
        json.dump({"policy": current, "seq_no": cur_seq, "primary_term": cur_term}, f)

    endpoint = f"{base_url}/_plugins/_ism/policies/{policy_id}"
    params = {"if_seq_no": seq_no, "if_primary_term": primary_term}

    for attempt in range(max_retries):
        resp = session.put(endpoint, json={"policy": new_policy}, params=params)
        if resp.status_code == 200:
            return True
        if resp.status_code == 409:
            # Someone else advanced the version: re-fetch and retry with backoff.
            _, seq_no, primary_term = fetch_policy_metadata(session, base_url, policy_id)
            params = {"if_seq_no": seq_no, "if_primary_term": primary_term}
            time.sleep(0.5 * (2 ** attempt))
            continue
        break  # Non-recoverable error: fall through to rollback.

    # Restore the baseline captured above.
    with open(backup_path, "r") as f:
        baseline = json.load(f)
    session.put(endpoint, json={"policy": baseline["policy"]}, params={
        "if_seq_no": baseline["seq_no"],
        "if_primary_term": baseline["primary_term"]
    })
    return False

Gotcha: the rollback PUT also needs the current version parameters. Re-fetch them right before restoring — the failed attempts may have advanced the sequence number, and a stale rollback will itself return 409.

Verification

After a rollout, confirm three things: the stored document is the new version, managed indices are actually executing it, and no index is stuck retrying.

Confirm the policy document advanced:

Shell

curl -s "https://<cluster-endpoint>:9200/_plugins/_ism/policies/observability-lifecycle?pretty" \
  | jq '{seq_no: ._seq_no, primary_term: ._primary_term}'

A healthy result shows the _seq_no you expect from Step 3’s response — a lower number means your write never landed.

Confirm indices picked up the new version:

Shell

curl -s "https://<cluster-endpoint>:9200/_plugins/_ism/explain/logs-app-*?pretty"

Each index should report "policy_seq_no" matching the updated policy and "policy_primary_term" in step. An index still showing the old policy_seq_no never received the change_policy from Step 4.

Confirm no index is wedged after the change:

Shell

curl -s "https://<cluster-endpoint>:9200/_plugins/_ism/explain/logs-app-*" \
  | jq 'to_entries[] | select(.value.retry_failed_count // 0 > 0) | .key'

An empty result is success. Any index listed is retrying a failed action against the new policy and needs a POST _plugins/_ism/retry/<index>.

Common failures

Symptom	Root cause	Fix command
`version_conflict_engine_exception` on `PUT`	`if_seq_no`/`if_primary_term` stale or omitted	Re-fetch: `GET _plugins/_ism/policies/<id>` then re-apply with the returned values
Policy updated but indices keep old thresholds	`change_policy` never called; managed indices pinned to prior version	`POST _plugins/_ism/change_policy/<index-pattern>` with the new `policy_id`
`illegal_argument_exception` mentioning `conditions`	Used ILM-style `max_size`/`conditions` instead of ISM `min_size`	Rewrite the action to `min_size`/`min_index_age` fields (Step 2)
Rollout hangs / `ClientConnectorError` under load	New `aiohttp` session per task exhausting sockets	Build one `ClientSession` for the whole rollout; bound with a semaphore
CCR follower index runs an outdated policy	Policy document is not replicated to the follower cluster	Re-`PUT` the policy on the follower cluster, then `change_policy` its indices

Frequently asked questions

Does updating an ISM policy automatically apply to indices already managed by it?

No. A PUT to _plugins/_ism/policies/{id} only changes the stored definition. Indices attached before the update keep running the version they were bound to until you call POST _plugins/_ism/change_policy for them. This is why Step 4 is mandatory, not optional.

What is the difference between _seq_no and _version in the response?

_version is a simple per-document counter that increments on every write. _seq_no and _primary_term are the coordinates ISM uses for optimistic concurrency — they encode which shard and primary term produced the write, so OpenSearch can reject a stale update. Always send _seq_no/_primary_term on conditional writes, not _version.

How do I avoid a rollout stampede against OpenSearch's cluster-manager node?

Bound concurrency with an asyncio.Semaphore set below OpenSearch’s cluster-manager write thread-pool size, share one connection pool, and stagger conflict retries with exponential backoff. For a deeper treatment of failure isolation during concurrent execution, see Handling async ISM policy execution failures.

Can I pin the state when moving indices to the new policy?

Yes. Pass a state and an include filter to change_policy so only indices currently in that state adopt the new version, and they resume in the state you name. This prevents an index in warm from being reset to a hot definition the new policy describes differently.

Writing Python scripts for automated ISM rollover triggers — force immediate rotation once a dynamic threshold is written.
Configuring index size and age thresholds for rollover — how to compute the min_size/min_index_age values you feed into Step 2.
Handling async ISM policy execution failures — isolating and recovering the parallel writes in Step 5.
Implementing retry logic for stuck ISM transitions — the retry and rollback contract behind Step 6.

Up one level: Python Orchestration Frameworks · Foundations: ISM Policy Implementation & Python Automation

Python Automation for Dynamic ISM Policy Updates

Prerequisites #

Step-by-step procedure #

1. Fetch the current policy and its version metadata #

2. Mutate the policy document in memory #

3. Write the update back with optimistic concurrency #

4. Re-apply the new version to already-managed indices #

5. Roll the update across many indices without overwhelming OpenSearch’s cluster manager #

6. Guard the mutation with an automatic rollback path #

Verification #

Common failures #

Frequently asked questions #

Related guides #

Prerequisites

Step-by-step procedure

1. Fetch the current policy and its version metadata

2. Mutate the policy document in memory

3. Write the update back with optimistic concurrency

4. Re-apply the new version to already-managed indices

5. Roll the update across many indices without overwhelming OpenSearch’s cluster manager

6. Guard the mutation with an automatic rollback path

Verification

Common failures

Frequently asked questions

Related guides