Phase Transition Logic

Phase transition logic is the deterministic machinery that decides when an OpenSearch Index State Management (ISM) index leaves one lifecycle state and enters the next, and in what order the actions inside each state run. It is easy to write a policy that looks correct and still stalls in production: transitions fire on a background poll cadence rather than in real time, the allocation action races the force_merge that follows it, and Cross-Cluster Replication (CCR) followers refuse to advance until the leader has committed. When that timing is misjudged, indices lock read-only mid-flight, shards strand UNASSIGNED at a tier boundary, and storage grows unbounded because the delete state is never reached. This guide covers the evaluation cadence, the state-commit sequence, exact policy payloads, watermark guardrails, and the Python automation needed to drive and verify transitions at scale, building on the ISM Policy Implementation & Python Automation execution model.

Tier alignment for transition timing

A phase transition is only ever as reliable as the hardware waiting on the far side of it. Each state in the policy targets a tier, and if that tier’s storage profile or compute ratio cannot absorb the migration wave the transition triggers, the commit stalls even though the policy JSON is valid. The table below maps each lifecycle state to the node profile it should transition onto, the canonical routing attribute the allocation action must reference, and the workload that dictates its transition cadence. The node-role mechanics behind these attributes are covered under Node Role Allocation, and how the tier ratios are sized is the subject of Hot-Warm-Cold Tier Design.

Lifecycle state	Storage profile	vCPU : RAM ratio	Routing attribute	Transition driver
Hot	Local NVMe SSD	1 : 4 (compute-heavy)	`node.attr.data: hot`	Rollover on shard size or age; fastest cadence
Warm	SATA/SAS SSD	1 : 6	`node.attr.data: warm`	`min_index_age` after rollover; relocation + merge
Cold	High-density HDD	1 : 8 (storage-heavy)	`node.attr.data: cold`	Age-based; slowest relocation, watermark-sensitive
Delete	n/a (removal)	n/a	n/a	Terminal `min_index_age`; irreversible commit

The cadence column matters as much as the hardware: the hot tier evaluates most frequently because rollover decisions are size-driven and time-sensitive, while cold transitions are dominated by relocation bandwidth on dense HDD nodes and must be paced so the migration wave never backs up onto the warm tier.

Evaluation cadence and the state-commit sequence

ISM does not react to events in real time. A background job scheduler on the coordinator node periodically polls index metadata, shard allocation metrics, and replication health, then resolves each managed index through three deterministic stages: condition evaluation, action execution, and state commit. Every transition requires explicit trigger definitions mapped to measurable cluster metrics — document count, primary shard size, or index age — and nothing moves between poll cycles.

The coordinator maintains a sweep queue that batches policy evaluations across all managed indices. This prevents thread-pool exhaustion during peak ingestion but introduces a predictable latency between the moment a trigger is satisfied and the moment the state actually mutates. The job interval is governed cluster-wide by plugins.index_state_management.job_interval (default 5 minutes); any orchestration script that expects a transition to be visible must account for at least one full interval of lag, plus relocation time. For high-throughput logging pipelines where rapid rollover and tier migration are critical, tightening this interval trades scheduler overhead for lower transition latency — a balance covered in depth under Threshold Tuning Strategies.

The order of operations inside a state is where most silent failures originate. Within a single state the actions run in the sequence they are declared, so a rollover must be listed before any allocation or shrink that depends on a fresh write index existing. Pairing allocation with wait_for: true holds the state commit until relocation completes, so a downstream force_merge or read_only never executes against a half-migrated index. The lifecycle these transitions walk — the hot → warm → cold → delete progression itself — is grounded in Index Lifecycle Basics, and when a target tier has no eligible node the outcome is decided by Fallback Routing Strategies.

Threshold calibration for deterministic firing

Threshold configuration dictates the exact moment a transition fires, and misaligned thresholds are the primary cause of orphaned shards and CCR follower lag. For size-based transitions, min_primary_shard_size should reserve headroom for segment merges, translog flushes, and replication buffers rather than filling the node — a practical target is 70–80% of usable per-node volume capacity:

S_{\text{shard}} = w_{\text{target}} \times \frac{D_{\text{node}}}{N_{\text{shards}}}

where $S_{\text{shard}}$ is the rollover threshold, $w_{\text{target}}$ is the target fill fraction (0.7–0.8), $D_{\text{node}}$ is usable disk per hot node, and $N_{\text{shards}}$ is the number of primary shards competing for that node. Age-based triggers must reference the index creation timestamp (index.creation_date), not document timestamps, or backfill and late-arriving data will drag an index prematurely across a boundary.

Cross-cluster replication adds a synchronization constraint on top of the cadence. A follower index cannot transition states until the leader has committed the change and replication lag falls below the configured checkpoint threshold. During cold-tier moves this is the difference between a clean archival and a retry storm: monitor _plugins/_replication/follower/stats and gate cold transitions on follower checkpoint alignment so a lagging follower never inherits a require attribute for a tier it has not yet caught up to.

Step-by-step transition configuration

The four steps below stand up deterministic transition logic for a logs-* index set: node attributes first, then a template so new indices start hot, then the policy that walks them down the tiers, then a verification pass that confirms both the declared state and the physical placement.

1. Node configuration

Declare the tier attribute on every data node in opensearch.yml. This string is the exact value every template and every allocation action references — a case mismatch produces a shard that can never route to that tier, which surfaces later as a stalled transition rather than an obvious error.

YAML

# opensearch.yml — value matches the node's physical tier
node.name: os-data-hot-01
node.roles: [ data, ingest ]
node.attr.data: hot          # hot | warm | cold
plugins.index_state_management.job_interval: 5   # minutes between evaluation sweeps

2. Index template

Bake the hot-tier routing filter and the policy binding into a template so freshly rolled-over indices start hot immediately, without waiting for ISM’s first evaluation sweep.

HTTP

PUT _index_template/log_transition_tmpl
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.number_of_shards": 3,
      "index.number_of_replicas": 1,
      "index.routing.allocation.require.data": "hot",                 // start on NVMe
      "index.plugins.index_state_management.policy_id": "log_transition_policy",
      "index.refresh_interval": "5s"
    }
  },
  "priority": 100,
  "version": 2
}

3. Policy JSON

The policy defines one state per tier. The critical detail is action ordering and commit gating: rollover runs first in hot; in warm and cold the allocation action carries wait_for: true so the state commit waits for relocation before force_merge or read_only runs; and every action carries an explicit retry block so a watermark blip or an unreachable snapshot repository degrades into bounded retries instead of a stuck index.

HTTP

PUT _plugins/_ism/policies/log_transition_policy
{
  "policy": {
    "description": "Deterministic phase transition logic for log indices",
    "default_state": "hot",
    "ism_template": [
      { "index_patterns": ["logs-*"], "priority": 100 }
    ],
    "states": [
      {
        "name": "hot",
        "actions": [
          {
            "rollover": {
              "min_index_age": "1d",
              "min_primary_shard_size": "40gb"   // roll before shards get unwieldy
            }
          }
        ],
        "transitions": [
          { "state_name": "warm", "conditions": { "min_index_age": "2d" } }
        ]
      },
      {
        "name": "warm",
        "actions": [
          {
            "allocation": { "require": { "data": "warm" }, "wait_for": true },
            "retry": { "count": 3, "backoff": "exponential", "delay": "10m" }
          },
          { "force_merge": { "max_num_segments": 1 } }   // merge after relocation, not during
        ],
        "transitions": [
          { "state_name": "cold", "conditions": { "min_index_age": "7d" } }
        ]
      },
      {
        "name": "cold",
        "actions": [
          {
            "allocation": { "require": { "data": "cold" }, "wait_for": true },
            "retry": { "count": 3, "backoff": "exponential", "delay": "30m" }
          },
          { "read_only": {} }
        ],
        "transitions": [
          { "state_name": "delete", "conditions": { "min_index_age": "30d" } }
        ]
      },
      {
        "name": "delete",
        "actions": [ { "delete": {} } ]
      }
    ]
  }
}

Attach the policy to any indices that predate the template using the ISM add endpoint:

HTTP

POST _plugins/_ism/add/logs-*
{
  "policy_id": "log_transition_policy"
}

4. Verification

Never assume a transition committed — confirm ISM’s own view of the state, the declared routing setting, and the physical shard placement, because those three diverge exactly when a transition has silently stalled.

Shell

# a) Ask ISM which state the index is in and whether any action failed
curl -s "https://<cluster>:9200/_plugins/_ism/explain/logs-2026.07.04?pretty"

# b) Confirm the require attribute matches the current phase
curl -s "https://<cluster>:9200/logs-2026.07.04/_settings" | grep -o '"require":{[^}]*}'

# c) Confirm shards physically sit on nodes of that tier
curl -s "https://<cluster>:9200/_cat/shards/logs-2026.07.04?v&h=index,shard,prirep,state,node"

Python automation for driving and verifying transitions

Manual verification does not scale past a handful of indices, and a transition that stalls at 3 a.m. needs to be detected and retried without a human. The opensearch-py routine below polls _plugins/_ism/explain across an index pattern, classifies each index as advancing, stuck (retries exhausted), or waiting, and force-retries the stuck ones — with structured logging, transport retries, and SSL verification for production use. It complements the concurrent attach-and-verify patterns in Async Execution Patterns and the broader tooling in Python Orchestration Frameworks.

Python

import os
import logging
from opensearchpy import OpenSearch
from opensearchpy.exceptions import TransportError, NotFoundError

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)


class TransitionMonitor:
    def __init__(self, host: str, port: int = 9200, auth: tuple = None):
        self.client = OpenSearch(
            hosts=[{"host": host, "port": port}],
            http_auth=auth,
            use_ssl=True,
            verify_certs=True,
            timeout=30,
            max_retries=3,
            retry_on_timeout=True,
        )

    def explain(self, index_pattern: str) -> dict:
        """Return ISM's view of every managed index matching the pattern."""
        try:
            return self.client.transport.perform_request(
                "GET", f"/_plugins/_ism/explain/{index_pattern}"
            )
        except NotFoundError:
            logger.warning("No managed indices match '%s'.", index_pattern)
            return {}
        except TransportError as exc:
            logger.error("Explain call failed: %s", exc)
            return {}

    def retry_index(self, index: str) -> bool:
        """Clear a failed action and re-run the current state from the start."""
        try:
            self.client.transport.perform_request(
                "POST", f"/_plugins/_ism/retry/{index}"
            )
            logger.info("Retry issued for stuck index '%s'.", index)
            return True
        except TransportError as exc:
            logger.error("Retry failed for '%s': %s", index, exc)
            return False

    def sweep(self, index_pattern: str) -> dict:
        """Classify managed indices and auto-retry any that are stuck."""
        stats = {"advancing": 0, "waiting": 0, "stuck": 0}
        for index, meta in self.explain(index_pattern).items():
            if not isinstance(meta, dict) or "state" not in meta:
                continue
            action = meta.get("action", {}) or {}
            failed = action.get("failed", False)
            state = meta.get("state", {}).get("name", "unknown")
            if failed:
                stats["stuck"] += 1
                logger.warning(
                    "Index '%s' stuck in state '%s': %s",
                    index, state, meta.get("info", {}).get("message", "no detail"),
                )
                self.retry_index(index)
            elif action.get("name"):
                stats["advancing"] += 1
            else:
                stats["waiting"] += 1
        logger.info("Sweep complete: %s", stats)
        return stats


if __name__ == "__main__":
    monitor = TransitionMonitor(
        host=os.getenv("OPENSEARCH_HOST", "localhost"),
        port=int(os.getenv("OPENSEARCH_PORT", "9200")),
        auth=(os.getenv("OPENSEARCH_USER", "admin"), os.getenv("OPENSEARCH_PASS", "admin")),
    )
    monitor.sweep("logs-*")

Schedule this on an interval slightly longer than job_interval so each sweep observes the result of the previous ISM evaluation, and wire it into your alerting so a rising stuck count pages before storage backs up. The retry-with-backoff patterns that make the auto-retry safe under transient failures are detailed in Error Handling & Retries.

Operational guardrails

Disk watermarks gate every allocation, so a tier at capacity rejects an incoming transition even when the require filter is correct — the single most common cause of a stalled cold move. The settings below keep transitions flowing while leaving the flood stage below 100% so a runaway migration cannot lock a tier read-only.

Setting	Recommended value	Effect on transition logic
`plugins.index_state_management.job_interval`	`5m` (tune to `1–2m` for hot pipelines)	Latency between trigger satisfaction and commit
`cluster.routing.allocation.disk.watermark.high`	`90%`	Above this, incoming tier transitions are blocked
`cluster.routing.allocation.disk.watermark.flood_stage`	`95%`	Forces indices read-only; blocks migration in
`cluster.routing.allocation.node_concurrent_recoveries`	`3` (NVMe) / `1–2` (HDD)	Caps parallel relocations per transition wave
ISM action `retry`	`count 3`, `backoff exponential`	Bounds recovery on transient allocation/snapshot faults

HTTP

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.high": "90%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%",
    "cluster.routing.allocation.node_concurrent_recoveries": 3
  }
}

Transitions into the delete state are irreversible once committed, so treat any policy change that shortens a delete condition as a destructive operation: validate it against _plugins/_ism/explain in staging with production-scale data before rolling it out, and version-control every policy so the mutation carries an audit trail. When a threshold fires prematurely, the _plugins/_ism/change_policy API can redirect affected indices to a corrective state machine before the delete action runs.

Troubleshooting transition failures

Each failure mode below pairs a diagnosis command with the corrective action.

Transition never fires despite the condition being met. The job scheduler has not swept the index yet, or the condition references the wrong timestamp. Confirm ISM’s view and the effective interval:

Shell

curl -s "https://<cluster>:9200/_plugins/_ism/explain/logs-2026.07.04?pretty"
# Fix: wait one job_interval, or lower plugins.index_state_management.job_interval for time-sensitive pipelines

allocation action retried out and the index is parked. The target tier has no eligible node or is above the high watermark, so relocation can never complete. Read the failure reason, then retry once the tier has headroom:

Shell

curl -s -X POST "https://<cluster>:9200/_cluster/allocation/explain" \
  -H "Content-Type: application/json" \
  -d '{"index":"logs-2026.07.04","shard":0,"primary":true}'
curl -s -X POST "https://<cluster>:9200/_plugins/_ism/retry/logs-2026.07.04"

CCR follower stuck one state behind the leader. The follower cannot advance until the leader checkpoint propagates and replication lag clears. Check follower lag before forcing anything:

Shell

curl -s "https://<cluster>:9200/_plugins/_replication/follower/stats?pretty"
# Fix: resolve the network/resource lag; never force a follower transition ahead of its checkpoint

force_merge runs against a half-migrated index. The allocation action was missing wait_for: true, so the merge started before relocation finished. Add the gate and retry:

Shell

curl -s "https://<cluster>:9200/_plugins/_ism/explain/logs-2026.07.04?pretty"
# Fix: set "wait_for": true on the allocation action so state commit waits for relocation

Index reached delete sooner than expected. A min_index_age referenced document time or the condition was too aggressive. Redirect survivors to a safe state before more are removed:

Shell

curl -s -X POST "https://<cluster>:9200/_plugins/_ism/change_policy/logs-*" \
  -H "Content-Type: application/json" \
  -d '{"policy_id":"log_transition_policy","state":"cold"}'
# Fix: base age conditions on index.creation_date and re-test in staging

Frequently asked questions

How long after a condition is met does a transition actually fire?

At least one plugins.index_state_management.job_interval (default 5 minutes), plus any relocation or merge time the state’s actions require. ISM evaluates on a background sweep, not in real time, so orchestration scripts should poll with a margin longer than the interval rather than expecting an immediate state change.

Why does action order inside a state matter?

Actions run in declaration order and share the same state commit. A rollover must precede any action that assumes a fresh write index exists, and an allocation that a later force_merge or read_only depends on must carry wait_for: true so the commit waits for relocation. Get the order wrong and the dependent action runs against a half-migrated index.

Can a CCR follower be forced to transition ahead of its leader?

No — and you should not try. A follower cannot commit a transition until the leader has committed the same change and the replication checkpoint has propagated. Forcing it risks divergence between leader and follower. Resolve the replication lag instead and let the follower advance on its own sweep.

How do I recover an index that transitioned to delete by mistake?

You cannot recover the deleted index itself — restore it from a snapshot. To stop the bleed for other indices still in flight, use _plugins/_ism/change_policy to redirect them to a safe state (for example cold) before their delete condition fires, then fix the age condition to reference index.creation_date.

Rollover Trigger Configuration — the conditions that end the hot state and start the transition chain.
Threshold Tuning Strategies — calibrating the size and age thresholds that fire each transition.
Async Execution Patterns — concurrent attach-and-verify across hundreds of managed indices.
Error Handling & Retries — bounded recovery for stuck transitions and failed actions.
Python Orchestration Frameworks — structuring the automation that drives these transitions in CI/CD.
Hot-Warm-Cold Tier Design — sizing the tiers each transition migrates onto.

Up: ISM Policy Implementation & Python Automation

Phase Transition Logic

Tier alignment for transition timing #

Evaluation cadence and the state-commit sequence #

Threshold calibration for deterministic firing #

Step-by-step transition configuration #

1. Node configuration #

2. Index template #

3. Policy JSON #

4. Verification #

Python automation for driving and verifying transitions #

Operational guardrails #

Troubleshooting transition failures #

Frequently asked questions #

Related #

Tier alignment for transition timing

Evaluation cadence and the state-commit sequence

Threshold calibration for deterministic firing

Step-by-step transition configuration

1. Node configuration

2. Index template

3. Policy JSON

4. Verification

Python automation for driving and verifying transitions

Operational guardrails

Troubleshooting transition failures

Frequently asked questions

Related