Node Role Allocation

Node role allocation decides which physical nodes are allowed to hold each shard, and getting it wrong is the single most common reason an Index State Management (ISM) transition parks an index in WAITING and never recovers. In production, allocation is driven by a chain of deciders — OpenSearch’s cluster-wide allocation switch, disk watermarks, shard-count limits, and finally the index-level routing filters that ISM rewrites at every phase transition. When the routing attribute an index requires does not match any live node attribute, shards go UNASSIGNED, rollovers stall, and Cross-Cluster Replication (CCR) checkpoints stop advancing. This guide covers the exact node roles, template and policy payloads, watermark calibration, and Python automation needed to make shard placement deterministic across a tiered cluster, building on the OpenSearch ISM Architecture & Fundamentals execution model.

Node and hardware alignment

Allocation only produces predictable placement when each tier is a distinct hardware pool with one canonical routing attribute that every index template and ISM allocation action references. Modern OpenSearch (2.x and later) ships dedicated data roles — data_hot, data_warm, data_cold, and data_frozen — that coexist with the older node.attr.data tag approach; both are matched by the routing deciders, and you should pick one convention per cluster and hold it. The table below is the contract the rest of this page enforces: the value in the routing-attribute column must appear verbatim in opensearch.yml, in the index template, and in the policy.

Tier	Node role	Storage profile	vCPU : RAM ratio	Routing attribute	Primary workload
Hot	`data_hot`	Local NVMe SSD, high IOPS	1 : 4 (compute-heavy)	`node.attr.data: hot`	Active ingest, rollover, real-time search, aggregations
Warm	`data_warm`	SATA/SAS SSD, moderate IOPS	1 : 6	`node.attr.data: warm`	Recent history, read-mostly search, force-merged segments
Cold	`data_cold`	High-density HDD	1 : 8 (storage-heavy)	`node.attr.data: cold`	Compliance retention, infrequent queries
Frozen	`data_frozen`	Object storage / searchable snapshots	1 : 8 (minimal compute)	`node.attr.data: frozen`	Archival, rarely-searched snapshots

How you choose the size ratio between these pools — how long data stays hot before it moves down, and how many replicas each tier carries — is the subject of Hot-Warm-Cold Tier Design. This page is concerned with the mechanic beneath it: declaring the attributes, stamping them onto indices, and proving that the shards actually land where the attribute says they should. The exact strings that map each tier to a node role are enumerated in Mapping data tiers to OpenSearch node roles.

The allocation decision pipeline

OpenSearch evaluates every shard through a strict, ordered decider chain, and the first decider that says “no” is the one that leaves the shard unassigned. The order is fixed: OpenSearch’s cluster-wide cluster.routing.allocation.enable state gates everything; disk watermark thresholds gate any node above capacity; shard-count and same_shard limits prevent two copies landing together; and the index-level filter — index.routing.allocation.require.<attr> — is matched last against the node’s declared attributes. ISM’s allocation action changes placement by rewriting that final filter, which means a policy transition can only succeed if a node carrying the matching attribute also passes every earlier decider.

The routing targets ISM moves an index between are the tiers themselves; the deciders relocate shards down this progression as the policy stamps a new require attribute at each phase:

Two behaviours cause most surprises. First, follower indices under CCR inherit the leader’s allocation filters by default, so a follower whose cluster lacks matching tier capacity will refuse to allocate until you override the routing explicitly. Second, an attribute mismatch is silent at index-creation time and only surfaces at the transition — the index is happily hot, then stalls the instant the policy asks for warm. Both are governed by the same rule: the attribute referenced in a template or policy must exactly match a value declared in opensearch.yml. How ISM stamps those attributes at each phase, and what to do when a tier has no eligible node, are covered in Data Tier Routing Patterns and Fallback Routing Strategies.

Step-by-step allocation configuration

The four steps below stand up deterministic allocation for a logs-prod-* index set. Apply them in order: attributes on the nodes, a template so new indices start on the hot tier, a policy so aging indices are rerouted, and a verification pass that confirms both the declared setting and the physical shard placement.

1. Node role configuration

Declare the tier attribute on every data node in opensearch.yml. The value here is the exact string every template and policy will reference — a trailing space or a case mismatch produces a shard that can never route to that node.

YAML

# opensearch.yml — value matches this node's physical tier
node.name: os-data-hot-01
node.roles: [ data_hot, ingest ]
node.attr.data: hot          # hot | warm | cold | frozen
node.attr.rack_id: az-1      # spread replicas across failure domains

Restart nodes sequentially to preserve quorum, then confirm the roles and attributes actually propagated before you deploy any policy:

Shell

# Roles per node
curl -s "https://<cluster>:9200/_nodes?filter_path=nodes.*.roles" | jq '.nodes[].roles'

# node.attr.data per node — this is what the require filter matches
curl -s "https://<cluster>:9200/_cat/nodeattrs?v&h=node,attr,value&s=attr" | grep data

If a node resolves to only data or data_content, it will not be selected by a require.data filter — fix the role boundary before continuing.

2. Index template routing

New indices inherit their creation-time placement from an Index Template v2. Pin new indices to the hot tier and attach the policy in the same template so allocation and lifecycle are declared together:

HTTP

PUT _index_template/logs-prod-template
{
  "index_patterns": ["logs-prod-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.routing.allocation.require.data": "hot",
      "index.plugins.index_state_management.policy_id": "logs-prod-allocation"
    }
  },
  "priority": 500,
  "version": 2
}

The require.data: hot filter keeps freshly created indices off warm and cold nodes; the priority must clear any legacy v1 template so this routing wins. ISM’s allocation action overrides this filter later, at each transition.

3. ISM allocation policy JSON

The policy is where allocation becomes lifecycle-driven. Each state that changes tier must run an allocation action that rewrites require.data, and wait_for: true blocks the transition until relocation actually completes — without it the policy races ahead and marks the state done while shards are still in flight.

HTTP

PUT _plugins/_ism/policies/logs-prod-allocation
{
  "policy": {
    "description": "Deterministic tier allocation for logs-prod-*",
    "default_state": "hot",
    "states": [
      {
        "name": "hot",
        "actions": [
          { "rollover": { "min_primary_shard_size": "50gb", "min_index_age": "1d" } }
        ],
        "transitions": [{ "state_name": "warm", "conditions": { "min_index_age": "7d" } }]
      },
      {
        "name": "warm",
        "actions": [
          {
            "retry": { "count": 3, "backoff": "exponential", "delay": "10m" },
            "allocation": { "require": { "data": "warm" }, "wait_for": true }
          }
        ],
        "transitions": [{ "state_name": "cold", "conditions": { "min_index_age": "30d" } }]
      },
      {
        "name": "cold",
        "actions": [
          {
            "retry": { "count": 3, "backoff": "exponential", "delay": "30m" },
            "allocation": { "require": { "data": "cold" }, "wait_for": true }
          },
          { "force_merge": { "max_num_segments": 1 } }
        ]
      }
    ]
  }
}

The explicit retry block turns a transient disk-pressure or network event into a bounded, backed-off retry instead of a hard stall. The transition conditions that time each move — min_index_age, min_size, min_primary_shard_size — are grounded in Index Lifecycle Basics.

4. Verification

Never trust that a require setting was applied — confirm both the declared filter and the real placement. _cat/shards shows where shards live; _cluster/allocation/explain shows why an unassigned shard was rejected, which _cat/shards cannot.

Shell

# Declared routing filter for the index
curl -s "https://<cluster>:9200/logs-prod-2026.07/_settings?filter_path=**.routing.allocation.require"

# Actual placement — node column must be a warm-tier node after the warm transition
curl -s "https://<cluster>:9200/_cat/shards/logs-prod-2026.07?v&h=index,shard,prirep,state,node&s=state"

# Why is a shard unassigned? Read the decider verdicts.
curl -s -X POST "https://<cluster>:9200/_cluster/allocation/explain" \
  -H "Content-Type: application/json" \
  -d '{"index":"logs-prod-2026.07","shard":0,"primary":true}'

Production automation with opensearch-py

Manual routing does not survive a growing index set. The class below enforces deterministic allocation programmatically: it keeps the template current, starts CCR followers with an explicit routing override so they never inherit an unroutable leader filter, and pulls the allocation-explain payload for diagnostics. It uses a pooled session with exponential backoff and structured logging so transient 429/503 responses degrade into retries rather than failures.

Python

import os
import logging
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger(__name__)


class OpenSearchAllocator:
    def __init__(self, base_url: str, username: str, password: str) -> None:
        self.base_url = base_url.rstrip("/")
        self.session = requests.Session()
        self.session.auth = (username, password)
        self.session.verify = True
        retry = Retry(total=3, backoff_factor=1.5,
                      status_forcelist=[429, 500, 502, 503, 504])
        self.session.mount("https://", HTTPAdapter(max_retries=retry))

    def update_index_template(self, name: str, attr_value: str, pattern: str) -> bool:
        """Pin new indices matching `pattern` to the `attr_value` tier."""
        payload = {
            "index_patterns": [pattern],
            "template": {
                "settings": {
                    "index.routing.allocation.require.data": attr_value,
                    "index.number_of_shards": 3,
                    "index.number_of_replicas": 1,
                }
            },
            "priority": 500,
            "version": 2,
        }
        url = f"{self.base_url}/_index_template/{name}"
        try:
            self.session.put(url, json=payload, timeout=10).raise_for_status()
            logger.info("Template '%s' set to require.data=%s", name, attr_value)
            return True
        except requests.exceptions.RequestException as exc:
            logger.error("Template update failed: %s", exc)
            return False

    def start_ccr_follower(self, follower: str, attr_value: str,
                           leader_alias: str, leader_index: str) -> bool:
        """Start replication with an explicit routing override (never inherit the leader filter)."""
        payload = {
            "leader_alias": leader_alias,
            "leader_index": leader_index,
            "settings": {
                "index.routing.allocation.require.data": attr_value,
                "index.number_of_replicas": 1,
            },
        }
        url = f"{self.base_url}/_plugins/_replication/{follower}/_start"
        try:
            self.session.put(url, json=payload, timeout=10).raise_for_status()
            logger.info("CCR follower '%s' started on require.data=%s", follower, attr_value)
            return True
        except requests.exceptions.RequestException as exc:
            logger.error("CCR override failed: %s", exc)
            return False

    def explain_allocation(self, index: str, shard: int = 0, primary: bool = True) -> dict:
        """Return the decider verdicts for one shard — the authoritative 'why unassigned'."""
        url = f"{self.base_url}/_cluster/allocation/explain"
        payload = {"index": index, "shard": shard, "primary": primary}
        try:
            resp = self.session.post(url, json=payload, timeout=10)
            resp.raise_for_status()
            return resp.json()
        except requests.exceptions.RequestException as exc:
            logger.error("Allocation explain failed: %s", exc)
            return {}


if __name__ == "__main__":
    allocator = OpenSearchAllocator(
        base_url=os.getenv("OPENSEARCH_URL", "https://localhost:9200"),
        username=os.getenv("OPENSEARCH_USER", "admin"),
        password=os.getenv("OPENSEARCH_PASS", "admin"),
    )
    allocator.update_index_template("logs-prod-template", "hot", "logs-prod-*")
    allocator.start_ccr_follower(
        "logs-prod-2026.07-follower", "warm", "leader-cluster", "logs-prod-2026.07"
    )
    explain = allocator.explain_allocation("logs-prod-2026.07-follower")
    reason = explain.get("allocate_explanation") or \
        explain.get("unassigned_info", {}).get("reason", "UNKNOWN")
    logger.info("Allocation state: %s", reason)

Run this from CI on every template change, and on a short cron to catch drift. The endpoints it touches are privileged, so scope the service account that runs it per Security & Access Boundaries rather than reusing an admin credential.

Operational guardrails and watermark calibration

Disk watermarks gate the allocation pipeline before the routing filter is ever consulted, so a tier at capacity rejects a correctly-filtered shard and stalls the transition. A shard relocation into a target node is admitted only while:

\text{disk}_\text{used} + \text{shard}_\text{size} \le w_\text{high} \times \text{capacity}_\text{node}

The single-tier defaults (85% / 90% / 95%) are usually too aggressive for a multi-tier cluster during migration windows. Reserve headroom so an ISM allocation action never pushes a node into flood stage mid-transition, and cap concurrent recoveries to the storage medium’s queue depth.

Setting	Recommended value	Effect on allocation
`cluster.routing.allocation.disk.watermark.low`	`82%`	Stops new shards routing to a filling node
`cluster.routing.allocation.disk.watermark.high`	`88%`	Triggers relocation off the node
`cluster.routing.allocation.disk.watermark.flood_stage`	`93%`	Forces indices read-only; blocks migration in
`cluster.routing.allocation.node_concurrent_recoveries`	`3–4` (NVMe) / `1–2` (HDD)	Caps parallel relocations per node
`cluster.routing.allocation.node_initial_primaries_recoveries`	`6`	Speeds primary recovery after a restart
`cluster.routing.allocation.disk.threshold_enabled`	`true`	Watermarks must stay enabled for tier safety

HTTP

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "82%",
    "cluster.routing.allocation.disk.watermark.high": "88%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "93%",
    "cluster.routing.allocation.disk.threshold_enabled": true,
    "cluster.routing.allocation.node_concurrent_recoveries": 3,
    "cluster.routing.allocation.node_initial_primaries_recoveries": 6
  }
}

Lower node_concurrent_recoveries on HDD-backed warm and cold tiers to avoid saturating storage queue depth during a migration wave; keep it at 3–4 only for NVMe hot nodes.

Troubleshooting allocation failures

Each failure mode below pairs a diagnosis command with the corrective action.

Shards UNASSIGNED after a tier transition. The target tier has no node carrying the required attribute — usually a missing, mistyped, or case-mismatched node.attr.data. Ask the decider exactly why, then fix the attribute or relax the filter:

Shell

curl -s -X POST "https://<cluster>:9200/_cluster/allocation/explain" \
  -H "Content-Type: application/json" \
  -d '{"index":"logs-prod-2026.07","shard":0,"primary":true}'
# Fix: set node.attr.data on a node in that tier, or override index.routing.allocation.require

ISM allocation action retried out. The action exhausted its retry count and the index is parked in its current state. Read the failure reason, then retry the managed index once the blocker clears:

Shell

curl -s "https://<cluster>:9200/_plugins/_ism/explain/logs-prod-2026.07?pretty"
curl -s -X POST "https://<cluster>:9200/_plugins/_ism/retry/logs-prod-2026.07"

Transition stuck at the disk watermark. The target tier is above watermark.high, so the pipeline rejects every incoming shard before it reaches the routing filter. Find the pressured node, then add capacity or temporarily raise the threshold after confirming headroom:

Shell

curl -s "https://<cluster>:9200/_cat/allocation?v&h=node,disk.percent,disk.avail&s=disk.percent:desc"
# Fix: expand tier disk, or PUT a transient watermark bump, then revert after recovery

CCR follower will not allocate. The follower inherited the leader’s routing filter, which points at a tier the follower cluster does not have. Inspect the inherited setting, then restart replication with an explicit override:

Shell

curl -s "https://<cluster>:9200/<follower>/_settings?filter_path=**.routing.allocation.require"
# Fix: stop and re-start the follower with settings.index.routing.allocation.require.data set locally

Recovery backlog after a full-cluster restart. Many primaries queue for allocation at once and appear stuck in CLUSTER_RECOVERED. Confirm the backlog is recovery, not misrouting, then raise the initial-primaries limit temporarily:

Shell

curl -s "https://<cluster>:9200/_cat/shards?v&h=index,shard,state,unassigned.reason&s=state" | grep -i unassign
# Fix: raise cluster.routing.allocation.node_initial_primaries_recoveries, then revert once green

Frequently asked questions

Should I use the built-in data_hot roles or node.attr.data tags?

Either works — the routing deciders match both — but pick one convention per cluster. The dedicated data_hot/data_warm/data_cold/data_frozen roles are the modern default and integrate with tier-preference features; the generic node.attr.data tag is more flexible for custom topologies and is what most existing ISM allocation actions reference. Mixing the two on the same cluster invites the exact attribute mismatches that strand shards.

Why does my index stay hot even though it is past the age threshold?

The transition condition fired but the allocation action could not complete — almost always because no node carries the target require.data value, or the target tier is above its high watermark. Run _plugins/_ism/explain for the failure reason and _cluster/allocation/explain for the decider verdict; the two together isolate whether it is a routing or a capacity problem.

What does wait_for: true actually change?

Without it, the allocation action rewrites the require filter and immediately reports success, so the policy advances while shards are still relocating — a later action can then run against a half-migrated index. With wait_for: true, the state blocks until relocation completes, which is what you want for any transition that pairs allocation with force_merge or shrink.

How do I stop CCR followers from inheriting an unroutable filter?

Always pass settings.index.routing.allocation.require.data when you start the follower, set to a tier the follower cluster actually has. This overrides the inherited leader filter at creation time and is the safest way to keep follower placement independent of leader hardware.

Mapping data tiers to OpenSearch node roles — the exact strings that bind each tier to a node role and template.
Hot-Warm-Cold Tier Design — sizing the pools this page routes shards between.
Data Tier Routing Patterns — how ISM stamps the routing attribute at each phase.
Fallback Routing Strategies — graceful degradation when a target tier has no eligible node.
Index Lifecycle Basics — the transition conditions that time each allocation change.
Security & Access Boundaries — scoping the roles that run allocation and CCR actions.

Up: OpenSearch ISM Architecture & Fundamentals

Node Role Allocation

Node and hardware alignment #

The allocation decision pipeline #

Step-by-step allocation configuration #

1. Node role configuration #

2. Index template routing #

3. ISM allocation policy JSON #

4. Verification #

Production automation with opensearch-py #

Operational guardrails and watermark calibration #

Troubleshooting allocation failures #

Frequently asked questions #

Related #

Node and hardware alignment

The allocation decision pipeline

Step-by-step allocation configuration

1. Node role configuration

2. Index template routing

3. ISM allocation policy JSON

4. Verification

Production automation with opensearch-py

Operational guardrails and watermark calibration

Troubleshooting allocation failures

Frequently asked questions

Related