Python Orchestration Frameworks for OpenSearch ISM & CCR Operations

Python orchestration frameworks are the deterministic control plane that sits between version-controlled policy JSON and a live OpenSearch cluster, driving Index State Management (ISM) and Cross-Cluster Replication (CCR) at production scale. Declarative policies alone cannot absorb ingestion spikes, react to replication topology shifts, or repair state drift across multi-tenant clusters — the engine only reconciles what you have already asserted. Without an orchestrator that enforces idempotent API calls, explicit state reconciliation, and structured failure recovery, an operator is left hand-clicking _plugins/_ism/* endpoints, indices stall silently in WAITING, and follower checkpoints diverge without anyone noticing until search results go stale. This page shows how to architect that control plane in Python, wire it to the ISM and CCR plugin APIs, and keep it observable and safe. It extends the automation surface established in the ISM Policy Implementation & Python Automation guide and depends on the timing rules documented in Phase Transition Logic.

Choosing an orchestration model

There is no single “framework” for ISM automation; there is a spectrum of scheduling and execution models, and the right one depends on how many indices you manage, how tightly you need to react to cluster events, and where you want reconciliation state to live. A one-shot cron script is fine for attaching a handful of policies nightly; a long-lived async reconciler is what you need when rollover decisions must track ingestion velocity in near real time. The table below aligns each model with its execution characteristics and its best-fit workload so you can pick deliberately rather than defaulting to cron and discovering its blind spots in an incident.

Orchestration model	Execution model	Concurrency handling	State persistence	Best-fit workload
Cron + sync script	Fire-and-forget, blocking	None (serial)	Stateless, re-derives each run	Nightly bulk policy attach, low index count
systemd + APScheduler	Long-lived, interval jobs	Thread pool	In-process, ephemeral	Steady-state reconciliation on one cluster
Async reconciler (`asyncio`)	Long-lived event loop	`asyncio` tasks, connection pool	In-process + external checkpoint	Real-time rollover/CCR polling at scale
Airflow / Prefect DAG	Scheduled workflow, task graph	Worker parallelism	Metadata DB (durable)	Multi-stage pipelines, audit trail, retries
Kubernetes CronJob	Containerized, per-invocation	Pod-level	External (ConfigMap/DB)	GitOps clusters, ephemeral compute

The load-bearing distinction is between stateless models (cron, Kubernetes CronJob) that re-derive the desired state on every invocation and stateful models (async reconciler, Airflow) that hold a checkpoint between cycles. Stateless is simpler and crash-safe but blind to transient failures that resolve between runs; stateful reacts faster and can implement circuit breakers, but must be defended against its own crashes with an external checkpoint. For CCR follower management, where a paused follower must be resumed at a precise checkpoint, prefer a stateful model or externalize the checkpoint to a durable store. The rest of this page builds the async reconciler because it exercises every primitive — connection pooling, backoff, circuit breaking, and structured logging — the simpler models use a subset of.

The reconciliation state model

An orchestrator is a reconciliation loop: read the observed state from OpenSearch, compare it to the desired state from your policy source, and issue the minimal set of API calls that closes the gap. Model each managed index as a small finite state machine whose nodes mirror the ISM phases (hot → warm → cold → delete) and whose CCR counterpart tracks follower status (SYNCING → PAUSED → RESUMED → FAILED_OVER). The orchestrator never drives individual transitions itself — the ISM daemon owns that — it asserts the desired policy and verifies convergence, retrying or escalating only when the engine reports a Failed action. That division of labour is what keeps automation safe: the same guarantee that the scheduler never runs two actions concurrently on one index (covered in Async Execution Patterns) means your loop can poll idempotently without racing the engine.

Run the reconciliation loop on a fixed interval of 15–30 seconds and align it with, but keep it independent of, the OpenSearch cluster’s own plugins.index_state_management.job_interval. Polling faster than the engine evaluates changes nothing and burns connections; polling much slower widens the window in which a failed action goes unremediated. Each cycle polls _plugins/_ism/explain for phase and action status, polls _plugins/_replication/_status for follower health, and evaluates a circuit breaker against thread-pool queue depth before issuing any mutating call.

The circuit-breaker threshold is a ratio of the write thread pool’s queued tasks to its queue capacity. Trip the breaker when

\frac{\text{queue}_{\text{used}}}{\text{queue}_{\text{capacity}}} \ge 0.80

so that reconciliation stops adding mutating load while OpenSearch is already saturated. The manage-index and write thread pools are the ones ISM actions contend for; read the live values from _cat/thread_pool before you decide the breaker is safe to reset.

Step-by-step configuration

The four steps below stand up an async orchestrator against a logs-prod-* index set: provision a scoped service account, configure the client and its connection pool, define the reconciliation loop, then verify that the loop both observes and enforces correctly.

1. Service account and access scope

The orchestrator authenticates as a dedicated service account, never as admin. Scope it to exactly the _plugins/_ism/* and _plugins/_replication/* endpoints plus the index patterns it manages — the role design is covered in Security & Access Boundaries. A minimal role for an ISM/CCR orchestrator looks like this:

JSON

{
  "cluster_permissions": [
    "cluster:admin/opendistro/ism/*",
    "cluster:admin/plugin/replication/*",
    "cluster:monitor/stats",
    "cluster:monitor/thread_pool"
  ],
  "index_permissions": [
    {
      "index_patterns": ["logs-prod-*"],
      "allowed_actions": [
        "indices:admin/opensearch/ism/*",
        "indices:monitor/stats",
        "indices:admin/settings/update"
      ]
    }
  ]
}

2. Client and connection pool configuration

Instantiate a single long-lived AsyncOpenSearch client and share it across every task — creating a client per request exhausts HTTP keep-alive connections and defeats pooling. Size maxsize (the per-host connection cap) to your loop’s concurrency, enable retry_on_timeout, and set a bounded request timeout so a single hung call cannot stall the whole cycle.

Python

from opensearchpy import AsyncOpenSearch

client = AsyncOpenSearch(
    hosts=["https://opensearch-cluster.local:9200"],
    http_auth=("ism-orchestrator", "<from-secrets-manager>"),
    timeout=30,             # per-request ceiling; a hung call cannot stall the cycle
    max_retries=3,
    retry_on_timeout=True,
    maxsize=16,             # per-host connection pool; align to loop concurrency
    verify_certs=True,
)

3. Reconciliation loop definition

The loop discovers managed indices, reads their ISM state, and — only when the circuit breaker is closed — evaluates rollover and CCR conditions. Keep the discovery step explicit (an index pattern or a _cat/indices query) rather than hard-coding names, so newly created indices are picked up automatically. Rollover threshold evaluation itself is delegated to the patterns in Rollover Trigger Configuration; the loop’s job is to decide when to evaluate, not to re-implement the thresholds documented in Threshold Tuning Strategies.

4. Verification

Before trusting the loop, verify it against a known index. Confirm the observed state matches the engine’s own report and that a deliberately induced failure is retried rather than swallowed:

Shell

# Confirm the orchestrator sees the same phase the engine reports
curl -s "https://<cluster>:9200/_plugins/_ism/explain/logs-prod-000001?pretty" \
  | jq '.["logs-prod-000001"].state.name'

# Confirm CCR follower status the loop polls
curl -s "https://<cluster>:9200/_plugins/_replication/logs-prod-000001/_status?pretty" \
  | jq '.status'

Production automation

The orchestrator below ties the pieces together: it attaches policies idempotently, monitors phase progression, manages CCR follower topology, and enforces exponential backoff behind a circuit breaker. Every request routes through one wrapper so retry, backoff, and breaker logic live in a single place. All state changes are logged in structured form so the loop is auditable after the fact — the failure-handling contract this depends on is detailed in Error Handling & Retries.

Python

import asyncio
import logging
from typing import Dict, List, Optional
from opensearchpy import AsyncOpenSearch
from opensearchpy.exceptions import ConnectionTimeout, TransportError, NotFoundError

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("opensearch_ism_ccr")


class ISMCCROrchestrator:
    """Async control plane for OpenSearch ISM + CCR reconciliation."""

    def __init__(self, hosts: List[str], http_auth: tuple, max_retries: int = 3):
        self._max_retries = max_retries
        self.client = AsyncOpenSearch(
            hosts=hosts,
            http_auth=http_auth,
            timeout=30,
            max_retries=max_retries,
            retry_on_timeout=True,
            maxsize=16,
        )
        self._circuit_open = False
        self._reconciliation_interval = 15.0

    async def _request_with_retry(
        self, method: str, path: str, body: Optional[Dict] = None
    ):
        """Idempotent request wrapper: exponential backoff + circuit breaker."""
        if self._circuit_open:
            raise RuntimeError("Circuit breaker open: halting mutating requests.")

        for attempt in range(self._max_retries):
            try:
                return await self.client.transport.perform_request(
                    method=method, url=path, body=body
                )
            except ConnectionTimeout:
                delay = 2 ** attempt
                logger.warning(
                    "Timeout on %s, retrying in %ss (attempt %s)",
                    path, delay, attempt + 1,
                )
                await asyncio.sleep(delay)
            except TransportError as exc:
                if exc.status_code == 429:  # too many requests: back off, retry
                    await asyncio.sleep(5)
                    continue
                raise
        raise RuntimeError(f"Max retries exceeded for {method} {path}")

    async def circuit_check(self) -> None:
        """Trip the breaker when the write thread pool queue is >= 80% full."""
        stats = await self.client.transport.perform_request(
            "GET", "/_nodes/stats/thread_pool/write"
        )
        for node in stats.get("nodes", {}).values():
            pool = node["thread_pool"]["write"]
            queue, capacity = pool.get("queue", 0), 200  # default write queue_size
            if capacity and queue / capacity >= 0.80:
                if not self._circuit_open:
                    logger.error("Write queue %s/%s — opening circuit.", queue, capacity)
                self._circuit_open = True
                return
        self._circuit_open = False

    async def attach_ism_policy(self, index_pattern: str, policy_id: str):
        """Attach an ISM policy to matching indices (idempotent)."""
        path = f"/_plugins/_ism/add/{index_pattern}"
        return await self._request_with_retry("POST", path, {"policy_id": policy_id})

    async def get_ism_state(self, index: str) -> Dict:
        """Retrieve current ISM phase and action status for one index."""
        path = f"/_plugins/_ism/explain/{index}"
        try:
            return await self._request_with_retry("GET", path)
        except NotFoundError:
            return {}

    async def manage_ccr_follower(
        self, follower_index: str, leader_index: str, leader_cluster: str
    ):
        """Start or verify a CCR follower against a remote leader."""
        path = f"/_plugins/_replication/{follower_index}/_start"
        body = {"leader_alias": leader_cluster, "leader_index": leader_index}
        return await self._request_with_retry("PUT", path, body)

    async def evaluate_rollover_triggers(self, index: str, size_gb: float) -> bool:
        """Check the size threshold before delegating rollover to OpenSearch."""
        stats = await self.client.indices.stats(index=index, metric="store")
        primary_bytes = stats["indices"][index]["primaries"]["store"]["size_in_bytes"]
        current_gb = primary_bytes / (1024 ** 3)
        if current_gb >= size_gb:
            logger.info("Size threshold breached for %s: %.2fGB", index, current_gb)
            return True
        return False

    async def run_reconciliation_loop(self, patterns: List[str]):
        """Main loop: observe state, gate on the breaker, enforce convergence."""
        logger.info("Starting ISM/CCR reconciliation loop.")
        while True:
            try:
                await self.circuit_check()
                for pattern in patterns:
                    resp = await self.client.indices.get(index=pattern, ignore=[404])
                    for index in (resp or {}):
                        state = await self.get_ism_state(index)
                        phase = (
                            state.get(index, {}).get("state", {}).get("name")
                        )
                        if phase == "hot" and not self._circuit_open:
                            await self.evaluate_rollover_triggers(index, size_gb=50.0)
                await asyncio.sleep(self._reconciliation_interval)
            except Exception as exc:  # loop must never die on a single-cycle error
                logger.error("Reconciliation cycle failed: %s", exc)
                await asyncio.sleep(30)


async def main():
    orchestrator = ISMCCROrchestrator(
        hosts=["https://opensearch-cluster.local:9200"],
        http_auth=("ism-orchestrator", "<from-secrets-manager>"),
    )
    await orchestrator.run_reconciliation_loop(patterns=["logs-prod-*", "metrics-*"])


if __name__ == "__main__":
    asyncio.run(main())

Deploy the loop as a systemd service or a container with a liveness probe wired to the reconciliation heartbeat, and export the queue-depth, retry-count, and phase-transition-latency counters so an alert fires when the breaker trips or retries climb. For runtime policy edits — fetching an active policy, mutating its conditions or actions in memory, and pushing it back — see the child guide, Python Automation for Dynamic ISM Policy Updates.

Operational guardrails

The orchestrator’s safety comes from a handful of tunables. Set them explicitly; the defaults are tuned for interactive clients, not for a loop hammering plugin endpoints every 15 seconds. The table records the settings that matter, a production-sane starting value, and why it exists.

Setting	Starting value	Rationale
Reconciliation interval	15–30 s	Fast enough to catch failed actions, slow enough not to flood the pool
Client `timeout`	30 s	Bounds a single hung call so it cannot stall the whole cycle
Connection pool `maxsize`	16	Caps concurrent connections per host; align to loop concurrency
`max_retries` / backoff	3 / `2**attempt`	Absorbs transient timeouts without amplifying a real outage
Circuit-breaker threshold	0.80 of write queue	Stops mutating load before OpenSearch saturates
`429` back-off	5 s fixed	Honours OpenSearch rejection signalling before retrying
Loop exception sleep	30 s	Prevents a crash loop when OpenSearch is unreachable

Two guardrails are non-negotiable. First, snapshot before bulk mutation — capture the current policy set before pushing changes across hundreds of indices, so a bad edit is recoverable. Second, gate every mutating call on the circuit breaker, never just the read calls; a breaker that only guards observation still lets the loop push writes into a saturated cluster.

Troubleshooting

Failure mode	Diagnosis command	Fix
Loop swallows failed actions, index stuck in `Failed`	`curl -s ".../_plugins/_ism/explain/<idx>" \| jq '.[].action'`	Call `POST _plugins/_ism/retry/<idx>` from the loop’s failure branch instead of only logging
Connection pool exhausted, `ConnectionTimeout` storms	`curl -s ".../_nodes/stats/http" \| jq '.nodes[].http.current_open'`	Reuse one client; raise `maxsize`; ensure tasks `await` and release connections
Breaker never resets after a spike	`curl -s ".../_cat/thread_pool/write?v&h=name,queue,rejected"`	Confirm `queue` drained; read real `queue_size` from `_cluster/settings` instead of assuming 200
CCR follower silently `PAUSED`	`curl -s ".../_plugins/_replication/<idx>/_status" \| jq '.status'`	Resume with `POST _plugins/_replication/<idx>/_resume`; override an unroutable inherited `require` filter
`429 Too Many Requests` on every attach	`curl -s ".../_cat/thread_pool/management?v"`	Lower reconciliation frequency; batch attaches by index pattern rather than per-index

Frequently asked questions

Do I need an orchestrator at all, or can ISM run on its own?

ISM reconciles the policy you have already attached, on its own scheduler — for a static set of indices with a fixed policy, no orchestrator is required. You need one the moment attachment, rollover timing, or CCR follower lifecycle has to react to runtime conditions (ingestion velocity, capacity, failover) that a declarative policy cannot express, or when you need auditable, version-controlled deploys instead of hand-clicked changes.

Should the reconciliation interval match plugins.index_state_management.job_interval?

No — keep them independent. The engine’s job_interval (default 5 minutes) controls how often ISM evaluates transitions; your loop’s 15–30 s interval controls how often you observe and remediate. Polling faster than the engine acts changes nothing, so there is no benefit to matching them, and coupling them makes the loop brittle to an OpenSearch-side setting change.

Why route through transport.perform_request instead of native client methods?

opensearch-py does not yet ship first-class wrapper methods for every _plugins/_ism/* and _plugins/_replication/* endpoint, so transport.perform_request is the stable way to reach them with explicit method, path, and body. Where a native helper exists (e.g. client.indices.stats), prefer it — it is typed and validated — and reserve the raw transport for the plugin routes that have no wrapper.

How do I stop the loop from crash-looping when OpenSearch is unreachable?

Wrap the loop body in a broad except that logs and sleeps (30 s here) rather than letting the exception kill the process, and put the reconciliation heartbeat behind a liveness probe so the supervisor restarts a genuinely wedged process. The circuit breaker handles cluster saturation; the outer try/except handles cluster unavailability — you need both.

Async Execution Patterns — the concurrency model the reconciliation loop is built on.
Error Handling & Retries — the retry and backoff contract behind the request wrapper.
Phase Transition Logic — how the engine decides transitions the loop observes.
Rollover Trigger Configuration — the size and age triggers the loop evaluates.
Python Automation for Dynamic ISM Policy Updates — mutating live policies from the orchestrator.

Up: ISM Policy Implementation & Python Automation

Python Orchestration Frameworks for OpenSearch ISM & CCR Operations

Choosing an orchestration model #

The reconciliation state model #

Step-by-step configuration #

1. Service account and access scope #

2. Client and connection pool configuration #

3. Reconciliation loop definition #

4. Verification #

Production automation #

Operational guardrails #

Troubleshooting #

Frequently asked questions #

Related #