openshift-docs/modules/ai-operators.adoc

// Module included in the following assemblies:
//
// * ai_workloads/index.adoc

:_mod-docs-content-type: CONCEPT
[id="ai-operators_{context}"]
= Operators for running AI workloads

[role="_abstract"]
You can use Operators to run artificial intelligence (AI) and machine learning (ML) workloads on {product-title}. With Operators, you can build a customized environment that meets your specific AI/ML requirements while continuing to use {product-title} as the core platform for your applications.

{product-title} provides several Operators that can help you run AI workloads:

{kueue-name}::
You can use {kueue-name} to provide structured queues and prioritization so that workloads are handled fairly and efficiently. Without proper prioritization, important jobs might be delayed while less critical jobs occupy resources.
+
For more information, see "Introduction to {kueue-name}".

{lws-operator}::
You can use the {lws-operator} to enable large-scale AI inference workloads to run reliably across nodes with synchronization between leader and worker processes. Without proper coordination, large training runs might fail or stall.
+
For more information, see "{lws-operator} overview".

{js-operator} (Technology Preview)::
You can use the {js-operator} to easily manage and run large-scale, coordinated workloads like high-performance computing (HPC) and AI training. The {js-operator} can help you gain fast recovery and efficient resource use through features like multi-template job support and stable networking.
+
For more information, see "{js-operator} overview".
////
Keep for future use (JobSet and DRA) - From Gaurav (PM):
AI in OpenShift – Focus Areas

What We’re Building
- Smarter Resource Allocation (DRA) – enhancing how accelerators and devices are requested, bound, and shared to maximize efficiency and utilization.
- Coordinated Distributed Jobs (LWS) – enabling large-scale AI training workloads to run reliably across many nodes with proper synchronization between lead and worker processes.
- Intelligent Queuing and Scheduling (Kueue) – providing structured queues and prioritization so workloads are handled fairly, respecting policies while improving throughput.
- Batch and Group Workload Management (Job Set) – allowing sets of jobs to be submitted, scheduled, and managed together, making it easier to run multi-step AI pipelines.

The Problems We’re Solving
- Resource waste and inefficiency (DRA) – current systems often over- or under-allocate accelerators, increasing cost.
- Complexity of distributed AI training (LWS) – without coordination, large training runs can fail or stall.
- Unfair or unpredictable scheduling (Kueue) – important jobs may be delayed while less critical ones consume resources.
- Lack of support for pipelines (Job Set) – multi-job workflows are hard to manage and monitor as a single unit.
////