mirror of
https://github.com/openshift/openshift-docs.git
synced 2026-02-05 12:46:18 +01:00
44 lines
2.8 KiB
Plaintext
44 lines
2.8 KiB
Plaintext
// Module included in the following assemblies:
|
||
//
|
||
// * ai_workloads/index.adoc
|
||
|
||
:_mod-docs-content-type: CONCEPT
|
||
[id="ai-operators_{context}"]
|
||
= Operators for running AI workloads
|
||
|
||
[role="_abstract"]
|
||
You can use Operators to run artificial intelligence (AI) and machine learning (ML) workloads on {product-title}. With Operators, you can build a customized environment that meets your specific AI/ML requirements while continuing to use {product-title} as the core platform for your applications.
|
||
|
||
{product-title} provides several Operators that can help you run AI workloads:
|
||
|
||
{kueue-name}::
|
||
You can use {kueue-name} to provide structured queues and prioritization so that workloads are handled fairly and efficiently. Without proper prioritization, important jobs might be delayed while less critical jobs occupy resources.
|
||
+
|
||
For more information, see "Introduction to {kueue-name}".
|
||
|
||
{lws-operator}::
|
||
You can use the {lws-operator} to enable large-scale AI inference workloads to run reliably across nodes with synchronization between leader and worker processes. Without proper coordination, large training runs might fail or stall.
|
||
+
|
||
For more information, see "{lws-operator} overview".
|
||
|
||
{js-operator} (Technology Preview)::
|
||
You can use the {js-operator} to easily manage and run large-scale, coordinated workloads like high-performance computing (HPC) and AI training. The {js-operator} can help you gain fast recovery and efficient resource use through features like multi-template job support and stable networking.
|
||
+
|
||
For more information, see "{js-operator} overview".
|
||
////
|
||
Keep for future use (JobSet and DRA) - From Gaurav (PM):
|
||
AI in OpenShift – Focus Areas
|
||
|
||
What We’re Building
|
||
- Smarter Resource Allocation (DRA) – enhancing how accelerators and devices are requested, bound, and shared to maximize efficiency and utilization.
|
||
- Coordinated Distributed Jobs (LWS) – enabling large-scale AI training workloads to run reliably across many nodes with proper synchronization between lead and worker processes.
|
||
- Intelligent Queuing and Scheduling (Kueue) – providing structured queues and prioritization so workloads are handled fairly, respecting policies while improving throughput.
|
||
- Batch and Group Workload Management (Job Set) – allowing sets of jobs to be submitted, scheduled, and managed together, making it easier to run multi-step AI pipelines.
|
||
|
||
The Problems We’re Solving
|
||
- Resource waste and inefficiency (DRA) – current systems often over- or under-allocate accelerators, increasing cost.
|
||
- Complexity of distributed AI training (LWS) – without coordination, large training runs can fail or stall.
|
||
- Unfair or unpredictable scheduling (Kueue) – important jobs may be delayed while less critical ones consume resources.
|
||
- Lack of support for pipelines (Job Set) – multi-job workflows are hard to manage and monitor as a single unit.
|
||
////
|