1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 12:46:18 +01:00
Files
openshift-docs/modules/ai-operators.adoc

44 lines
2.8 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
// Module included in the following assemblies:
//
// * ai_workloads/index.adoc
:_mod-docs-content-type: CONCEPT
[id="ai-operators_{context}"]
= Operators for running AI workloads
[role="_abstract"]
You can use Operators to run artificial intelligence (AI) and machine learning (ML) workloads on {product-title}. With Operators, you can build a customized environment that meets your specific AI/ML requirements while continuing to use {product-title} as the core platform for your applications.
{product-title} provides several Operators that can help you run AI workloads:
{kueue-name}::
You can use {kueue-name} to provide structured queues and prioritization so that workloads are handled fairly and efficiently. Without proper prioritization, important jobs might be delayed while less critical jobs occupy resources.
+
For more information, see "Introduction to {kueue-name}".
{lws-operator}::
You can use the {lws-operator} to enable large-scale AI inference workloads to run reliably across nodes with synchronization between leader and worker processes. Without proper coordination, large training runs might fail or stall.
+
For more information, see "{lws-operator} overview".
{js-operator} (Technology Preview)::
You can use the {js-operator} to easily manage and run large-scale, coordinated workloads like high-performance computing (HPC) and AI training. The {js-operator} can help you gain fast recovery and efficient resource use through features like multi-template job support and stable networking.
+
For more information, see "{js-operator} overview".
////
Keep for future use (JobSet and DRA) - From Gaurav (PM):
AI in OpenShift Focus Areas
What Were Building
- Smarter Resource Allocation (DRA) enhancing how accelerators and devices are requested, bound, and shared to maximize efficiency and utilization.
- Coordinated Distributed Jobs (LWS) enabling large-scale AI training workloads to run reliably across many nodes with proper synchronization between lead and worker processes.
- Intelligent Queuing and Scheduling (Kueue) providing structured queues and prioritization so workloads are handled fairly, respecting policies while improving throughput.
- Batch and Group Workload Management (Job Set) allowing sets of jobs to be submitted, scheduled, and managed together, making it easier to run multi-step AI pipelines.
The Problems Were Solving
- Resource waste and inefficiency (DRA) current systems often over- or under-allocate accelerators, increasing cost.
- Complexity of distributed AI training (LWS) without coordination, large training runs can fail or stall.
- Unfair or unpredictable scheduling (Kueue) important jobs may be delayed while less critical ones consume resources.
- Lack of support for pipelines (Job Set) multi-job workflows are hard to manage and monitor as a single unit.
////