1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 12:46:18 +01:00
Files
openshift-docs/modules/about-jobset.adoc
2025-11-04 19:15:10 +00:00

22 lines
1.7 KiB
Plaintext

// Module included in the following assemblies:
//
// * ai_workloads/jobset_operator/index.adoc
:_mod-docs-content-type: CONCEPT
[id="js-about_{context}"]
= About the {js-operator}
[role="_abstract"]
Use the {js-operator} on {product-title} to manage large, distributed, and coordinated computing workloads, such as high-performance computing (HPC) or artificial intelligence (AI) training, and gain automatic stability, coordination, and failure recovery.
The {js-operator} is based on the link:https://jobset.sigs.k8s.io/docs/overview/[JobSet] open source project.
{js-operator} is designed to manage a group of jobs as a single, coordinated unit. This is especially useful for fields like HPC and training massive AI models where you need a team of machines to run for hours or days.
You can use the {js-operator} to solve problems that are too big or too complex for a standard {product-title} job. The {js-operator} provides coordination, stability, and recovery.
The {js-operator} automatically sets up stable headless service to get an IP address so workers can find and communicate with each other, even after a failure and restart. It also provides automatic failure recovery. If one small part of a large training job fails, the Operator can be configured to restart the entire group of workers from a saved checkpoint. This saves time and computing costs.
The {js-operator} offers startup control, allowing you to define a specific startup sequence to ensure dependencies are met. For example, making sure the leader is running before any workers attempt to connect.
{js-operator} makes managing large, distributed, and coordinated computing tasks on {product-title} easier, turning many individual components into one resilient and manageable system.