openshift-docs/modules/about-jobset.adoc

// Module included in the following assemblies:
//
// * ai_workloads/jobset_operator/index.adoc

:_mod-docs-content-type: CONCEPT
[id="js-about_{context}"]
= About the {js-operator}

[role="_abstract"]
Use the {js-operator} on {product-title} to manage large, distributed, and coordinated computing workloads, such as high-performance computing (HPC) or artificial intelligence (AI) training, and gain automatic stability, coordination, and failure recovery.

The {js-operator} is based on the link:https://jobset.sigs.k8s.io/docs/overview/[JobSet] open source project.

{js-operator} is designed to manage a group of jobs as a single, coordinated unit. This is especially useful for fields like HPC and training massive AI models where you need a team of machines to run for hours or days.

You can use the {js-operator} to solve problems that are too big or too complex for a standard {product-title} job. The {js-operator} provides coordination, stability, and recovery.

The {js-operator} automatically sets up stable headless service to get an IP address so workers can find and communicate with each other, even after a failure and restart. It also provides automatic failure recovery. If one small part of a large training job fails, the Operator can be configured to restart the entire group of workers from a saved checkpoint. This saves time and computing costs.

The {js-operator} offers startup control, allowing you to define a specific startup sequence to ensure dependencies are met. For example, making sure the leader is running before any workers attempt to connect.

{js-operator} makes managing large, distributed, and coordinated computing tasks on {product-title} easier, turning many individual components into one resilient and manageable system.