Job parallelism default incorrect

2026-02-05 12:46:18 +01:00 · 2019-11-25 13:02:47 -05:00
parent d2ccd8cff0
commit c12d707812
3 changed files with 98 additions and 35 deletions
--- a/modules/nodes-nodes-jobs-about.adoc
+++ b/modules/nodes-nodes-jobs-about.adoc
@@ -3,29 +3,49 @@
 // * nodes/nodes-nodes-jobs.adoc

 [id="nodes-nodes-jobs-about_{context}"]
-= Understanding jobs and CronJobs
+= Understanding Jobs and CronJobs

-A job tracks the overall progress of a task and updates its status with information
-about active, succeeded, and failed pods. Deleting a job will clean up any pods it created.
+A Job tracks the overall progress of a task and updates its status with information
+about active, succeeded, and failed pods. Deleting a Job will clean up any pods it created.
 Jobs are part of the Kubernetes API, which can be managed
 with `oc` commands like other object types.

 There are two possible resource types that allow creating run-once objects in {product-title}:

 Job::
-A regular job is a run-once object that creates a task and ensures the job finishes.
+A regular Job is a run-once object that creates a task and ensures the Job finishes.
+
+There are three main types of task suitable to run as a Job:
+
+* Non-parallel Jobs:
+** A Job that starts only one Pod, unless the Pod fails.
+** The Job is complete as soon as its Pod terminates successfully.
+
+* Parallel Jobs with a fixed completion count:
+** a Job that starts multiple pods.
+** The Job represents the overall task and is complete when there is one successful Pod for each value in the range `1` to the `completions` value.
+
+* Parallel Jobs with a work queue:
+** A Job with multiple parallel worker processes in a given pod. 
+** {product-title} coordinates pods to determine what each should work on or use an external queue service. 
+** Each Pod is independently capable of determining whether or not all peer pods are complete and that the entire Job is done.
+** When any Pod from the Job terminates with success, no new Pods are created.
+** When at least one Pod has terminated with success and all Pods are terminated, the Job is successfully completed.
+** When any Pod has exited with success, no other Pod should be doing any work for this task or writing any output. Pods should all be in the process of exiting.
+
+For more information about how to make use of the different types of Job, see link:https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#job-patterns[Job Patterns] in the Kubernetes documentation.

 CronJob::

 A CronJob can be scheduled to run multiple times, use a CronJob.

-A _CronJob_ builds on a regular job by allowing you to specify
-how the job should be run. CronJobs are part of the
+A _CronJob_ builds on a regular Job by allowing you to specify
+how the Job should be run. CronJobs are part of the
 link:http://kubernetes.io/docs/user-guide/cron-jobs[Kubernetes] API, which
 can be managed with `oc` commands like other object types.

 CronJobs are useful for creating periodic and recurring tasks, like running backups or sending emails.
-CronJobs can also schedule individual tasks for a specific time, such as if you want to schedule a job for a low activity period.
+CronJobs can also schedule individual tasks for a specific time, such as if you want to schedule a Job for a low activity period.

 ifdef::openshift-online[]
 [IMPORTANT]
@@ -38,52 +58,55 @@ endif::[]

 [WARNING]
 ====
-A CronJob creates a job object approximately once per execution time of its
-schedule, but there are circumstances in which it fails to create a job or
-two jobs might be created.  Therefore, jobs must be idempotent and you must
+A CronJob creates a Job object approximately once per execution time of its
+schedule, but there are circumstances in which it fails to create a Job or
+two Jobs might be created. Therefore, Jobs must be idempotent and you must
 configure history limits.
 ====

 [id="jobs-create_{context}"]
-= Understanding how to create jobs
+= Understanding how to create Jobs

-Both resource types require a job configuration that consists of the following key parts:
+Both resource types require a Job configuration that consists of the following key parts:

 - A pod template, which describes the pod that {product-title} creates.
- An optional `parallelism` parameter, which specifies how many pods running in parallel at any point in time should execute a job. If not specified, this defaults to
- the value in the `completions` parameter.
- An optional `completions` parameter, specifying how many successful pod completions are needed to finish a job. If not specified, this value defaults to one.
+- The `parallelism` parameter, which specifies how many pods running in parallel at any point in time should execute a Job. 
+** For non-parallel Jobs, leave unset. When unset, defaults to `1`.
+- The `completions` parameter, specifying how many successful pod completions are needed to finish a Job. 
+** For non-parallel Jobs, leave unset. When unset, defaults to `1`.
+** For parallel Jobs with a fixed completion count, specify a value.
+** For parallel Jobs with a work queue, leave unset. When unset defaults to the `parallelism` value. 

 [id="jobs-set-max_{context}"]
-== Understanding how to set a maximum duration for jobs
+== Understanding how to set a maximum duration for Jobs

-When defining a job, you can define its maximum duration by setting
+When defining a Job, you can define its maximum duration by setting
 the `activeDeadlineSeconds` field. It is specified in seconds and is not
 set by default. When not set, there is no maximum duration enforced.

 The maximum duration is counted from the time when a first pod gets scheduled in
-the system, and defines how long a job can be active. It tracks overall time of
-an execution. After reaching the specified timeout, the job is terminated by {product-title}.
+the system, and defines how long a Job can be active. It tracks overall time of
+an execution. After reaching the specified timeout, the Job is terminated by {product-title}.

 [id="jobs-set-backoff_{context}"]
-== Understanding how to set a job back off policy for pod failure
+== Understanding how to set a Job back off policy for pod failure

 A Job can be considered failed, after a set amount of retries due to a
 logical error in configuration or other similar reasons. Failed Pods associated with the Job are recreated by the controller with
 an exponential back off delay (`10s`, `20s`, `40s` …) capped at six minutes. The
 limit is reset if no new failed pods appear between controller checks.

-Use the `spec.backoffLimit` parameter to set the number of retries for a job.
+Use the `spec.backoffLimit` parameter to set the number of retries for a Job.

 [id="jobs-artifacts_{context}"]
 == Understanding how to configure a CronJob to remove artifacts

-CronJobs can leave behind artifact resources such as jobs or pods.  As a user it is important
-to configure history limits so that old jobs and their pods are properly cleaned.  There are two fields within CronJob's spec responsible for that:
+CronJobs can leave behind artifact resources such as Jobs or pods.  As a user it is important
+to configure history limits so that old Jobs and their pods are properly cleaned.  There are two fields within CronJob's spec responsible for that:

-* `.spec.successfulJobsHistoryLimit`. The number of successful finished jobs to retain (defaults to 3).
+* `.spec.successfulJobsHistoryLimit`. The number of successful finished Jobs to retain (defaults to 3).

-* `.spec.failedJobsHistoryLimit`. The number of failed finished jobs to retain (defaults to 1).
+* `.spec.failedJobsHistoryLimit`. The number of failed finished Jobs to retain (defaults to 1).

 [TIP]
 ====
@@ -101,10 +124,10 @@ Doing this prevents them from generating unnecessary artifacts.
 [id="jobs-limits_{context}"]
 = Known limitations

-The job specification restart policy only applies to the _pods_, and not the _job controller_. However, the job controller is hard-coded to keep retrying jobs to completion.
+The Job specification restart policy only applies to the _pods_, and not the _job controller_. However, the job controller is hard-coded to keep retrying Jobs to completion.

-As such, `restartPolicy: Never` or `--restart=Never` results in the same behavior as `restartPolicy: OnFailure` or `--restart=OnFailure`. That is, when a job fails it is restarted automatically until it succeeds (or is manually discarded). The policy only sets which subsystem performs the restart.
+As such, `restartPolicy: Never` or `--restart=Never` results in the same behavior as `restartPolicy: OnFailure` or `--restart=OnFailure`. That is, when a Job fails it is restarted automatically until it succeeds (or is manually discarded). The policy only sets which subsystem performs the restart.

-With the `Never` policy, the _job controller_ performs the restart. With each attempt, the job controller increments the number of failures in the job status and create new pods. This means that with each failed attempt, the number of pods increases.
+With the `Never` policy, the _job controller_ performs the restart. With each attempt, the job controller increments the number of failures in the Job status and create new pods. This means that with each failed attempt, the number of pods increases.

-With the `OnFailure` policy, _kubelet_ performs the restart. Each attempt does not increment the number of failures in the job status. In addition, kubelet will retry failed jobs starting pods on the same nodes.
+With the `OnFailure` policy, _kubelet_ performs the restart. Each attempt does not increment the number of failures in the Job status. In addition, kubelet will retry failed Jobs starting pods on the same nodes.
--- a/modules/nodes-nodes-jobs-creating.adoc
+++ b/modules/nodes-nodes-jobs-creating.adoc
@@ -34,12 +34,22 @@ spec:
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: OnFailure    <6>
 ----
-1. Optional value for how many pod replicas a job should run in parallel; defaults to `completions`.
-2. Optional value for how many successful pod completions are needed to mark a job completed; defaults to one.
-3. Optional value for the maximum duration the job can run.
-4. Option value to set the number of retries for a job. This field defaults to six.
-5. Template for the pod the controller creates.
-6. The restart policy of the pod. This does not apply to the job controller.
+1. Optionally, specify how many pod replicas a job should run in parallel; defaults to `1`.
+* For non-parallel jobs, leave unset. When unset, defaults to `1`.
+2. Optionally, specify how many successful pod completions are needed to mark a job completed.
+* For non-parallel jobs, leave unset. When unset, defaults to `1`.
+* For parallel jobs with a fixed completion count, specify the number of completions.
+* For parallel jobs with a work queue, leave unset. When unset defaults to the `parallelism` value. 
+3. Optionally, specify the maximum duration the job can run.
+4. Optionally, specify the number of retries for a job. This field defaults to six.
+5. Specify the template for the pod the controller creates.
+6. Specify the restart policy of the pod:
+* `Never`. Do not restart the job.
+* `OnFailure`. Restart the job only if it fails.
+* `Always`. Always restart the job.
+
+For details on how {product-title} uses restart policy with failed containers, see
+the link:https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#example-states[Example States] in the Kubernetes documentation.

 . Create the job:
 +
--- a/nodes/jobs/nodes-nodes-jobs.adoc
+++ b/nodes/jobs/nodes-nodes-jobs.adoc
@@ -14,6 +14,36 @@ about active, succeeded, and failed pods. Deleting a job will clean up any pod
 replicas it created. Jobs are part of the Kubernetes API, which can be managed
 with `oc` commands like other object types.

+.Sample Job specification
+
+[source,yaml]
+----
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: pi
+spec:
+  parallelism: 1    <1>
+  completions: 1    <2>
+  activeDeadlineSeconds: 1800 <3>
+  backoffLimit: 6   <4>
+  template:         <5>
+    metadata:
+      name: pi
+    spec:
+      containers:
+      - name: pi
+        image: perl
+        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
+      restartPolicy: OnFailure    <6>
+----
+1. The pod replicas a job should run in parallel.
+2. Successful pod completions are needed to mark a job completed.
+3. The maximum duration the job can run.
+4. The number of retries for a job.
+5. The template for the pod the controller creates.
+6. The restart policy of the pod.
+
 See the http://kubernetes.io/docs/user-guide/jobs/[Kubernetes documentation] for
 more information about jobs.