Automatically Clean Up Dangling Jobs with Policy Engine

Preface

Last year, I was reading about PSP deprecation and started wondering what could be the solutions in the future. Fortunately, there are already several policy engines like OPA Gatekeeper and Kyverno available.

With the help of the policy engine, not only we can ensure workloads are compliant with selected, predefined rules, but also achieved custom company policies like:

  • Schedule workloads to spot instances based on certain criteria for better cost-saving, and put delicate ones to on-demand instances.
  • Add preStop hooks for containers that have ports open (like ingress-nginx!)1.
  • Patch image version to leverage local cache and speed things up (e.g., fixed version for amazon/aws-cli).
  • Restrict home-made services exposing endpoints that are not ready at the moment (publishNotReadyAddresses).
  • Restrict service load balancers.
  • Restrict modification on ingress’ annotations that tries to use an arbitrary proxy buffer size.
  • …and many, many more, without any other users’ interventions and/or modifications.

Policy engines are just fascinating. I also learned a few things from it by making my own admission webhooks. You should be able to achieve most of the requirements by policy engines alone, though.

Policies are meant to be enforeced. Documents and meetings alone just won’t stop bad use of Kubernetes (intentionally or unintentionally).

Your cluster, your rules2.

Jobs that just won’t go away

We constantly maintain and improve our cluster policies because operation issues never end.

Recently, I noticed more and more dangling jobs are floating around and keep increasing. Apparently, these jobs were created directly (no matter by service or user instead of by Controller like CronJob).

It’s not necessarily a bad practice – but the Completed (or Errored ) jobs just won’t disappear.

Fortunately, there is a TTL-after-finished Controller that can help.

To quote from the enhancement proposal:

Motivation

… it’s difficult for the users to clean them up automatically, and those Jobs and Pods can accumulate and overload a Kubernetes cluster very easily.

User Stories

The users keep creating Jobs in a small Kubernetes cluster with 4 nodes. The Jobs accumulates over time, and 1 year later, the cluster ended up with more than 100k old Jobs. This caused etcd hiccups, long high latency etcd requests, and eventually made the cluster unavailable.

The situation of our clusters is definitely nowhere close to 100k at this point. But I’ve seen 3k finished jobs in a really small cluster before, and that already made me feel terrified.

The answer to this problem seems very straightforward: just add .spec.ttlSecondsAfterFinished to your Job and it’s done.

But is it really that “happily ever after”?

Yes and no. You can’t expect everyone who directly creates a Job will always puts that field. So what should we do now?

Since we are in a post talking about policy engine, so yeah, let’s leverage policy engine.

We will set .spec.ttlSecondsAfterFinished to a Job whenever there is no .metadata.ownerReferences defined (i.e. It’s not created by controller like CronJob).

Prerequisites

  • Kubernetes >= 1.12
  • Your policy engine of choice.

Here we will use Kyverno’s ClusterPolicy as an example, but you should be able to implement with any other solutions on the market.

Example ClusterPolicy for Kyverno

Info
Special thanks to Chip Zoller from Nirmata for the hint of precondition!
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  annotations:
    policies.kyverno.io/title: Add TTL to dangling Job
    policies.kyverno.io/category: The cool company policy collection
    policies.kyverno.io/description: >-
      Automatically clean dangling jobs by adding TTL to spec.      
  name: add-ttl-to-dangling-job
spec:
  background: false
  failurePolicy: Ignore
  validationFailureAction: enforce
  rules:
    - name: add-ttl-to-dangling-job
      match:
        resources:
          kinds:
            # We only deal with Job in this policy
            - "Job"
      preconditions:
        any:
          # If the Job is created by CronJob, it will have ".metadata.ownerReferences" field,
          # which is an array. But the exactly value doesn't really matter here.
          # We just want to know whether this field exists.
          #
          # The following line is saying:
          # If there is no ".metadata.ownerReferences", fallback to an empty string (''),
          - key: "{{ request.object.metadata.ownerReferences || '' }}"
            operator: Equals
            # And if the value is empty string, means there is no ".metadata.ownerReferences".
            # That's the kind of Job we want to set ".spec.ttlSecondsAfterFinished"
            value: ''
      mutate:
        patchStrategicMerge:
          spec:
            # Add ".spec.ttlSecondsAfterFinished" (only when it's not specified),
            # so the Job will be deleted 15 minutes after completion.
            # Set to the value you want.
            +(ttlSecondsAfterFinished): 900

Policy engine in action

It’s important to validate whether the policy actually works; let’s leverage k3d again.

Start k3d

1
2
3
4
5
6
$ k3d cluster create
# ...omitted
NFO[0008] Starting Node 'k3d-k3s-default-serverlb'
INFO[0015] Injecting records for hostAliases (incl. host.k3d.internal) and for 3 network members into CoreDNS configmap...
INFO[0017] Cluster 'k3s-default' created successfully!
# ...omitted

Install Kyverno with Helm Chart

1
2
3
4
5
6
7
$ helm repo add kyverno https://kyverno.github.io/kyverno/
$ helm repo update
$ helm install kyverno kyverno/kyverno --namespace kyverno --create-namespace
NAME: kyverno
LAST DEPLOYED: Sun Jul 10 01:25:23 2022
# ...omitted
Thank you for installing kyverno! Your release is named kyverno.

Apply the ClusterPolicy

First, save the ClusterPolicy above as a file, e.g. add-ttl-to-dangling-job.yaml.

1
2
3
4
$ kubectl apply -f add-ttl-to-dangling-job.yaml
# Or if you are feeling lazy, use the following command:
# Caution: Always check what's in the file first before applying anything!
$ kubectl apply -f https://blog.wtcx.dev/2022/07/09/automatically-clean-up-dangling-jobs-with-policy-engine/add-ttl-to-dangling-job.yaml

Create Job directly

You can use the Job example from Kubernetes’ document:

1
2
$ kubectl apply -f https://kubernetes.io/examples/controllers/job.yaml
job.batch/pi created

Examine the Job

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
$ kubectl get job pi -o yaml
apiVersion: batch/v1
kind: Job
metadata:
  # There is no ".ownerReferences" under "metadata".
  annotations:
    # ...omitted
    # You can see the modifications done by Kyverno here
    policies.kyverno.io/last-applied-patches: |
      add-ttl-to-dangling-job.add-ttl-to-dangling-job.kyverno.io: added /spec/ttlSecondsAfterFinished
  # ...omitted
  name: pi
  namespace: default
  # ...omitted
spec:
  # ...omitted
  # The following field is added by the ClusterPolicy
  ttlSecondsAfterFinished: 900

So, from what we can see here, we know the ClusterPolicy actually works as expected.

Now, let’s make sure Kyverno doesn’t touch the Job created by CronJob.

Create a CronJob

Again, let’s just use CronJob example from Kubernetes’ document:

1
2
$ kubectl apply -f https://kubernetes.io/examples/application/job/cronjob.yaml
cronjob.batch/hello created

Examine the Job (created by CronJob)

The Job created by CronJob will be named differently with some suffix. Get the name first.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
$ kubectl get job
pi               1/1           36s        10m
hello-27623144   1/1           8s         32s
$ kubectl get job hello-27623144 -o yaml
apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    # There is no Kyverno's annotation
    batch.kubernetes.io/job-tracking: ""
  # ...omitted
  name: hello-27623144
  namespace: default
  # This is the ".metadata.ownerReferences" we kept talking about before!
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: CronJob
    name: hello
    uid: d4910a7c-dc57-4563-8611-e6f58a1cb5e1
  # ...omitted
spec:
  # You won't see the ".spec.ttlSecondsAfterFinished" field here.
  backoffLimit: 6
  completionMode: NonIndexed
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: 3dbaeff8-3163-4b32-9946-ececad06e965
  suspend: false
  template:
    # ...omitted

Cleanup

1
$ k3d cluster delete

Clean dangling Jobs manually for the one last time

The following gives you the idea of which Jobs are not owned by higher-level controllers:

1
$ kubectl get job -o json -A | jq -r '.items[] | select(.metadata.ownerReferences == null and .status.active == null) | .metadata.name'

To delete these Jobs:

1
$ kubectl get job -o json -A | jq -r '.items[] | select(.metadata.ownerReferences == null and .status.active == null) | "kubectl delete job -n " + .metadata.namespace + " " + .metadata.name' | xargs -I {} bash -c "{}"

Conclusion

It’s pretty common that users are not aware of the potential issues like massive dangling jobs.

However, problems are normally caused by the area where no one pays attention. At the end of day, it’s still admin’s job (no pun intended) to make sure things run as smooth as possible.

Further Readings

Cover: https://unsplash.com/photos/znfc7DF7M7U


  1. It’s somehow sad that you can’t expect every one knows and implements graceful shutdown. ↩︎

  2. Well, it’s more like company’s cluster. ↩︎

0%