Automatically Clean Up Dangling Jobs with Policy Engine

Last year, I was reading about PSP deprecation and started wondering what could be the solutions in the future. Fortunately, there are already several policy engines like OPA Gatekeeper and Kyverno available.

With the help of the policy engine, not only we can ensure workloads are compliant with selected, predefined rules, but also achieved custom company policies like:

Schedule workloads to spot instances based on certain criteria for better cost-saving, and put delicate ones to on-demand instances.
Add preStop hooks for containers that have ports open (like ingress-nginx!)¹.
Patch image version to leverage local cache and speed things up (e.g., fixed version for amazon/aws-cli).
Restrict home-made services exposing endpoints that are not ready at the moment (publishNotReadyAddresses).
Restrict service load balancers.
Restrict modification on ingress’ annotations that tries to use an arbitrary proxy buffer size.
…and many, many more, without any other users’ interventions and/or modifications.

Policy engines are just fascinating. I also learned a few things from it by making my own admission webhooks. You should be able to achieve most of the requirements by policy engines alone, though.

Policies are meant to be enforeced. Documents and meetings alone just won’t stop bad use of Kubernetes (intentionally or unintentionally).

Your cluster, your rules².

`Jobs` that just won’t go away
#

We constantly maintain and improve our cluster policies because operation issues never end.

Recently, I noticed more and more dangling jobs are floating around and keep increasing. Apparently, these jobs were created directly (no matter by service or user instead of by Controller like CronJob).

It’s not necessarily a bad practice – but the Completed (or Errored ) jobs just won’t disappear.

Fortunately, there is a TTL-after-finished Controller that can help.

To quote from the enhancement proposal:

Motivation
… it’s difficult for the users to clean them up automatically, and those Jobs and Pods can accumulate and overload a Kubernetes cluster very easily.

User Stories
The users keep creating Jobs in a small Kubernetes cluster with 4 nodes. The Jobs accumulates over time, and 1 year later, the cluster ended up with more than 100k old Jobs. This caused etcd hiccups, long high latency etcd requests, and eventually made the cluster unavailable.

The situation of our clusters is definitely nowhere close to 100k at this point. But I’ve seen 3k finished jobs in a really small cluster before, and that already made me feel terrified.

The answer to this problem seems very straightforward: just add .spec.ttlSecondsAfterFinished to your Job and it’s done.

But is it really that “happily ever after”?

Yes and no. You can’t expect everyone who directly creates a Job will always puts that field. So what should we do now?

Since we are in a post talking about policy engine, so yeah, let’s leverage policy engine.

We will set .spec.ttlSecondsAfterFinished to a Job whenever there is no .metadata.ownerReferences defined (i.e. It’s not created by controller like CronJob).

Prerequisites
#

Kubernetes >= 1.12
- The TTL-after-finished Controller’s feature state is alpha in 1.12, beta in 1.21, and stable in 1.23.
- If you are using Amazon EKS like me, features are only available after they enter beta feature state. That is, you can only use TTL-after-finished Controller on Amazon EKS >= 1.21.
Your policy engine of choice.

Here we will use Kyverno’s ClusterPolicy as an example, but you should be able to implement with any other solutions on the market.

Example `ClusterPolicy` for Kyverno
#

Special thanks to Chip Zoller from Nirmata for the hint of precondition!

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  annotations:
    policies.kyverno.io/title: Add TTL to dangling Job
    policies.kyverno.io/category: The cool company policy collection
    policies.kyverno.io/description: >-
      Automatically clean dangling jobs by adding TTL to spec.      
  name: add-ttl-to-dangling-job
spec:
  background: false
  failurePolicy: Ignore
  validationFailureAction: enforce
  rules:
    - name: add-ttl-to-dangling-job
      match:
        resources:
          kinds:
            # We only deal with Job in this policy
            - "Job"
      preconditions:
        any:
          # If the Job is created by CronJob, it will have ".metadata.ownerReferences" field,
          # which is an array. But the exactly value doesn't really matter here.
          # We just want to know whether this field exists.
          #
          # The following line is saying:
          # If there is no ".metadata.ownerReferences", fallback to an empty string (''),
          - key: "{{ request.object.metadata.ownerReferences || '' }}"
            operator: Equals
            # And if the value is empty string, means there is no ".metadata.ownerReferences".
            # That's the kind of Job we want to set ".spec.ttlSecondsAfterFinished"
            value: ''
      mutate:
        patchStrategicMerge:
          spec:
            # Add ".spec.ttlSecondsAfterFinished" (only when it's not specified),
            # so the Job will be deleted 15 minutes after completion.
            # Set to the value you want.
            +(ttlSecondsAfterFinished): 900

Policy engine in action
#

It’s important to validate whether the policy actually works; let’s leverage k3d again.

Start k3d
#

$ k3d cluster create
# ...omitted
NFO[0008] Starting Node 'k3d-k3s-default-serverlb'
INFO[0015] Injecting records for hostAliases (incl. host.k3d.internal) and for 3 network members into CoreDNS configmap...
INFO[0017] Cluster 'k3s-default' created successfully!
# ...omitted

Install Kyverno with Helm Chart
#

$ helm repo add kyverno https://kyverno.github.io/kyverno/
$ helm repo update
$ helm install kyverno kyverno/kyverno --namespace kyverno --create-namespace
NAME: kyverno
LAST DEPLOYED: Sun Jul 10 01:25:23 2022
# ...omitted
Thank you for installing kyverno! Your release is named kyverno.

Apply the `ClusterPolicy`
#

First, save the ClusterPolicy above as a file, e.g. add-ttl-to-dangling-job.yaml.

$ kubectl apply -f add-ttl-to-dangling-job.yaml
# Or if you are feeling lazy, use the following command:
# Caution: Always check what's in the file first before applying anything!
$ kubectl apply -f https://blog.wtcx.dev/2022/07/09/automatically-clean-up-dangling-jobs-with-policy-engine/add-ttl-to-dangling-job.yaml

Create `Job` directly
#

You can use the Job example from Kubernetes’ document:

$ kubectl apply -f https://kubernetes.io/examples/controllers/job.yaml
job.batch/pi created

Examine the `Job`
#

$ kubectl get job pi -o yaml
apiVersion: batch/v1
kind: Job
metadata:
  # There is no ".ownerReferences" under "metadata".
  annotations:
    # ...omitted
    # You can see the modifications done by Kyverno here
    policies.kyverno.io/last-applied-patches: |
      add-ttl-to-dangling-job.add-ttl-to-dangling-job.kyverno.io: added /spec/ttlSecondsAfterFinished
  # ...omitted
  name: pi
  namespace: default
  # ...omitted
spec:
  # ...omitted
  # The following field is added by the ClusterPolicy
  ttlSecondsAfterFinished: 900

So, from what we can see here, we know the ClusterPolicy actually works as expected.

Now, let’s make sure Kyverno doesn’t touch the Job created by CronJob.

Create a `CronJob`
#

Again, let’s just use CronJob example from Kubernetes’ document:

$ kubectl apply -f https://kubernetes.io/examples/application/job/cronjob.yaml
cronjob.batch/hello created

Examine the `Job` (created by `CronJob`)
#

The Job created by CronJob will be named differently with some suffix. Get the name first.

$ kubectl get job
pi               1/1           36s        10m
hello-27623144   1/1           8s         32s
$ kubectl get job hello-27623144 -o yaml
apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    # There is no Kyverno's annotation
    batch.kubernetes.io/job-tracking: ""
  # ...omitted
  name: hello-27623144
  namespace: default
  # This is the ".metadata.ownerReferences" we kept talking about before!
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: CronJob
    name: hello
    uid: d4910a7c-dc57-4563-8611-e6f58a1cb5e1
  # ...omitted
spec:
  # You won't see the ".spec.ttlSecondsAfterFinished" field here.
  backoffLimit: 6
  completionMode: NonIndexed
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: 3dbaeff8-3163-4b32-9946-ececad06e965
  suspend: false
  template:
    # ...omitted

Cleanup
#

$ k3d cluster delete

Clean dangling `Jobs` manually for the one last time
#

The following gives you the idea of which Jobs are not owned by higher-level controllers:

$ kubectl get job -o json -A | jq -r '.items[] | select(.metadata.ownerReferences == null and .status.active == null) | .metadata.name'

To delete these Jobs:

$ kubectl get job -o json -A | jq -r '.items[] | select(.metadata.ownerReferences == null and .status.active == null) | "kubectl delete job -n " + .metadata.namespace + " " + .metadata.name' | xargs -I {} bash -c "{}"

Conclusion
#

It’s pretty common that users are not aware of the potential issues like massive dangling jobs.

However, problems are normally caused by the area where no one pays attention. At the end of day, it’s still admin’s job (no pun intended) to make sure things run as smooth as possible.

Automatically Clean Up Dangling Jobs with Policy Engine

`Jobs` that just won’t go away
#

Prerequisites
#

Example `ClusterPolicy` for Kyverno
#

Policy engine in action
#

Start k3d
#

Install Kyverno with Helm Chart
#

Apply the `ClusterPolicy`
#

Create `Job` directly
#

Examine the `Job`
#

Create a `CronJob`
#

Examine the `Job` (created by `CronJob`)
#

Cleanup
#

Clean dangling `Jobs` manually for the one last time
#

Conclusion
#

Further Readings
#

Related

Jobs that just won’t go away#

Prerequisites#

Example ClusterPolicy for Kyverno#

Policy engine in action#

Start k3d#

Install Kyverno with Helm Chart#

Apply the ClusterPolicy#

Create Job directly#

Examine the Job#

Create a CronJob#

Examine the Job (created by CronJob)#

Cleanup#

Clean dangling Jobs manually for the one last time#

Conclusion#

Further Readings#

Related

`Jobs` that just won’t go away
#

Prerequisites
#

Example `ClusterPolicy` for Kyverno
#

Policy engine in action
#

Start k3d
#

Install Kyverno with Helm Chart
#

Apply the `ClusterPolicy`
#

Create `Job` directly
#

Examine the `Job`
#

Create a `CronJob`
#

Examine the `Job` (created by `CronJob`)
#

Cleanup
#

Clean dangling `Jobs` manually for the one last time
#

Conclusion
#

Further Readings
#