Automatically Clean Up Dangling Jobs with Policy Engine
Last year, I was reading about PSP deprecation and started wondering what could be the solutions in the future. Fortunately, there are already several policy engines like OPA Gatekeeper and Kyverno available.
With the help of the policy engine, not only we can ensure workloads are compliant with selected, predefined rules, but also achieved custom company policies like:
- Schedule workloads to spot instances based on certain criteria for better cost-saving, and put delicate ones to on-demand instances.
preStophooks for containers that have ports open (like ingress-nginx!)1.
- Patch image version to leverage local cache and speed things up (e.g., fixed version for
- Restrict home-made services exposing endpoints that are not ready at the moment (
- Restrict service load balancers.
- Restrict modification on ingress’ annotations that tries to use an arbitrary proxy buffer size.
- …and many, many more, without any other users’ interventions and/or modifications.
Policy engines are just fascinating. I also learned a few things from it by making my own admission webhooks. You should be able to achieve most of the requirements by policy engines alone, though.
Policies are meant to be enforeced. Documents and meetings alone just won’t stop bad use of Kubernetes (intentionally or unintentionally).
Your cluster, your rules2.
Jobs that just won’t go away
We constantly maintain and improve our cluster policies because operation issues never end.
Recently, I noticed more and more dangling jobs are floating around and keep increasing. Apparently, these jobs were created directly (no matter by service or user instead of by Controller like
It’s not necessarily a bad practice – but the
Errored ) jobs just won’t disappear.
Fortunately, there is a TTL-after-finished Controller that can help.
To quote from the enhancement proposal:
… it’s difficult for the users to clean them up automatically, and those Jobs and Pods can accumulate and overload a Kubernetes cluster very easily.
The users keep creating Jobs in a small Kubernetes cluster with 4 nodes. The Jobs accumulates over time, and 1 year later, the cluster ended up with more than 100k old Jobs. This caused etcd hiccups, long high latency etcd requests, and eventually made the cluster unavailable.
The situation of our clusters is definitely nowhere close to 100k at this point. But I’ve seen 3k finished jobs in a really small cluster before, and that already made me feel terrified.
The answer to this problem seems very straightforward: just add
.spec.ttlSecondsAfterFinished to your
Job and it’s done.
But is it really that “happily ever after”?
Yes and no. You can’t expect everyone who directly creates a
Job will always puts that field. So what should we do now?
Since we are in a post talking about policy engine, so yeah, let’s leverage policy engine.
We will set
.spec.ttlSecondsAfterFinished to a
Job whenever there is no
.metadata.ownerReferences defined (i.e. It’s not created by controller like
- Kubernetes >= 1.12
- Your policy engine of choice.
Here we will use Kyverno’s
ClusterPolicy as an example, but you should be able to implement with any other solutions on the market.
ClusterPolicy for Kyverno
Policy engine in action
It’s important to validate whether the policy actually works; let’s leverage k3d again.
Install Kyverno with Helm Chart
First, save the ClusterPolicy above as a file, e.g.
You can use the Job example from Kubernetes’ document:
So, from what we can see here, we know the ClusterPolicy actually works as expected.
Now, let’s make sure Kyverno doesn’t touch the Job created by
Again, let’s just use CronJob example from Kubernetes’ document:
Job (created by
The Job created by CronJob will be named differently with some suffix. Get the name first.
If you need to clean these
Jobs by hand for the one last time…
The following gives you the idea of which
Jobs are not owned by higher-level controllers:
To delete these
It’s pretty common that users are not aware of the potential issues like massive dangling jobs.
However, problems are normally caused by the area where no one pays attention. At the end of day, it’s still admin’s job (no pun intended) to make sure things run as smooth as possible.
- PodSecurityPolicy Deprecation: Past, Present, and Future
- Writing Policies
- Automatic Clean-up for Finished Jobs