Automatically Clean Up Dangling Jobs with Policy Engine

Preface
Last year, I was reading about PSP deprecation and started wondering what could be the solutions in the future. Fortunately, there are already several policy engines like OPA Gatekeeper and Kyverno available.
With the help of the policy engine, not only we can ensure workloads are compliant with selected, predefined rules, but also achieved custom company policies like:
- Schedule workloads to spot instances based on certain criteria for better cost-saving, and put delicate ones to on-demand instances.
- Add
preStop
hooks for containers that have ports open (like ingress-nginx!)1. - Patch image version to leverage local cache and speed things up (e.g., fixed version for
amazon/aws-cli
). - Restrict home-made services exposing endpoints that are not ready at the moment (
publishNotReadyAddresses
). - Restrict service load balancers.
- Restrict modification on ingress’ annotations that tries to use an arbitrary proxy buffer size.
- …and many, many more, without any other users’ interventions and/or modifications.
Policy engines are just fascinating. I also learned a few things from it by making my own admission webhooks. You should be able to achieve most of the requirements by policy engines alone, though.
Policies are meant to be enforeced. Documents and meetings alone just won’t stop bad use of Kubernetes (intentionally or unintentionally).
Your cluster, your rules2.
Jobs
that just won’t go away
We constantly maintain and improve our cluster policies because operation issues never end.
Recently, I noticed more and more dangling jobs are floating around and keep increasing. Apparently, these jobs were created directly (no matter by service or user instead of by Controller like CronJob
).
It’s not necessarily a bad practice – but the Completed
(or Error
ed ) jobs just won’t disappear.
Fortunately, there is a TTL-after-finished Controller that can help.
To quote from the enhancement proposal:
Motivation
… it’s difficult for the users to clean them up automatically, and those Jobs and Pods can accumulate and overload a Kubernetes cluster very easily.
User Stories
The users keep creating Jobs in a small Kubernetes cluster with 4 nodes. The Jobs accumulates over time, and 1 year later, the cluster ended up with more than 100k old Jobs. This caused etcd hiccups, long high latency etcd requests, and eventually made the cluster unavailable.
The situation of our clusters is definitely nowhere close to 100k at this point. But I’ve seen 3k finished jobs in a really small cluster before, and that already made me feel terrified.
The answer to this problem seems very straightforward: just add .spec.ttlSecondsAfterFinished
to your Job
and it’s done.
But is it really that “happily ever after”?
Yes and no. You can’t expect everyone who directly creates a Job
will always puts that field. So what should we do now?
Since we are in a post talking about policy engine, so yeah, let’s leverage policy engine.
We will set .spec.ttlSecondsAfterFinished
to a Job
whenever there is no .metadata.ownerReferences
defined (i.e. It’s not created by controller like CronJob
).
Prerequisites
- Kubernetes >= 1.12
- The TTL-after-finished Controller’s feature state is
alpha
in 1.12,beta
in 1.21, andstable
in 1.23. - If you are using Amazon EKS like me, features are only available after they enter
beta
feature state. That is, you can only use TTL-after-finished Controller on Amazon EKS >= 1.21.
- The TTL-after-finished Controller’s feature state is
- Your policy engine of choice.
Here we will use Kyverno’s ClusterPolicy
as an example, but you should be able to implement with any other solutions on the market.
Example ClusterPolicy
for Kyverno
|
|
Policy engine in action
It’s important to validate whether the policy actually works; let’s leverage k3d again.
Start k3d
|
|
Install Kyverno with Helm Chart
|
|
Apply the ClusterPolicy
First, save the ClusterPolicy above as a file, e.g. add-ttl-to-dangling-job.yaml
.
|
|
Create Job
directly
You can use the Job example from Kubernetes’ document:
|
|
Examine the Job
|
|
So, from what we can see here, we know the ClusterPolicy actually works as expected.
Now, let’s make sure Kyverno doesn’t touch the Job created by CronJob
.
Create a CronJob
Again, let’s just use CronJob example from Kubernetes’ document:
|
|
Examine the Job
(created by CronJob
)
The Job created by CronJob will be named differently with some suffix. Get the name first.
|
|
Cleanup
|
|
Clean dangling Jobs
manually for the one last time
The following gives you the idea of which Jobs
are not owned by higher-level controllers:
|
|
To delete these Jobs
:
|
|
Conclusion
It’s pretty common that users are not aware of the potential issues like massive dangling jobs.
However, problems are normally caused by the area where no one pays attention. At the end of day, it’s still admin’s job (no pun intended) to make sure things run as smooth as possible.