The Long Way to Windows Container on Amazon EKS: VPC Resource Controller
Amazon EKS, which was previously called Amazon Elastic Container Service for Kubernetes, is the managed Kubernetes service of theirs. We already have some services (Linux containers) running on production environments for while. It is battle-tested and works well if you ask me.
Due to the characteristic of several Windows-based services
that are unlikely to run on other platforms in the foreseeable future, we have been discussing the feasibility of running Windows Containers on EKS.
I am aware that Amazon announced EKS Windows Container support became generally available last October, however, I haven’t tried it until now.
To know whether it works as we expect, I fired up an EKS cluster (with version 1.14), then followed the Windows Support guide to add…well, Windows support.
There are a few ways to do it, be it
eksctl, or steps by steps using the commands they provided, I chose the latter one so I can at least know what I was doing.
However, here comes the plot twist. I applied the IIS sample YAML but the pod was never up.
After describing the pod, there were events like:
Strangely, I’ve never seen errors like
Insufficient vpc.amazonaws.com/PrivateIPv4Address. I went to the EC2 console and checked the Windows worker instance, and there were 0 private IP addresses other than the primary private IP attached to that instance as they should.
So yeah, it looks like the only way out would be everybody’s best friend, Google. At that time, there were only a few GitHub issues that remain open. There is one interest blog post I found that seemed to be helpful, but it was written in Russian, I could only comprehend the message from AWS’ support engineer.
The support engineer from AWS in that post said the private subnets that worker nodes were using were not associated with any route tables. So I checked our VPC console, the subnets we were using were all associated with the main route table! I tough maybe we are having different issues and just moved on.
Digging into the logs of vpc-resource-controller that I deployed in the Windows Support guide, it throws errors:
I noticed the “failed to find” messages and thought, maybe there was a DNS resolution issue? I then
execed into that pod and see what’s going on:
So, it wasn’t a DNS resolution issue. Out of ideas, I opened a support ticket. After a few backs and forths, the support engineer couldn’t reproduce the problem and suggested me to redeploy vpc-resource-controller.
And I checked logs again:
That was new and looked like somehow related to…route table. But these errors became the previous ones after I delete the pod and let it to be re-created.
Finally, the support engineer confirmed that he or she was able to reproduce the issue and mentioned that our private subnets didn’t associate with any route table.
I thought I’ve checked that before. So I went to VPC console and checked again, it looks like these subnets were associated!
The support engineer also suggested me to use aws cli to check:
But only the output only contains an empty array:
The support engineer pointed out that subnets without explicitly associated route table will implicitly use the main route table (I should’ve read the document though).
…You can explicitly associate a subnet with a particular route table. Otherwise, the subnet is implicitly associated with the main route table.
The VPC console usually shows the route table that subnet is currently using. It doesn’t necessary mean that particular subnet is explicitly associated with the route table.
And you guessed it, the vpc-resource-controller uses
DescribeRouteTables API and it couldn’t find the route table, just like the result above, hence the errors.
Let’s associate the subnets with the main route table like the support engineer suggested:
After that, vpc-resource-controller didn’t show errors any more and the IIS server pod was finally working.
At the time of writing, you must manually associate the subnets that you are using to the route table you need, otherwise vpc-resource-controller will fail.
I am not sure whethe they will modify the vpc-resource-controller, but this is something you need to notice for now if you need to schedule Windows workload on Amazon EKS.