After dealing with the vpc-resource-controller, I can finally see the IIS page. But a running sample does not mean anything. So I wrapped a few deployment YAML up to see if our workloads work.
To correctly schedule Windows workload, we need to choose the nodes with the os label set to windows
. Otherwise, it could be scheduled on Linux worker node and just stuck there.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: beta.kubernetes.io/os
operator: In
values:
- windows
I applied the YAML file, the pod was never started. It was really frustrating to see situations like this.
But we still had to know why, so:
$ k describe <poor-pod>
(...omitted)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m4s default-scheduler Successfully assigned fake-namespace/fake-pod-name to ip-10-xx-xx-xx.ap-northeast-1.compute.internal
Warning FailedCreatePodSandBox 3m58s kubelet, ip-10-xx-xx-xx.ap-northeast-1.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "0a32725f52780fbb1099efee02d1da4523981e12be0987a694ba38acf48be829" network for pod "fake-pod-name": NetworkPlugin cni failed to set up pod "fake-pod-name_fake-namespace" network: failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address, failed to clean up sandbox container "0a32725f52780fbb1099efee02d1da4523981e12be0987a694ba38acf48be829" network for pod "fake-pod-name": NetworkPlugin cni failed to teardown pod "fake-pod-name_fake-namespace" network: failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address]
Normal SandboxChanged 8s (x16 over 3m55s) kubelet, ip-10-xx-xx-xx.ap-northeast-1.compute.internal Pod sandbox changed, it will be killed and re-created.
Hmm, the error NetworkPlugin cni failed to teardown pod "fake-pod-name_fake-namespace" network: failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address
seems a little bit fimiliar. Didn’t we fix a similar issue before?
Oddly enough, I checked the previous sample IIS workload and it just works as usual. I had to start the Google journey all over again.
A few hours later, I still had no idea. So I switched to the tab of Windows Support document and checked the deployment manifest again.
I compared that YAML and the one I slightly modified from the following command:
$ k run <deployment-name> --image <image> --dry-run -o yaml
- Was my manifest legit? Of course, otherwise it couldn’t be applied at very begining.
- Did I specify
imagePullPolicy
? No, it’s not needed. - Did I expose the
containerPort
? No, I don’t even need to expose any ports. - Did I need to use
command
? No, I just don’t need it. - Did I use node selector? No, I used node affinity for scheduling.
- Did I add additional things? Yes, I had to add
imagePullSecrets
to pull image from our private docker registry.
The only difference left will be the image. But when I logged in that Windows EC2 and manually run the container with docker run
, it just works. Besides, it’s hardly container’s issue since the pod was never up.
So I tried to add the fields back one by one and see what will happen.
imagePullPolicy
? Check. containerPort
? command
? No, I can’t even convince myself to add it, it simply makes no sense.
Node Selector? OK, let me just replace node affinity with node selector, although the effect here are basically the same.
I then re-applied the deployment manifest.
And the pod started. 🤦♂️
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8s default-scheduler Successfully assigned fake-namespace/fake-pod-name to ip-10-xx-xx-xx.ap-northeast-1.compute.internal
Normal Pulled 7s kubelet, ip-10-xx-xx-xx.ap-northeast-1.compute.internal Container image "fake-registry/image:fake-tag" already present on machine
Normal Created 7s kubelet, ip-10-xx-xx-xx.ap-northeast-1.compute.internal Created container fake-container-name
Normal Started 6s kubelet, ip-10-xx-xx-xx.ap-northeast-1.compute.internal Started container fake-container-name
So, that was it. You can’t use Node Affinity…for now. You can only use node selector.
To confirm this, I opened another support ticket. The support engineer verified this issue and has reported to internal team.
Update 2020/01/11:
Apparently, this is a “common feature request”. And yet this information is nowhere to be seen.