Recently, We started migrating our EKS 1.16 clusters to brand new 1.19 ones.
Of all the changes, I am most excited about NodeLocal DNSCache. It not only greatly reduced the time of DNS query latency (mostly because of these serach domains), but issues like conntrack races at very little expense.
The hidden problem#
Networking is hard, especially in Kubernetes. DNS is often one of the problems.
Normally, no matter how many CoreDNS pods you run, there are always chances your DNS queries fly over your head across instances, or even worse, availability zones. This increases the latency, failure possibility and lowers the app’s performance that needs high throughput.
For instance, applications like log shipper and push notification workers tend to send many requests in short periods and will only increase when under pressure. If you check CoreDNS’ logs, you will notice there are so many NXDOMAIN
results because of the search domains. Although these results are getting cached, the time wasted on traveling between nodes and zones is still expensive.
I’ve reduced ndots
for cluster-level log shippers, but it’s unlikely and impractical to ask everyone else to do the same.
Install the cluster addon#
If I remember correctly, GKE has a simple checkbox to install NodeLocal DNSCache. But since we are using AWS, we seem not to deserve the option. Anyway, it’s still pretty easy to install if you follow the instructions.
It creates a daemonset, a configmap, a service account for the node-local-dns
and services to expose itself and the upstream CoreDNS. You don’t need to change any of your workloads to benefit from this. This daemonset will manipulate iptables rules (in a good way) to intercept DNS queries to the CoreDNS service cluster IP and check if it has the cache.
It feels great, just like you were using VMs and never thought you would be worrying about DNS queries being too slow one day.
Wait, I can’t connect to the services with stub domain?#
There is always a “but”. We have some “special” domains that need to set up a stub domain block in Corefile to make it work. Like:
apiVersion: v1
kind: ConfigMap
data:
Corefile: |
.:53 {
# ...(omitted)
}
some.corp:53 {
cache 30
forward . 10.1.2.3 10.4.5.6
}
It always worked before, and I naively thought it would continue to work after I installed NodeLocal DNSCache, until a colleague told me that he couldn’t connect to a certain database from the new cluster.
When something can’t be connected, I usually check DNS first and then use tools like nc
later. And since we are all in this article, of course, it’s a DNS issue.
I was dumbfounded and wondering how it is possible. If cache misses, then it will still ask upstream CoreDNS, won’t it?
I removed NodeLocal DNSCache, then found the stub domain resolves. I applied the NodeLocal DNSCache again, no result. After that, I started searching around and hoping my mentors Google and StackOverflow would shed some light.
There was a commit that added support for stubDomains
of kube-dns
. I immediately checked daemonset manifest and found there is indeed an optional volume mounting the kubedns
configmap. I tried to change the configmap to coredns
and hope it will work.
It doesn’t.
Why? Because the stub domains format is different between kube-dns
and CoreDNS. As you can see in CoreDNS configuration equivalent to kube-dns document:
apiVersion: v1
data:
stubDomains: |
{"abc.com" : ["1.2.3.4"], "my.cluster.local" : ["2.3.4.5"]}
upstreamNameservers: |
["8.8.8.8", "8.8.4.4"]
kind: ConfigMap
This is what kube-dns
’ stubDomains
looks like. And the CoreDNS one is already presented above. Of course it doesn’t work.
Take a good hard look at the configmap#
I’ve searched a few pages on Google but haven’t thought that answer can be in the manifest all along.
After all these sed
substitutions, the configmap in manifest will look like:
Corefile: |
cluster.local:53 {
# ...(omitted)
forward . __PILLAR__CLUSTER__DNS__ {
force_tcp
}
# ...(omitted)
}
in-addr.arpa:53 {
# ...(omitted)
forward . __PILLAR__CLUSTER__DNS__ {
force_tcp
}
# ...(omitted)
}
ip6.arpa:53 {
# ...(omitted)
forward . __PILLAR__CLUSTER__DNS__ {
force_tcp
}
# ...(omitted)
}
.:53 {
# ...(omitted)
forward . __PILLAR__UPSTREAM__SERVERS__
# ...(omitted)
}
So, when it comes to zone cluster.local
, in-addr.arpa
and ip6.arpa
, go ask __PILLAR__CLUSTER__DNS__
.
What is __PILLAR__CLUSTER__DNS__
?#
__PILLAR__CLUSTER__DNS__
will be replaced at runtime with c.clusterDNSIP
. What is c.clusterDNSIP
you say?
According to the flag parsing, it’s upstreamsvc
flag that defaults to kube-dns
.
Let’s go back to the manifest, we will find:
# ...(omitted)
containers:
- name: node-cache
image: k8s.gcr.io/dns/k8s-dns-node-cache:1.17.0
resources:
requests:
cpu: 25m
memory: 5Mi
args: [ "-localip", "169.254.20.10,172.20.0.10", "-conf", "/etc/Corefile", "-upstreamsvc", "kube-dns-upstream" ]
# ...(omitted)
…and the service snippet:
apiVersion: v1
kind: Service
metadata:
name: kube-dns-upstream
namespace: kube-system
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "KubeDNSUpstream"
spec:
ports:
- name: dns
port: 53
protocol: UDP
targetPort: 53
- name: dns-tcp
port: 53
protocol: TCP
targetPort: 53
selector:
k8s-app: kube-dns
So, __PILLAR__CLUSTER__DNS__
is an “alternative” upstream CoreDNS ClusterIP service, since the original IP of CoreDNS service will be intercepted (172.20.0.10
in this case). This will create another ClusterIP service for NodeLocal DNSCache to contact with upstream. Therefore, when it comes to zone cluster.local
, in-addr.arpa
and ip6.arpa
, go ask upstream CoreDNS.
What is __PILLAR__UPSTREAM__SERVERS__
?#
We now understand that stub domains are not included in the zones above, so it will ask __PILLAR__UPSTREAM__SERVERS__
.
According to this comfigmap.go and the zero value of UpstreamNameservers, __PILLAR__UPSTREAM__SERVERS__
will be replaced with /etc/resolv.conf
at runtime.
Let’s look back to the manifest again:
# ...(omitted)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-local-dns
namespace: kube-system
# ...(omitted)
spec:
# ...(omitted)
spec:
# ...(omitted)
hostNetwork: true
dnsPolicy: Default # Don't use cluster DNS.
# ...(omitted)
We can get explanations about dnsPolicy: Default
from the Pod’s DNS Policy document, that:
“Default”: The Pod inherits the name resolution configuration from the node that the pods run on. See related discussion for more details.
What nameserver will present in node’s /etc/resolv.conf
? Yes, the reserved IP address for DNS in VPC, which may be something like 10.0.0.2
.
Fallback to upstream CoreDNS, not Route53#
After all that troubles, we can just replace __PILLAR__UPSTREAM__SERVERS__
with __PILLAR__CLUSTER__DNS__
.
If you are not using stub domains, then using Route53 won’t cause any problems.
However, in our case, we should fallback to upstream CoreDNS instead.