Query Stub Domains with CoreDNS and NodeLocal DNSCache

Recently, We started migrating our EKS 1.16 clusters to brand new 1.19 ones.

Of all the changes, I am most excited about NodeLocal DNSCache. It not only greatly reduced the time of DNS query latency (mostly because of these serach domains), but issues like conntrack races at very little expense.

The hidden problem

Networking is hard, especially in Kubernetes. DNS is often one of the problems.

Normally, no matter how many CoreDNS pods you run, there are always chances your DNS queries fly over your head across instances, or even worse, availability zones. This increases the latency, failure possibility and lowers the app’s performance that needs high throughput.

For instance, applications like log shipper and push notification workers tend to send many requests in short periods and will only increase when under pressure. If you check CoreDNS’ logs, you will notice there are so many NXDOMAIN results because of the search domains. Although these results are getting cached, the time wasted on traveling between nodes and zones is still expensive.

I’ve reduced ndots for cluster-level log shippers, but it’s unlikely and impractical to ask everyone else to do the same.

Install the cluster addon

If I remember correctly, GKE has a simple checkbox to install NodeLocal DNSCache. But since we are using AWS, we seem not to deserve the option. Anyway, it’s still pretty easy to install if you follow the instructions.

It creates a daemonset, a configmap, a service account for the node-local-dns and services to expose itself and the upstream CoreDNS. You don’t need to change any of your workloads to benefit from this. This daemonset will manipulate iptables rules (in a good way) to intercept DNS queries to the CoreDNS service cluster IP and check if it has the cache.

It feels great, just like you were using VMs and never thought you would be worrying about DNS queries being too slow one day.

Wait, I can’t connect to the services with stub domain?

There is always a “but”. We have some “special” domains that need to set up a stub domain block in Corefile to make it work. Like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
apiVersion: v1
kind: ConfigMap
data:
  Corefile: |
    .:53 {
        # ...(omitted)
    }
    some.corp:53 {
        cache 30
        forward . 10.1.2.3 10.4.5.6
    }    

It always worked before, and I naively thought it would continue to work after I installed NodeLocal DNSCache, until a colleague told me that he couldn’t connect to a certain database from the new cluster.

When something can’t be connected, I usually check DNS first and then use tools like nc later. And since we are all in this article, of course, it’s a DNS issue.

I was dumbfounded and wondering how it is possible. If cache misses, then it will still ask upstream CoreDNS, won’t it?

I removed NodeLocal DNSCache, then found the stub domain resolves. I applied the NodeLocal DNSCache again, no result. After that, I started searching around and hoping my mentors Google and StackOverflow would shed some light.

There was a commit that added support for stubDomains of kube-dns. I immediately checked daemonset manifest and found there is indeed an optional volume mounting the kubedns configmap. I tried to change the configmap to coredns and hope it will work.

It doesn’t.

Why? Because the stub domains format is different between kube-dns and CoreDNS. As you can see in CoreDNS configuration equivalent to kube-dns document:

1
2
3
4
5
6
7
apiVersion: v1
data:
  stubDomains: |
        {"abc.com" : ["1.2.3.4"], "my.cluster.local" : ["2.3.4.5"]}        
  upstreamNameservers: |
        ["8.8.8.8", "8.8.4.4"]        
kind: ConfigMap

This is what kube-dnsstubDomains looks like. And the CoreDNS one is already presented above. Of course it doesn’t work.

Take a good hard look at the configmap

I’ve searched a few pages on Google but haven’t thought that answer can be in the manifest all along.

After all these sed substitutions, the configmap in manifest will look like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
  Corefile: |
    cluster.local:53 {
        # ...(omitted)
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        # ...(omitted)
        }
    in-addr.arpa:53 {
        # ...(omitted)
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        # ...(omitted)
        }
    ip6.arpa:53 {
        # ...(omitted)
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        # ...(omitted)
        }
    .:53 {
        # ...(omitted)
        forward . __PILLAR__UPSTREAM__SERVERS__
        # ...(omitted)
        }    

So, when it comes to zone cluster.local, in-addr.arpa and ip6.arpa, go ask __PILLAR__CLUSTER__DNS__.

What is __PILLAR__CLUSTER__DNS__?

__PILLAR__CLUSTER__DNS__ will be replaced at runtime with c.clusterDNSIP. What is c.clusterDNSIP you say? According to the flag parsing, it’s upstreamsvc flag that defaults to kube-dns.

Let’s go back to the manifest, we will find:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# ...(omitted)
      containers:
      - name: node-cache
        image: k8s.gcr.io/dns/k8s-dns-node-cache:1.17.0
        resources:
          requests:
            cpu: 25m
            memory: 5Mi
        args: [ "-localip", "169.254.20.10,172.20.0.10", "-conf", "/etc/Corefile", "-upstreamsvc", "kube-dns-upstream" ]
# ...(omitted)

…and the service snippet:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: v1
kind: Service
metadata:
  name: kube-dns-upstream
  namespace: kube-system
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "KubeDNSUpstream"
spec:
  ports:
  - name: dns
    port: 53
    protocol: UDP
    targetPort: 53
  - name: dns-tcp
    port: 53
    protocol: TCP
    targetPort: 53
  selector:
    k8s-app: kube-dns

So, __PILLAR__CLUSTER__DNS__ is an “alternative” upstream CoreDNS ClusterIP service, since the original IP of CoreDNS service will be intercepted (172.20.0.10 in this case). This will create another ClusterIP service for NodeLocal DNSCache to contact with upstream. Therefore, when it comes to zone cluster.local, in-addr.arpa and ip6.arpa, go ask upstream CoreDNS.

What is __PILLAR__UPSTREAM__SERVERS__?

We now understand that stub domains are not included in the zones above, so it will ask __PILLAR__UPSTREAM__SERVERS__.

According to this comfigmap.go and the zero value of UpstreamNameservers, __PILLAR__UPSTREAM__SERVERS__ will be replaced with /etc/resolv.conf at runtime.

Let’s look back to the manifest again:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# ...(omitted)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-local-dns
  namespace: kube-system
# ...(omitted)
spec:
    # ...(omitted)
    spec:
      # ...(omitted)
      hostNetwork: true
      dnsPolicy: Default  # Don't use cluster DNS.
# ...(omitted)

We can get explanations about dnsPolicy: Default from the Pod’s DNS Policy document, that:

“Default”: The Pod inherits the name resolution configuration from the node that the pods run on. See related discussion for more details.

What nameserver will present in node’s /etc/resolv.conf? Yes, the reserved IP address for DNS in VPC, which may be something like 10.0.0.2.

Fallback to upstream CoreDNS, not Route53

After all that troubles, we can just replace __PILLAR__UPSTREAM__SERVERS__ with __PILLAR__CLUSTER__DNS__.

If you are not using stub domains, then using Route53 won’t cause any problems.

However, in our case, we should fallback to upstream CoreDNS instead.

Further readings

Cover: https://unsplash.com/photos/1cqIcrWFQBI

0%