Ongoing issues in the Linux kernel's UDP connection tracking have caused challenges with DNS, and bugs particularly affect DNS in Kubernetes in its default configuration; we saw elevated rates of DNS failures that seemed to increase with load on our clusters. Other developers have reported these problems in blog posts, such as Racy conntrack and DNS lookup timeouts and a reason for unexplained connection timeouts on Kubernetes/Docker. In these cases, a race condition in the kernel causes the loss of the response for one of two queries when two are made in a short period of time. A Kubernetes extension called node-local solves these issues, but introduced a few new ones. To address some of these new problems, we developed a new implementation of node-local as a CoreDNS plugin: coreDNS-codecache.
To see why this especially affects Kubernetes, it's important to understand two options that Kubernetes defines in
/etc/resolv.conf: "ndots" and
"searchpath". Kubernetes allows you to use the name of a service, as a shortcut, to facilitate the connection between pods. For example, if you call
http://nginx-proxy, the search path defined in
/etc/resolv.conf will be used to expand the request to
http://nginx-proxy.default.svc.cluster.local. This happens automatically if the domain you are calling contains less than "ndots" dots, and for each domain in the searchpath. In our case, the searchpath contains five domains.
As a consequence, a request to www.google.com will generate several requests:
www.google.com. www.google.com.cluster.local. www.google.com.svc.cluster.local. www.google.com.namespace.svc.cluster.local. ...
But it’s actually worse than that. Each request is made in IPv4 and IPv6, and all requests are made in parallel for the sake of performance; so every time a pod tries to communicate with www.google.com, it is not one but ten DNS queries that are done in parallel. This increases the risk of triggering the kernel bug. The larger the nodes you use, the more DNS queries each of these make increasing the risk of triggering the problem.
There are a few options that, while not solving the problem, help reduce its impact:
Add periods at the end of each domain: Make your request to "www.google.com." instead of "www.google.com". By specifying a Fully-Qualified Domain Name (FQDN), domain expansion is avoided and you only make two requests (IPv4 and IPv6). However, this is not always practical because the domains might be hardcoded in libraries you use.
Disable IPv6 lookups: This would halve the number of DNS requests made. However, if you have some containers that use Alpine like many do, this is not possible because Alpine uses Musl instead of Glibc, which doesn't support disabling IPv6.
Configure your pods to use TCP for DNS instead of UDP: Again, this is not possible in Alpine.
Others have worked on a Kubernetes Enhancement Proposal (KEP) called node-local. This solution deploys a DNS cache on each node of a cluster, greatly reducing the number of outgoing DNS queries. It also upgrades the DNS requests to TCP and deactivates the conntrack on these connections. I recommend this talk by one of the KEP authors to understand it better.
The DNS cache deployed on each node for node-local is called node-cache. It’s a thin layer around CoreDNS that creates a dummy interface for the k8s node to bind onto. It also adds several iptables rules and removes them and the interface when shutting down.
This is a new approach that’s used by most organizations to solve DNS problems. However, it brings its own set of concerns:
Node-cache is deployed as a DaemonSet; there’s exactly one pod per node. If this pod is updated or crashes, you’ll lose a number of DNS queries until it restarts.
The codebase of nodecache was in its early stages and stuck to an older version of CoreDNS, which had its own set of bugs.
The test coverage is insufficient, and during our testing phase, it once failed to start on a node — which remained without DNS.
This enhancement proposal is still in beta, and high availability is a condition for release.
We got to work and developed Coredns-nodecache, a plugin for CoreDNS. It uses the CoreDNS plugin interface, which is stable from version to version. This allows us to easily update CoreDNS. The configuration is done directly in the configuration file of Coredns (the Corefile). Coredns-nodecache also supports setup in high availability, and has been in use in production on hundreds of nodes for several months.
A few words about high availability: Linux supports "shared sockets" with the SO_REUSEPORT socket flag, a flag CoreDNS already uses by default when binding to its interfaces. Using this was suggested early on in the enhancement proposal. Our setup involves deploying two different DaemonSets with similar configurations that bind to the same port. One of the two instances of coredns-nodecache will be configured to create the interface and iptables rules, the second will bind to the interface created by the first one. An option will tell coredns-nodecache not to delete the interface when it closes. In practice, this works very well.
We’re now running this in production at Contentful with great success!
However, this is still a solution that could be improved. To create the iptables rules and the interface, Coredns-nodecache (just like nodecache) needs special privileges, as well as the iptables binary, which involves creating a docker image derived from Alpine in place of an image from SCRATCH, and running CoreDNS as root. To fix this, I started a project to delegate the creation of the iptables interface and rules to a K8s operator that would be deployed as a DaemonSet. This would make it possible to use the CoreDNS vanilla images as cache, and to rotate them with lower privileges.