317 points by fanf2 9 days ago
I operate authoritative name servers for almost 10.000 domains. Originally, I used a default TTL of 2 days, as recommended by RIPE-203¹ (which is also compatible with the recommendations of RFC 1912²), but this was not accepted by users, who didn’t want to wait two days. Therefore, for all records except SOA and NS records, I changed the default TTL to one hour, which I still use as the default value unless a change is scheduled and/or planned, in which case I lower it to 5 minutes. I do not want to lower it any more, as I’ve heard rumors of buggy resolvers interpreting “too low” TTLs as bad, and reverting to some very-high default TTL, and thereby wrecking my carefully planned DNS changeover. I have, however, not seen any real numbers or good references on what numbers are “too low”, and would like to hear from anyone who might have some information on this.
Unless you have insight into the end users DNS deployments I would say this is the appropriate amount of caution to apply. Besides just TTL being low, a frequent issue I had when first migrating to AWS years ago was CNAME to CNAME records not resolving among some end users. Primary schools were the worst offenders, I assume some of them still have Novell deployed.
Would you mind sharing the service?
The service? What service? What do you mean?
You said you run authoritative servers- I assume you provide DNS hosting. I'm curious which provider you run.
I am hesitant to say; we only target the local area, and our home page isn’t even available in English. Our main role is as a domain name registrar, also providing, in increasingly tangential order, domain name strategy planning, some trade mark strategy, DNS hosting, HTTP redirects, E-mail, and web hosting. Our main value proposition is support; call us and talk to us directly, or send an e-mail, and get an answer more or less immediately. We only very reluctantly provide self-service control panels, and we don’t mention its availability unless people directly ask for it, and we generally discourage its use, preferring that people simply tell us what they want done in their DNS. Some people, including some very large companies, prefer this arrangement, and if you are one of them, and you are part of our local market, I’m sure you’ll be able to find us.
Probably way better to not say than to say. Little upside versus who knows what downside.
The irony of all of this is that those TTLs are almost meaningless as a server operator anyway. Even if you set your TTL to 5 minutes, there are a whole lot of clients that will ignore it.
When I made a DNS switch at reddit, even with a 5 minute TTL, it still took an hour for 80% of the traffic to shift. After a week, only 95% had shifted. After two weeks we still had 1% of traffic going to the old IP.
And after a month there was still some traffic at the old endpoint. At some point I just shut off the old endpoint with active traffic (mostly scrapers with hard coded IPs at that point as far as I could tell).
One of my friends who ran an ISP in Alaska told me that they would ignore all TTLs and set them all to 7 days because they didn't have enough bandwidth to make all the DNS queries to the lower 48.
So yeah, set your TTL to 40 hours. It won't matter anyway. In an emergency, you'll need something other than DNS to rapidly shift your traffic (like a routed IP where you can change the router configs).
It was some time ago, but I've had similar trouble. Back when I had a small webhost (I stopped when shared/reseller hosting descended too far into an overselling-and-deliberately-misleading-advertising-to-compete-on-price race to the bottom) a customer who "left" (was told to get lost due to non-payment) demanding I keep his content up because some users were still ending up there. As far as I could tell me of the records had ever had a TTL longer than the four hours )my default) while they were pointing at that address, yet more than a month later some traffic was still coming in to that address for that domain. I didn't look too deeply into it due to the history of the client in question, but it was certainly a real problem at that point in time.
So for my own stuff if there is a controlled change I try to keep the original address operating as a relay to the new address for a chunk of time, and for a while after that have it host a message saying "your DNS resolution seems to be broken, you shouldn't have been sent here, please report this to your ISP or local SysAdmin, if you want me to try fix it for you here is a list of my consulting fees".
(Weeping in agreement.)
The lesson I took away from hosting operations is that the implementation of internet standards has a long tail of customization; it's part of the job to accommodate them graciously. :)
Do you mean gracefully?
No reason one can't be both graceful and gracious.
Yes, thank you. (Past edit window.)
I've had a lot of issues with this for some of our customers. Nobody wants to run old deploy environments for weeks..
Turns out their routers somehow sets an insane TTL (like max int), and reports this to their clients, which in turn also get stuck with an insane TTL. You have to reboot or flush both the router and the clients to get it unstuck. I don't know if it's a "feature" or some sort of memory corruption / race condition.
The routers were almost always Asus routers, ex: Asus RT66. And the same customers were repeat offenders (not every time, but often enough).
In the end we had them set DNS on their computers to something like 18.104.22.168
I know that HTTP clients in some platforms like .NET will not resend the DNS query until the underlying TCP connection is closed. So if the server is using keep alive an you keep sending requests, you might actually end up using the same IP address for a long time.
I think this is actually the desirable behavior. If you've got very long-running connections and you want to force a switch-over you can always drop the connection even if you don't have some in-band mechanism. If you're constantly watching for a DNS change all it takes is one DNS failure to kill a connection (or all connections). In general transient network issues are probably going to be more common than IP changes on your infrastructure and issues caused by the former are harder to debug.
This is part of the cancer, though. People notice that even though their TTL is low, changes don't propagate, so they set it even lower in an attempt to compensate.
It goes both ways, a vicious circle.
TTLs aren't respected -> people try lowering them even more.
Lots of low TTLs -> people configure a minimum or ignore them.
As an infrastructure provider with limited connectivity, seeing lots of low DNS TTLs means I'll just configure a saner minimum (say 1h) and consider it not my problem.
>ISP in Alaska told me that they would ignore all TTLs and set them all to 7 days
This is exactly why people set TTLs lower; they get ignored.
As someone who overrides TTL's for all domains on my home network, I agree with this. I use Unbound DNS to query upstream servers over a VPN. I override min ttl to 20 minutes and that has never caused any issues as far as I can tell. I have been doing this for many years.
> One of my friends who ran an ISP in Alaska told me that they would ignore all TTLs and set them all to 7 days
If you’re hosting for customers the difference here is that this is your friends fault, not yours.
The question is how much percentage of your legitimate customers respect DNS TTLs. Amazon relies on DNS pretty much, this is how they shift traffic from one datacenter to another.
Even with short TTLs, I often see Facebook mobile app users lingering for days(!) on the old IP in the logs, long after all other traffic is gone.
I'm not really sure what's up with that, as no-one has ever reported the site not being reachable in the Facebook app after an IP change. Either it does some very aggressive caching, or something is pretending to be the Facebook app.
> Why are DNS records set with such low TTLs?
The author seem to be missing one of the big reasons ridiculously low TTLs are used: it lets passive eavesdroppers discover a good approximation of your browsing history. Passive logging of HTTP has (fortunately) been hindered as most traffic moved to HTTPS, but DNS is still plaintext.
Low TTLs mean a new DNS request happens (apx) every time someone clicks a link. Seeing which domain names someone is interacting with every 60s (or less!) is enough to build a very detailed pattern-of-life. Remember, it's probably not just one domain name per click; the set of domain names that are requested to fetch the js/css/images/etc for each page can easily fingerprint specific activities within a domain.
Yes, TTLs need to have some kind of saner minimum. Even more important is moving to an encrypted protocol. Unfortunately DOH doesn't solve this problem; it just moves the passive eavesdropping problem to a different upstream server (e.g. Cloudflare). The real solution is an encrypted protocol that allows everyone to do the recursive resolution locally.
> The author seem to be missing one of the big reasons ridiculously low TTLs are used: it lets passive eavesdroppers discover a good approximation of your browsing history.
I operate DNS for hundreds of thousands of domains. I've tried to reassemble browsing history from DNS logs, and I can tell you it is damn near impossible. You have DNS caches in the browser, the OS, broadband routers, and ISPs/public resolvers to account for - and half of them don't respect TTLs anyways.
The reason people set low TTLs is they don't want to wait around for things to expire when they want to make a change. DNS operators encourage low TTLs because it appears broken to the user when they make a change and "it doesn't work" for anywhere from a few hours to a few days.
> I operate DNS for hundreds of thousands of domains. I've tried to reassemble browsing history from DNS logs
To make sure others can't do the same.
The problem is that your ISP can log and mine your DNS requests, regardless of the servers you use. They definitely do this and one can only assume they then sell it after some sort of processing.
I’ve worked for a few, in Europe mind you, but I can say with certainty we did not do this.
It would be naive to think none do of course.
The comment you're replying specifies caching at the browser, OS, and router level. Not one of three would show up as DNS refreshes with the ISP because the DNS is not being refreshed.
Don't browsers and operating systems mostly respect ttls?
So if some things are cached, you won't get a complete picture, but the picture you get might be enough.
I can't tell. I run Firefox at home, and set up my own DoH server (mainly because I saw the writing on the wall and and if Mozilla/Google are going to shove this down my throat, I want it shoved down on my terms, but I digress). If I visit my blog (which has a DNS TTL of 86,400) I get a query for my domain not only on every request, but even if I just hover over the link. It will also do a query when I click on a link to news.ycombinator.com (with a TTL of 300) but not when I hover over a link. It's bizarre.
Mostly, yes. In my experience (as a service provider) Chrome has a bad habit of caching records occasionally for much much longer than it should. Maybe bug maybe intentional, I dunno.
HTTPS does not hinder that type of tracking.. in fact, using SNI (which is unencrypted) would be more accurate than trying to do it with DNS... since it's sent with every request.
SNI is sent in the clear once per tls connection, not once per http request.
While the OP had the wrong method, it still means ISP boxes end up tracking that TLS connection.
(Though this is being fixed-- both Firefox and Cloudflare implement the eSNI draft).
I'm sure the destination IP:443 tells about as accurate a story as the DNS lookups?
Particularly with SNI.
I seem to remember a paper a few years ago that (IIRC) tested this by setting a very low TTL (like 60), changing the value, and seeing how long they continued to receive requests at the old value... and most updated within the TTL, but there were some that took up to (I want to say) an hour. I'm probably getting bits of this wrong though..
I did find this paper:
The violations in that paper that are important are those that have increased the TTL. Reducing the TTL increases costs for the DNS provider, but isn't important here. The slowest update was about 2 hours (with the TTL set to 333).
Of those that violated the TTL, we don't know what portion of those would function correctly with a different TTL (increasing the TTL indicates they're already not following spec). So I wouldn't assume that increasing the TTL would get them to abide by your requested TTL. They're following their own rules, and those could by anything.
Considering how common low TTLs are... you're worrying about a DNS server that's already potentially causing errors for major well known websites.
It is important to note that this study used active probes asking selected recursive resolvers around the world.
From my own experience when changing records and seeing when the long tail of clients stops calling the old addresses (with the name), it is a really long tail. An extreme example that lasted almost six months was a web spider that just refused to update their DNS records and continued to request websites using the old addresses.
Is there a lot of custom written code that does their own DNS caching? Yes. One other example is internal DNS servers that shadow external DNS. There is a lot of very old DNS software running year after year. Occasionally at work we stumble onto servers which are very clearly handwritten by someone a few decades ago by people with only a vague idea of what the RFCs actually say. Those are not public resolvers of major ISPs, so the above study would not catch them.
Naturally if you have a public resolver where people are constantly accessing common sites with low TTLs then issues would crop up quickly and the support cost would get them to fix the resolver. If it's an internal resolver inside a company where non-work sites are blocked then you might not notice until the company moves to a new web hosting solution and suddenly all employees can't access the new site, an hour later they call the public DNS hosting provider, the provider diagnoses the issue to be internal of the customer's network, and then finally several hours later the faulty resolver gets fixed.
> An extreme example that lasted almost six months was a web spider that just refused to update their DNS records and continued to request websites using the old addresses.
It may have been Java client that was not restarted. At least for older versions of Java the default was to cache result forever.
Yep, older java versions had some ridiculous caching of both positive and negative DNS responses. That was some weird problem to troubleshoot. We ended up writing our own caching then, back in Java7ish. And the first version of our DNS caching was broken and promptly triggered load alerts on 2 DNS servers of our operations team by issuing ... a lot of DNS queries very very quickly :)
Could it be that it wasn't using DNS at all? Just hardcoded the ip address?
> Of course, a service can switch to a new cloud provider, a new server, a new network, requiring clients to use up-to-date DNS records. And having reasonably low TTLs helps make the transition friction-free. However, no one moving to a new infrastructure is going to expect clients to use the new DNS records within 1 minute, 5 minutes or 15 minutes. Setting a minimum TTL of 40 minutes instead of 5 minutes is not going to prevent users from accessing the service.
Note that you can still get the benefit of a low TTL during a planned switch to a new cloud provider, server, or network even if you run with a high TTL normally. You just have to lower it as you approach the switch.
For example, let's say you normally run with a TTL of 24 hours. 25 hours before you are going to throw the switch on the provider change, change the TTL to 1 hour. 61 minutes before the switch, change TTL to 1 minute.
Wouldn't you be canarying your switch over a period of longer than 24 hours anyway?
I can still imagine a benefit to short TTLs in the sense that you can maybe roll out your canary in a more controlled way. But that's a lot more complicated than the issue of quick switching.
If it's planned, yes.
If your cloud provider does an oopsie (e.g. https://news.ycombinator.com/item?id=20064169) and takes down your entire infrastructure, or you have to move quickly for some other reason, or you're recovering from a misconfiguration, the long TTL can add 24 hours to your mitigation time.
If you're just playing around with your personal project/web site, you just added a giant round of whack-a-cache to your "let's finally clean up my personal server mess" evening.
As most who's ever worked with web hosting can confirm, small business customers often have no idea of what they're doing, and I've talked to many people who switched providers after seeing an ad for cheap hosting, without realising that they have to a) wait for the DNS changes to propagate, b) that they have to actually move their web site from one provider to another.
Subsequently, my previous employer lowered the default TTL simply because it got rid of all the bad Trustpilot ratings about customers being "prevented from leaving", and started offering a "move my WordPress site for me" service to profit from all the panicking new-comers who had no idea about how to do trivial things like importing/exporting a database and transferring files.
It would have been interested to see actual delay rather than qualitative results of the nature "<x>% wasn't in cache so this is horrible!". Admins and users don't care if it's in cache, they care what the impact to operations and load time is. https://www.dnsperf.com/dns-speed-benchmark says lookup times for my personal domain results in 20ms-40ms. Ironically the same dns test for 00f.net is taking 100ms-150ms.
99% of apps will gladly trade a 30ms increase in session start (assuming the browser's prefetcher hasn't already beaten them to it) to not have to worry about things taking an hour to change. Not all efficiency is about how technically slick something is.
I just tested 00f.net and got as low numbers as 6ms. Latency is a question about network traffic between the client and the server, and unless you use anycast you will get different latency depending on what place in the world the client and server reside in, and if you use anycast it depend on how good the contracts and spread the anycast network has.
Very true, looks to be hosted out of Europe. The point about 00f.net optimizing for ease of operation vs optimizing milliseconds of performance only holds doubly true with this information though.
Hug of death by the looks of it. Maybe they need to quickly change their DNS entries to point to a better server :-)
Doesn’t work for me. This does:
> I’m not including “for failover” in that list. In today’s architectures, DNS is not used for failover any more.
I mean, my company does this for certain failure scenarios involving our CDNs. Can anyone tell me why we're idiots, or is this just hyperbole?
This is very common (Dynect, NS1, AWS, GCP, etc all depend on this for monitoring and failover). The author is an incorrect.
amazon.com, reddit.com, facebook.com, and others use low TTLs on their domains for this reason. Anyone who can't maintain an anycast infrastructure around the world and doesn't want to depend on Cloudflare will use this method.
For literally my entire career in SRE, well over a decade now, I’ve only interacted with systems that use DNS for this purpose, from small shops to parts of every HN reader’s life. That sentence in your quote, as well as the assertive nature of the post on such a weak foundation, are sufficient to disqualify a hiring candidate on account of lack of experience despite the authored software presented. It simply does not align with reality when presented with two logically separate networks and a required mechanism to transition between them.
The only other alternative for that scenario is using anycast addressing, and that has a colorful bag of limitations that are quite different from those of low-TTL DNS (including being out of reach for most).
DNS failover is used extensively, especially in the cloud world.
I see nothing wrong with using low DNS TTLs for failover - really don't understand the author's objections here, and them claiming that "DNS is not used for failover any more" significantly discredits them, IMO.
> I mean, my company does this for certain failure scenarios involving our CDNs. Can anyone tell me why we're idiots, or is this just hyperbole?
I came here to say exactly that. Our company uses DNS entries with low TTL for failover and load balancing purposes as well -- it's a very common approach. Services like AWS Route 53 and CloudFlare make it very easy to setup and low cost. I was surprised that the author didn't give much acknowledgement to this type of usage.
Don't worry many companies use DNS for failover (including Amazon).
You aren't idiots if you're using it where there are no better alternatives - it's preferable to use load balancers etc where available, but there are places where it's very much "DNS or nothing".
How would I use a load balancer to fail traffic between, say, London and Amsterdam with no fiber in place between them? Where would the load balancer physically exist in that scenario and how would it fail to the other when power is lost in one location? Would I make a third PoP to isolate it? What would then be my redundancy story for that PoP? How would I relocate traffic to my backup load balancer PoP number four?
Within a single network, sure, load balance all you want. That’s not the scenario low TTLs go after.
> How would I use a load balancer to fail traffic between, say, London and Amsterdam with no fiber in place between them?
What people use in those sitations is Anycast.
Of course, DNS itself, or e-mail, don’t need this kind of redundancy, since the NS (or MX) records themselves provides a list of failover servers. The corresponding alternative for HTTP, SRV records, has been consistently stonewalled by standard writers for HTTP/2, QUIC, etc.
There is an interesting draft RFC which I am keeping an eye on, but I don’t want to get my hopes up:
> What people use in those sitations is Anycast.
This requires that you blow a publicly-advertisable prefix for every unique combination of services you would want to fail-over.
E.g., if you wanted to be able to have independent fail-over between your customer-facing self-service portal and your webmail interface (each relying on specific state that you can't replicate synchronously, and can't guarantee replicate consistently with each other), you would need to /24s, one dedicated to anycast for the webmail interface, one for the self-service portal, and separate from any services which are active-active.
Whereas using DNS, you could use your other existing public /24s that you are already using for your active-active services.
In the last days of IPv4, an extra 2 /24s just for this is quite an expense.
Some folks are using anycast to accomplish this, but that involves a different set of problems.
Within a region, Anycast is how many big companies move things around mostly seamlessly. Why inside a region? If RIPE or ARIN catch you advertising their IP in the other's territory, they will send nasty emails and threaten to take back your CIDR blocks. I have no idea if they will follow through as we always stopped violating their rules when told to.
Outside of a region, they use DNS and sometimes a combination of WAN accelerators and VPN's, dark fiber.
"Where there are no better alternatives." Were you in such a hurry to show off you couldn't be bothered parsing my sentence?
That can be acceptable price for minimizing the impact of accidental DNS misconfiguration. Which probably happened to every sysadmin.
Or is there a better way to quickly invalidate DNS caches in case of emergency?
No, there isn’t. The specification as implemented requires no invalidation mechanism, which means no such mechanism across all caches exists, nor will it ever. The long tail kills you in such a failure scenario, and remember, people who make kitchen appliances write DNS resolvers.
Google Public DNS, at least, provides a public web form you can use to clear cached DNS records .
Not sure about any others.
Yeah, this was my first thought...I am guessing the author has never accidentally pushed out a bad DNS entry and needed to revert/update.
Everyone probably starts with higher TTLs, then the first time they mess up an update they switch to a short one.
The author seems to not appreciate how big a problem a misconfigured, long TTL DNS entry could be for someone.
Other reasons for short TTLs... maybe an IP gets blocked/flagged by a large network and they need to change fast... or a network path is slow and they need to move to a new location.
"The urban legend that DNS-based load balancing depends on TTLs (it doesn’t)"
So whats the solution? We are using AWS ALB/ELB and it states in docs, that we should have low TTL, and it makes sense. Servers behind LB scale up and down. What is the option B?
In fact, if you use Route 53 with an alias to an ELB, the TTL is hard-coded at 60s -- it is not even configurable. If it were, we'd follow the practice of lowering it prior to changes, and raising again once things or stable, but as it is, that's not an option (moving DNS off AWS would be a hard sell, not cause it's terribly hard but afamic, there's not really any value to doing it).
I would maintain that if you are experiencing poor performance for a web site, there are MUCH more fruitful places to look than DNS latency. Third party objects, excessive page sizes, lack of overall optimization based on device are just the tip of the iceberg.
For many apps I've worked the DB connection setup was always the slow part (use PgBouncer). Then, the part was the queries. DNS, gziped CSS/JS - chasing a red-herring.
Yeah definitely. A poorly crafted SQL query can wreak havoc on performance, especially at scale!
> Here’s another example of a low-TTL-CNAME+low-TTL-records situation, featuring a very popular name:
> $ drill detectportal.firefox.com @22.214.171.124
Is captive portal detection not a valid use case for low TTL? The entire point is to detect DNS hijacking of a known domain, which takes longer when you cache the DNS results...
Captive portal detection involves more than just checking for DNS hijacking. The browser tries to load http://detectportal.firefox.com/success.txt and acts based on how that goes. Having a short TTL does not help.
If your captive portal is implemented by intercepting your DNS queries, then having a short TTL should ensure that the captive portal actually has a query to intercept.
But sure, there are other implementation approaches (e.g. injecting HTTP redirects), which I imagine is one reason why Firefox doesn't literally inspect the DNS reply.
I run authoritative DNS for a very busy domain - 30B queries per month. Originally we had 6 hour TTLs, but now I use 60s. We have had no problems. Uptime and fast failover comes before anything else.
There was a dns record looked up primarily by large supercomputers that had a 0 ttl. It was used for stats via a UDP packet (because it was non blocking, nevermind that the dns query was blocking). This was set to 0 for "failover" but it hadn't changed in years. I worked out that our systems alone had caused billions of queries for this name.
After I complained I think they upped the ttl.. to 60.
Reminds me a of a server pair at the last healthcare place I worked. Between the two of them they'd generate something around 1200 DNS lookups per second (about 60% of the load on the DNS servers) of their own name. I think the logic was if the name stopped responding then server A was primary. If the name was responding the server that owned the IP it was responding for was primary. If the servers wanted to swap primary/secondary they would issue a DDNS request.
After about 8 years we were were restructuring our DNS infrastructure for performance and I rate limited those two to 10 or so queries per second each. In that time there must have been 300 billion or so requests from those two boxes alone.
In my experience that sort of thing is from the local hostname not being present in /etc/hosts and (of course) a caching resolver not in use.
Some process on the system wants to connect to itself, which then causes a dns lookup. Add a high transaction rate on top of that, and 1,200/second is easy.
The funniest one I remember seeing was thousands and thousands of lookups for HTACCESS. Turned out apache was running on top of a web root stored in AFS and not configured to stop looking for .htaccess files at the project root, so it would try to open
This would be done something like twice for every incoming web request.
Just anecdotally, from running PiHole, looking at the logs, I have some sites being resolved 12K times over 11 days... That's over a thousands requests a day.
$ echo min-cache-ttl=300 | \
sudo tee /etc/dnsmasq.d/99-min-cache-ttl.conf
Don't forget to run
$ sudo pihole restartdns
Probably ad or analytic sites? I know one app in particular where every other update seems to result in it sending a request per second to some blocked analytics site.
Can anyone explain why ping.ring.com needs to have such a low TTLs?
$ drill ping.ring.com @126.96.36.199
;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 36008
;; flags: qr rd ra ; QUERY: 1, ANSWER: 2, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;; ping.ring.com. IN A
;; ANSWER SECTION:
ping.ring.com. 3 IN CNAME iperf.ring.com.
iperf.ring.com. 3 IN CNAME ap-southeast-2-iperf.ring.com.
;; AUTHORITY SECTION:
ring.com. 573 IN SOA ns-385.awsdns-48.com. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
The command output has the answer, it's a CNAME to whatever random AWS instance happens to be up and running. They probably let the instances autoscale to load and don't guarantee they'll be around for any amount of time and rather than configure an additional service for heatbeating they just used DNS.
There are caching nameservers that allow you to override the minimum TTL but be aware the device is likely relying on this being immediately up to date and may not work during a change with an extended TTL set.
That's actually a pretty long TTL by Amazon standards.
;; ANSWER SECTION:
amazon.com. 60 IN A 188.8.131.52
amazon.com. 60 IN A 184.108.40.206
amazon.com. 60 IN A 220.127.116.11
;; ANSWER SECTION:
glacier.us-east-1.amazonaws.com. 60 IN A 18.104.22.168
;; ANSWER SECTION:
s3.ap-northeast-1.amazonaws.com. 5 IN A 22.214.171.124
;; ANSWER SECTION:
dynamodb.us-east-1.amazonaws.com. 5 IN A 126.96.36.199
From time to time, you need to do something with customer facing infrastructure: Remove the DNS entry, watch the traffic drain over the next 5-10 minutes, and then do what you need to do on the device, test, and then add it back in the DNS again, from which you can watch traffic return to normal levels and verify everything is good.
You're looking at the SOA (573) record for ring.com, but he's asking about the CNAME (3) for ping.ring.com.
Please excuse any ignorant use of terminology, I am not a DNS expert like others on here, but I can share some experience in the smaller business world.
A company I worked with a couple of years ago was using Dyn as their DNS provider, and one day we got a notifcation that we had passed the usage limits for our account. This seemed impossible considering our site was getting a couple of hundred unique visitors a day. A few things came out of the analytics.
1) A short TTL on an A record had been left on from a website migration project. The majority of the requests were coming from our internal website administrators. I moved it up to a couple of hours and this went a way.
2) We were getting a huge amount of AAAA record hits. I think most modern browsers/OS try quad A first??? We didn't have IPv6 configured, and therefore the negative resolution had a TTL set to the minimum on the SOA record, which was 1second! A change of this to 60 caused a huge reduction in requests. I suppose I should have set up ipv6, but I didn't.
3) When we sent out stuff to our mailing list the SPF (or rather TXT) records saw a peak that was off the chart. We had a pretty settled infrasructure, so I moved that TTL to a day (I think from memory) and it flattened the peak somewhat.
4) There was a large peak in MX request around 9am. I put this down to people opening their email when they got to work and replying to us. I had to set the TTL to a couple of days (of course) to smooth that one.
I like to think it was worthwhile and improved things for users. I at least had a nice warm glow that I had saved the internet from a bunch of junk requests, and it just felt tidier.
Thankfully this is one of those things, that you don't need to respect. TTLs are just suggested values in the end (the standard may disagree).
I just checked, and I actually have TTL forced to 1 day in dnscrypt-proxy. My internet experience is fine. I guess I never noticed in the last 2 years or so.
Why does DNS cache expiration need to be in the critical path?
Instead of a browser doing
1. Local DNS lookup (resulting in expired entry)
2. DNS query
3. DNS response
4. HTTP request
why not do
2.1. DNS query
2.2. HTTP request
4. If DNS response changed and HTTP request failed, HTTP request again
Maybe use two expiration lengths, one that results in flow 2 and a much longer one that results in flow 1.
Ya, this is roughly what the fb apps do. Dns rarely blocks, and changes are seen quickly.
Probably because the gain in milliseconds is not worth the code complexity of parallell executing code.
Well, in my case it makes sense, I think: I host my server at home, and have a dynamic IPv4. I don't know when it could change, so I just set the TTL to something low.
Since the traffic is low, though, I can afford to check for an IP change every ~5min, and although I set a TTL of ~15min on most services, the main CNAME (ovh-provided dynamic dns service, TTL set by them) is set to 60s.
My IPv6 record was set to 1h, but I'll look into increasing it. It is true that my mobile phone often pings my server, so I imagine that it could reduce the battery usage.
I'm wondering what it would take to make a DNS caching service with updates based on reliable notifications rather than polling? After all, every cellphone does it.
This already exists (NOTIFY), but it's only used for master-slave setups (ie. a bunch of DNS servers serving some authoritative zone who want changes to be transmitted to all slaves ASAP)
It would be interesting to (ab)use this mechanism in the way you suggest. A recursive DNS server could ask to be NOTIFY'ed of changes in the zone they are querying...it would, of course, add load to the server, and it would need strict limits to avoid DoS, but it seems an interesting idea.
The big problem, to the extent there is one, is between the client and the recursive server. Not as much the recursive and the authoritative. Cost is highly amortized between recursive and authoritative for busy names.
DNS Push Notifications could be tacked on to resolvers and clients.
The author says low TTLs are bad because of latency but never attempts to quantify how much latency we are actually talking about. Its hard to know how outraged I'm supposed to be without actually seeing the numbers.
It seems that a lot of sites are ok with slightly higher latencies if it means greater operational flexibility.
Latency is dependent on many things. DNS server location - accessing my website from australia will take 500ms for DNS lookup (or twice as much if I'm using cnames). If this is not cached somewhere, that's 500ms every few seconds with those <1minute TTLs. If I'm on GPRS, or similar, that will add a bunch more hundreds of ms to every useless DNS resolution, incl. unpredictable variability.
So there's no single latency to report.
"It was DNS" is the root cause of enough postmortems to justify low TTL values, in my opinion.
I run my own dnsmasq server on an old laptop and force really long TTL caching regardless of what the records come back with. I even cache nxdomain. It works great, except once or twice a month I have to flush the cache because Slack seems to not handle it well.
I'm doing the same but "only" three hours and it's working just fine. Not a slack user though.
I’m honestly not sure what this author is complaining about. If the infrastructure can handle it and the zone owner is willing to pay for the excessive traffic, and DNS cache operators are fine with it, then this seems like a call for premature optimization.
The missing data in the article that has many graphs is how often did the records truly change
I have worked in a place using GTM to fail over from a bad data center to a good data center. Maybe few minutes TTL. I worried about it but availability is much higher this way, especially combined with a only change one data center at a time.
I'm kind of suprised that I can't see any other comments talking about GTM's (I assume you mean F5 Global Traffic Managers)
Where I am at the moment GTM's are used everywhere, and everywhere the TTL is set to 30s.
The only part of this that really annoys me is that the global default configuration, rather than serving up a subset of the list of IP addresses, only a single IP address is returned when you resolve down to the A record.
When I've pressed the issue that _at least_ on our internal GTM's we should just return a bunch of IP addresses every time someone resolves the address, I've been told that it would break load balancing... which blows my mind because who on earth is relying on DNS to load balance traffic with a 30s TTL, I would have thought that the normal thing to do, if you actually wanted load to balance, would be to result a subset of IP addresses in a different order and with a different subset each time. That way other DNS servers which resolvers that will cache that record can at least be returning multiple addresses to all the clients it serves, as opposed to everyone using that resolver getting stuck to a single address for 30 seconds...
But all of that being said, it would make perfect sense to me to just return like 4 IP addresses publicly for every resolution and rotating setting the TTL to like 30s so that clients could spend 30s iterating through the A Records they have cached, then hit your resolver up again and get a different sites addresses back if your site had gone down...
To avoid delay when migrating a website IP, what I usualy do is to first migrate on a HAProxy (like 2 days before switching) so all ISP DNS are update and on the D-day I switch my backend to the new website/VM.
And then I change my DNS again to the new IP.
You have to tune a bit to get the right IP in your logs but so far it works.
>> The urban legend that DNS-based load balancing depends on TTLs (it doesn’t - since Netscape Navigator, clients pick a random IP from a RR set, and transparently try another one if they can’t connect)
Unless you do not return an RR set and what you return is based on geolocation and data center health.
Uhm, how does this work with "global" DNS services which people tend to use more and more? (Eg. Cloudflare's 188.8.131.52 or Google's 184.108.40.206/220.127.116.11)
Basically, your request is coming from them and wherever their servers are (US, I guess, though they probably have several data centers) and they route it to the final user.
I think using DNS-based geolocation sounds like a really bad idea: what am I missing?
The EDNS0 client-subnet extension exists for this exact reason.
Thanks. It seems, unfortunately, that only Google DNS and OpenDNS (Cisco iirc) include the data as of now. Older articles even mention how you have to have your website (well, nameservers) whitelisted for them to forward client subnet as part of DNS queries, not sure if that is still the case.
Of course, caching gets more complicated and less useful with this.
DNS Caching: Running on Zero
This entire analysis is plain out wrong.
They've collected data on DNS queries "for a few hours". By definition, clients who have DNS cached (iow, most clients, since browsers and resolv calls in operating systems will do that for you), will not issue DNS requests for any records that have a TTL that has not yet expired.
So, they've caught all (well, all that were re-requested) the TTLs shorter than whatever "a few hours" is, and only those longer ones that expired exactly during the experiment and were re-requested.
To run a proper experiment testing for "short" vs "regular" (let's say 1-3 days), you need to collect data for days (eg. at least 7 days, preferably at least 30), but even that would not report most TTLs longer than 7/30 days.
Articles like this are bad because they can easily confuse even the knowledgeable people like the HN crowd.
Should DNS providers have a setting that increase TTLs over time automatically? I.e the longer I leave my dns entry pointing to the same IP the longer my TTL is?
Obviously it would be possible to opt out for situations where you genuinely need a low TTL on a domain.
This is definitely a feature I've also thought would be useful to have in DNS providers.
I've worked on managing thousands of (sub)domains and the administrative overhead of changing the TTLs for everything manually would be considerable. I'd certainly like an automated way to say "These records should gradually increase TTL up to <X> time over <Y> time" (e.g., gradually raise TTL to 2 days over 2 weeks if there are no changes).
There are downsides to high TTLs though: (1) you need to remember to preemptively lower them ahead of any planned changes (if you want those changes to take effect quickly), and (2) you can't change the records quickly in an emergency. But, fortunately, lots of record types are ones that you probably don't need to change in an emergency -- and for ones that you do, you can use a low TTL.
Anyway, I'd personally like to see automated TTL management as a feature in DNS software.
Maybe up to a point, but really the TTL should be set for how long is acceptable for traffic to continue to flow to the old destination after a change. That's not necessarily correlated with time between changes: just because a service IP hasn't changed for two years doesn't mean I would want to wait a day for most traffic to move.
Of course, the reality is some traffic will continue to flow to the old destination for as long as you care to measure. There's plenty of absurdly broken DNS caching out there.
Yes. I have low TTL on some domains because I chose it for a cautious update and never bothered/remembered to come back and increase it.
I bet this is one of the main reasons for low TTLs.
I've seen some that have a somewhat reasonable minimum time. You can go below it but it will reset to their minimum after a day or two.
But it's a risky play for providers. It reduces their DNS load (which is pretty cheap to handle), but increases the risk that a customer will come yelling why they couldn't fix their outage quickly because the algorithm increased their TTL to something large.
What DNs providers should do is to transition to traffic-based billing instead, if TTL is such an issue for them. There’s a variety of useful use-cases handled with low TTLs.
For those running CloudFlare, proxied DNS records have an unchangeable TTL of 5 minutes.
I hate low DNS TTLs. They are a stupid way to do load balancing.
However they wouldn't be quite as bad if web pages didn't load useless crap from 60 different domains.
What if you accidently make changes to your authoritative nameserver you want the recovery to be as fast possible because it's a complete outage
Nothing precludes you from upping the TTL after the change. Traditionally DNS admins progressively drop the TTL prior to a change to reduce the time an RRSet is in flux (so if your TTL is N, N + 1 seconds prior you drop it to half N, and again and again until its your preferred window size) and cautious ones slowly ramp back up again to the regular value.
Obvs. The authors server isn’t seeing longer TTL’s because there’s no need for clients to keep querying his server for them?
Am I missing something, or is the reason most of the queries observed have low TTL because, well, they have a low TTL? IOW, the higher TTL responses would be cached downstream and so you'd see them less often. If that is the case, the distribution shown is not all that surprising.
It's weird how people are not understanding this: perhaps it's the way you phrased it. Or perhaps you missed to mention the core part from the article: the experiment was only run "for a few hours". This means that many a DNS record (well, most) with TTL greater than the experiment duration would not show up in the data.
FWIW, I've learned in the past that while there are plenty of people who claim to want communication to be as succint as possible, majority are unable to understand when somebody is really terse (while still saying exactly enough). I've learned to follow up such a terse statement with examples and longer explanations for the majority that does not get it.
But maybe it's just that people don't expect the mathematics-level precision on the internet :)
Maybe DNS servers should support push updates rather than relying on polling.
I don't think they meant to authoritative servers.
Seems like DNS TTL was a big issue before HTTP 1.1.
Connections are cached and reused.
If you're talking about keep-alive this times out far before most of the DNS TTLs mentioned in this article and don't persist after a connection is closed anyways.
That's not true. It all depends on the implementation...
Also what are you talking about? If the connection is closed, how would it be used?
Connections should only be closed due to? Inactivity. If a connection is closed, don't you think you would probably want to do another DNS request?
Also if your doing proper layer 4 load balancing using BGP, DNS is a moot point...
> Connections should only be closed due to? Inactivity.
Or if the user closes the browser of if the server/proxy restarts. But yes, mostly inactivity on the order of a couple of minutes.
> If a connection is closed, don't you think you would probably want to do another DNS request?
That's the whole point of the DNS TTL, to say how long to go before doing another lookup rather than doing it each time you reconnect.
> Also if your doing proper layer 4 load balancing using BGP, DNS is a moot point...
BGP load balancing operates on layer 3, and is irrelevant as you still need to DNS lookup an anycast address. EDNS client subnet is better anyways.
An anycast address doesn't change. I mean come on. And it actually operates on layer 4. It uses layer 4 to actually work?
I hope you didn't pay for your education.
Anycast addresses change all the time. Ask Google, Microsoft, Amazon, Akamai, Cloudlfare and so on if you don't believe me. About the only anycast IPs that don't change are public DNS resolvers but that's also true of unicast resolvers as well.
By that logic BGP is a layer 7 load balancer since it has an application layer. BGP only exchanges layer 3 reachability information to update route tables therefore you can only load balance layer 3 with it.
Personal attacks and other things in your comments are against the HN guidelines. The goal is to talk about DNS/TTLs and their impact on performance not insult each other. https://news.ycombinator.com/newsguidelines.html
Also, wtf are you talking about? Did you even read the article?
Most sit between? 0-15 minutes. WTF is your timeout?
> Also, wtf are you talking about?
> Did you even read the article?
Sure did which is why I referenced it.
> Most sit between? 0-15 minutes.
This is true in that the range encapsulates most keep-alive timeouts not in that keepalives longer than a minute or two are actually the majority. nginx defaults to 100 seconds, Apache is less than that. Most don't mess with these let alone bump them to 900. Generally 60 to 120 is considered standard with some cap on the number of active keep-alive sessions as well. Some go ultra-low or disable it all together, very few go ultra-high.
> WTF is your timeout?
Also please try to keep the conversations together.
TTLS isn't a thing.
You're argument about keep alive makes no sense. You're confusing the nginx documentation. 100 is the number of default connections it will hold open. 60 is the default timeout. Also it's sent as an HTTP response to the client in the headers...
Here you need to read up: https://nginx.org/en/docs/http/ngx_http_core_module.html#kee...
The timeout only matters if you're not making requests. Setting a low keep alive will actually result in more DNS requests...doh
Next you're going to tell me you only write blocking code and use a thread per connection...
I guess you haven't created many 10k concurrent app servers.
Meant "TTLs" as in the plural of TTL but my phone capitalized the whole block on me.
> You're argument about keep alive makes no sense. You're confusing the nginx documentation. 100 is the number of default connections it will hold open. 60 is the default timeout. Also it's sent as an HTTP response to the client in the headers... Here you need to read up: http://nginx.org/en/docs/http/ngx_http_core_module.html#keep...
Are you're right about the 100 being the default active keepalives not the default timeout. According to your own nginx link 75 is the timeout not 60 though: "Default:
Either way, 75/60/100/120 are significantly far off from 15 minutes.
> The timeout only matters if you're not making requests.
Or if the server reaches max connections.
> Setting a low keep alive will actually result in more DNS requests...doh
Which is how the discussion on DNS TTLs comes about in the first place. It's trivial to set the DNS TTL astronomically higher than the HTTP keep-alive, in which case the browser & OS won't actually make a lookup request since it's cached.
Cloud providers actually use low TTLs to route traffic globally away from regional failures. You're not seeing that go away anytime soon, there aren't other options.
Uh? I always used 24 hours TTL for DNS. I reluctantly move it to 1 hour for some tests, then quickly set it back to 24 hours. What are these people thinking?
One use case where short TTL-s make sense is running a service on a residential network where a power outage or router reboot can trigger an IP address change. If the IP address changes then you won't be offline for too long.
Yes, it is not exactly great, but at least it works well enough for self-hosting services.