Skip to content(if available)orjump to list(if available)

Cloudflare 1.1.1.1 Incident on July 14, 2025

homebrewer

This is a good time to mention that dnsmasq lets you setup several DNS servers, and can race them. The first responder wins. You won't ever notice one of the services being down:

  all-servers
  server=8.8.8.8
  server=9.9.9.9
  server=1.1.1.1

karel-3d

dnsdist is AMAZINGLY easy to set up as a secure local resolver that forwards all queries to DoH (and checks SSL) and checks liveliness every second

I need to do a write-up one day

anthonyryan1

Additionally, as long as you don't set strict-order, dnsmasq will automatically use all-servers for retries.

If you were using systemd-resolved however, it retries all servers in the order they were specified, so it's important to interleave upstreams.

Using the servers in the above example, and assuming IPv4 + IPv6:

    1.1.1.1
    2001:4860:4860::8888
    9.9.9.9
    2606:4700:4700::1111
    8.8.8.8
    2620:fe::fe
    1.0.0.1
    2001:4860:4860::8844
    149.112.112.112
    2606:4700:4700::1001
    8.8.4.4
    2620:fe::9
will failover faster and more successfully on systemd-resoved, rather than if you specify all Cloudflare IPs together, then all Google IPs, etc.

Also note that Quad9 is default filtering on this IP while the other two or not, so you could get intermittent differences in resolution behavior. If this is a problem, don't mix filtered and unfiltered resolvers. You definitely shouldn't mix DNSSEC validatng and not DNSSEC validating resolvers if you care about that (all of the above are DNSSEC validating).

mnordhoff

Even without "all-servers", DNSMasq will race servers frequently (after 20 seconds, unless it's changed), and when retrying. A sudden outage should only affect you for a few seconds, if at all.

v5v3

> For many users, not being able to resolve names using the 1.1.1.1 Resolver meant that basically all Internet services were unavailable.

Don't you normally have 2 DnS servers listed on any device. So was the second also down, if not why didn't it go to that.

rom1v

On Android, in Settings, Network & internet, Private DNS, you can only provide one in "Private DNS provider hostname" (AFAIK).

Btw, I really don't understand why it does not accept an IP (1.1.1.1), so you have to give an address (one.one.one.one). It would be more sensible to configure a DNS server from an IP rather than from an address to be resolved by a DNS server :/

quacksilver

Private DNS on Android refers to 'DNS over HTTPS' and would normally only accept a hostname.

Normal DNS can normally be changed in your connection settings for a given connection on most flavours of Android.

fs111

No, it is not DNS over HTTPS it is DNS over TLS, which is different.

eptcyka

Cloudflare has valid certs for 1.1.1.1

quaintdev

Its DNS over TLS. Android does not support DNS over HTTPS except Google's DNS

rom1v

> Private DNS on Android refers to 'DNS over HTTPS'

Yes, sorry, I did not mention it.

So if you want to use DNS over HTTPS on Android, it is not possible to provide a fallback.

Macha

Cloudflare's own suggested config is to use their backup server 1.0.0.1 as the secondary DNS, which was also affected by this incident.

stingraycharles

TBH at this point the failure modes in which 1.1.1.1 would go down and 1.0.0.1 would not are not that many. At CloudFlare’s scale, it’s hardly believable a single of these DNS servers would go down, and it’s rather a large-scale system failure.

But I understand why Cloudflare can’t just say “use 8.8.8.8 as your backup”.

bombcar

At least some machines/routers do NOT have a primary and backup but instead randomly round-robin between them.

Which means that you’d be on cloudflare half the time and on google half the time which may not be what you wanted.

sschueller

Yes, I would also highly recommend using a DNS closest to you (for those that have ISPs that don't mess around (blocking etc.) with their DNS you usually get much better response times) and multiple from different providers.

If your device doesn't support proper failover use a local DNS forwarder on your router or an external one.

In Switzerland I would use Init7 (isp that doesn't filter) -> quad9 (unfiltered Version) -> eu dns0 (unfiltered Version)

Gieron

I think normally you pair 1.1.1.1 with 1.0.0.1 and, if I understand this correctly, both were down.

moontear

Just pair 1.1.1.1 with 9.9.9.9 (Quad9) so you have fault tolerance in terms of provider as well.

Aachen

I became a bit disillusioned with quad9 when they started refusing to resolve my website. It's like wetransfer but supporting wget and without the AI scanning or interstitials. A user had uploaded malware and presumably sent the link to a malware scanner. Instead of reporting the malicious upload or blocking the specific URL¹, the whole domain is now blocked on a DNS level. The competing wetransfer.com resolves just fine at 9.9.9.9

I haven't been able to find any recourse. The malware was online for a few hours but it has been weeks and there seems to be no way to clear my name. Someone on github (the website is open source) suggested that it's probably because they didn't know of the website, like everyone heard of wetransfer and github and so they don't get the whole domain blocked for malicious user content. I can't find any other difference, but also no responsible party to ask. The false-positive reporting tool on quad9's website just reloads the page and doesn't do anything

¹ I'm aware DNS can't do this, but with a direct way of contacting a very responsive admin (no captchas or annoying forms, just email), I'd not expect scanners to resort to blocking the domain outright to begin with, at least not after they heard back the first time and the problematic content has been cleared swiftly

rvnx

Quad9 is reselling the traffic logs, so it means if you connect to secret hosts (like for your work), they will be leaked

baobabKoodaa

Windows 11 does not allow using this combination

Algent

Yeah pretty much. In a perfect world you would pair it with another service I guess but usually you use the official backup IP because it's not supposed to break at same time.

carlhjerpe

I would rather fall back to the slow path of resolving through root servers than fall back from one recursive resolver to another.

rvnx

8.8.8.8 + 1.1.1.1 is stable and mostly safe

baobabKoodaa

Windows 11 does not allow using this combination

zamadatix

1.1.1.1 is also what they call the resolver service as a whole, the impact section (seems to) be saying both 1.0.0.0/24 and 1.1.1.0/24 were affected (among other ranges).

ahoka

Or run your own, if you are able to.

Bluescreenbuddy

Yup. I have Cloudfare and Quad9

bmicraft

My Mikrotik router (and afaict all of them) don't support more than one DoH address.

jallmann

Good writeup.

> It’s worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com, configured manually or through their browser, to access the public DNS resolver, rather than by IP address.

Interesting, I was affected by this yesterday. My router (supposedly) had Cloudflare DoH enabled but nothing would resolve. Changing the DNS server to 8.8.8.8 fixed the issues.

bauruine

How does DoH work? Somehow you need to know the IP of cloudflare-dns.com first. Maybe your router uses 1.1.1.1 for this.

maxloh

Yeah, your operating system will first need to resolve cloudflare-dns.com. This initial resolution will likely occur unencrypted via the network's default DNS. Only then will your system query the resolved address for its DoH requests.

Note that this introduces one query overhead per DNS request if the previous cache has expired. For this reason, I've been using https://1.1.1.1/dns-query instead.

In theory, this should eliminate that overhead. Your operating system can validate the IP address of the DNS response by using the Subject Alternative Name (SAN) field within the CA certificate presented by the DoH server: https://g.co/gemini/share/40af4514cb6e

null

[deleted]

stingraycharles

Yeah I don’t understand this part either, maybe it’s supposed to be bootstrapped using your ISP’s DNS server?

tom1337

Pretty much that. You set up a bootstrap DNS server (could be your ISPs or any other server) which then resolves the IP of the DoH server which then can be used for all future requests.

stavros

Are we meant to use a domain? I've always just used the IP.

landgenoot

You need a domain in order to get the s in https to work

ta1243

And even if you have already resolved it the TTL is only 5 minutes

nelox

[flagged]

k1t

Smells like AI and completely fails to answer the question.

How is the IP address of the DoH server obtained?

noduerme

Funny. I was configuring a new domain today, and for about 20 minutes I could only reach it through Firefox on one laptop. Google's DNS tools showed it active. SSH to an Amazon server that could resolve it. My local network had no idea of it. Flush cache and all. Turns out I had that one FF browser set up to use Cloudflare's DoH.

sneak

I disagree. The actual root cause here is shrouded in jargon that even experienced admins such as myself have to struggle to parse.

It’s corporate newspeak. “legacy” isn’t a clear term, it’s used to abstract and obfuscate.

> Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.

I know what this means, but there’s absolutely no reason for it to be written in this inscrutable corporatese.

stingraycharles

I disagree, the target audience is also going to be less technical people, and the gist is clear to everyone: they just deploy this config from 0 to 100% to production, without feature gates or rollback. And they made changes to the config that wasn’t deployed for weeks until some other change was made, which also smells like a process error.

I will not say whether or not it’s acceptable for a company of their size and maturity, but it’s definitely not hidden in corporate lingo.

I do believe they could have elaborate more on the follow up steps they will take to prevent this from happening again, I don’t think staggered roll outs are the only answer to this, they’re just a safety net.

willejs

If you carry on reading, its quite obvious they misconfigured a service and routed production traffic to that instead of the correct service, and the system used to do that was built in 2018 and is considered legacy (probably because you can easily deploy bad configs). Given that, I wouldn't say the summary is "inscrutable corporatese" whatever that is.

bigiain

I agree it's not "inscrutable corporatese"

It's carefully written so my boss's boss thinks he understands it, and that we cannot possibly have that problem because we obviously don't have any "legacy components" because we are "modern and progressive".

It is, in my opinion, closer to "intentionally misleading corporatese".

sathackr

Good writeup except the entirely false timeline shared at the beginning of the post

bartvk

You need to clarify such a statement, in my opinion.

Hamuko

My (Unifi) router is set to automatic DoH, and I think that means it's using Cloudflare and Google. Didn't notice any disruptions so either the Cloudflare DoH kept working or it used the Google one while it was down.

zahrc

Check Jallmann’s response https://news.ycombinator.com/item?id=44578490#44578917

TLDR; DoH was working

Thorrez

AFAICS, Jallmann just left 1 comment and it was top-level. I'm not sure what you mean by "Jallmann’s response".

chrismorgan

I’m surprised at the delay in impact detection: it took their internal health service more than five minutes to notice (or at least alert) that their main protocol’s traffic had abruptly dropped to around 10% of expected and was staying there. Without ever having been involved in monitoring at that kind of scale, I’d have pictured alarms firing for something that extreme within a minute. I’m curious for description of how and why that might be, and whether it’s reasonable or surprising to professionals in that space too.

perlgeek

There's a constant tension between speed of detection and false positive rates.

Traditional monitoring systems like Nagios and Icinga have settings where they only open events/alerts if a check failed three times in a row, because spurious failed checks are quite common.

If you spam your operators with lots of alerts for monitoring checks that fix themselves, you stress the unnecessarily and create alert blindness, because the first reaction will be "let's wait if it fixes itself".

I've never operated a service with as much exposure as CF's DNS service, but I'm not really surprised that it took 8 minutes to get a reliable detection.

sbergot

I work on the SSO stack in a b2b company with about 200k monthly active users. One blind spot in our monitoring is when an error occurs on the client's identity provider because of a problem on our side. The service is unusable and we don't have any error logs to raise an alert. We tried to setup an alert based on expected vs actual traffic but we concluded that it would create more problems for the reason you provided.

grinich

This is off topic, but I’m the founder of WorkOS and we love hiring people with your experience. (WorkOS powers SSO for OpenAI, Anthropic, Cursor, etc.)

Send me an email if you’re ever looking for a new job? mg@workos.com

chrismorgan

At Cloudflare’s scale on 1.1.1.1, I’d imagine you could do something comparatively simple like track ten-minute and ten-second rolling averages (I know, I know, I make that sound much easier and more practical than it actually would be), and if they differ by more than 50%, sound the alarm. (Maybe the exact numbers would need to be tweaked, e.g. 20 seconds or 80%, but it’s the idea.)

Were it much less than 1.1.1.1 itself, taking longer than a minute to alarm probably wouldn’t surprise me, but this is 1.1.1.1, they’re dealing with vasts amounts of probably fairly consistent traffic.

perlgeek

I'm sure some engineer at cloudflare is evaluating something like this right now, and try it on historical data how many false positives that would've generated in the past, if any.

Thing is, it's probably still some engineering effort, and most orgs only really improve their monitoring after it turned out to be sub-optimal.

briangriffinfan

I would want to make sure we avoid "We should always do the exact specific thing that would have prevented this exact specific issue"-style thinking.

bombcar

This is one of those graphs that would have been on the giant wall in the NOC in the old days - someone would glance up and see it had dropped and say “that’s not right” and start scrambling.

TheDong

I'm not surprised.

Let's say you've got a metric aggregation service, and that service crashes.

What does that result in? Metrics get delayed until your orchestration system redeploys that service elsewhere, which looks like a 100% drop in metrics.

Most orchestration take a sec to redeploy in this case, assuming that it could be a temporary outage of the node (like a network blip of some sort).

Sooo, if you alert after just a minute, you end up with people getting woken up at 2am for nothing.

What happens if you keep waking up people at 2am for something that auto-resolves in 5 minutes? People quit, or eventually adjust the alert to 5 minutes.

I know you often can differentiate no data and real drops, but the overall point, of "if you page people constantly, people will quit" I think is the important one. If people keep getting paged for too tight alarms, the alarms can and should be loosened... and that's one way you end up at 5 minutes.

croemer

It's not rocket science. You do a 2 stage thing: Why not check if the aggregation service has crashed before firing the alarm if it's within the first 5 minutes? How many types of false positives can there be? You just need to eliminate the most common ones and you gradually end up with fewer of them.

Before you fire a quick alarm, check that the node is up, check that the service is up etc.

mentalgear

Its not wrong for smaller companies. But there's an argument that a big system critical company/provider like Cloudflare should be able to afford its own always on team with a night shift.

misiek08

Please don’t. It doesn’t make sense, doesn’t help, doesn’t improve anything and is just waste of money, time, power and people.

Now without crying: I saw multiple, big companies getting rid of NOC and replacing that with on duties in multiple, focused teams. Instead of 12 people sitting 24/7 in group of 4 and doing some basic analysis and steps before calling others - you page correct people in 3-5 minutes, with exact and specific alert.

Incident resolution times went greatly down (2-10x times - depends on company), people don’t have to sit overnight and sleep for most of the time and no stupid actions like service restart taken to slow down incident resolution.

And I’m not liking that some platforms hire 1500 people for job that could be done with 50-100, but in terms of incident response - if you already have teams with separated responsibilities then NOC it’s "legacy"

amelius

I think it is reasonable if the alarm trigger time is, say 5-10% of the time required to fix most problems.

chrismorgan

Not even a night shift, just normal working hours in another part of the world.

philipwhiuk

Remember they have no SLA for this service.

chrismorgan

So?

They have a rather significant vested interest in it being reliable.

kachapopopow

Interesting to see that they probably lost 20% of 1.1.1.1 usage from a roughly 20 minute incident.

Not sure how cloudflare keeps struggling with issues like these, this isn't the first (and probably won't be the last) time they have these 'simple', 'deprecated', 'legacy' issues occuring.

8.8.8.8+8.8.4.4 hasn't had a global(1) second of downtime for almost a decade.

1: localized issues did exist, but that's really the fault of the internet and they did remain running when google itself suffered severe downtime in various different services.

Tepix

There's more to DNS than just availability (granted, it's very important). There's also speed and privacy.

European users might prefer one of the alternatives listed at https://european-alternatives.eu/category/public-dns over US corporations subject to the CLOUD act.

adornKey

I think just setting up Unbound is even less trouble. Servers come and go. Getting rid of the dependency altogether is better than having to worry who operates the DNS-servers and how long it's going to be available.

daneel_w

Everyone, European or not, should prefer anything but Cloudflare and Google if they feel that privacy has any value.

immibis

HN users might prefer to run their own. It's a low maintenance service. It's not like running a mail server.

daneel_w

I think that might be overestimating the technical prowess of HN readers on the whole. Sure, it doesn't require wizardry to set up e.g. Unbound as a catch-all DoT forwarder, but it's not the click'n'play most people require. It should be compared to just changing the system resolvers to dns0, Quad9 etc.

perlgeek

An outage of roughly 1 hour is 0.13% of a month or 0.0114% of a year.

It would be interesting to see the service level objective (SLO) that cloudflare internally has for this service.

I've found https://www.cloudflare.com/r2-service-level-agreement/ but this seems to be for payed services, so this outage would put July in the "< 99.9% but >= 99.0%" bucket, so you'd get a 10% refund for the month if you payed for it.

philipwhiuk

Probably 99.9% or better annually just from a 'maintaining reputation for reliability' standpoint.

stingraycharles

What really matters with these percentages is whether it’s per month or per year. 99.9% per year allows for much longer outages than 99.9% per month.

alyandon

  Cloudflare's 1.1.1.1 Resolver service became unavailable to the Internet starting at 21:52 UTC and ending at 22:54 UTC
Weird. According to my own telemetry from multiple networks they were unavailable for a lot longer than that.

CuteDepravity

It's crazy that both 1.1.1.1 and 1.0.0.1 where affected by the same change

I guess now we should start using a completely different provider as dns backup Maybe 8.8.8.8 or 9.9.9.9

sammy2255

1.1.1.1 and 1.0.0.1 are served by the same service. It's not advertised as a redundant fully separate backup or anything like that...

yjftsjthsd-h

Wait, then why does 1.0.0.1 exist? I'll grant I've never seen it advertised/documented as a backup, but I just assumed it must be because why else would you have two? (Given that 1.1.1.1 already isn't actually a single point, so I wouldn't think you need a second IP for load balancing reasons.)

kalmar

I don't know of it's the reason, but inet_aton[0] and other parsing libraries that match its behaviour will parse 1.1 as 1.0.0.1. I use `ping 1.1` as a quick connectivity test.

[0] https://man7.org/linux/man-pages/man3/inet_aton.3.html#DESCR...

tom1337

Wasn’t it also because a lot of hotel / public routers used 1.1.1.1 for captive portals and therefore you couldn’t use 1.1.1.1?

immibis

Because operating systems have two boxes for DNS server IP addresses, and Cloudflare wants to be in both positions.

ta1243

Far quicker to type ping 1.1 than ping 1.1.1.1

1.0.0.0/24 is a different network than 1.1.1.0/24 too, so can be hosted elsewhere. Indeed right now 1.1.1.1 from my laptop goes via 141.101.71.63 and 1.0.0.1 via 141.101.71.121, which are both hosts on the same LINX/LON1 peer but presumably from different routers, so there is some resilience there.

Given DNS is about the easiest thing to avoid a single point of failure on I'm not sure why you would put all your eggs in a single company, but that seems to be the modern internet - centralisation over resilience because resilience is somehow deemed to be hard.

0xbadcafebee

In general, the idea of DNS's design is to use the DNS resolver closest to you, rather than the one run by the largest company.

That said, it's a good idea to specifically pick multiple resolvers in different regions, on different backbones, using different providers, and not use an Anycast address, because Anycast can get a little weird. However, this can lead to hard-to-troubleshoot issues, because DNS doesn't always behave the way you expect.

ben0x539

Isn't the largest company most likely to have the DNS resolver closest to me?

null

[deleted]

sschueller

No, your ISP can have a server closer before any external one.

fragmede

Your ISP should have a DNS revolver closer to you. "Should" doesn't necessarily mean faster, however.

dontTREATonme

What’s your recommendation for finding the dns resolver closest to me? I currently use 1.1 and 8.8, but I’m absolutely open to alternatives.

LeoPanthera

The closest DNS resolver to you is the one run by your ISP.

baobabKoodaa

Windows 11 doesn't allow using that combination

codingminds

Wasn't that the case since ever?

bigiain

I mean, aren't we already?

My Pi-holes both use OpenDNS, Quad9, and CloudFlare for upstream.

Most of my devices use both of my Pi-holes.

globular-toast

In general there's no such thing as "DNS backup". Most clients just arbitrarily pick one from the list, they don't fall back to the other one in case of failure or anything. So if one went down you'd still find many requests timing out.

JdeBP

The reality is that it's rather complicated to say what "most clients" do, as there is some behavioural variation amongst the DNS client libraries when they are configured with multiple IP addresses to contact. So whilst it's true to say that fallback and redundancy does not always operate as one might suppose at the DNS client level, it is untrue to go to the opposite extreme and say that there's no such thing at all.

Mindless2112

Interesting that traffic didn't return to completely normal levels after the incident.

I recently started using the "luci-app-https-dns-proxy" package on OpenWrt, which is preconfigured to use both Cloudflare and Google DNS, and since DoH was mostly unaffected, I didn't notice an outage. (Though if DoH had been affected, it presumably would have failed over to Google DNS anyway.)

caconym_

> Interesting that traffic didn't return to completely normal levels after the incident.

Anecdotally, I figured out their DNS was broken before it hit their status page and switched my upstream DNS over to Google. Haven't gotten around to switching back yet.

radicaldreamer

What would be a good reason to switch back from Google DNS?

Algent

After trying both several time I since stayed with google due to cloudflare always returning really bad IPs for anything involving CDN. Having users complain stuff take age to load because you got matched to an IP on opposite side of planet is a bit problematic especially when it rarely happen on other dns providers. Maybe there is a way to fix this but I admit I went for the easier option of going back to good old 8.8.8.8

sammy2255

Depends who you trust more with your DNS traffic. I know who I trust more.

anon7000

They go into that more towards the end, sounds like some smaller % of servers needed more direct intervention

motorest

> Interesting that traffic didn't return to completely normal levels after the incident.

Clients cache DNS resolutions to avoid having to do that request each time they send a request. It's plausible that some clients held on to their cache for a significant period.

nu11ptr

Question: Years ago, back when I used to do networking, Cisco Wireless controllers used 1.1.1.1 internally. They seemed to literally blackhole any comms to that IP in my testing. I assume they changed this when 1.0.0.0/8 started routing on the Internet?

blurrybird

Yeah part of the reason why APNIC granted Cloudflare access to those very lucrative IPs is to observe the misconfiguration volume.

The theory is CF had the capacity to soak up the junk traffic without negatively impacting their network.

yabones

The general guidance for networking has been to only use IPs and domains that you actually control... But even 5-8 years ago, the last time I personally touched a cisco WLC box, it still had 1.1.1.1 hardcoded. Cisco loves to break their own rules...

0xbadcafebee

> A configuration change was made for the same DLS service. The change attached a test location to the non-production service; this location itself was not live, but the change triggered a refresh of network configuration globally.

Say what now? A test triggered a global production change?

> Due to the earlier configuration error linking the 1.1.1.1 Resolver's IP addresses to our non-production service, those 1.1.1.1 IPs were inadvertently included when we changed how the non-production service was set up.

You have a process that allows some other service to just hoover up address routes already in use in production by a different service?

i_niks_86

Many commenters assume fallback behavior exists between DNS providers, but in practice, DNS clients - especially at the OS or router level -rarely implement robust failover for DoH. If you're using cloudflare-dns(.)com and it goes down, unless the stub resolver or router explicitly supports multi-provider failover (and uses a trust-on-first-use or pinned cert model), you’re stuck. The illusion of redundancy with DoH needs serious UX rethinking.

tankenmate

I use routedns[0] for this specific reason it handles almost all DNS protocols; UDP, TCP, DoT, DoH, DoQ (including 0-RTT). But more importantly is has a very configurable route steering even down to a record by record basis if you want to put up with all the configuration involved. It's very robust and is very handy, I use 1.1.1.1 on my desktops and servers and when the incident happened I didn't even notice as the failover "just worked". I had to actually go look at the logs because I didn't notice.

[0] https://github.com/folbricht/routedns