Webflow Down for >31 Hours
65 comments
·July 29, 2025stackskipton
qcnguy
More likely that their core database hit some scaling limit and fell over. Their status page talks constantly about them working with their "upstream database provider" (presumably AWS) to find a fix.
My guess. They use AWS hosted Postgresql and autovacuuming fell permanently behind without them noticing, and can't keep up with organic growth, and they can't scale vertically because they already maxed that out before. So they have to do crash migrations of data off their core DB which is why it's taking so long.
esafak
If so it is probably a good time to apply for an SRE position there unless they really do not get it!
acedTrex
An outage of this magnitude is almost ALWAYS the direct and immediate fault of senior leaderships priorities and focus. Pushing too hard in some areas, not listening to engineers on needed maintenance tasks etc.
AsmodiusVI
And engineers never are the cause of mistakes? There can't possibly be any data to back up that major outages are more often caused by leadership. I've been in SIEs simply because someone pushed a network outage to a switch network. Statements like these only go to show how much we have to learn, humble ourselves, and stop blaming others all the time.
acedTrex
PROLONGED outages are a failure point that more often than not, require organizational dysfunction to happen.
AlotOfReading
Leadership can include engineers responsible for technical priorities. If you're down for that long though, it's usually an organizational fuck-up because the priorities didn't include identifying and mitigating systemic failure modes. The proximate cause isn't all that important and the people who set organizational priorities are by-and-large not engineers.
bravesoul2
Think of airplane safety. I think it is similar. A good culture can make sure $root-cause is more likely detected, tested, isolated, monitored, easy to roll back and so on.
nusl
My sympathy for those in the mud dealing with this. Never a fun place to be. Hope y'all figure it out and manage to de-stress :)
mattbillenstein
We're sorry https://www.youtube.com/watch?v=9u0EL_u4nvw
Edit, an outage of this length smells of bad systems architecture...
hinkley
Prediction: Someone confidently broke something, then confidently 'fixed' it, with the consequence of breaking more things instead. And now they have either been pulled off of the cleanup work or they wish they had been.
bravesoul2
Wow >31h I am surprised they couldnt rebuild their entire systems in parallel on new infra in that time. Can be hard if data loss is invokved tho (a guess). Would love to see the post mortem so we all can learn.
stackskipton
I doubt it’s infra failure but software failure. Their bad design has caught up and they can’t toss more hardware for some reason. Most companies have this https://xkcd.com/2347/ in their stack and it’s fallen over.
wavemode
CEO's statement: https://www.reddit.com/r/webflow/comments/1mcmxco/from_webfl...
progbits
> 99.99%+ uptime is the standard we need to meet, and lately, we haven’t.
Four nines is not what I would be citing at this point. (That's less than an hour per year, so they burned that for next three decades)
Maybe aim for 99% first.
Otherwise a pretty honest and solid response, kudos for that!
zamadatix
One could have nearly 3 such incidents per year and still have hit 99%.
I always strive for 7 9s myself, just not necessarily consecutive digits.
manquer
It could be consecutive too and even start with a 9 and be all nines here you go : 9.9999999%
Spivak
I strive for one 9, thank you. No need to overcomplicate. We use Lambda on top of Glacier.
jeeyoungk
why go for 9's when you can go for 8s? you can aim for 88.8888888!
null
theideaofcoffee
Lots get starry-eyed and aim for five nines right out of the gate where they should have been targeting nine fives and learning from that. Walk before you run.
edoceo
Interesting the phrase "I'm sorry" was in there. Almost feels like someone in the Big Chair taking a bit of responsibility. Cheers to that.
thih9
> Change controls are tighter, and we’re investing in long-term performance improvements, especially in the CMS.
This reads as if overall performance was an afterthought and this doesn’t seem practical; it should be a business metric, it is important to the users after all.
Then again, it’s easy to comment like this in hindsight. We’ll see what happens long term.
newZWhoDis
As a former webflow customer I can assure you performance was always an afterthought.
stackskipton
I mean, if customers don’t leave them over this, higher ups likely won’t care after dust settles.
bravesoul2
Decent update. Guess people are really waiting for a fix tho!
willejs
Hugops to the people working on this for the last 31+ hours. Running incidents of this significance is hard, draining and requires a lot of effort, this going on for so long must be very difficult for all involved.
bravesoul2
Hopefully they are rotating teams not people staying awake for a dangerous amount of time.
dangoodmanUT
Hugs for their SREs sweating bullets rn
sangeeth96
Hugs to the ones dealing with this and the users of Webflow who invested in them for their clientele. Hoping they'll release a full postmortem once the sky clears up.
betaby
I'm more surprised that WordPress-like platforms are profitable businesses in 2025.
bravesoul2
Because imagine your local biz can either pay a designer 1k a year or DIY and pay godaddy 200 bucks. Or 30 bucks for Wordpress and 20 hours of fiddling and asking their cousin for help.
Its not great by our standards but I bet many of us drink the house wine not something more sophisticated, right :)
bogzz
Why? Genuinely asking. Did you mean because there are free alternatives to self-host? I don't think that it would be so easy for someone in the market for a WYSIWYG blog builder to set everything up themselves.
newZWhoDis
We moved away from webflow because it was slow (got the nickname web-slow internally).
Plus, despite marketing begging for the WYSIWYG interface they actually weren't creative enough to generate new content at a pace that required it.
We massively increased conversion rates by going full native and having 1 Engineer churn out parts kits/kitbash LPs from said kits.
Scale for reference: ~$10M/month
wewewedxfgdf
Companies get very good at handling disasters - after the disaster has happened.
dylan604
The problem is they get good in that specific disaster. They can only plug a hole in the dike after the hole exists, then they look at the hole and make a plug the exact shape of that hole. The next hole starts the process over for it specifically. Each time. There's no generic plug that can be used each time. So sure, the get very good at making specific plugs. They never get to the point of making a better dike that doesn't spring so many leaks.
wewewedxfgdf
It is the job of the CTO to ensure the company has anticipated as many as possible such situations.
It's not a very interesting thing to do however.
dylan604
okay. and? the CTO isn't the last word in anything. if they are overruled to keep releasing new features, acquiring new users/clients, sales forward dev cycles, then the whole thing has potential to collapse under the weight of itself.
It's actually the job of the CEO to keep all of the c-suite people doing jobs. Doesn't seem to stop the CEO salary explosions.
esseph
You just described every company.
(And also why security is always a losing battle)
plutaniano
Will the company survive long enough to produce a postmortem?
My SRE brain reading between the lines is they have been feature factory and tech debt finally caught up to them.
My guess is reason they been down so long is they don’t have good rollback so they attempting to fix forward with limited success.