w10-1 2 hours ago

Kudos to Cloudflare for clarity and diligence.

When talking of their earlier Lua code:

> we have never before applied a killswitch to a rule with an action of “execute”.

I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?

It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.

I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.

  • braiamp an hour ago

    This is funny, considering that someone that worked on the defense industry (guide missile system) found a memory leak on one of their products, at that time. They told him that they knew about it, but that it's timed just right with the range of the system it would be used, so it doesn't matter.

    • Etheryte 33 minutes ago

      This paraphrased urban legend has nothing to do with quality engineering though? As described, it's designed to the spec and working as intended.

    • mopsi 35 minutes ago

      ... until the extended-range version is ordered and no one remembers to fix the leak. :]

  • zwnow an hour ago

    "Kudos"? This is like the South Park episode in which the oil company guy just excuses himself while the company just continues to fuck up over and over again. There's nothing to praise, this shouldn't happen twice in a month. Its inexcusable.

Scaevolus 4 hours ago

> Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules

Warning signs like this are how you know that something might be wrong!

  • testplzignore an hour ago

    > They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

    This is what jumped out at me as the biggest problem. A wild west deployment process is a valid (but questionable) business decision, but if you do that then you need smart people in place to troubleshoot and make quick rollback decisions.

    Their timeline:

    > 08:47: Configuration change deployed and propagated to the network

    > 08:48: Change fully propagated

    > 08:50: Automated alerts

    > 09:11: Configuration change reverted and propagation start

    > 09:12: Revert fully propagated, all traffic restored

    2 minutes for their automated alerts to fire is terrible. For a system that is expected to have no downtime, they should have been alerted to the spike in 500 errors within seconds before the changes even fully propagated. Ideally the rollback would have been automated, but even if it is manual, the dude pressing the deploy button should have had realtime metrics on a second display with his finger hovering over the rollback button.

    Ok, so they want to take the approach of roll forward instead of immediate rollback. Again, that's a valid approach, but you need to be prepared. At 08:48, they would have had tens of millions of "init.lua:314: attempt to index field 'execute'" messages being logged per second. Exact line of code. Not a complex issue. They should have had engineers reading that code and piecing this together by 08:49. The change you just deployed was to disable an "execute" rule. Put two and two together. Initiate rollback by 08:50.

    How disconnected are the teams that do deployments vs the teams that understand the code? How many minutes were they scratching their butts wondering "what is init.lua"? Are they deploying while their best engineers are sleeping?

  • philipwhiuk 3 hours ago

    > Warning signs like this are how you know that something might be wrong!

    Yes, as they explain it's the rollback that was triggered due to seeing these errors that broke stuff.

    • Scaevolus 3 hours ago

      They saw errors and decided to do a second rollout to disable the component generating errors, causing a major outage.

    • 8cvor6j844qw_d6 3 hours ago

      Would be nice if the outage dashboards are directly linked to this instead of whatever they have now.

uyzstvqs 2 hours ago

What I'm missing here is a test environment. Gradual or not; why are they deploying straight to prod? At Cloudflare's scale, there should be a dedicated room in Cloudflare HQ with a full isolated model-scale deployment of their entire system. All changes should go there first, with tests run for every possible scenario.

Only after that do you use gradual deployment, with a big red oopsie button which immediately rolls the changes back. Languages with strong type systems won't save you, good procedure will.

lionkor 3 hours ago

Cloudflare is now below 99.9% uptime, for anyone keeping track. I reckon my home PC is at least 99.9%.

  • markus_zhang an hour ago

    TBF, it depends on the number of outages locally. In my area it is one outage every thunderstorm/snow storm, so unfortunately the up time of my laptop, even with the help of a large, portable battery charging station (which can charge multiple laptops at the same time), is not optimistic.

    I sometimes fancy that I could just take cash, go into the wood, build a small solar array, collect & cleanse river water, and buy a starlink console.

    • roguecoder 18 minutes ago

      Costco had a deal on solid-state UPS & solar panels a while back that I was happy to partake of

flaminHotSpeedo 4 hours ago

What's the culture like at Cloudflare re: ops/deployment safety?

They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place

  • dkyc 3 hours ago

    One thing to keep in mind when judging what's 'appropriate' is that Cloudflare was effectively responding to an ongoing security incident outside of their control (the React Server RCE vulnerability). Part of Cloudlfare's value proposition is being quick to react to such threats. That changes the equation a bit: any hour you wait longer to deploy, your customers are actively getting hacked through a known high-severity vulnerability.

    In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.

    That isn't to say it didn't work out badly this time, just that the calculation is a bit different.

    • flaminHotSpeedo 3 hours ago

      To clarify, I'm not trying to imply that I definitely wouldn't have made the same decision, or that cowboy decisions aren't ever the right call.

      However, this preliminary report doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage. Deployment safety should have been the focus of this report, not the technical details. My question that I want answered isn't "are there bugs in Cloudflare's systems" it's "has Cloudflare learned from it's recent mistakes to respond appropriately to events"

      • vlovich123 2 hours ago

        > doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage

        There’s no other deployment system available. There’s a single system for config deployment and it’s all that was available as they haven’t yet done the progressive roll out implementation yet.

        • locknitpicker 11 minutes ago

          > There’s no other deployment system available.

          Hindsight is always 20/20, but I don't know how that sort of oversight could happen in an organization whose business model rides on reliability. Small shops understand the importance of safeguards such as progressive deployments or one-box-style deployments with a baking period, so why not the likes of Cloudflare? Don't they have anyone on their payroll who warns about the risks of global deployments without safeguards?

        • edoceo an hour ago

          Ok. Sure But shouldn't they have some beta/staging/test area they could deploy to, run tests for an hour then do the global blast?

          • vlovich123 25 minutes ago

            Config changes are distinctly more difficult to have that set up for and as the blog says they’re working on it. They just don’t have it ready yet and are pausing any more config changes until it’s set up. They just did this one in response to try to mitigate an ongoing security vulnerability and missed the mark.

            I’m happy to see they’re changing their systems to fail open which is one of the things I mentioned in the conversation about their last outage.

    • cowsandmilk 33 minutes ago

      Cloudflare had already decided this was a rule that could be rolled out using their gradual deployment system. They did not view it as being so urgent that it required immediate global roll out.

    • Already__Taken 3 hours ago

      the cve isn't a zero day though how come cloudflare werent at the table for early disclosure?

      • flaminHotSpeedo 2 hours ago

        Do you have a public source about an embargo period for this one? I wasn't able to find one

        • charcircuit 2 hours ago

          Considering there were patched libraries at the time of disclosure, those libraries' authors must have been informed ahead of time.

        • Pharaoh2 2 hours ago

          https://react.dev/blog/2025/12/03/critical-security-vulnerab...

          Privately Disclosed: Nov 29 Fix pushed: Dec 1 Publicly disclosed: Dec 3

          • drysart 2 hours ago

            Then even in the worst case scenario, they were addressing this issue two days after it was publicly disclosed. So this wasn't a "rush to fix the zero day ASAP" scenario, which makes it harder to justify ignoring errors that started occuring in a small scale rollout.

    • udev4096 3 hours ago

      Clownflare did what it does best, mess up and break everything. It will keep happening again and again

      • toomuchtodo 2 hours ago

        Indeed, but it is what it is. Cloudflare comes out of my budget, and even with downtime, its better than not paying them. Do I want to deal with what Cloudflare offers? I do not, I have higher value work to focus on. I want to pay someone else to deal with this, and just like when cloud providers are down, it'll be back up eventually. Grab a coffee or beer and hang; we aren't savings lives, we're just building websites. This is not laziness or nihilism, but simply being rational and pragmatic.

  • liampulles 3 hours ago

    Rollback is a reliable strategy when the rollback process is well understood. If a rollback process is not well known and well experienced, then it is a risk in itself.

    I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.

    • programd 33 minutes ago

      Global rollout of security code on a timeframe of seconds is part of Cloudflare's value proposition.

      In this case they got unlucky with an incident before they finished work on planned changes from the last incident.

    • newsoftheday an hour ago

      Rollback carries with it the contextual understanding of complete atomicity; otherwise it's slightly better than a yeet. It's similar to backups that are untested.

      • marcosdumay 39 minutes ago

        Complete atomicity carries with it the idea that the world is frozen, and any data only needs to change when you allow it to.

        That's to say, it's an incredibly good idea when you can physically implement it. It's not something that everybody can do.

        • newsoftheday 8 minutes ago

          No, complete atomicity doesn't require a frozen state, it requires common sense and fail-proof, fool-proof guarantees derived assurances gained from testing.

          There is another name for rolling forward, it's called tripping up.

  • crote 2 hours ago

    > They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

    Note that the two deployments were of different components.

    Basically, imagine the following scenario: A patch for a critical vulnerability gets released, during rollout you get a few reports of it causing the screensaver to show a corrupt video buffer instead, you roll out a GPO to use a blank screensaver instead of the intended corporate branding, a crash in a script parsing the GPOs on this new value prevents users from logging in.

    There's no direct technical link between the two issues. A mitigation of the first one merely exposed a latent bug in the second one. In hindsight it is easy to say that the right approach is obviously to roll back, but in practice a roll forward is often the better choice - both from an ops perspective and from a safety perspective.

    Given the above scenario, how many people are genuinely willing to do a full rollback, file a ticket with Microsoft, and hope they'll get around to fixing it some time soon? I think in practice the vast majority of us will just look for a suitable temporary workaround instead.

  • lukeasrodgers 3 hours ago

    Roll back is not always the right answer. I can’t speak to its appropriateness in this particular situation of course, but sometimes “roll forward” is the better solution.

    • flaminHotSpeedo 2 hours ago

      Like the other poster said, roll back should be the right answer the vast majority of the time. But it's also important to recognize that roll forward should be a replacement for the deployment you decided not to roll back, not a parallel deployment through another system.

      I won't say never, but a situation where the right answer to avoid a rollback (that it sounds like was technically fine to do, just undesirable from a security/business perspective) is a parallel deployment through a radioactive, global blast radius, near instantaneous deployment system that is under intense scrutiny after another recent outage should be about as probable as a bowl of petunias in orbit

      • crote an hour ago

        Is a roll back even possible at Cloudflare's size?

        With small deployments it usually isn't too difficult to re-deploy a previous commit. But once you get big enough you've got enough developers that half a dozen PRs will have been merged since the start of the incident and now. How viable is it to stop the world, undo everything, and start from scratch any time a deployment causes the tiniest issues?

        Realistically the best you're going to get is merging a revert of the problematic changeset - but with the intervening merges that's still going to bring the system in a novel state. You're rolling forwards, not backwards.

        • newsoftheday an hour ago

          If companies like Cloudflare haven't figured out how to do reliable rollbacks, there seems little hope for any of us.

        • yuliyp an hour ago

          I'd presume they have the ability to deploy a previous artifact vs only tip-of-master.

    • echelon 3 hours ago

      You want to build a world where roll back is 95% the right thing to do. So that it almost always works and you don't even have to think about it.

      During an incident, the incident lead should be able to say to your team's on call: "can you roll back? If so, roll back" and the oncall engineer should know if it's okay. By default it should be if you're writing code mindfully.

      Certain well-understood migrations are the only cases where roll back might not be acceptable.

      Always keep your services in "roll back able", "graceful fail", "fail open" state.

      This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.

      Never make code changes you can't roll back from without reason and without informing the team. Service calls, data write formats, etc.

      I've been in the line of billion dollar transaction value services for most of my career. And unfortunately I've been in billion dollar outages.

      • drysart 2 hours ago

        "Fail open" state would have been improper here, as the system being impacted was a security-critical system: firewall rules.

        It is absolutely the wrong approach to "fail open" when you can't run security-critical operations.

  • this_user 3 hours ago

    The question is perhaps what the shape and status of their tech stack is. Obviously, they are running at massive scale, and they have grown extremely aggressively over the years. What's more, especially over the last few years, they have been adding new product after new product. How much tech debt have they accumulated with that "move fast" approach that is now starting to rear its head?

    • sandeepkd 2 hours ago

      I think this is probably a bigger root cause and is going to show up in different ways in future. The mere act of adding new products to an existing architecture/system is bound to create knowledge silos around operations and tech debt. There is a good reason why big companies keep smart people on their payroll to just change couple of lines after a week of debate.

  • otterley 3 hours ago

    From the post:

    “We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.

    “We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”

  • NicoJuicy 2 hours ago

    Where I work, all teams were notified about the React CVE.

    Cloudflare made it less of an expedite.

  • ignoramous 2 hours ago

    > this sounds like the sort of cowboy decision

    Ouch. Harsh given that Cloudflare's being over-honest (to disabling the internal tool) and the outage's relatively limited impact (time wise & no. of customers wise). It was just an unfortunate latent bug: Nov 18 was Rust's Unwrap, Dec 5 its Lua's turn with its dynamic typing.

    Now, the real cowboy decision I want to see is Cloudflare [0] running a company-wide Rust/Lua code-review with Codex / Claude...

    cf TFA:

      if rule_result.action == "execute" then
        rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
      end
    
      This code expects that, if the ruleset has action="execute", the "rule_result.execute" object will exist ... error in the [Lua] code, which had existed undetected for many years ... prevented by languages with strong type systems. In our replacement [FL2 proxy] ... code written in Rust ... the error did not occur.
    
    [0] https://news.ycombinator.com/item?id=44159166
  • rvz 3 hours ago

    > Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

    Also there seems to be insufficient testing before deployment with very junior level mistakes.

    > As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

    Where was the testing for this one? If ANY exception happened during the rules checking, the deployment should fail and rollback. Instead, they didn't assess that as a likely risk and pressed on with the deployment "fix".

    I guess those at Cloudflare are not learning anything from the previous disaster.

  • deadbabe 4 hours ago

    As usual, Cloudflare is the man in the arena.

    • samrus 3 hours ago

      There are other men in the arena who arent tripping on their own feet

      • usrnm 3 hours ago

        Like who? Which large tech company doesn't have outages?

        • k8sToGo 3 hours ago

          It's not about outages. It's about the why. Hardware can fail. Bugs can happen. But to continue a roll out despite warning sings and without understanding the cause and impact is on another level. Especially if it is related to the same problem as last time.

          • udev4096 2 hours ago

            And yet, it's always clownflare breaking everything. Failures are inevitable, which is widely known, therefore we build resilience systems to overcome the inevitable

            • deadbabe 2 hours ago

              It is healthy for tech companies to have outages, as they will build experience in resolving them. Success breeds complacency.

        • nish__ 2 hours ago

          Google does pretty good.

          • hansonkd 25 minutes ago

            Google docs was just down a couple weeks ago almost the whole day.

        • k__ 3 hours ago

          "tripping on their own feet" == "not rolling back"

  • nine_k 3 hours ago

    > more to the story

    From a more tinfoil-wearing angle, it may not even be a regular deployment, given the idea of Cloudflare being "the largest MitM attack in history". ("Maybe not even by Cloudflare but by NSA", would say some conspiracy theorists, which is, of course, completely bonkers: NSA is supposed to employ engineers who never let such blunders blow their cover.)

  • NoSalt 3 hours ago

    Ooh ... I want to be on a cowboy decision making team!!!

miyuru 4 hours ago

Whats going on with cloudflare's software team?

I have seen similar bugs in cloudflare API recently as well.

There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.

  • archon810 2 hours ago

    I recently ran into an issue with the Cloudflare API feature that if you want to roll back requires contacting the support team because there's no way to roll it back with the API or GUI. Even when the exact issue was pointed out, it took multiple days to change the setting and to my knowledge there's still no API fix available.

    https://www.answeroverflow.com/m/1234405297787764816

  • LelouBil 2 hours ago

    Can you elaborate? I'm not sure what you mean by "at the last step"

    • Etheryte 30 minutes ago

      I'm not sure which endpoint gp meant, but as I understood it, as an example, imagine a three-way handshake that's only available to enterprise users. Instead of failing a regular user on the first step, they allow steps one and two, but then do the check on step three and fail there.

  • 65 2 hours ago

    My guess? Code written by AI

    • system2 an hour ago

      100%. Upper managements try to cut costs and hire remote bullshitters.

cpncrunch 2 hours ago

I've noticed that in recent months, even apart from these outages, cloudflare has been contributing to a general degradation and shittification of the internet. I'm seeing a lot more "prove you're human", "checking to make sure you're human", and there is normally at the very least a delay of a few seconds before the site loads.

I don't think this is really helping the site owners. I suspect it's mainly about AI extortion:

https://blog.cloudflare.com/introducing-pay-per-crawl/

  • james2doyle 2 hours ago

    You call it extortion of the AI companies, but isn’t stealing/crawling/hammering a site to scrape their content to resell just as nefarious? I would say Cloudflare is giving these site owners an option to protect their content and as a byproduct, reduce their own costs of subsidizing their thieves. They can choose to turn off the crawl protection. If they aren't, that tells you that they want it, doesn’t it?

    • cpncrunch 33 minutes ago

      >You call it extortion of the AI companies, but isn’t stealing/crawling/hammering a site to scrape their content to resell just as nefarious?

      You can easily block ChatGPT and most other AI scrapers if you want:

      https://habeasdata.neocities.org/ai-bots

      • james2doyle 8 minutes ago

        This is just using robots.txt and asking "pretty please, don’t scrape me".

        Here is an article (from TODAY) about the case where Perplexity is being accused of ignoring robots.txt: https://www.theverge.com/news/839006/new-york-times-perplexi...

        If you think a robots.txt is the answer to stopping the billion-dollar AI machine from scraping you, I don’t know what to say.

  • NooneAtAll3 2 hours ago

    it can't even spy on us silently, damn

paradite 3 hours ago

The deployment pattern from Cloudflare looks insane to me.

I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.

The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.

I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.

For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.

  • vlovich123 2 hours ago

    That is also true at Cloudflare for what it’s worth. However, the company is so big that there’s so many different products all shipping at the same time it can be hard to correlate it to your release, especially since there’s a 5 min lag (if I recall correctly) in the monitoring dashboards to get all the telemetry from thousands of servers worldwide.

    Comparing the difficulty of running the world’s internet traffic with hundreds of customer products with your fintech experience is like saying “I can lift 10 pounds. I don’t know why these guys are struggling to lift 500 pounds”.

    • autoexec 32 minutes ago

      > However, the company is so big that there’s so many different products all shipping at the same time it can be hard to correlate it to your release

      This kind of thing would be more understandable for a company without hundreds of billions of dollars, and for one that hasn't centralized so much of the internet. If a company has grown too large and complex to be well managed and effective and it's starting to look like a liability for large numbers of people there are obvious solutions for that.

      • pulkitsh1234 14 minutes ago

        Genuinely curious, how to actually implement detection systems for a large scale global infra which that works with < 1 minute SLO ? Given cost is no constraint.

      • vlovich123 26 minutes ago

        Can you name a major cloud provider that doesn’t have major outages?

        If this were purely a money problem it would have been solved ages ago. It’s a difficult problem to solve. Also, they’re the youngest of the major cloud providers and have a fraction of the resources that Google, Amazon, and Microsoft have.

        • autoexec 17 minutes ago

          > Can you name a major cloud provider that doesn’t have major outages?

          That fact that no major cloud provider is actually good is not an argument that cloudflare isn't bad, or even that they couldn't/shouldn't do better than they are. They have fewer resources than Google or Microsoft but they're also in a unique position that makes us differently vulnerable when they fuck up. It's not all their fault, since it was a mistake to centralize the internet to the extent that we have in the first place, but now that they are responsible for so much they have to expect that people will be upset when they fail.

  • dehrmann 2 hours ago

    Cloudflare is orders of magnitude larger than any fintech. Rollouts likely take much longer, and having a human monitoring a dashboard doesn't scale.

    • notepad0x90 an hour ago

      That means they engineered their systems incorrectly then? Precisely because they are much bigger, they should be more resilient. You know who's bigger than Cloudflare? tier-1 ISPs, if they had an outage the whole internet would know about it, and they do have outages except they don't cascade into a global mess like this.

      Just speculating based on my experience: It's more likely than not that they likely refused to invest in fail-safe architectures for cost reasons. Control-plane and data-plane should be separate, a react patch shouldn't affect traffic forwarding.

      Forget manual rollbacks, there should be automated reversion to a known working state.

      • vlovich123 20 minutes ago

        > Control-plane and data-plane should be separate

        They are separate.

        > a react patch shouldn't affect traffic forwarding.

        If you can’t even bother to read the blog post maybe you shouldn’t be so confident in your own analysis of what should and shouldn’t have happened?

        This was a configuration change to change the buffered size of a body from 256kb to 1mib.

        The ability to be so wrong in so few words with such confidence is impressive but you may want to take more of a curiosity first approach rather than reaction first.

    • cowsandmilk 27 minutes ago

      > Rollouts likely take much longer

      Cloudflare’s own post says the configuration change that resulted in the outage rolled out in seconds.

  • markus_zhang 3 hours ago

    My guess is that CF has so many external customers that they need to move fast and try not to break things. My hunch is that their culture always favors moving fast. As long as they are not breaking too many things, customers won't leave them.

    • paradite 3 hours ago

      There is nothing wrong with moving fast and deploying fast.

      I'm more talking about how slow it was to detect the issue caused by the config change, and perform the rollback of the config change. It took 20 minutes.

    • linhns 35 minutes ago

      I think everyone favors moving fast. We humans want to see results of our action early.

  • theideaofcoffee 3 hours ago

    Same, my time at a F100 ecommerce retailer showed me the same. Every change control board justification needed an explicit back-out/restoration plan with exact steps to be taken, what was being monitored to ensure that was being held to, contacts of prominent groups anticipated to have an effect, emergency numbers/rooms for quick conferences if in fact something did happen.

    The process was pretty tight, almost no revenue-affecting outages from what I can remember because it was such a collaborative effort (even though the board presentation seemed a bit spiky and confrontational at the time, everyone was working together).

    • prdonahue 3 hours ago

      And you moved at a glacial pace compared to Cloudflare. There are tradeoffs.

      • theideaofcoffee 2 hours ago

        Yes, of course, I want the organization that inserted itself into handling 20% of the world's internet traffic to move fast and break things. Like breaking the internet on a bi-weekly basis. Yep, great tradeoff there.

        Give me a break.

        • jimmydorry an hour ago

          While you're taking your break, exploits gain traction in the wild and one of the value propositions for using a service provider like CloudFlare is catching and mitigating theses exploits as fast as possible. From the OP, this outage was in relation to handling a nasty RCE.

        • wvenable 2 hours ago

          But if your job is mitigate attacks/issues then things can very broken while you're being slow to mitigate it.

        • JeremyNT 38 minutes ago

          Lest we forget, they initially rose to prominence by being cheaper than the existing solutions, not better, and I suppose this is a tradeoff a lot of their customers are willing to make.

    • lljk_kennedy an hour ago

      This sounds just as bad as yolo-merges, just on the other end of the spectrum.

roguecoder 11 minutes ago

I notice that this is the kind of thing that solid sociable tests ought to have caught. I am very curious how testable that code is (random procedural if-statements don't inspire high confidence.)

rachr 3 hours ago

Time for Cloudflare to start using the BOFH excuse generator. https://bofh.d00t.org/

  • bit1993 21 minutes ago

    Thank you for this resource. Now an indispensable tool for my toolbox.

liampulles 3 hours ago

The lesson presented by the last few big outages is that entropy is, in fact, inescapable. The comprehensibility of a system cannot keep up with its growing and aging complexity forever. The rate of unknown unknowns will increase.

The good news is that a more decentralized internet with human brain scoped components is better for innovation, progress, and freedom anyway.

  • agentifysh 10 minutes ago

    yet my dedicated server has been up since 2015 with zero downtimes

    i dont think this is an entropy issue its human error bubbling up and cloudflare charges a premium for it

    my faith in cloudflare is shoook for sure two major outages weeks apart ad this wont be the last

  • hnthrowaway0328 2 hours ago

    I'm not sure how decentralization helps though. People in a bazzar are going to care even less about sharing shadow knowledge. Linux IMO succeeds not because of the bazaar but because of Linus.

    • marcosdumay 24 minutes ago

      You don't keep a bazaar running with shadow knowledge. Either the important things are published or it doesn't run.

    • liampulles 41 minutes ago

      What is the shadow knowledge in this case?

aeyes an hour ago

How hard can it be for a company with 1000 engineers to create a canary region before blasting their centralized changes out to everyone.

Every change is a deployment, even if its config. Treat it as such.

Also you should know that a strongly typed language won't save you from every type of problem. And especially not if you allow things like unwrap().

It is just mind boggling that they very obviously have completely untested code which proxies requests for all their customers. If you don't want to write the tests then at least fuzz it.

jakub_g 2 hours ago

The interesting part:

After rolling out a bad ruleset update, they tried a killswitch (rolled out immediately to 100%) which was a code path never executed before:

> However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied, the code correctly skipped the evaluation of the execute action, and didn’t evaluate the sub-ruleset pointed to by it. However, an error was then encountered while processing the overall results of evaluating the ruleset

> a straightforward error in the code, which had existed undetected for many years

  • 8cvor6j844qw_d6 2 hours ago

    > have never before applied a killswitch to a rule with an action of “execute”

    One might think a company on the scale of Cloudflare would have a suite of comprehensive tests to cover various scenarios.

    • hnthrowaway0328 2 hours ago

      I kinda think most companies out there are like that. Moving fast is the motto I heard the most.

      They are probably OK with occasional breaks as long as customers don't mind.

xnorswap 4 hours ago

My understanding, paraphrased: "In order to gradually roll out one change, we had to globally push a different configuration change, which broke everything at once".

But a more important takeaway:

> This type of code error is prevented by languages with strong type systems

  • jsnell 4 hours ago

    That's a bizarre takeaway for them to suggest, when they had exactly the same kind of bug with Rust like three weeks ago. (In both cases they had code implicitly expecting results to be available. When the results weren't available, they terminated processing of the request with an exception-like mechanism. And then they had the upstream services fail closed, despite the failing requests being to optional sidecars rather than on the critical query path.)

    • pdimitar 2 hours ago

      To be precise, the previous problem with Rust was because somebody copped out and used a temporary escape hatch function that absolutely has no place in production code.

      It was mostly an amateur mistake. Not Rust's fault. Rust could never gain adoption if it didn't have a few escape hatches.

      "Damned if they do, damned if they don't" kind of situation.

      There are even lints for the usage of the `unwrap` and `expect` functions.

      As the other sibling comment points out, the previous Cloudflare problem was an acute and extensive organizational failure.

    • Hamuko 37 minutes ago

      Yeah, my first thought was that had they used Rust, maybe we would've seen them point out a rule_result.unwrap() as the issue.

    • littlestymaar 3 hours ago

      In fairness, the previous bug (with the Rust unwrap) should never have happened: someone explicitly called the panicking function, the review didn't catch it and the CI didn't catch it.

      It required a significant organizational failure to happen. These happen but they ought to be rarer than your average bug (unless your organization is fundamentally malfunctioning, that is)

      • greatgib 3 hours ago

        The issue would also not have happened, if someone did the right code, tests, and the review or CI caught it...

        • marcosdumay 20 minutes ago

          It's different to expect somebody to write the correct program every time than to expect somebody not to call the "break_my_system" procedure that was warnings all over it telling people it's there for quick learning-to-use examples or other things you'll never run.

  • debugnik 4 hours ago

    Prevented unless they assert the wrong invariant at runtime like they did last time.

  • skywhopper 3 hours ago

    This is the exact same type of error that happened in their Rust code last time. Strong type systems don’t protect you from lazy programming.

    • inejge 15 minutes ago

      It's not remotely the same type of error -- error non-handling is very visible in the Rust code, while the Lua code shows the happy path, with no indication that it could explode at runtime.

      Perhaps it's the similar way of not testing the possible error path, which is an organizational problem.

gkoz 4 hours ago

I sometimes feel we'd be better off without all the paternalistic kitchensink features. The solid, properly engineered features used intentionally aren't causing these outages.

  • ilkkao 3 hours ago

    Agreed, I don't really like Cloudflare trying to magically fix every web exploit there is in frameworks my site has never used.

    • nish__ 2 hours ago

      Honestly. This feels outside of their domain.

bradly 15 minutes ago

Dang… I don’t even use React and it still brings down my sites. Good beats I guess.

8cvor6j844qw_d6 3 hours ago

Is there some underlying factors that resulted in the recent outages (e.g., new processes, layoffs, etc.) or just a series of pure coincidences?

  • Elucalidavah 2 hours ago

    Sounds like their "FL1 -> FL2" transition is involved in both.

    • Someone1234 2 hours ago

      It was involved in the previous one, but not in this latest one. All FL2 did was prevent the outage being even wider spread than it was. None of this had anything to do with migration.

      • tetha 2 hours ago

        If FL2 didn't have the outage, and FL1 did, the pace of the migration did have an impact.

        Though this is showing the problem with these things: Migrating faster could have reduced the impact of this outage, while increasing the impact of the last outage. Migrating slower could have reduced the impact of the last outage, while increasing the impact of this outage.

        This is a hard problem: How fast do you rip old working infrastructure out and risk finding new problems in the new stack, yet, how long do you tolerate shortcomings of the old stack that caused you to build the new stack?

dznodes 29 minutes ago

When should we just give up on Cloudflare? Seems like this just keeps happening. Like some kind of backdoor triggered willy nilly, Hmmm?

egorfine 3 hours ago

> provides customers with protection against malicious payloads, allowing them to be detected and blocked. To do this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis.

I have a mixed feeling about this.

On the other hand, I absolutely don't want a CDN to look inside my payloads and decide what's good for me or. Today it's protection, tomorrow it's censorship.

At the same time this is exactly what CloudFlare is good for - to protect sites from malicious requests.

  • udev4096 2 hours ago

    We need a decentralized ddos mitigation network based on incentives. Donate X amount of bandwidth, get Y amount of protection from other peers. Yes, we gotta do TLS inspection on every end for effective L7 mitigation but at least filtering can be done without decrypting any packets

borplk 7 minutes ago

Every time they screw up they write an elaborate postmortem and pat themselves on the back. Don't get me wrong, better have the postmortem than not. But at this point it seems like the only thing they are good at is writing incident postmortem blog posts.

Bender 2 hours ago

Suggestion for Cloudflare: Create an early adopter option for free accounts.

Benefit: Earliest uptake of new features and security patches.

Drawback: Higher risk of outages.

I think this should be possible since they already differentiate between free, pro and enterprise accounts. I do not know how the routing for that works but I bet they could do this. Think crowd-sourced beta testers. Also a perk for anything PCI audit or FEDRAMP security prioritized over uptime.

  • LelouBil an hour ago

    I would for sure enable this, my personal server can handle being unreachable for a few hours in exchange for (potentially) interesting features.

markus_zhang 2 hours ago

I wonder anyone from internal could share the culture a bit. I'm mostly interested in the following part:

If someone messes up royally, is there someone who says "if you break the build/whatever super critical, then your ass is the grass and I'm the lawn mower"?

hrimfaxi 3 hours ago

Having their changes fully propagate within 1 minute is pretty fantastic.

  • denysvitali 3 hours ago

    This is most likely a strong requisite for such a big scale deployment if DDOS protection and detection - which explains their architectural choices (ClickHouse & co) and the need of a super low latency config changes.

    Since attackers might rotate IPs more frequently than once per minute, this effectively means that the whole fleet of servers should be able to quickly react depending on the decisions done centrally.

  • chatmasta 3 hours ago

    The coolest part of Cloudflare’s architecture is that every server is the same… which presumably makes deployment a straightforward task.

rany_ 3 hours ago

> As part of our ongoing work to protect customers using React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why would increasing the buffer size help with that security vulnerability? Is it just a performance optimization?

  • redslazer 3 hours ago

    If the request data is larger than the limit it doesn’t get processed by the Cloudflare system. By increasing buffer size they process (and therefore protect) more requests.

  • boxed 3 hours ago

    I think the buffer size is the limit on what they check for malicious data, so the old 128k would mean it would be trivial to circumvent by just having 128k ok data and then put the exploit after.

    • whs an hour ago

      I got curious and I checked AWS WAF. Apparently AWS WAF default limit for CloudFront is 16KB and max is 64KB.

MagicMoonlight an hour ago

If you had a 99.99% availability requirement they will have already cost you a fortune

_pdp_ 3 hours ago

So no static compiler checks and apparently no fuzzers used to ensure these rules work as intended?

  • perching_aix 2 hours ago

    Such tooling exists for Lua? Didn't know.

nish__ 2 hours ago

Is it crazy to anyone else that they deploy every 5 minutes? And that it's not just config updates, but actual code changes with this "execute" action.

  • kccqzy 34 minutes ago

    Config updates are not so clear cut from code changes.

    Once I worked with a team in the anti-abuse space where the policy is that code deployments must happen over 5 days and config updates can take a few minutes. Then an engineer on the team argued that deploying new Python code doesn’t count as a code change because the CPython interpreter did not change; it didn’t even restart. And indeed given how dynamic Python is, it is totally possible to import new Python modules that did not exist when the interpreter process is launched.

snafeau 4 hours ago

A lot of these kind of bugs feel like they could be caught be a simple review bot like Greptile... I wonder if Cloudlare uses an equivalent tool internally?

  • nkmnz 3 hours ago

    What makes greptile a better choice compared to claude code or codex, in your opinion?

  • nish__ 2 hours ago

    Any bot that runs an AI model should not be called "simple".

antiloper 4 hours ago

Make faster websites:

> we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why is the Next.js limit 1 MB? It's not enough for uploading user generated content (photographs, scanned invoices), but a 1 MB request body for even multiple JSON API calls is ridiculous. There frameworks need to at least provide some pushback to unoptimized development, even if it's just a lower default request body limit. Otherwise all web applications will become as slow as the MS office suite or reddit.

  • ramon156 3 hours ago

    The update was to update it to 3MB (paid 10MB)

  • AmazingTurtle 3 hours ago

    a) They serialize tons of data into requests b) Headers. Mostly cookies. They are a thing. They are being abused all over the world by newbies.

mmmlinux 2 hours ago

Messing around on a Friday? Brave.

  • orphea an hour ago

    If you're afraid of deploying on Friday, you're doing it wrong.

kachapopopow 4 hours ago

why does this seem oddly familiar (fail-closed logic)

rudedogg an hour ago

I’m really sick of constantly seeing cloudflare, and their bullshit captchas. Please, look at how much grief they’re causing trying to be the gateway to the internet. Don’t give them this power

nish__ 2 hours ago

No love lost, no love found.

system2 an hour ago

Is that me, or did CloudFlare outages increase since LLM "engineers" were hired remotely? Do you think there is a correlation?

  • roguecoder a few seconds ago

    They've always been flakey. At least these only impacted their own customers instead of taking down the internet.

iLoveOncall 3 hours ago

The most surprising from this article is that CloudFlare handles only around 85M TPS.

  • blibble 2 hours ago

    it can't really be that small, can it?

    that's maybe half a rack of load

    • nish__ 2 hours ago

      Given the number of lua scripts they seem to be running, it has to take more than half a rack.

lapcat 3 hours ago

> This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

Cloudflare deployed code that was literally never tested, not even once, neither manually nor by unit test, otherwise the straightforward error would have been detected immediately, and their implied solution seems to be not testing their code when written, or even adding 100% code coverage after the fact, but rather relying on a programming language to bail them out and cover up their failure to test.

  • JohnMakin 2 hours ago

    Large scale infrastructure changes are often by nature completely untestable. The system is too large, there are too many moving parts to replicate with any kind of sane testing, so often, you do find out in prod, which is why robust and fast rollback procedures are usually desirable and implemented.

    • lapcat 2 hours ago

      > Large scale infrastructure changes are often by nature completely untestable.

      You're changing the subject here and shifting focus from the specific to the vague. The two postmortems after the recent major Cloudflare outages both listed straightforward errors in source code that could have been tested and detected.

      Theoretical outages could theoretically have other causes, but these two specific outages had specific causes that we know.

      > which is why robust and fast rollback procedures are usually desirable and implemented.

      Yes, nobody is arguing against that. It's a red herring with regard to my point about source code testing.

      • JohnMakin an hour ago

        I am not changing any subject. These are glue logic scripts connecting massive pieces of infra together, spanning what is likely several teams and orgs over the course of many years. It is impossible to blurt something out like "well, source code testing" for something like this, when the source code inputs are not possibly testable outside the scale of the larger system. They're often completely unknowable as well.

        With all due respect, it sounds like you have not worked on these types of systems, but out of curiosity - what type of test do you think would have prevented this?

        • lapcat 39 minutes ago

          With all due respect, it sounds like you have never heard of unit tests.

          Cloudflare states that the compiler would prevent the bug in certain programming languages. So it seems ridiculous to suggest that the bug can't be detected outside the scale of a larger system.

blibble 2 hours ago

amateur level stuff again

jgalt212 3 hours ago

I do kind of like who they are blaming React for this.

theoldgreybeard 2 hours ago

This is total amateur shit. Completely unacceptable for something as critical as Cloudflare.

Uptrenda an hour ago

Can't believe one shitty website can take down most of the mainstream web.

guluarte 2 hours ago

is it me or critical software bugs are more and more common?

rvz 3 hours ago

> Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.

Doesn't Cloudflare rigorously test their changes before deployment to make sure that this does not happen again? This better not have been used to cover for the fact that they are using AI to fix issues like this one.

Better not be any presence of vibe coders or AI agents being used to be touching such critical pieces of infrastructure at all and I expected Cloudflare to learn from the previous outage very quickly.

But this is quite a pattern but might need to consider putting the unreliability next to GitHub (which goes down every week).

fidotron 4 hours ago

> This change was being rolled out using our gradual deployment system, and, as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules. As this was an internal tool, and the fix being rolled out was a security improvement, we decided to disable the tool for the time being as it was not required to serve or protect customer traffic.

Come on.

This PM raises more questions than it answers, such as why exactly China would have been immune.

  • skywhopper 3 hours ago

    China is probably a completely separate partition of their network.

    • fidotron 3 hours ago

      One that doesn't get proactive security rollouts, it would seem.

      • roguecoder 3 minutes ago

        The deploys are very unlikely to be managed from the same system.

      • skywhopper 3 hours ago

        I assume it was next on the checklist, or assigned to a different ops team.

da_grift_shift 4 hours ago

It's not an outage, it's an Availability Incident™.

https://blog.cloudflare.com/5-december-2025-outage/#what-abo...

  • perching_aix 2 hours ago

    You jest, but recently I also felt compelled to stop using the word (planned) outage where I work, because it legitimately creates confusion around the (expected) character of impact.

    Outage is the nuclear wasteland situation, which given modern architectural choices, is rather challenging to manifest. To avoid it is face-saving, but also more correct.

kosolam 2 hours ago

Some nonsense again. The level of negligence there is astounding. This is frightening because this entity is daily exposed to a large portion of our personal data which goes over the wire. As well as business data. It’s just a matter of time before a disaster is going to occur. Some regulatory body must take control in their hands right now.

websiteapi 4 hours ago

i wonder why they cannot partially rollout. like the other outage they have to do a global rollout.

  • usrnm 3 hours ago

    I really don't see how it would've helped. In go or Rust you'd just get a panic, which is in no way different.

  • denysvitali 3 hours ago

    The article mentions that this Lua-based proxy is the old generation one, which is going to be replaced by the Rust based one (FL2) and that didn't fail on this scenario.

    So, if anything, their efforts towards a typed language were justified. They just didn't manage to migrate everything in time before this incident - which is ironically a good thing since this incident was cause mostly by a rushed change in response to an actively exploited vulnerability.

    • websiteapi 3 hours ago

      yes, but as the article states why are they doing global fast rollouts?

      • denysvitali 3 hours ago

        I think (would love to be corrected) that this is the nature of their service. They probably push multiple config changes per minute to mitigate DDOS attacks. For sure the proxies have a local list of IPs that, for a period of time, are blacklisted.

        For DDOS protection you can't really rely on multiple-hours rollouts.

denysvitali 4 hours ago

Ironically, this time around the issue was in the proxy they're going to phase out (and replace with the Rust one).

I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.

HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.

At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.

Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.

@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems

  • iLoveOncall 3 hours ago

    > I truly believe they're really going to make resilience their #1 priority now

    I hope that was their #1 priority from the very start given the services they sell...

    Anyway, people always tend to overthink about those black-swan events. Yes, 2 happened in a quick succession, but what is the average frequency overall? Insignificant.

    • roguecoder 4 minutes ago

      This is Cloudflare. They've repeatedly broken DNS for years.

      Looking across the errors, it points to some underlying practices: a lack of systems metaphors, modularity, testability, and an reliance on super-generic configuration instead of software with enforced semantics.

    • denysvitali 2 hours ago

      I think they have to strike a balance between being extremely fast (reacting to vulnerabilities and DDOS attacks) while still being resilient. I don't think it's an easy situation

  • trashburger 3 hours ago

    I would very much like for him not to ignore the negativity, given that, you know, they are breaking the entire fucking Internet every time something like this happens.

    • denysvitali 3 hours ago

      This is the kind of comment I wish he would ignore.

      You can be angry - but that doesn't help anyone. They fucked up, yes, they admitted it and they provided plans on how to address that.

      I don't think they do these things on purpose. Of course given their good market penetration they end up disrupting a lot of customers - and they should focus on slow rollouts - but I also believe that in a DDOS protection system (or WAF) you don't want or have the luxury to wait for days until your rule is applied.

      • beanjuiceII 3 hours ago

        I hope he doesn't ignore it, the internet has been forgiving enough toward cloudflares string of failures..its getting pretty old, and creates a ton of choas. I work with life saving devices, being impacted in any way in data monitoring has a huge impact in many ways. "sorry ma'am we can't give your child t1d readings on your follow app because our provider decided to break everything in the pursuit of some react bug." has a great ring to it

        • Anon1096 2 hours ago

          Cloudflare and other cloud infra providers are only providing primitives to use, in this case WAF. They have target uptimes and it's never 100%. It's up to the people actually making end user services (like your medical devices) to judge whether that is enough and if not to design your service around it.

          (and also, rolling your own version of WAF is probably not the right answer if you need better uptime. It's exceedingly unlikely a medical devices company will beat CF at this game.)

        • esseph 2 hours ago

          Half your medical devices are probably opening up data leakage to China.

          https://www.csoonline.com/article/3814810/backdoor-in-chines...

          Most hospital and healthcare IT teams are extremely under funded, undertrained, overworked, and the software, configurations and platforms are normally not the most resilient things.

          I have a friend at one in the North East right now going through a hell of a security breach for multiple months now and I'm flabbergasted no one is dead yet.

          When it comes to tech, I get the impression most organizations are not very "healthy" in the durability of systems.

      • nish__ 2 hours ago

        Maybe not on purpose but there's such a thing as negligence.

  • fidotron 3 hours ago

    > HugOps

    This childish nonsense needs to end.

    Ops are heavily rewarded because they're supposed to be responsible. If they're not then the associated rewards for it need to stop as well.

    • denysvitali 3 hours ago

      I have never seen an Ops team being rewarded for avoiding incidents (focusing in tech debt reduction), but instead they get the opposite - blamed when things go wrong.

      I think it's human nature (it's hard to realize something is going well until it breaks), but still has a very negative psychological effect. I can barely imagine the stress the team is going through right now.

      • fidotron 3 hours ago

        > I have never seen an Ops team being rewarded for avoiding incidents

        That's why their salaries are so high.

        • denysvitali 3 hours ago

          Depending on the tech debt, the ops team might just be in "survival mode" and not have the time to fix every single issue.

          In this particular case, they seem to be doing two things: - Phasing out the old proxy (Lua based) which is replaced by FL2 (Rust based, the one that caused the previous incident) - Reacting to an actively exploited vulnerability in React by deploying WAF rules - and they're doing them in a relatively careful way (test rules) to avoid fuckups, which caused this unknown state, which triggered the issue

          • fidotron 3 hours ago

            They deliberately ignored an internal tool that started erroring out at the given deployment and rolled it out anyway without further investigation.

            That's not deserving of sympathy.

        • esseph 3 hours ago

          Ops salaries are high??? Where?!?!

          • hnthrowaway0328 2 hours ago

            Definitely commands better salaries than us pitty DEs.

    • esseph 3 hours ago

      Ops has never been "rewarded" at any org I've ever been at or heard about, including physical infra companies.

  • da_grift_shift 3 hours ago

    [ Removed by Reddit ]

    • denysvitali 3 hours ago

      Wow. The three comments below parent really show how toxic HN has become.

      • beanjuiceII 3 hours ago

        being angry about something doesn't make it toxic, people have a right to be upset

        • denysvitali 3 hours ago

          The comment, before the edit, was what I would consider toxic. No wonder it has been edited.

          It's fine to be upset, and especially rightfully so after the second outage in less than 30 days, but this doesn't justify toxicity.

jpeter 4 hours ago

Unwrap() strikes again

  • dap 4 hours ago

    I guess you’re being facetious but for those who didn’t click through:

    > This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

    • skywhopper 3 hours ago

      That bit may be true, but the underlying error of a null reference that caused a panic was exactly the same in both incidents.

      • roguecoder 10 minutes ago

        Yep: it is wild for them to claim that a strongly-typed language would have saved them when it didn't.

        Relying on language features instead of writing code well will always eventually backfire.

  • throwawaymaths 4 hours ago

    this time in lua. cloudflare can't catch a break

    • RoyTyrell 4 hours ago

      Or they're not thoroughly testing changes before pushing them out. As I've seen some others say, CloudFlare at this point should be considered critical infrastructure. Maybe not like power but dang close.

      • esseph 2 hours ago

        My power goes out every Wednesday around noon and normally if the weather is bad. In a major US metro.

        I hope cloudflare is far more resilient than local power.

    • gcau 4 hours ago

      The 'rewrite it in lua' crowd are oddly silent now.

      • infrcg 3 hours ago

        [flagged]

        • jcmfernandes 3 hours ago

          Did you really go through the trouble of creating an account just to spit trash? Damn!

    • rvz 3 hours ago

      Time to use boring languages such as Java and Go.

barbazoo 4 hours ago

> Customers that did not have the configuration above applied were not impacted. Customer traffic served by our China network was also not impacted.

Interesting.

  • flaminHotSpeedo 4 hours ago

    They kinda buried the lede there, 28% failure rate for 100% of customers isn't the same as 100% failure rate for 28% of customers

dreamcompiler 3 hours ago

"Honey we can't go on that vacation after all. In fact we can't ever take a vacation period."

"Why?"

"I've just been transferred to the Cloudflare outage explanation department."