Let's blame the dev who pressed "Deploy" - by Dmitry Kudryavtsev

Mac@programming.dev · edit-2 4 months ago

Let's blame the dev who pressed "Deploy" - by Dmitry Kudryavtsev

quinkin@lemmy.world · 4 months ago

If a single person can make the system fail then the system has already failed.

corsicanguppy@lemmy.ca · 4 months ago

If a single outsider can make Your system fail then it’s already failed.

Now consider in the context of supply-chain tuberculosis like npm.

slazer2au@lemmy.world · 4 months ago

Left-pad right?

Eager Eagle@lemmy.world · 4 months ago

Many people need to shift away from this blaming mindset and think about systems that prevent these things from happening. I doubt anyone at CrowdStrike desired to ground airlines and disrupt emergency systems. No one will prevent incidents like this by finding scapegoats.

over_clox@lemmy.world · edit-2 4 months ago

Hey, why not just ask Dave Plummer, former Windows developer…

https://youtube.com/watch?v=wAzEJxOo1ts

When anywhere from 8.5 million to over a billion systems went down, numbers I’ve read so far vary significantly, still that’s way too much failure for a simple borked update to a kernel level driver, not even made by Microsoft.

Eager Eagle@lemmy.world · 4 months ago

that’s a huge sign that their rollout process is garbage

people_are_cute@lemmy.sdf.org · 4 months ago

That means spending time and money on developing such a system, which means increasing costs in the short term… which is kryptonite for current-day CEOs

Eager Eagle@lemmy.world · edit-2 4 months ago

Right. More than money, I say it’s about incentives. You might change the entire C-suite, management, and engineering teams, but if the incentives remain the same (e.g. developers are evaluated by number of commits), the new staff is bound to make the same mistakes.

azertyfun@sh.itjust.works · edit-2 4 months ago

I strongly believe in no-blame mindsets, but “blame” is not the same as “consequences” and lack of consequences is definitely the biggest driver of corporate apathy. Every incident should trigger a review of systemic and process failures, but in my experience corporate leadership either sucks at this, does not care, or will bury suggestions that involve spending man-hours on a complex solution if the problem lies in that “low likelihood, big impact” corner.
Because likely when the problem happens (again) they’ll be able to sweep it under the rug (again) or will have moved on to greener pastures.

What the author of the article suggests is actually a potential fix; if developers (in a broad sense of the word and including POs and such) were accountable (both responsible and empowered) then they would have the power to say No to shortsighted management decisions (and/or deflect the blame in a way that would actually stick to whoever went against an engineer’s recommendation).

morbidcactus@lemmy.ca · edit-2 4 months ago

Edit: see my response, realised the comment was about engineering accountability which I 100% agree with, leaving my original post untouched aside from a typo that’s annoying me.

I respectfully disagree coming from a reliability POV, you won’t address culture or processes that enable a person to make a mistake. With the exception of malice or negligence, no one does something like this in a vacuum; insufficient or incorrect training, unreasonable pressure, poorly designed processes, a culture that enables actions that lead to failure.

Example I recall from when I worked manufacturing, operator runs a piece of equipment that joins pieces together in manual rather than automatic, failed to return it to a ready flag and caused a line stop. Yeah, operator did something outside of process and caused an issue, clear cut right? Send them home? That was a symptom, not a cause, the operator ran in manual because the auto cycle time was borderline causing linestops, especially on the material being run. The operator was also using manual as there were some location sensors that had issues with that material and there was incoming quality issues, so running manually, while not standard procedure, was a work around to handle processing issues, we also found that culturally, a lot of the operators did not trust the auto cycles and would often override. The operator was unlucky, if we just put all the “accountability” on them we’d never have started projects to improve reliability at that location and change the automation to flick over that flag the operator forgot about if conditions were met regardless.

Accountability is important, but it needs to be applied where appropriate, if someone is being negligent or malicious, yeah there’s consequences, but it’s limiting to focus on that only. You can implement what you suggest that the devs get accountability for any failure so they’re “empowered”, but if your culture doesn’t enable them to say no or make them feel comfortable to do so, you’re not doing anything that will actually prevent an issue in the future.

Besides, I’d almost consider it a PPE control and those are on the bottom of the controls hierarchy with administrative just above it, yes I’m applying oh&s to software because risk is risk conceptually, automated tests, multi phase approvals etc. All of those are better controls than relying on a single developer saying no.

azertyfun@sh.itjust.works · 4 months ago

Oh I was talking in the context of my specialty, software engineering. The main difference between an engineer and an operator is that one designs processes while the other executes on those processes. Negligence/malice aside the operator is never to blame.

If the dev is “the guy who presses the ‘go live’ button” then he’s an operator. But what is generally being discussed is all the engineering (or lack thereof) around that “go live” button.

As a software engineer I get queasy when it is conceivable that a noncritical component reaches production without the build artifact being thoroughly tested (with CI tests AND real usage in lower environments).
The fact that CrowdWorks even had a button that could push a DOA update on such a highly critical component points to their processes being so out of the industry standards that no software engineer would have signed off on anything… If software engineers actually had the same accountability as Civil Engineers. If a bridge gets built outside the specifications of the Civil Engineer who signed off on the plans, and that bridge crumbles, someone is getting their tits sued off. Yet there is no equivalent accountability in Software Engineering (except perhaps in super safety-critical stuff like automotive/medical/aerospace/defense applications, and even there I think we’d be surprised).

morbidcactus@lemmy.ca · 4 months ago

I realised you meant this over lunch, I’m a mech eng who changed disciplines into software (data and systems mainly) over my career, I 100% feel you, I have seen enough colleagues do things that wouldn’t fly in other disciplines, it’s definitely put me off a number of times. I’m personally for rubber stamping by a PEng and the responsibility that comes with that. There’s enough regulatory and ethical considerations just in data usage that warrants an engineering review, systems designed for compliance should be stamped too.

Really bothers me sometimes how wildwest things are.

lad@programming.dev · 4 months ago

This might help in some regard, but this will also create a bottleneck of highly skilled highly expensive Engineers with the accountability certificate. I’ve seen what happens when this is cornerstone even without the accountability that would make everything even more expensive: the company wants to cut expenses so there’s only one high level engineer per five or so projects. Said engineer has no time and no resources to dig into what the fuck actually happens on the projects. Changes are either under reviewed or never released because they are forever stuck in review.

On the other hand, maybe we do move a tad bit too fast, and some industries could do with a bit of thinking before doing. Not every software company should do that, though. To continue on the bridge analogy, most of software developers are more akin to carpenters even if they think about themselves as of architects of buildings and bridges. If a table fails, nothing good is going to happen, and some damage is likely to occur, but the scale is very different from what happens if a condo fails

azertyfun@sh.itjust.works · 4 months ago

But a company that hires carpenters to build a roof will be held liable if that roof collapses on the first snow storm. Plumbers and electricians must be accredited AFAIK, have the final word on what is good enough by their standards, and signing off on shoddy work exposes them to criminal negligence lawsuits.

Some software truly has no stakes (e.g. a free mp3 converter), but even boring office productivity tools can be more critical than my colleagues sometimes seem to think. Sure, we work on boring office productivity tools, but hospitals buy those tools and unreliable software means measurably worse health outcomes for the patients.

Engineers signing off on all software is an extreme end of the spectrum, but there are a whole lot of options between that and the current free-for-all where customers have no way to know if the product they’re buying is following industry standard practices, or if the deployment process is “Dave receives a USB from Paula and connects to the FTP using a 15 year-old version of FileZilla and a post-it note with the credentials”.

lad@programming.dev · 3 months ago

True, there is a spectrum of options, and some will work much better than what we have now. It’s just that when I read about holding people accountable I don’t quite imagine it’s going to be implemented in the optimal way, not in the first hundred years or so

over_clox@lemmy.world · edit-2 4 months ago

If you were a developer that knew you were responsible for developing ring zero code, massively deployed across corporate systems across the world, then you should goddamned properly test the update before deploying it.

This isn’t a simple glitch like a calculation rounding error or some shit, the programmers of any ring zero code should be held fully responsible, for not properly reviewing and testing the code before deploying an update.

Edit: Why not just ask Dave Plummer, former Windows developer…

https://youtube.com/watch?v=wAzEJxOo1ts

Aceticon@lemmy.world · edit-2 4 months ago

If you system depends on a human never making a mistake, your system is shit.

It’s not by chance that for example, Accountants have since forever had something which they call reconciliation where the transaction data entered from invoices and the like then gets cross-checked with something else done differently, for example bank account transactions - their system is designed with the expectation that humans make mistakes hence there’s a cross-check process to catch those.

Clearly Crowdstrike did not have a secondary part of the process designed to validate what’s produced by the primary (in software development that would usually be Integration Testing), so their process was shit.

Blaming the human that made a mistake for essentially being human and hence making mistakes, rather than the process around him or her not having been designed to catch human failure and stop it from having nasty consequences, is the kind of simplistic ignorant “logic” that only somebody who has never worked in making anything that has to be reliable could have.

My bet, from decades of working in the industry, is that some higher up in Crowdstrike didn’t want to pay for the manpower needed for the secondary process checking the primary one before pushing stuff out to production because “it’s never needed” and then the one time it was needed, it wasn’t there, thinks really blew up massivelly, and here we are today.

over_clox@lemmy.world · 4 months ago

Indeed, I fully agree. They obviously neglected on testing before deployment. So you can split the blame between the developer that goofed on the null pointer dereferencing and the blank null file, and the higher ups that apparently decided that proper testing before deployment wasn’t necessary.

Ultimately, it still boils down to human error.

Eager Eagle@lemmy.world · 4 months ago

Finding people to blame is, more often than not, useless.

Systematic changes to the process might prevent it from happening again.

Replacing “guilty” people with other fallible humans won’t do it.

over_clox@lemmy.world · 4 months ago

Still, with billions of dollars in losses across the globe and all the various impacts it’s having on people’s lives, is nobody gonna be held accountable? Will they just end up charging CrowdStrike as a whole a measly little fine compared to the massive losses the event caused?

One of their developers goofed up pretty bad, but in a fairly simple and forgivable way. The real blame should go on the higher ups that decided that full proper testing wasn’t necessary before deployment.

So yes, they really need to review their policies and procedures before pressing that deploy button.

Eager Eagle@lemmy.world · edit-2 4 months ago

is nobody gonna be held accountable?

Likely someone will, but legal battles between companies are more about who has more money and leverage than actual accountability, so I don’t see them as particularly useful for preventing incidents or for society.

The only good thing that might come out of this and is external to CrowdStrike, is regulation.

lad@programming.dev · 4 months ago

with billions of dollars in losses

But the real question we should be asking ourselves is “how much did tops saved over the course of the years without proper testing”

It probably is what they are concerned about, and I really wish I knew the answer to this question.

I think, this is absolutely not the way to do business, but maybe that’s because I don’t have one ¯\_(ツ)_/¯

v9CYKjLeia10dZpz88iU@programming.dev · edit-2 4 months ago

deleted by creator

Aceticon@lemmy.world · edit-2 4 months ago

Making a mistake once in a while on something one does all time is to be expected - even somebody with a 0.1% rate of mistakes will fuck up once in while if they do something with high enough frequency, especially if they’re too time constrained to validate.

Making a mistake on something you do just once, such as setting up the process for pushing virus definition files to millions of computers in such a way that they’re not checked inhouse before they go into Production, is a 100% rate of mistakes.

A rate of mistakes of 0.1% is generally not incompetence (dependes on how simple the process is and how much you’re paying for that person’s work), whilst a rate of 100% definitelly is.

The point being that those designing processes, who have lots of time to do it, check it and cross check it, and who generally only do it once per place they work (maybe twice), really have no excuse to fail the one thing they had to do with all the time in the World, whilst those who do the same thing again and again under strict time constraints definitelly have valid excuse to once in a blue moon make a mistake.

sping@lemmy.sdf.org · edit-2 4 months ago

deleted by creator

over_clox@lemmy.world · 4 months ago

Watch the video that I linked as an edit from Dave Plummer, he explains it rather well. The driver was signed, it was the rolling update definition files from CrowdStrike that were unsigned.

solrize@lemmy.world · 4 months ago

Note: Dmitry Kudryavtsev is the article author and he argues that the real blame should go to the Crowdstrike CEO and other higher-ups.

Mac@programming.dev · 4 months ago

Edited the title to have a by in front to make that a bit more clear

iAvicenna@lemmy.world · 4 months ago

sure it is the dev who is to blame and not the clueless managers who evaluate devs based on number of commits/reviews per day and CEOs who think such managers are on top of their game.

Kissaki@programming.dev · 4 months ago

Is that the case at CrowdStrike?

iAvicenna@lemmy.world · 4 months ago

I don’t have any information on that, this was more like a criticism of where the world seems to be leading to

FlorianSimon@sh.itjust.works · 4 months ago

I’ve been working as a professional programmer for many years and have never ever seen this kind of evaluation, not even once. I’m pretty convinced it’s an exception rather than a rule. And I’d add that it’s probably a very rare exception.

iAvicenna@lemmy.world · edit-2 4 months ago

NGL I am also a second hand witness to it. This particular example may be a few but there are a lot of others to the same effect: evaluating performance based on number of lines of code, trying to combine multiple dev responsibilities into a single position, unrealistic deadlines which can usually be met very superficially, managers looking for opportunities to replace coders with AI and further tasking other devs with AI code checking responsibilities, replacing experienced coders with newly graduates because they are willing to work more for less. All of these are some form of quantity over quality and usually end up with some sort of crisis.

Ephera@lemmy.ml · 4 months ago

Yeah, and at the end of the day, it is just as much a very rare exception that a dev actually gets enough time to complete their work at a level of quality they would take responsibility for.
Hell, it is standard industry practice to ship things and then start fixing the issues that crop up.

im sorry i broke the code@sh.itjust.works · 3 months ago

Nono listen to me, it’s agile

Kissaki@programming.dev · 4 months ago

CrowdStrike ToS, section 8.6 Disclaimer

[…] THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION. NEITHER THE OFFERINGS NOR CROWDSTRIKE TOOLS ARE FOR USE IN THE OPERATION OF AIRCRAFT NAVIGATION, NUCLEAR FACILITIES, COMMUNICATION SYSTEMS, WEAPONS SYSTEMS, DIRECT OR INDIRECT LIFE-SUPPORT SYSTEMS, AIR TRAFFIC CONTROL, OR ANY APPLICATION OR INSTALLATION WHERE FAILURE COULD RESULT IN DEATH, SEVERE PHYSICAL INJURY, OR PROPERTY DAMAGE. […]

It’s about safety, but truly ironic how it mentions aircraft-related twice, and communication systems (very broad).

It certainly doesn’t impose confidence in the overall stability. But it’s also general ToS-speak, and may only be noteworthy now, after the fact.

goferking0@lemmy.sdf.org · 4 months ago

Weren’t the issues at airports because of the ticketing and scheduling systems going down, not anything with aircraft?

Kissaki@programming.dev · 3 months ago

Yes, I think so.

lad@programming.dev · 4 months ago

That’s just covering up, like a disclaimer that your software is intended to only be used on 29ᵗʰ of February. You don’t expect anyone to follow that rule, but you expect the court to rule that the user is at fault.

Luckily, it doesn’t always work that way, but we will see how it turns out this time

v9CYKjLeia10dZpz88iU@programming.dev · edit-2 3 months ago

Lawful Masses with Leonard French covered this yesterday. He is a copyright attorney. He starts the video with the opinion that the ToS wouldn’t protect CrowdStrike.

trolololol@lemmy.world · 4 months ago

I’m pretty sure if a client pays for use in any of that they’ll shut up and take the money. Pretty ethical.

v9CYKjLeia10dZpz88iU@programming.dev · edit-2 4 months ago

deleted by creator

ByteOnBikes@slrpnk.net · 4 months ago

It’s never a single person who caused a failure.

jonne@infosec.pub · 3 months ago

Yeah exactly. You’d think they’d have a test suite before pushing an update, or do a staggered rollout where they only push it to a sample amount of machines first. Just blaming one guy because you had an inadequate UAT process is ridiculous.

The Snark Urge@lemmy.world · 3 months ago

Allow me to introduce myself

over_clox@lemmy.world · edit-2 4 months ago

I hope this incident shines more light on the greedy rich CEOs and the corners they cut, the taxes they owe, the underpaid employees and understaffed facilities, and now probably some hefty fines, as just a slap on the wrist of course…

BB_C@programming.dev · 4 months ago

Yesterday I was browsing /r/programming

:tabclose

polle@feddit.org · 4 months ago

Microsoft also started blaming th eu. Its such a shitshow its ridiculous.

https://www.tomshardware.com/software/windows/microsofts-eu-agreement-means-it-will-be-hard-to-avoid-crowdstrike-like-calamities-in-the-future

trolololol@lemmy.world · 4 months ago

OMG the article conflates kennel API calls and kennel drivers such as what crowdstrike actually does. I refuse to read it until the end.

Gestrid@lemmy.ca · 4 months ago

Kennel? You mean kernel?

trolololol@lemmy.world · 3 months ago

Opsi my dumb keyboard still haven’t learned what I do

MTK@lemmy.world · edit-2 4 months ago

If only we had terms for environments that were ment for testing, staging, early release and then move over to our servers that are critical…

I know it’s crazy, really a new system that only I came up with (or at least I can sell that to CrowdStrike as it seems)

gedhrel@lemmy.world · 4 months ago

Check Crowdstrike’s blurb about the 1-10-60 rule.

You can bet that they have a KPI that says they can deliver a patch in under 15m; that can preclude testing.

Although that would have caught it, what happened here is that 40k of nuls got signed and delivered as config. Which means that unparseable config on the path from CnC to ring0 could cause a crash and was never covered by a test.

It’s a hell of a miss, even if you’re prepared to accept the argument about testing on the critical path.

(There is an argument that in some cases you want security aystems to fail closed; however that’s an extreme case - PoS systems don’t fall into that - and you want to opt into that explicitly, not due to a test omission.)

Mischala@lemmy.nz · 4 months ago

That’s the crazy thing. This config can’t ever been booted on a win10/11 machine before it was deployed to the entire world.

Not once, during development of the new rule, or in any sort of testing CS does. Then once again, never booted by MS during whatever verification process they (should) have before signing.

The first win11/10 to execute this code in the way it was intended to be used, was a customer’s machine.

Insane.

gedhrel@lemmy.world · 3 months ago

Possibly the thing that was intended to be deployed was. What got pushed out was 40kB of all zeroes. Could’ve been corrupted some way down the CI chain.

jonne@infosec.pub · 3 months ago

Which definitely wouldn’t have been a single developer’s fault.

gedhrel@lemmy.world · 3 months ago

Developers aren’t the ones at fault here.

Miaou@jlai.lu · 3 months ago

Not the most at fault, but if you sign off on a shitty process, you are still partially responsible

gedhrel@lemmy.world · 3 months ago

That depends entirely on the ability to execute change. CTO is the role that should be driving this.

Kissaki@programming.dev · edit-2 4 months ago

It’s a systematic multi-layered problem.

The simplest, least effort thing that could have prevented the scale of issues is not automatically installing updates, but waiting four days and triggering it afterwards if no issues.

Automatically forwarding updates is also forwarding risk. The higher the impact area, the more worth it safe-guards are.

Testing/Staging or partial successive rollouts could have also mitigated a large number of issues, but requires more investment.

wizardbeard@lemmy.dbzer0.com · 4 months ago

The update that crashed things was an anti-malware definitions update, Crowdstrike offers no way to delay or stage them (they are downloaded automatically as soon as they are available), and there’s good reason for not wanting to delay definition updates as it leaves you vulnerable to known malware longer.

merc@sh.itjust.works · 3 months ago

And there’s a better reason for wanting to delay definition updates: this outage.

Kissaki@programming.dev · 3 months ago

How does a definitions update crash windows with a BSOD?

Gestrid@lemmy.ca · 4 months ago

Four days for an update to malware definitions is how computers get infected with malware. But you’re right that they should at least do some sort of simple test. “Does the machine boot, and are its files not getting overzealously deleted?”

Kissaki@programming.dev · 3 months ago

One of the fixes was deleting a sysm32 driver file. Is a Windows driver how they update definitions?

Gestrid@lemmy.ca · edit-2 3 months ago

The driver was one installed on the computer by the security company. The driver would look for and block threats incoming via the internet or intranet.

The definitions update included a driver update, and most of the computers the software was used on were configured to automatically restarted to install the update. Unfortunately, the faulty driver update caused computers to BSOD and enter a boot loop.

Because of the boot loop, the driver could only be removed manually by entering Safe Mode. (That’s the thing you saw about deleting that file.) Then the updated driver, the one they released when they discovered the bug, would ideally be able to be installed normally after exiting Safe Mode.

🏴 hamid abbasi [he/him] 🏴@vegantheoryclub.org · edit-2 3 months ago

Crowdstrike CEO should go to jail. The corporation should get the death sentence.

Edit: For the downvoters, they for real negligently designed a system that killed people when it fails. The CEO as an officer of the company holds liability. If corporations want rights like people when they are grossly negligent they should be punished. We can’t put them in jail so they should be forced to divest their assets and be “killed.” This doesn’t even sound radical to me, this sounds like a basic safe guard against corporate overreach.

Seasm0ke@lemmy.world · 4 months ago

Reading between the lines, crowdstrike is certainly going to be sued for damages, putting a Dev on the hook means nobody gets - or pays - anything so long as one guy’s life gets absolutely ruined. Great system

luciole (he/him)@beehaw.org · 4 months ago

That is a lot of bile even for a rant. Agreed that it’s nonsensical to blame the dev though. This is software, human error should not be enough to cause such massive damage. Real question is: what’s wrong with the test suites? Did someone consciously decided the team would skimp on them?

As for blame, if we take the word of Crowdstrike’s CEO then there is no individual negligence nor malice involved. Therefore this it is the company’s responsibility as a whole, plain and simple.

thingsiplay@beehaw.org · 4 months ago

Real question is: what’s wrong with the test suites?

This is what I’m asking myself too. If they tested it, and they should have, then this massive error would not happen: a) controlled test suites and machines in their labors, b) at least one test machine connected through internet and acting like a customer, tested by real human, c) update in waves throughout the day. They can’t tell me that they did all of these 3 steps. -meme

Hector_McG@programming.dev · 4 months ago

Therefore this it is the company’s responsibility as a whole.

The governance of the company as a whole is the CEO’s responsibility. Thus a company-wide failure is 100% the CEO’s fault.

If the CEO does not resign over this, the governance of the company will not change significantly, and it will happen again.

Umbrias@beehaw.org · 4 months ago

I don’t know snough about the crowdstrike stuff in particular to have much of an opinion on it in particular, but I will say that software devs/engineers have long skirted py without any of the accountability present n other engineering fields. If software engineers want to be called engineers, and they should, then this may be an excellnt opportunity to introduce acccountability associations and ethics requirements which prevent or reduce company systemic issues and empower se to enforce good practices.

Mubelotix@jlai.lu · 4 months ago

I blame the users for using that software in the first place

Let's blame the dev who pressed "Deploy" - by Dmitry Kudryavtsev

Let's blame the dev who pressed "Deploy" - by Dmitry Kudryavtsev

Let's blame the dev who pressed "Deploy" - Dmitry Kudryavtsev