Crowdstrike – was it really the problem?

For folks not into tech, I’ll go ahead and answer the headline:

No.

Ok, all the luddites gone? Ok? Let’s go.

The TLDR is Crowdstike makes anti-malware software. Some may call it anti-virus, but the capabilities of these packages has far surpassed simple virus scanning and interdiction. These are complicated products that utilize complicated mathematical methods to compare code found running on a system to signatures of known threats. The products then can respond by alerting, interdicting, halting, quarantining, or even deleting the threats.

If you haven’t heard of the Crowdstrike problem from today (7/18/2024), here’s some coverage:

https://www.npr.org/2024/07/19/g-s1-12222/microsoft-outage-banks-airlines-broadcasters

https://www.foxbusiness.com/fox-news-global-economy/global-technology-outage-disrupts-major-airlines-911-services-businesses

https://www.cnn.com/business/live-news/global-outage-intl-hnk/index.html

As the cybersecurity world moves quickly and new threats are discovered almost daily, products like Crowdstrike’s are updated frequently. Crowdstrike has an efficient content distribution network that allows those updates to push out to their customers very rapidly.

In this case, there was a defect in one of these updates, and that defect led to an unrecoverable fault in the Windows OS that resulted in the dreaded Blue Screen of Death (BSOD). Further, the systems continued to BSOD and crash every time they tried to load the Crowdstrike service on boot. The fix was to boot each system into Safe Mode and delete a file associated with the update, then restart the system into normal mode and push the repaired update back down to it.

This is an extremely labor intensive process. Recovery from this is going to require humans to manually interrupt the boot process of every affected system on the planet and fix the issue on the system’s local console. Since there’s no networking available in Safe Mode, there is no way to remediate the issue at scale. It’s literally IT walking around to every PC with a floppy to install a fix, a common thing back in the 1980s and 1990s.

This could take weeks.

Here’s why I’m saying Crowdstrike isn’t the problem:

What’s happened here shouldn’t be possible. A service like this should not be able to induce a crash in a modern operating system. Even if it does, the OS’s crash recovery routines should identify the source of the crash and prevent that software from being launched on the next boot.

The much maligned (but really misunderstood) Systemd service manager for Linux can do this.

IBM Z/OS can do this

Most commercial Unixes could do this

Why can Windows not do this yet?

Further, why do companies continue to subject themselves to the abuse Microsoft’s Operating system inflicts?

Microsoft Windows has a history of low quality and even outright fraud that stretches all the way back to its inception. In 1982, Bill Gates saw a demo of a graphical shell running on a PC at COMDEX and vowed to have a similar product by the following year.

At Comdex in 1983, Bill demo’d Windows 1.0. The problem? His demo was a mock-up. It wasn’t functioning software. It was a series of hand-drawn screens with extremely limited functionality. It took two more years for Microsoft to release the real Windows 1.0. Microsoft was lampooned widely at the time for advertising vaporware.

But release it did, and Windows basically did nothing in the market until version 3.0 came out, when somehow it took off in the business market. Yet, it was an extremely limited product that basically stretched the 16-bit 80286 platform to its limits. If a user dared to run more than one large app at a time, they were almost guaranteed to not get through the workday without having to reboot the computer several times. Even the action of cutting and pasting between Word and Excel could knock your PC completely out.

Fast forward to the famous demo of Windows 95, when the OS bluescreend during the launch demo when somebody plugged in a scanner.

https://www.theregister.com/2018/04/20/windows_98_comdex_bsod_video/

The audience roared because they knew. Windows was not reliable. It never was.

And yet, moving into the late 1990s, Microsoft decided to make a push into the datacenter. At the time, PC-based servers were dominated by Novell Netware for file and printer sharing, and IBM OS/2 for apps and middleware. Almost all database work was done on mainframes or mid-range systems running commercial Unix products like AIX, Solaris, and HP-UX.

But Microsoft used marketing and anti-competitive bullying to force its way into the server OS market with Windows NT. But they didn’t do it alone. They were aided by STUPID BUSINESS USERS that wanted their server environments to look like their desktops. Stupid humans buy what they think they know, not what they should.

So we got Windows NT 3.1, which actually wasn’t a bad product. David Cutler’s team did their homework. The problem came with Windows NT 4.0 when the desire to unify desktop and server resulted in third party drivers like display being moved out of isolated user space into the kernel to improve performance. Windows NT 4.0, Windows 2000, and Windows XP and the correlated server versions suffered with driver instability that regularly crashed systems for the next two decades.

Further, Windows is a GUI based OS. It didn’t even have a shell with basic looping structures beyond GOTO until 2006 (Unix had a fully programmable shell – several of them, actually – almost since inception in the early 1970s). Automating tasks in Windows was difficult and typically required a software developer to write and compile an application. This was the trap of Windows. The GUI made it seem easy to use, but it’s lack of robust scripting and complicated object-oriented API interfaces made automating tasks and operating the OS at scale really difficult. That made supporting Windows extremely labor intensive and expensive.

On top of that, Windows doesn’t perform. Even back in the 1990s and early 2000s, Linux builds running on the same hardware could post I/O performance that embarassed what Windows could do on the same system. This suited Microsoft and their hardware partners just fine, because if a customer decided to put Windows in their datacenter, they were guaranteed to have to buy at least twice as many servers to accomplish the same amount of work as compared to a customer using Netware or Unix for the same roles.

Even today, you can see this in practice. Go set up an EC2 instance at AWS, and one using the same vCPU and memory footprint and OS at Azure, then run UnixBench inside each of them. Azure will be slower.

Why? Azure’s product is using the Windows Hyper-V hypervisor, AWS uses Linux KVM. Hyper-V runs on Windows. No matter what OS you run on top of it, you can only go as fast as Windows can go, and Windows still can’t go as fast as its competition, even on the same hardware.

Windows NT’s network stack was once so crappy you could crash the entire machine simply by pinging it too quickly from another system.

The filesystem, NTFS, to this day, still isn’t even journaled. My employer had to restore some systems from backups because the crash ruined the filesystems. In 30 years, I have never experienced an unrecoverable filesystem on a Unix system, and I’ve been through a ReiserFS hash collision.

That brings us to today. Sure, Crowdstrike shipped a bad update. They have a lot of inward reflection to go through. Better code review, better testing before wide release, phasing the updates so they don’t roll out so quickly that they end up crippling the entire planet.

But the real problem is Microsoft Windows. It still suffers from fatal defects that make it truly unsuitable for mission critical work. What happened today would not have been possible on a modern Unix, or even a Unix from 20 years ago.

Business and government alike should take note of that and thoroughly examine why they insist on punishing themselves and their customers so often.

1

Leave a Reply