Saturday, 20 July 2024

How a bug in a little-known piece of software caused a global meltdown.

Extract from ABC News

ABC News Homepage


On Friday afternoon, a little-known computer program triggered an IT meltdown that took down digital infrastructure on a global scale.

Seemingly all at once, millions of computers around the world became unusable and unable to be rebooted, showing what's known in the industry as the "Blue Screen of Death".

The ABC's broadcast systems were some of the earliest to be visibly affected, with computer after computer inexplicably breaking down, leaving journalists scrambling to deal with the exact issue they were trying to report on.

"It's possible we'll have difficulties updating this story due to computer systems affecting us here at the ABC," wrote the ABC's Joseph Dunstan on the blog at 4pm AEST.

Meanwhile, the ABC's TV presenters continued to broadcast without teleprompters or on-screen graphics.

A blue screen on computer screens in a newsroom
ABC Perth newsroom during the outage.(ABC News: Keane Bourke)

The scale of the outage — which some experts have called unprecedented — soon became clear as reports came in from almost every conceivable business.

Airports were thrown into chaos. Check-outs at supermarkets were down. Government departments, emergency services, universities, law firms, mines, media — no industry seemed to be spared.

Lucas and Liz Gibson were caught at Sunshine Coast airport, unable to check in for their scheduled flight to Sydney.

"The service desk staff can't help, computers are down, they don't know what's going on, they're just waiting and seeing," Mr Gibson told the ABC.

A Jetstar-branded screen announces a "global systems outage"
Airports across Australia were thrown into chaos as computer systems went down.(ABC News: Adam Griffiths)

It was mid-afternoon in Australia when it first hit, while other parts of the globe were still sleeping.

From Berlin to Dubai, the United States to India, the problem was the same.

"Across Europe, at airports in Spain, in Germany, there have been incidents that have been reported at almost all of the airports," said the ABC's Michelle Rimmer in London.

As the massive scope of the outage was still revealing itself, its cause was already being dissected online.

There was early speculation that Microsoft was responsible, largely because only computers running the US tech giant's operating systems had been affected.

Adding to the confusion was that Microsoft had reported a major technical outage with its cloud services earlier in the day.

Ultimately though, that was a red herring.

Microsoft would later attribute the issue to "an update from a third-party software platform"

A little-known culprit

Less than an hour after computers had begun melting down, another US-based software company began to be linked to the outage.

CrowdStrike — valued at $125 billion at the close of US markets overnight — was being named by multiple affected organisations.

"Like a number of other organisations, global issues affecting CrowdStrike and Microsoft are disrupting some of our systems," said a Telstra spokesperson.

By 3.20pm AEST, the company had informed its customers it was investigating "widespread reports of [Blue Screen of Death] on Windows hosts".

When a Reuters reporter called the company's technical support, they received a prerecorded message: "CrowdStrike is aware of reports of crashes on Windows … related to the Falcon sensor".

While few outside of the IT industry had heard the name before, CrowdStrike's Falcon software turned out to be deeply embedded in the world's computing infrastructure.

Used by IT departments to monitor for cybersecurity threats in real time, Falcon is close to ubiquitous for protecting Windows computers — including laptops, desktops and servers — from cyber threats.

To do this job, it needs to be deeply intertwined in the inner workings of the computers it is monitoring.

"Antivirus software is typically given access to a deep set of permissions (kernel-level access) on computers to protect against viruses and malware," said Professor Salil Kanhere, an expert in cyber security at UNSW.

"The flip side, however, is that if this very software malfunctions, then it can crash the computer."

Not a cyber attack

While a cybersecurity program had been identified as the root cause of the global outage, officials in Australia were saying it did not appear to be a cyber attack.

Australia's National Cyber Security Coordinator said there was "no information to suggest it is a cybersecurity incident".

An hour after it had notified customers of the issue, CrowdStrike pushed out a fix.

Technology experts soon reported successfully applying it, but warned that manual intervention was required in some cases.

"The trouble is getting the fix onto the computers means IT teams are going to have to touch every keyboard," said CyberCX executive Alistair McGibbon.

And given the timing, it was bad news for IT administrators across the country.

"It's a Friday afternoon. People have gone home and gone away for the weekend," said CEO of StickmanCyber Ajay Unni, whose team was working into the night to help its clients recover from the outage.

Dr Suelette Dreyfus, a computing and IT specialist from Melbourne University, warned that a resolution to the outage could be delayed due to heavy backlogging.

CrowdStrike meets with Australian authorities

By 6pm — multiple hours after the outage began — Australia's governmental response had clicked into gear as well.

A national emergency meeting was called to discuss the outage with representatives from CrowdStrike.

The display on the side of a train says "not in service"
Train lines in the UK were temporarily unable to run during the outage.(AP: Peter Byrne/PA)

Following the meeting, Home Affairs Minister Clare O'Neil confirmed there was no evidence of a cybersecurity incident.

"This is a technical issue, caused by a CrowdStrike update to its customers," she said.

"They have issued a fix for this, allowing affected companies and organisations to reboot their systems without the problem.

"Given the size and nature of this incident, it may take some time to resolve."

Four hours after the incident began, CrowdStrike CEO George Kurtz made the company's first public statement: "The issue has been identified, isolated and a fix has been deployed.

"Our team is fully mobilized to ensure the security and stability of CrowdStrike customers."

A handwritten note on a shop door says "IT system down!"
Shop owners across the country were forced to close early on Friday.(ABC News: Jay Bowman)

It will take a gargantuan effort from IT teams around the world — not just CrowdStrike's — to reboot the millions of affected computers and re-start the world's digital infrastructure.

No comments:

Post a Comment