Written by:
Category:
Date:
According to Microsoft, only 8.5 million Windows devices were affected by the outage, which is less than 1% of all Windows machines worldwide. Sources have reported that 60 - 70% of Fortune 500 companies were impacted by this downtime and the deployment (update) was said to have lasted only about 90 minutes. The dollar monetary value lost will be quantified in its billions. This is a significant mistake by Crowdstrike and could have been prevented.
Putting all the above facts together, I wonder if this global company follows any DevOps best practices. I question what the actual figure related to the outage is, and what their code review and deployment process entails. Do they have a functional QA team? What is their process for logging incidents and auto-correction? and even more, what percentage of their customers did the deployment affect? and lastly, what could have been done differently?
As a DevOps-certified Engineer, this article will highlight preventive and corrective measures that could have been employed to prevent such incidents. Let's call these measures “lines of defense”.
Human errors are inevitable, it’s for this reason we have a code peer review process. The premise of this is simple - Mr A writes a code, and before it can be merged or used in staging or productions, this has to be reviewed by Ms. B at the very least. A task as simple as this would prevent many bugs from going into production but even that wouldn’t be enough, leading to a second line of defense.
Build a staging environment that closely mimics the production environment. It’s not enough to “just” have a staging environment. In fact, the staging environment should be identical to the production environment, differing mainly in scale (size).
To keep the environments identical, it is highly recommended to deploy to both staging and production using IaC (Infrastructure as Code). There are numerous benefits to this approach, one of which is eliminating human errors when it comes to manually configuring these devices or servers. We have a higher level of confidence that if the same code was used to provision staging and production, then the environment should be identical - all things being equal.
Regardless of the prior two lines of defense, errors are still inevitable. Servers can unexpectedly stop, reboot, hang, or freeze. Code that previously worked may suddenly fail. Customers shouldn't have to notify you about issues like a broken image on your application for example.
Application logs and changes should be monitored, and remedial actions should be automated. For example, in this case, the remediation could involve pausing the deployment, rolling back the deployment, and alerting the relevant team. This would have prevented the scale of the downtime.
Automating incident response will be highly challenging, if not impossible, without Infrastructure as Code (IaC) as mentioned above. This is one of the key advantages of using IaC.
Have simple health checks that works. It’s that simple! Health checks are designed to monitor the vitals of the applications. Without the appropriate health checks, we cannot monitor the applications, or mitigate and automate the response to critical application downtime.
In DevOps practice, there are several deployment strategies, such as rolling updates and blue-green deployments. For example, in a rolling update, a few instances (buckets) are updated first, and the subsequent update only begins when the previous bucket has been successfully deployed and is healthy. If a solid deployment and rollback strategy is in place, only a small fraction of devices will be affected.
It’s common practice to use a percentage to roll out deployments. However, this approach may not be advisable when deploying to 8 million devices. This is because a percentage of 8 million devices equals 800,000 critical devices. In this case, I’ll suggest categorizing device types into sample buckets. These samples can be categorized by a common configuration, operating system, or device type for instance. Each sample bucket can then have about 3 of those identical devices, which will then form the initial test rollout.
In conclusion, DevOps is a very interesting practice that mitigates against this type of occurrence. If these 5 simple steps were followed, I can almost guarantee that the scale of devastation would be minimal or perhaps unnoticeable. But sadly, just because it’s a Fortune 500 company, doesn’t mean they adhere to best practices or evolve with technology. Don’t be like Crowdstrike- Hire a Certified DevOps engineer to review your workflow.