‘IT Stories from the Road’ is a series of first-person stories told by IT professionals. If you’d like to share a story, email us at [email protected]!
—
When Automation Goes Off the Rails
When most people think about automation, there are images of clean processes kicking off quickly, efficiently, and in a self-managing way. Automation is the new ‘Easy Button’. Why have people do things manually? They are slow, error-prone, and expensive, and automation can solve for all those problems. But for those of us working in IT, we know the risk that comes with automation. It’s true, automation can be slick when it works correctly, but when it doesn’t, you better look out.
The following example was documented as a Halloween horror story last year, and I thought I’d bring it here too because IT nightmares unfortunately, aren’t relegated to just Halloween. They can be regular occurrences that keep IT professionals up all night, cost an organization thousands of dollars, and can even risk irreversible damage to a company’s, or individual’s, reputation.
Once Upon a Time…
The story starts years ago, after one of my colleagues had developed an Oracle upgrade which would support roughly 2,000 users of a single application. At the time, we had many different versions of Oracle for different PeopleSoft apps across the company, including asset management, payroll, general ledger, and other critical systems – none of which had been tested with the update. Since the other applications had not yet tested the upgrade, it needed to be deployed only to those 2,000 specific devices. Impact to others was unknown, so better to avoid them.
Unfortunately, the engineer made a critical mistake. Instead of targeting just those specific devices in SMS, he dragged the install package to a collection called “All Devices”. The next morning the upgrade began to install on every single device in the company – potentially affecting up to 45,000 employees.
The implications of this mistake rippled across the entire company. Every device that received the update would be out of commission for a certain amount of time, including an unanticipated mid-day forced reboot. Because this update was largely unexpected, we saw a huge uptick in help desk calls, and worst of all, because the application hadn’t been tested, there was high expectation of application issues which threatened to have a major business impact we weren’t prepared for.
It was essentially impossible to stop the download or install once it was received by the device. When you tell SMS to do something, it does it. And telling it to stop is just another command that gets put into the queue. Since execution reporting was so slow, the team was helpless to know the scope, business impact, or what steps we should take for remediation. If it installed on a device that didn’t previously have the app, we needed to remove it. But if it upgraded someone that had a prior version, we couldn’t just remove it – we needed to know what version and configuration was on the device previously. Without being able to know who exactly received the update, we didn’t even know which devices needed remediation. In short, we were in the middle of an IT nightmare, and the operations teams were blind.
Historical Problems Still Experienced Today
Here’s the silver lining: In today’s advanced world of DEX, situations like this can be avoided. With the right DEX solution, you can have a better understanding of the delivery, installation, and reboot status ( minute-by-minute) and clearer visibility to see any resulting application errors in real-time. The result from these advances means that teams today can quickly reverse the consequences of a single mistake (manual or automated). And in these days of automated updates and SaaS services, seeing the impact of updates is more critical now than ever before.
Mitigation is possible through communication with only the affected users of the rogue update, or those who were at risk of being affected. We would’ve achieved this crucial step by developing a real time mitigation script to stop the client activity at the endpoint, but at the time all we had available was the slower and imprecise SMS client.
Looking back, there was a sense of panic that washed over us as we worked frantically to come up with a solution that could erase, or even just fully understand the problem. In this example, we were thankfully able to mitigate the issue to some extent. We prevented the interruption from hitting all 45,000 employees, but it did impact around 5,000 more employees beyond the original distribution.
You Can Avoid Nightmare Scenarios
This story is just one example of the stressful situations IT professionals will find themselves in over the course of their careers – but there are hundreds more stories like this one that have caused every IT worker to lose sleep.
By improving visibility across all devices – whether in remote settings or in the office – you can significantly reduce the number of IT nightmares your service team will experience. IT mishaps are bound to happen, but arming IT professionals with the right tools to mitigate issues, whether human error or technology failure, can make all the difference.