Ok, here’s my first IT Horror Story of the Saturday night. It's a bit long.
I was at a client, a healthcare facility, to replace some hard drives. They didn’t want to spend money, so we had to keep the current setup running, which was outdated and unreliable.
Now I can say it: they didn’t want to spend because the general manager’s goal was to give work to a company of his friends, who were already providing support on two VMs, and he wanted to hand everything over to them. The IT manager hadn’t yet understood the financial interests of these people and still believed everything was in good faith, that they really didn’t have the funds. We were holding everything together with duct tape, but it was working and stable. I had set up a "cluster" with OpenNebula and GlusterFS for storage (later replaced with MooseFS and then with Ceph), using all available hardware.
We scheduled an intervention and notified everyone to disconnect and shut down the machines by 12:30. By 13, we had completed the backups, aiming to start the intervention by 14 and get the work up and running by 15:30. The goal was to update the systems and check the disks. We shut down all the VMs, had everyone disconnect. It was lunchtime.
We updated the servers, rebooted them. One of the disks started throwing errors. GlusterFS, for some reason I never really investigated (I have my theories, which I’ll share later, but from that day forward, GlusterFS no longer exists for me), decided to overwrite both that disk and its replica with zeros. I hadn’t changed anything.
Panic – there were backups, but on a USB 2 disk (old servers, no USB 3)! I immediately stopped everything. I was almost fainting. The IT manager didn’t understand what had happened, so I explained it to him. He announced to everyone that we would cancel the rest of the intervention, restore from backup, and have the work back up by 15:30 as planned, prioritizing the most critical VMs.
The "competing" company reached out. They had powered on one of their VMs and started "doing their own interventions." Even though they had been warned not to do anything. And they complained "we" had lost some of their data. Of course, the manager and those guys went "carpe diem": they put me under accusation, saying I had undoubtedly made a mistake that led to "lost data." I wrote a technical report explaining what I had seen, noting various SSH logins from those guys during the intervention. The "history" had been erased. Of course, not by me.
They continued to harass me for a while. The last thing they asked was for me to go to a meeting "to explain in person." They tried to schedule it the day before my wedding. And they knew it.
They threatened to ask me for an unspecified (high) financial compensation for 'the lost data.' What lost data? The ones that, allegedly, the other company would have entered in the meantime.
Final result: no problem for me (I hadn’t done anything wrong), the backups were fine, they only calmed down when I proved (logs in hand) that I wasn’t the only one connected to that machine, and my witnesses (two colleagues and the IT manager) had seen all my actions, confirming I hadn’t done anything wrong.
In the end: I realized they would do it again and I left the client – even the IT manager decided to resign and change jobs. The general manager managed to install, at astronomical figures, the company he wanted to place. After two months, they got their hands on the system and broke it. They asked me for assistance, which I refused. At any price.
After a few years, I found out that the general manager ended up in jail for corruption, bribes, and for favoring his friend companies in many sectors.
That day, I celebrated.
#HorrorStory #ITHorrorStory #ITSupport #TechTroubles #SysAdmin