The first outage I caused.

The first time I caused a real outage, I wiped the NTFS permissions off every home folder in the company.

I was early in my career, working on an NT4 file server. I opened the security tab on the top-level home folders share, ticked Take Ownership, ticked Replace Permissions on Subdirectories, and clicked OK without really understanding what I was about to push down the tree. The dialog closed. The ACLs on every user's home folder went with it. Hundreds of people. Every personal directory. Gone in the time it took the change to propagate.

There was a moment where the dialog had closed and I understood what I had just done, and then a longer moment where I had to walk down the hall and tell someone.

Two colleagues stayed back with me that night. We pulled the backups, worked out the restore order, mapped what had to come back first so people could actually log in the next morning, and watched progress bars until the sun came up. Nobody made me feel small about it. They just sat down and started solving the problem with me, which is something I will never forget.

The technical lesson was the obvious one. Understand what a checkbox actually does before you click it, especially when the checkbox is going to recurse. Test destructive changes somewhere they cannot hurt anyone. Take backups yourself before you touch anything, even if there is supposed to be one already.

The bigger lesson took longer. Senior engineers are not the ones who never break anything. They are the ones who have broken something that mattered, sat with the weight of it, and changed how they work afterwards. The juniors I worry about are the ones who have never had a night like that, because it usually means they have not yet been trusted with anything large enough to drop.

I am more careful and reflective now. I am also more patient with the person on the other end of the ticket who has just done something they cannot undo, because I remember exactly what that feels like.

Back to Writing