Writing

IT posture, infrastructure fundamentals, and the patterns that keep showing up.

A place to collect thinking on IT posture, infrastructure fundamentals, and the kinds of problems that appear in practice across different organisations and sectors. Most of it is observational. Some of it is opinionated. None of it is sponsored.

Posts

  1. The cause is usually older than the trigger, and sometimes the cause is you.

    The worst change I ever made to a production environment took months to fully surface, and I'd made it deliberately, carefully, and for good reasons.

    Microsoft had renamed the user shell folder from Documents to My Documents, or back the other way depending on which era you're counting from, and the Group Policy that redirected user folders needed updating to reflect the new path. I made the change. It applied cleanly. Spot checks looked fine. New logons picked up the new path, redirection worked, files were where they were supposed to be, and the change moved out of my head and into the pile of things that were done.

    The problems started turning up weeks later and didn't look like a Group Policy problem. A user couldn't find a file they were sure they'd saved. Another user's application was writing to a path that no longer existed for anyone else. A third user had two folders with overlapping contents and no clear sense of which was current. Each one looked like a user error or an application quirk in isolation, and each one got handled in isolation, which is the worst possible way to handle a pattern.

    The mechanism was that the rename hadn't been clean for every profile. Existing profiles, profiles that had logged on before the change, profiles that had been roamed, profiles that had been touched by older policies still cached somewhere, all behaved slightly differently from the freshly minted ones I'd tested against. Some users were reading from one path and writing to another. The redirection was working exactly as configured. The configuration was correct. The environment underneath it wasn't uniform enough for "correct" to mean the same thing for every user.

    What made it take months to surface was that nothing failed loudly. No error dialogs, no event log entries that pointed at the policy, no helpdesk ticket that said "since the Group Policy change". In fact, the helpdesk started doing "Profile Resets" which caused even more harm and user data loss.

    The symptoms were all downstream and all human: missing files, confusion, duplicated work, the slow erosion of trust in the file server. By the time I traced it back to the change I'd made, the incident had stopped being a Group Policy problem and started being a data hygiene problem across dozens of profiles, which is a much harder thing to fix than the original change would have been to roll back.

    The lesson I took from it, and the reason I still think about it nearly 15 years later, is that a change being technically correct is not the same as a change being safe in the environment you're applying it to. The test environment was uniform. Production wasn't. Nothing in my process at the time was designed to find that difference before users did.

    The technical trigger is always recent. The cause is usually older than the trigger, and sometimes the cause is you.

  2. Event Logs and the Story Behind Every Incident.

    This one is close to my heart and probably the most important thing I would say to anyone starting out in IT.

    Over the years, I've learned that most systems are already telling you what went wrong. Not in summaries or dashboards, but in the event logs, quietly and often long before anyone notices there's a problem.

    When incidents escalate, it is common to see teams restart services, reapply configurations or focus on the last visible failure. The pressure to "do something" is understandable, but it often skips over the place where the full story already exists.

    Event logs are rarely neat. They're noisy, repetitive, and sometimes frustrating to interpret but taken together, they describe sequence, timing, dependency and cause in a way no single alert ever can.

    With experience, patterns start to stand out. The same warnings appearing before different failures, the same authentication errors preceding broader outages and the same timing gaps that point back to an earlier dependency failure rather than the symptom everyone is reacting to.

    Reading logs well isn't about memorising event IDs or knowing every subsystem in advance. It's about learning how systems narrate their own behaviour, and trusting that narrative even when it contradicts first impressions.

    I've lost count of the number of times an incident felt complex and multi-dimensional, only for the logs to show a very ordinary sequence of events once they were read in order.

    The fundamentals show up here again. Time matters, identity matters and dependencies matter. The logs reflect all of it, if you're willing to slow down and listen.

    For me, this is where troubleshooting stopped feeling reactive and started feeling deliberate, not because problems disappeared, but because the story was already there, waiting to be read.

  3. The first outage I caused.

    The first time I caused a real outage, I wiped the NTFS permissions off every home folder in the company.

    I was early in my career, working on an NT4 file server. I opened the security tab on the top-level home folders share, ticked Take Ownership, ticked Replace Permissions on Subdirectories, and clicked OK without really understanding what I was about to push down the tree. The dialog closed. The ACLs on every user's home folder went with it. Hundreds of people. Every personal directory. Gone in the time it took the change to propagate.

    There was a moment where the dialog had closed and I understood what I had just done, and then a longer moment where I had to walk down the hall and tell someone.

    Two colleagues stayed back with me that night. We pulled the backups, worked out the restore order, mapped what had to come back first so people could actually log in the next morning, and watched progress bars until the sun came up. Nobody made me feel small about it. They just sat down and started solving the problem with me, which is something I will never forget.

    The technical lesson was the obvious one. Understand what a checkbox actually does before you click it, especially when the checkbox is going to recurse. Test destructive changes somewhere they cannot hurt anyone. Take backups yourself before you touch anything, even if there is supposed to be one already.

    The bigger lesson took longer. Senior engineers are not the ones who never break anything. They are the ones who have broken something that mattered, sat with the weight of it, and changed how they work afterwards. The juniors I worry about are the ones who have never had a night like that, because it usually means they have not yet been trusted with anything large enough to drop.

    I am more careful and reflective now. I am also more patient with the person on the other end of the ticket who has just done something they cannot undo, because I remember exactly what that feels like.

Most of the writing lives on LinkedIn. The posts here are a selection of the longer or more reference-worthy pieces.

Let's talk.

Interested in working together, or just want to connect? Drop me a line and I'll get back to you.

rob@robswain.au