What do you do when you need to make a change to a system used by several thousand people? When making a change to any complex system, there is always the possibility of unintended consequences. The answer is this: Test, Test, Test. When you’re done testing and you’re ready to deploy your change be sure to have a backup plan.
I found myself in this very situation last week. We found an issue with our email system which would affect only a few users in a very specific scenario; however, it did need to be fixed. After researching the problem and several possible solutions I settled on one. “Stop. What if this solution causes other problems?” We host well over 2000 mailboxes on our email system, and while I consider myself to be a competent engineer no amount of staring at a potential solution is going to reveal all of the possible side-effects.
After consulting with more seasoned colleagues, the decision was to implement the solution on a limited number of mailboxes first and also have a safety net in case things go wrong. We would watch for a day, and if no one reported any issues we would start rolling out this fix to other mailboxes.
After half a day of testing our solution a report of undelivered mail came in, and after investigating it I found it was indeed a corner case I hadn’t considered. My stomach got that all-too-familiar feeling: the “what would have happened if I had done that with all 2000+ mailboxes we host” feeling. Luckily, I had my safety net to fall back on, and I was able to re-deliver the messages that had not gone through the first time.
Let’s take the example above and construct a list of general steps to solving a problem on a complex system:
- Think about the problem. What caused it? What are the solutions? Which solution has the least impact on existing operations?
- Do a small-scale test of the proposed solution. This test should not affect any person or process in production. Adjust the solution and repeat this step until no side-effects are found.
- Test the proposed solution on a subset of the production system for an extended period. Have a backup plan in case something goes wrong. Adjust the solution and repeat this step until no side-effects are found.
- Take the leap: deploy your change on a wider scale on the production system. This probably should be incremental as there is still a chance of side-effects not found before.
- After the solution has been fully deployed continue to monitor the situation for unusual activity.




























