Complex System Problem Solving
Topics: Technical
No Comments »

What do you do when you need to make a change to a system used by several thousand people? When making a change to any complex system, there is always the possibility of unintended consequences. The answer is this: Test, Test, Test. When you’re done testing and you’re ready to deploy your change be sure to have a backup plan.

I found myself in this very situation last week. We found an issue with our email system which would affect only a few users in a very specific scenario; however, it did need to be fixed. After researching the problem and several possible solutions I settled on one. “Stop. What if this solution causes other problems?”  We host well over 2000 mailboxes on our email system, and while I consider myself to be a competent engineer no amount of staring at a potential solution is going to reveal all of the possible side-effects.

After consulting with more seasoned colleagues, the decision was to implement the solution on a limited number of mailboxes first and also have a safety net in case things go wrong. We would watch for a day, and if no one reported any issues we would start rolling out this fix to other mailboxes.

After half a day of testing our solution a report of undelivered mail came in, and after investigating it I found it was indeed a corner case I hadn’t considered. My stomach got that all-too-familiar feeling: the “what would have happened if I had done that with all 2000+ mailboxes we host” feeling. Luckily, I had my safety net to fall back on, and I was able to re-deliver the messages that had not gone through the first time.

Let’s take the example above and construct a list of general steps to solving a problem on a complex system:

  1. Think about the problem. What caused it? What are the solutions? Which solution has the least impact on existing operations?
  2. Do a small-scale test of the proposed solution. This test should not affect any person or process in production. Adjust the solution and repeat this step until no side-effects are found.
  3. Test the proposed solution on a subset of the production system for an extended period. Have a backup plan in case something goes wrong. Adjust the solution and repeat this step until no side-effects are found.
  4. Take the leap: deploy your change on a wider scale on the production system. This probably should be incremental as there is still a chance of side-effects not found before.
  5. After the solution has been fully deployed continue to monitor the situation for unusual activity.
Related Posts Plugin for WordPress, Blogger...

Leave a Reply

(will not be published)

(optional)