Problem:

Release To Web (RTW) for Office Online site and we were seeing features breaking randomly with no reason at all. A lot of debugging led to no indication of what the problem was.
 
Cause:

Testing features before RTW, testers did not catch the issue and also our test environments had not really mimicked the production environment. All features were tested as a standalone unit and they all worked perfectly but not as a whole.
 
Issue:

The problem was the hardware (H/W) load balancer in the production environment that directed all requests to available servers for processing, including cookies/headers that we set per feature. The load balancer had a upper limit for cookie size as 4 KB, and we were exceeding that limit when all the features were used on the site. When this happened, the load balancer was truncating the cookie set first to keep it below 4 KB. This caused some features that were used first to not function anymore due to lack of cookies required.
 
Result:

We immediately fixed the issue by re-thinking our cookie story and keeping it under the limit.
 
Lessons learnt:

Always mimic your test environments with production in terms of H/W and design.

  1. Test your features as a complete set with the entire product and not as a unit.
  2. This led to discovery of more potential issues over the next couple of releases that opened our eyes to how we write and use cookies.
    1. Reduce cookie size to reduce the performance impact and improve end–user experience.
    2. Discard cookies when not needed to keep it down to a reasonable size.

 -- Vamshidar Rawal, SDET

 

Do you have a bug whose story you love to tell? Let me know!