We had instrumented our code to spit out timing markers at various points of pages that were tracked via automation and measured actual end-user download times 24/7 on the production site. The results showed that 25% of the time the download times were in the excess of 20 Seconds and it did not matter if it was peak usage time or over the weekend. That did not make any sense and all debugging and investigation showed no cause for the behavior and it was really hurting our end user experience.
How was the cause determined:
We had no cause nor a solution even after a few months. I monitored the live site end user experience for some time and noticed that during a couple of times a month the issue with 20 second download times now dropped to 5 seconds. This triggered another investigation on what happened or if any changes occurred on the live site during these 2 hour windows where the performance issue was better. We came to know that during these 2 hour windows twice a month we were updating the production site with newer content/code. During these times our operations team takes out half the servers out of rotation and updates code/content and when done they swap the servers with rest of them being out of rotation. During these times the long download times were resolved. It stunned us to see that the site worked better on just half the number of servers and that didn’t make sense.
This led us to discuss a better way of arranging and linking our FE and BE servers in production so the number of connection strings was reduced. We ended up with a site that now had 6 servers in the middle tier between the FE and the BE that now solved the issue with the excessive connection strings.
-- Vamshidar Rawal, SDET
Do you have a bug whose story you love to tell? Let me know!