Welcome to MSDN Blogs Sign in | Join | Help
Keep it Running

I have not had a chance to post much since I started this blog.  My intent was to lay some ground work for building scale-out, high perf systems for medium scale sites.  It seems like the best laid plans are always interrupted and my vision for this blog is no different.

These plans were interrupted  by production issues that needed to be addressed with the systems we support.  We recently implemented several changes that ultimately help us, but caused us some short term grief.  I am not going to go into all the issues but I will try to cover a few of the things we learned in the last two weeks.

 The database is not always the problem

Of all of the issues we have had to hunt down in the last few weeks, 2/5ths were not database issues, even though it looked that way at first. 

The first non-database issue was interesting.  At first glance, it appeared that our web services were not responding.  This was the exact symptom we say when there was a database issue, so it was natural to look there first.  After looking at the database servers, we found that they were working normally. The performance numbers on our VMs looked good too.  Ultimately it ended up being a runaway process on the VM host that was starving the VMs of CPU.

The second non-database issue was very interesting (at least to me) to find.  The first indication that we had a problem was during a SQL trace of a production machine, we found that the same query with the same parameters was being executed about every 200 milliseconds.  This was strange since we were supposed to be caching these kinds of requests in our front end servers.  Investigation of this problem actually found two issues: our caching was mis-configured on our front end servers and our VMs did not have enough memory to effectively cache the amount of data we wanted to cache.

Caching was working exactly as it was supposed to.  I can't go into all the details of how we configure our machines, but the problem was that one of our applications on the VM was consuming most of the available cache on the machine.  The application was tagging data that it was caching as high priority data.  All of the other data was being cached and immediately being flushed from the cache, giving the impression that it wasn't cached.  To find this problem, we had to do a SQL trace at the same time we were debugging a production server.  This was one of those problems where you scratch your head a little bit, identify what could be causing the problem and look for each condition. 

Sometimes the Database is the Problem

One of the changes we have recently made was the migration from SQL 2005 to SQL 2008.  If you haven't used SQL 2008, it rocks.  However, anytime you change engines like this, there are bound to be little problems that pop up.  For us, we had two. 

Both issues manifested themselves as performance related issues that looked exactly the same.  On SQL 2005, we had already learned that for our application, we wanted to be careful with how we used the maximum degrees of parallelism (MDOP).  In fact, we had set this to 1 for our production environment.  MDOP is a great feature, but is some conditions, it can actually hurt performance and our system is definitely one of them.  Unfortunately, during the migration to SQL 2008, the default setting was not modified to what it should have been.  As you can imagine, this resulted in a performance problem that was ultimately fixed by just modifying the setting on SQL 2008.

The second problem is more a result of changes in 2008 that affect us in a negative way.  Our system has complex queries that include XML (XQuery) components.  When these queries pass through the query plan optimizer, enormous query plans were getting generated.  The query plan optimizer doesn't keep statistics on all  XML so when it was trying to optimize these complex queries, it was doing a poor job on the parts that had XML.  In fact, when we compared the query plans over time, we would get different query plans for the same query.  In all cases, we found that we were spending way more time in the XQuery parts of the query plan that we had in the past.  This problem had been present in 2005, but for a number of reasons, we didn't really experience the issue until we were in SQL 2008. The ultimate fix for this problem was to refactor the SQL code into simpler logic. 

 

Posted: Wednesday, October 15, 2008 6:21 AM by VRohr

Comments

Chris.blog.Microsoft said:

Image via Wikipedia I’d like to bring a little focus on some of the cool things going on behind the scenes

# October 16, 2008 12:51 PM
Anonymous comments are disabled
Page view tracker