I went away on holiday with the servers running very well. Our average CPU was hovering about 4% and the peak was close to 10%. When I came back, the average was at about 40% and the peak was in the neighborhood of 90%. I was not out of the office that long and this kind of a change was something I did not expect so I began to ask questions about what was done while I was out.
If you look back at some of my previous posts, you will see that we use a scaled out architecture that utilizes read and write servers. Our read servers were running 'hot'. The load on our write servers was higher then expected, but not yet in the 'hot' range in my opinion. Replication between servers was also running slower. As a system, something was not quite right. We asked our operations people to help out and run a trace on the SQL boxes to see if there was something causing this problem. They immediately came back and said there was. They recommended a non-clustered index on one of our critical tables which already had a clustered index on the same table and here is why:
The table being hit was being queried against an index, which was clustered. If you're not familiar with the workings of a DB, this means that the index lives in the database on the same physical page as the data the index references. This is pretty common and in most cases, a very efficient way to create an index. It is so common that the default creation of a primary key in SQL is to make it a clustered index.
By contrast, a non-clustered index is built almost like a separate table in the server. It only contains the minimal amount of data needed to find the data you are looking for. Since this is not a complete copy of the data, you can represent much more data per page of 'memory' used. This is important in a couple of ways.
1. You can create 'mini views' of your data that are actually stored on disk that the query optimizer will use in some cases without ever looking at the table. This concept is known as 'coverage' and is a great way to improve performance for some types of queries. 2. You can reduce the number of pages of data that need to be traversed while doing an index scan.
1. You can create 'mini views' of your data that are actually stored on disk that the query optimizer will use in some cases without ever looking at the table. This concept is known as 'coverage' and is a great way to improve performance for some types of queries.
2. You can reduce the number of pages of data that need to be traversed while doing an index scan.
It is this last optimization that was important in this case. When we looked at the query plan being generated, the query was using an index scan to locate the data we were interested in. The index it was using was a clustered index. This means that the index scan was physically equivalent to a table scan. Now there is a time and place for table scans and this situation was not one of them.
After adding the non-clustered index to the 'hot' tables, the CPU dropped back down to where it was prior to my holidays. The impact this had on the rest of the servers was the cpu on the write server(s) dropped and the time for replications dropped. This was a small amount of work with a huge impact.
An now the question that remains is why didn't this happen before the holidays? Well that was easy to figure out. A large amount of data was migrated to the new system while I was out of office. This was enough data that it changed how SQL was caching things. This problem was there before, but it was masked by having lots of ram and a smaller amount of data.
If you found this information useful, you can thank a member of our team that asked why this helped. I figured that if one person might ask this kind of question, I am sure others will as well and decided to post it to the blog.