I recently wrote about some issues that we had to diagnose in our production environment. I wanted to take a moment to document an interesting problem that had significant performance impact, had interesting symptoms, and was a rather easy fix once it was identified.
"I've just picked up a fault in the AE35 unit. It's going to go 100% failure in 72 hours." *
Our first clue that we had a problem was that the system was running 'slower' than normal. During normal SQL profiling, we discovered that there were some queries being run about every 100 milliseconds which were requesting the same data each time. In a few minutes, we would see around 7,500 requests for the same data.
"I'm sorry Dave, I'm afraid I can't do that." *
At this point, the proverbial alarm bells were being sounded. A concern was raised that maybe caching wasn't working in our code. Our test team ran testing against the old code and the new code. In both cases, the data indicated that caching was working. We decided to debug the cache code in production and found that the system was caching the results of that query as expected, and immediately dumping the data from cache, giving the appearance of a non-functional cache.
"I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do." *
Now that we knew why it looked like caching wasn't working, we needed to find the cause. Our first step was to look at the performance metrics that the OS provided. Most specifically, the amount of memory that was available.
The above graph shows four different machines under a production level of load. Notice that the available memory is practically a flat line. There was more memory available on the systems, but it wasn't being used. Now what could cause such a problem? Our suspicion was that there was a configuration limit we were hitting that was artifically keeping the cache from growing. To test this out, we create some code that we could use to force new data to be loaded into the cache. The idea was that if there was memory available and caching was working, then asking for new data to be cached should displace some of the available memory. When we ran the test, the line continued to stay flat. That was an 'aha' moment. We were now about 95% certain that the issue was related to the configuration of the cache.
After a little bit of research, we came across this information: 'cache Element for caching (ASP.NET Settings Schema)'
After digging around and talking with some people, we were able to get some detailed information about how caching is configured:
“If you set <cache privateBytesLimit> and/or a private bytes memory limit for the worker process (in IIS Manager), we use what you set, or the minimum if you set both. If you don’t set either, on an x64 box we use MIN(60%*physical memory, 1 Terabyte). The private byte limit for the cache is independent of percentagePhysicalMemoryUsedLimit. We have two limits, one for private bytes used by the process, and one for % physical memory used on the machine. They’re independent of each other.”
"It can only be attributable to human error." *
After configuring a system with the cache settings, we ran the same test as before. This time, the available memory was affected.
This graph shows the memory fluctuating across multiple machines when our test is run. Exactly what we expected.
Running a SQL trace on our systems with the same level of load on our IIS servers now shows that those pesky repeat queries are practically non-existant. Now we see the query about once every 15 minutes instead of once every 100 milliseconds. Remember that the symptom was our SQL queries were running slower than expected but the problem was not a SQL problem. It was a cache configuration problem which resulted in a huge number of queries being run that shouldn't have.
* Quotes from '2001: A Space Odyssey', 1968