.NET StockTrader was always designed as a high-performance application (its been benchmarked many, many times with published results).  When performance tuning and benchmarking previous on-premise versions, I always focussed on achieving top-end throughput for the implementations, as measured by peak web pages served per second to requesting clients.  Interestingly, with the Azure version, while this is important, it is also important to think about latency, since the clients to the app and the app itself are not running on the same high-speed internal network--the Internet is involved.  In this blog post, I am just going to give an overview of some of the key things to look at when tuning a .NET server app for throughput, and making it more efficient, which can of course also impact latency in a positive way.  However, one key point is that there are ways to dramatically improve latency that may tradeoff some top-end throughput.  Basically, making several simultaneous asynchronous requests based a single user request.  For example, in StockTrader there are two pages that each make 3 independent web service requests to WCF before returning the page to the user. Each of these WCF operations makes at least one database call to SQL Server/SQL Azure.  It seems natural to simply launch 2 additional threads, so that all three can happen in parallel, since they are not dependent on each other.  Then, simply block the main ASP.NET processing thread in Pre-Render until the two other threads return. When they do, build the page in Pre-Render (not Page_Load).  In this way the main ASP.NET thread is building any user controls, and potentially making its own WCF call (the third call, in the StockTrader case for the two key pages async was added to), at the same time the other two threads are doing their WCF calls and waiting for their responses from the remote service.

However, this technique will not improve top-end throughput, and should actually have a slight negative impact on top-end throughput since one user request will now have the overhead (albeit slight) of launching/managing mulitple threads to get the same amount of work done.  But, the benefit is potentially reducing response times for these key pages by 66%--a huge number.  If each of the three WCF call takes 1 second, doing them one at a time would be 3 seconds. Doing them all at once should take ~1 second--a big improvement for end-users.  More on this in the next post.  But, *before* thinking about raw page response times and making such async calls, you need to first go through the excersise of optimizing throughput, becuase this metric is still the one that gets you to an efficient system that will best scale, utilize the available hardware the app sits on and service the most concurrent users on this hardware before queing requests.  So with that said, lets focus on bottlenecks that keep throughput low. If an ASP.NET app has a bottleneck external to the pages themselves, (same for WCF services), it will be evident becuase you cannot get to ~100% CPU utilization no matter how many load test clients you throw at it.

In this scenario, load is driven by some test tool, and the metrics in this tool are set to capture pages/sec or wcf requests/sec. If you look at perfmon, and see the host server never gets beyond say, 50% CPU utilization, then you have an external bottlneck (you actually likely have several, but you cannot uncover the second until you get rid of the first). The first bottleneck, for a typical web app or WCF Service, could be one of many things:

a) Available database connections from the ASP.NET app.  Just forget to close one ADO.NET connection in your code and drive some load--you'll see this real quick.  And app capable of doing, say 6,000 page requests per second might top-out at 50 pages/sec.  ADO.NET pools connections automatically, based on the userauthentication, database connection string properties--basically the conn string just has to match since a lookup is done from the pool to see if an open connection to the database that precisely matches the connection string is available-if so, this already-opened connection is returned immediately to the app, with basically no delay and no need to make an expensive request to open a new connection.  If you run out of connections in the pool, say,  by forgetting to close a connection in your data access logic (which is called from the ASP.NET Web Form), performance will just tank, orders of magnitude, becuase your app, after hitting the site a few times with a browser, will be leaving connections open, meaning not released back to ADO.NET conn pool, and hence not available for the next page request.  After hitting say 20 or so pages (you can adjust min pooled and max pooled connections in the connection string), all of a sudden you will see 30-60 second pauses on each request, or worse, timeout errors from ASP.NET via the SqlClient.  Some might think the answer is just increase the max pool connections in the conn string--make the pool bigger.  Of course this only delays the issue for another few requests, it does nothing to fix the issue--and more open connections to the database is more overhead for the datatabase.  So you have to make sure you close off your connections after using them, close your ADO.NET datareaders, then return from your DB logic at which point ASP.NET will go off on its merry way processing the page and returning the page to the user but the connection is already available for another user request before asp.net has even finished processing the first page.  That's good design.  StockTrader on-premise can serve up to ~2,500 concurrent requests on a single quad-core with the max connection pool size at no more than 30 pooled max connections.  One special note here:  ADO.NET maintains a different connection pool for each connection that has a different connection string--becuase the connection properties are potentially different.  If you want or need to have multiple connection pools, even for the same database, you can even just make the two strings slightly different (say reverse the order of userid=xxx;password=xxx to password=xxx;userid=xxx--even if they are the same, ADO.NET will create two independent pools.  See MSDN: http://msdn.microsoft.com/en-us/library/8xx3tyca.aspx for a good reading.

b)  NOTE.  For this performance tuning step, you need to be testing against a well-loaded test database.  Put lots of records in it, just like the real production database will have.  Otherwise, you will never see the issue.

Even with proper data access logic, eventually a Web app (or middle tier service) will push the backing database to its threshold, and this will be the bottlneck.  Its unlikely for a well designed app to happen with one server running the Web app (with a reasonably-sized SQL Server DB), but try a web farm with 4 servers all at peak throughput attached to a single central SQL Server box. But if you are running just a single web/asp.net/wcf macine against the single SQL Server, and its SQL Server that is running at ~90% CPU and the app server never gets close to 100% CPU utilization, then you have an issue you need to fix--and it may be quite easy to fix.  You can scale SQL Server up (make the box bigger with more cores, more memory, and faster storage), of course. It is incredibly scalable.  (Hint: for really good SQL performance, use two storage controllers each attached to its own separate RAID array.  Put the transaction log on one array when creatting your database, and the data file on the other).  You can also shard your database into many, each served by a separate DB server.  But, for most performance issues I see when SQL Server is involved, its a simple matter of a missing index on a large table.  You always need to look at your SQL WHERE clauses, and make sure there is an index that makes sense, or else SQL Server (or Oracle) will be doing a table scan on every request, and if there are  100 or 1 million records or  more in that table, look out!  You are eating huge amounts of SQL Server CPU and disk I/O all at once.  Again, an app capable of serving 6,000 pages a second might top out at 50 pages a second, just because of one simple missing index--something that takes 10 seconds to fix.  You do not want too many indexes, becuase they make inserts more expensive, but most tables are read-heavy anyway. So I always start with indexes matched against my WHERE clauses, then use the SQL Server tuning wizard/query analysis tool (this is very cool in SQL Management Studio), which analyzes my app's queries, and tells me if they stink or if one or more can be added to improve performance.  And it will even add the indexes automatically in the wizard. 

In past performance presentations I have done the following demo:  launch my load test tool against my asp.net server, and show perfmon with CPU utilization for both the web app and the database.  At first, we see the app server running at 25% CPU utilization and at that point SQL Server (same is true of Oracle) has hit 100% CPU utilization.  Of course, the web tier will never get beyond 25%, and never serve more pages per second becuase the DB is the bottleneck. At this point the app is serving 100 concurrent users, with a one second think time and 100/pages per second returned--that's it max.  I then stop the test, take my SQL query and run it through the SQL Server tuning wizard.  It tells me I am  missing an index.  I add the index.  I start the test again.  At 100 concurrent test users, I now see the app server again at 25% CPU load, but SQL Server is running at not even at 2% CPU! It's that big a deal.  I can now jack my concurrent users up to 400 concurrent users before the web tier is now (a good thing), at 100% CPU load.  What a difference!  I am now servicing 400 concurrent users, 4 times as many, without any response delay.  But better yet--the database is still at only 8-10% CPU load.  Wow.  I can now add several more scale-out servers on my web tier, and get even more concurrent users with no response delay!

But even with the most optimized queries, a database server can only do so much--it lives in a constrained environment in terms of CPU, network, memory, and especially disk i/o. So, caching result sets from ADO.NET becomes super important--no large web site runs without some sort of caching.  In StockTrader 5, just the market summary is cached using simple ASP.NET output caching--its cached for 30 seconds (might be 60, I can't remember).  This is a super expensive query, and we do not care if the market summary is up to 30 seconds stale for users.  So, imagine the StockTrader app with 1000 concurrent users. If each is hitting the home page 1 time every 10 seconds, we are doing the super expensive query 100 times a second on SQL Server. SQL Server will be maxed out just doing this one query.  If we cache this output for 30 seconds, we are then doing the query just 2 times per minute (assuming one web server)!  Every user gets the same results from the cache in this case--since it's a market summary, the data is the same for every user anyway.  Otherwise we might key the data when we cache it by user id, or maybe use ASP.NET Session State (but now you need more memory for the cache of course--since every user is caching his/her own objects, and we need to centralize the state somewhere if we are running in a scale-out scenario like Azure instances or load balanced Windows Servers or Hyper-V VMs--more on this in a later post). 

c) Network.  At some point, you might be serving a ton of pages per second, and of course multply the bytes per page on avg, and you are using at least that much bandwidth on your NIC(s) and network infrastructure. 

So these are some of the most common bottlenecks in a typical data-driven Web app like StockTrader 5.  The art of benchmarking is by definition, the art of performance tuning.  The only way to figure out if there is a bottleneck that can be alleviated is to either deploy the app and wait for them to yell at you; or use some tool to drive load at the app, see the performance, and fix the bottlneck.  Once you fix the first bottleneck, you test again, becuase then your throughput will peak out higher (meaning now serve many more concurrent users); but you will be at the second bottlneck--so you figure out what this is (ASP.NET trace output is great, and of course perfmon), fix it, then test again.  If you just spend 7 days doing this, I gaurantee the app will likely perform an order of magnitude better than if you did not spend this time. 

A well tuned ASP.NET Web app should come close to MAXING OUT CPU UTILIZATION on the web server box once you drive enough load at it via the load test tool.  If it does not, its got a bottlneck---see a,b,c above, and you should figure out what it is, and test again.  Today with ASP.NET for most apps, there is no need to jack with tuning settings, thread pools and the like.  It basically is tuned (along with IIS) out of the box to scale and handle tons of traffic, from small boxes to large boxes, just depending on their power.  Some apps this will not be true of, but for the vast, vast, majority of apps, if you cannot max out CPU utilization with your ASP.NET web app with a load test tool driving more and more client requests---you have an app-level bottleneck you need to investigate--otherwise you will never be able to fully utiilze the CPU power of your servers serving up your ASP.NET Web app.  The same is true, of couse of JSP apps, PHP apps, etc. 

So for base level performance tuning, the two most important starting tools are:

1)  A load test tool that can drive load against your system.  You do need adequate client hardware to drive load with the tool, but even old XP client boxes can be used with some load test tools.  A single benchmark agent might be able to drive hundreds of simulated users depending on the harwdare its on.  Also note that Visual Stdio 2010 Ultimate and MSDN Ultimate subscribers can now get an unlimited number of load test agents for free, using the VS 2010 load test tool.  See:  http://blogs.msdn.com/b/vstsqualitytools/archive/2011/03/08/announcement-unlimited-load-testing-for-visual-studio-2010-ultimate-with-msdn-subscribers-now.aspx

2) Perfmon, with just the CPU % utilization metric, looking at both the asp.net web server, and the database.  For a single app server, you should max it out long before SQL Server even is hitting 50% CPU utilization with a reasonable-size DB with reasonable disk i/o and well tuned queries.  StockTrader 5 does about 2,500 ASP.NET page requests per second in my on-premise setup when the SQL Server is a nice HP dual quad core, and the ASP.NET IIS web server is a single HP blade quad core system. But, for each page in StockTrader 5, there are on average 3 SQL queries executed in the DAL layer.  So SQL Server is actually doing close to 7,500 SQL statements a second on my HP dual-quad core database at this point. That's a lot! (most are reads at this point)  And, for the StockTrader 5 workload, SQL Server at this point is only at about 30% CPU utilization with no disk queing (I/O issues against disks), since I have enough memory in my database that selects are typically coming from SQL Server's own cache.

Three final notes for Part 1. 

1. What happens if you throw a lot of concurrent users at your system under test (call this the SUT if you want to sound like an expert), but neither the ASP.NET web server or the SQL Server database get near 100% CPU utilization no matter how many users you add---you peak out at say 500 pages/sec, and then you add more users and the response times get longer but the pages/sec never increase?  You may well have a client-side bottleneck (or network bottleneck).  If you are trying to run too many simulated users per client test machine, this can happen easily.  You need to figure out how many users, for the pages you are testing, is reasonable per test client machine.  Maybe its 100, maybe its 50 or 300.  You will need more client boxes (or beefier boxes) for the test clients if this cannot saturate your system.

2. When testing to improve just your processing logic, for the items listed above, don't have your test agents download images.  Image downloads (all those GIFs, JPGs, etc) put heavier strain on your IIS server, the network, and your test clients.  You are masking what you are trying to determine---the efficiency of your code, and the DB access/query performance.  Either turn off image download on the agents, or turn on the agent simulated browser cache, if it has one.  You can worry about images later.  Also, browsers have this wonderful thing called a browser cache anyway, if the same set of users is consistently using your site.

3. Watch out for testing on a network hooked up to your corpnet domain.  You may well be sending test agents off to your corporate proxy server on every request, which is then routing the request/response to your server.  Its a wonderful test of your IT's departments corporate proxy performance, but your results will have little to do with your app's actual performance capabilities.  When testing WCF services, I have done this accidently even in my private lab (which has its own proxy to the Internet).  Many default WCF bindings specify "use default proxy server".  This means a client request to a named server is likely going to this proxy first.  You can turn of IE's proxy (the default proxy is whatever IE is showing in its connection config tab) on the test client machines, or you can drive load against an IP address and not the DNS-name of the server.

In subsequent posts I will drill into more perf tuning topics at the app level--including for on-premise and Azure/SQL Azure apps.

-Greg