One of the first questions we asked when visual studio online (which was still tfspreview.com back then) was released was which performance counters should we be watching? How can we tell that something is wrong?
Now almost 4 years later we know a lot more about things that can go wrong and affect the service, but the list is still growing, for now I wanted to share some of the things we learned to watch for through experience, for many of these I am using the default thresholds recommended here for web roles specific counters. I also suggest reading ASP.NET Performance, Troubleshooting, and Debugging
Process and machine health
The alert threshold for all these performance counters is very application specific, you will be able to tell what is ok and what is not by watching patterns over normal and peak hours.
Again these are application specific, you want to have as little contention as possible, ideally zero, but that’s not always feasible
We learned this one the hard way, when things go wrong with your service, looking at disk usage isn’t the first thing that comes to your mind, particularly if you haven’t seen an increase in activity or disk usage, remember that in the cloud you do not own the machines, so you can have different hardware every time, with a new set of challenges
If you have lots of dependencies, you will probably want to add custom counters for traffic to each of these dependencies
For obvious reasons, it is never a good sign if you find this falling to zero frequently :)
This is the first thing we look at when we have a performance problem, these are very well documented in the MSDN article ASP.NET Performance Monitoring, and When to Alert Administrators
SQL connections and activity if you are using SQL:
If ReclaimedConnections is anything greater than zero, alert! Someone is leaking connections!
CLR (you might need more or less counters depending on your service)
We have a lot more perf counters that are application specific, I would love to hear your thoughts and experiences about things that broke when you were not watching the right counters!