One of the first questions we asked when visual studio online (which was still tfspreview.com back then) was released was which performance counters should we be watching? How can we tell that something is wrong?

Now almost 4 years later we know a lot more about things that can go wrong and affect the service, but the list is still growing, for now I wanted to share some of the things we learned to watch for through experience, for many of these I am using the default thresholds recommended here for web roles specific counters.
I also suggest reading ASP.NET Performance, Troubleshooting, and Debugging

Process and machine health

The alert threshold for all these performance counters is very application specific, you will be able to tell what is ok and what is not by watching patterns over normal and peak hours.

  • Process(w3wp)\Private Bytes
  • Process(w3wp)\Virtual Bytes
  • Process(w3wp)\Thread Count
  • Process(w3wp)\Handle Count
  • TCPv4\Connections Established"
  • Processor(_Total)\% Processor Time
  • Memory\Available Mbytes

Contentions

Again these are application specific, you want to have as little contention as possible, ideally zero, but that’s not always feasible

  • .NET CLR LocksAndThreads(_Global_)\Current Queue Length
  • .NET CLR LocksAndThreads(_Global_)\Contention Rate / sec

Disk Performance

We learned this one the hard way, when things go wrong with your service, looking at disk usage isn’t the first thing that comes to your mind, particularly if you haven’t seen an increase in activity or disk usage, remember that in the cloud you do not own the machines, so you can have different hardware every time, with a new set of challenges  

  •  LogicalDisk(*)\% Free Space
  •  PhysicalDisk(*)\Avg. Disk Queue Length
  •  PhysicalDisk(*)\Avg. Disk sec/Read
  •  PhysicalDisk(*)\Avg. Disk sec/Transfer
  •  PhysicalDisk(*)\Avg. Disk sec/Write
  •  PhysicalDisk(*)\Disk Reads/sec
  •  PhysicalDisk(*)\Disk Transfers/sec
  •  PhysicalDisk(*)\Disk Writes/sec
  •  LogicalDisk(*)\Avg. Disk sec/Transfer
  •  LogicalDisk(*)\Disk Bytes/sec
  •  LogicalDisk(*)\Disk Transfers/sec

Network Usage

If you have lots of dependencies, you will probably want to add custom counters for traffic to each of these dependencies

  •  Network Interface(*)\Bytes Received/sec
  •  Network Interface(*)\Bytes Sent/sec

For obvious reasons, it is never a good sign if you find this falling to zero frequently :)

  • System\System Up Time

ASP.Net Counters

This is the first thing we look at when we have a performance problem, these are very well documented in the MSDN article ASP.NET Performance Monitoring, and When to Alert Administrators

  • ASP.NET Applications(__Total__)\Requests/Sec
  • ASP.NET Applications(__Total__)\Requests Executing
  • ASP.NET Applications(__Total__)\Request Wait Time
  • ASP.NET Applications(__Total__)\Request Execution Time
  • ASP.NET Applications(__Total__)\Requests In Application Queue
  • ASP.NET\Application Restarts
  • ASP.NET\Requests Rejected
  • ASP.NET\Worker Process Restarts

SQL connections and activity if you are using SQL:

If ReclaimedConnections is anything greater than zero, alert! Someone is leaking connections!

  • .NET Data Provider for SqlServer(*)\NumberOfPooledConnections
  • .NET Data Provider for SqlServer(*)\NumberOfReclaimedConnections
  • .NET Data Provider for SqlServer(*)\NumberOfActiveConnectionPools

CLR (you might need more or less counters depending on your service)

  • .NET CLR Exceptions(*)\# of Exceps Thrown / sec
  • .NET CLR Memory(_Global_)\# Bytes in all Heaps
  • .NET CLR Memory(_Global_)\% Time in GC
  • .NET CLR Memory(_Global_)\Allocated Bytes/sec

Web Service

  • Web Service\Current Connections
  • Web Service\ISAPI Extension Requests/sec

 

We have a lot more perf counters that are application specific, I would love to hear your thoughts and experiences about things that broke when you were not watching the right counters!