The Groove Relay Server is a store-and-forward messaging system that stores Groove data for offline clients.  It also serves up presence information, and uses several techniques to improve connection and communication efficiency.  The data traffic patterns of Groove depend very highly on usage, and so it is particularly important to monitor the Relay Server's health to assess its ability to support its current user base.

Much like any server, there are common performance counters, such as memory and cpu utilization, that are important to watch.  But there are a few Relay-specific counters that I recommend paying attention to as well.  Here is a quick list of key performance counters, in order of diminishing importance, that I typically watch:

Counter to watch

Uh oh…

What it means

\Memory\Available Bytes

< 2GB (or < 25% Physical Mem)

Available Memory dropping too low could be a sign that the relay is falling behind in processing some component of its traffic, or could also indicate a memory leak in the system.  The Relay requires at least 25% of total physical memory free to optimally cache disk transactions, so operation below this threshold should be avoided.

\Processor(_Total)\% Processor Time

Sustained >95% more than 20 minutes

Spikes to 100% are OK, but a sustained CPU % > 95% is a sign the relay is being overdriven.  System might seem unresponsive (slow).

\PhysicalDisk(*)\Avg. Disk Queue Length (or could use similar LogicalDisk(*)\Avg. Disk Queue Length.) 

Significant number of spikes / periods > 25 on Data drive

This indicates the disk subsystem is falling behind.  When this occurs it’s always good to verify all disk subsystem components are performing optimally (i.e. no failed disks, no failed disk cache batteries, etc.).

\Groove Relay Store\StoreFFQHelperOutstandingCommands

> 500 for longer than 15 min

Indicates messages backing up writing to relay.  Could indicate slow disk subsystem, or overload of data.

\Paging File(C:\pagefile.sys)\% Usage

Sustained > 15%

Means system is being overloaded and paging.  When paging is high, system is not running at optimal performance.

\Groove Relay Flow Control Client\FlowControlStopSendingCount

> 1000 for long periods of time (hours)

Indicates either a poorly-performing disk subsystem, or a large number of clients that are using an excessive amount of bandwidth.

\LogicalDisk(E:)\Free Megabytes

< 5% of disk on Data drive

A disk this low on memory will cause inefficient performance, and eventually crash the relay.

\TCPv4\Segments Retransmitted/sec

> 15-20

More a sign of a config issue.  Usually indicates a bad or poorly-configured network connection.