The following is a good place to get started with how to go about collecting basic information for monitoring and/or troubleshooting an App Fabric (AF) cache setup (Host/Cluster and clients). It starts by pointing to some resources on the different available logging features and then, via a sample scenario, goes over the decision taken to implement a logging solution. Additionally, it answers a frequent customer question on what are the basic recommended performance counters to collect.
Here is a quick walk through on how to better format and generate the log generation for ease of management, some of the links within are pre-release so you may refer to the following more updated ones, server log sink settings and client log sink settings. These should give a fairly good idea on the logging capabilities offered in AF Cache, please review the given links, as its knowledge will help further reading. With these concepts, the discussion and planning on what is the best suitable logging solution for your specific implementation can start.
Assuming that memory pressure issues are a concern on the host side. The default event trace level of ERROR would not be enough as it would be necessary to have a more detailed sense of what objects are being cached on the host. This can be done by overriding the default host log sink to collect information level logs, enabling more detailed log analysis in the case of memory related errors. 5 different levels are provided: No Tracing (-1), Error (0), Warning (1), Information(2), Verbose(3). In this sample, the Information level will be taken.
At that point the next decision will be to determine if the configuration setting should be performed via code or XML (configuration file). In this sample, the organization decides that their Infrastructure personnel can handle the required changes via XML and no programmers will be required (no code needed) and hence the XML solution is the simplest.
Next is the type of logging – as the same infrastructure team will also be analyzing the logs, a file-based log sink is agreed upon (versus console or Event Tracing for Windows - ETW). For the sake of simplicity and to ease the understanding of the sample in this blog, AF Cache logging was chosen. ETW logging is a viable option as well. Since the logs will be written into an existing central shared location on the network, the NETWORK SERVICE account is given rights to the share (in the case of a cluster, each host NETWORK SERVICE account will have to be added to have write access to this share). NOTE that AF Cache service ONLY runs under the NETWORK SERVICE account of the server (which is the account assigned to the computer by the domain), the service cannot run as any other account, such as an independent (not server/local machine account) Network user or group.
In the case of an outage the logs could be overwritten, to alleviate this, the process-specific character ($) is agreed upon and it is to be used within the log name. Also, the log generation interval is settled for every hour (dd-hh).
Similarly, since memory pressure on the webservers (AF cache client) is also a concern, the client logs sink needs similar changes. The final custom type attribute for client and host for the fabric object will then look similar to the following:
<customType className="System.Data.Fabric.Common.EventLogger,FabricCommon" sinkName="System.Data.Fabric.Common.FileEventSink,FabricCommon" sinkParam=\\CentralLogs\\AFCache\\Server1-$/dd-hh <!-- For the client machines the log name are modified: sinkParam="\\CentralLogs\\AFCache\\Client1-$/dd-hh" –> defaultLevel="2" />
<customType
className="System.Data.Fabric.Common.EventLogger,FabricCommon"
sinkName="System.Data.Fabric.Common.FileEventSink,FabricCommon"
sinkParam=\\CentralLogs\\AFCache\\Server1-$/dd-hh
<!-- For the client machines the log name are modified: sinkParam="\\CentralLogs\\AFCache\\Client1-$/dd-hh" –>
defaultLevel="2"
/>
Logs are a good way to collect application specific or, as in the case above, scenario specific information that will allow for ad-hoc or error-driven analysis. Similarly, collecting performance counters can give a window in the internal operations of not only the particular application (AF cache) but also of the overall system.
As a complement to the above, here is a list of recommended performance counters to aid in monitoring (for later analysis) and troubleshooting AF cache issues. The table contains the names for the performance monitor counters and the important instances to monitor. A small comment follows each counter grouping.
Counter Name
Running Instance
Counter Instance
Comments
AppFabric Caching:Host
N/A
All (*): Cache Miss Percentage,Total Client Requests, Total Client Requests /sec, Total Data Size Bytes, Total Evicted Objects, Total Eviction Runs, Total Expired Objects, Total Get Requests, Total Get Requests /sec, Total GetAndLock Requests, Total GetAndLock Requests /sec, Total Memory Evicted, Total Notification Delivered, Total Object Count, Total Read Requests, Total Read Requests /sec, Total Write Operations, Total Write Operations /sec
For obvious reasons all the host counters are included. For memory troubleshooting and monitoring, the counters “Total Data Size Bytes” and “Total Object Count” are relevant. Also, a high level of evictions (Total Memory Evicted) may indicate memory pressure.
.NET CLR Memory
Distributed CacheService
# Gen 0 Collections, # Gen 1 Collections, # Gen 2 Collections, # of Pinned Objects, % Time in GC, Large Object Heap size, Gen 0 heap size, Gen 1 heap size, Gen 2 heap size
These counters will give a fuller spectrum on what is taking place with the CLR memory. For instance, a large “% Time in GC” would indicate that too much garbage collection (GC) is taking place, and the CPU will provably be working extra to process the GC. Memory pressure may therefore be an issue. See this blog for further details
Memory
Available MBytes
The preference will be to keep this at above 10% of total memory and plan to have available space for high traffic times.
Process
% Processor Time, Thread Count, Working Set
A small working set could imply memory pressure from AF Cache as oppose to other processes in the box.
_Total
% Processor Time
Monitoring the total process time will tell which application(s) are competing for CPU resources.
Network Interface
All (*): Bytes Received/sec, Bytes Sent/sec, Current Bandwidth
As long as memory and Garbage Collection are not an issue then the CPU should be expected to work well and a lot of throughput can be handle. At this point a slow network card or any network point from there to the client may become the bottleneck. Monitor the network interfaces to ensure that they are not saturated.
To avoid having to include all the counters above one-by-one, download this performance counter data collection set template and import it to the server, as follows (alternative you can also use this more generic instructions):
Both logs and performance counters collected together are the first step in being ready to analyze errors or monitor for specific concerns or conditions (i.e. memory pressure) with AppFabric Caching.
Since this is a big subject, I will look into further exploring the reasons behind the performance counters recommendation in a future blog.
Author: Jaime Alva Bravo
Reviewers: Mark Simms; James Podgorski; Rama Ramani
I am assuming all this is valid for On-Premise world?
All these is from on-prem. For the cloud, the Memory, Process (_Total) and Network Interface counters are the only ones available. To get to them, you have to remote desktop to your Azure compute node.