Windows Server AppFabric uses a monitoring data store to capture tracking data generated by the execution of WCF and WF services. By default and out of the box, the data store implementation is a SQL Server database.
Before going further into the discussion of choosing the right deployment configuration in regards to bottlenecks mitigation, let's look at the flow of data from the Event Tracking for Windows (ETW) system used by the AppFabric runtime components, through the Event Collection service, and to the monitoring store data structures. The following diagram and steps conceptually describe the processing sequence:
Performance lab tests conducted by the AppFabric CAT team show that the SQL Agent job is capable of processing between 3,500 to 4,500 staging records (events) per second on a 4 quad-core CPU BL680c server with 32GB of RAM and disk storage/file configuration aligned with common best practices for SQL deployments.
As a reference, let's look at the following simple, short-running workflow service:
Using the "Health Monitoring" level, the end to end execution of a single workflow instance generates about 13-14 tracking events (resulting in that many staging records). This means that the incoming records rate into the staging table will break even with the SQL Agent staging job processing (drain) rate at ~285 service calls per second (4,000 records drain rate / 14 events per workflow instance = ~285 workflow instances). A higher throughput for this workflow will start building a backlog in the staging table.
Ultimately, the impact of this backlog is that the data in the "normalized" tables (and respectively the AppFabric Dashboard statistics) will be out of date, missing the information from the records that are pending processing in the staging table. Depending on the load on the system, this backlog may put the AppFabric Dashboard hours behind of what is really happening in the system.
Note: For pure code-based WCF services, AppFabric has the ability to aggregate operation call statistics prior to sending the tracked information to the monitoring store. The default sampling/aggregation interval is 5 seconds, resulting in a single event being emitted to the monitoring store for each service operation, no matter how many times it has been called within the configured sampling interval.
If a single WF service within the environment has a constant load that generates more records than the staging job drain rate, then the mitigation options are limited to:
If a number of services jointly contribute to an incoming staging records rate higher than the backlog threshold, the best option is to provision multiple monitoring stores and configure the services to capture their tracking data into different monitoring stores. Using PowerShell cmdlets, the sequence to achieve this is:
The resulting topology is depicted below, without a single SQL Agent job as a bottleneck:
As long as the disk I/O of your storage configuration can support the load, the physical location of the monitoring data stores may vary - the same SQL Server (potentially using different disk volumes for each DB), different SQL Server instances on the same server, or completely different SQL Server installations. As each monitoring database comes with a corresponding SQL Server Agent job, the total staging throughput is increased by virtue of "partitioning" the monitoring infrastructure.
It is important to keep in mind that the AppFabric End-to-End Activity monitoring level only works for services that use the same monitoring store, which obviously does not apply when multiple stores are used.
The two mitigation techniques can be combined to ensure that AppFabric can support the monitoring load generated by the services deployed to the environment – with a hybrid approach, services that generate large volumes of tracked events, or as a group require end-to-end tracking, can be configured with tracking profiles that only capture key events (such as "milestone" activities), while independent services (or group of services) would forward their tracking data to a different monitoring store or stores.
As long as the limitations imposed by the throttled staging job are recognized during the deployment planning phase, the flexible AppFabric monitoring infrastructure can be configured to support large volumes of tracking data – 1) by carefully selecting the information to be captured and recorded, and 2) by scaling out to multiple monitoring stores.