SharePoint 2013 uses AppFabric to host a bunch of stuff in a cache-cluster. This is awesome mainly because each web-front-end server doesn’t have its’ own cache island that can’t be shared with the other web-front-ends and the likelihood of a cache-miss is much lower (especially if the load-balancer doesn’t do sticky-sessions), thus the caching becomes more effective.

AppFabric is used to cache all sorts of info for SharePoint installation and farms. Here’s some types of data that’s cached:

  • ASP.Net view-state cache
  • Logon token cache
  • User “my sites” activity feed cache
  • WFE to app-server access token cache
  • Security trimming cache (search)
  • More caches

Each one of those has its’ own timeout settings so the first thing to do is work out which cache is timing out. That’s typically easy as you can see from these example errors:

  1. Unexpected error occurred in method 'GetObject' , usage 'SPViewStateCache' - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0018>:SubStatus<ES0001>
  2. Unexpected error occurred in method 'Put' , usage 'SPViewStateCache' - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException
  3. Unexpected error occurred in method 'GetObject' , usage 'Distributed Logon Token Cache' - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out.. Additional Information : The client was trying to communicate with the server : net.tcp://servername:22233
  4. Unexpected Exception in SPDistributedCachePointerWrapper::InitializeDataCacheFactory for usage 'DistributedLogonTokenCache' - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out.

No prizes for guessing which cache containers we need to tweak the settings for here, although that last one is timing-out out trying to open a new connection rather than reuse an existing one.

Side-note: there’s two types of timeouts for SharePoint/AppFabric – one to open a channel, and another to send requests once that channels’ been opened. From the point-of-view of SharePoint just treat them both the same as this would normally let you tweak performance of the client-calling code but as nobody bar the SharePoint product-group & support have access to said “client” code the difference for us is a somewhat academic point – just configure both as if they were the same. For the record, opening a new channel to AppFabric involves authenticating that new connection so there is a performance hit.

Configuring New Timeout Values

So once we’ve figured out which cache container we want to tweak timeout settings for, we do it with this PowerShell:

$set = Get-SPDistributedCacheClientSetting -ContainerType DistributedViewStateCache

$set.requestTimeout = "100" # normally it’s 20ms

$set.channelOpenTimeOut = "100" # normally it’s 20ms

# maybe change the other values too

Set-SPDistributedCacheClientSetting -ContainerType DistributedViewStateCache $set

You can see two different container configurations if you just get the settings for another one:

clip_image002

Just for the record the defaults configured values are:

clip_image003

After you’re done you need to restart the AppFabric service everywhere it’s running (although I’ve never confirmed this – I think some settings don’t actually require a service restart). Remember, it can take 5 minutes before a cache-node has initialised properly from its’ peers so make sure you give 5 mins between AppFabric restarts (10 mins if you’re paranoid) so the cache cluster can move objects around as nodes go-down & come-up, so the cache doesn’t die.

How to Tell What Timeout Values to Use

Good question. It depends on system load to a large extent, obviously. A good start at least is the general network latency between each node, found by pinging each machine with AppFabric from each web-front-end (WFE):

clip_image005

Here we can see that most of these network calls, should they have been for AppFabric, would’ve timed-out (despite receiving a response). Work out the latency for between each AppFabric & WFE server when the whole farms’ under stress this way:

image

That would be 4 ping tests in total for just testing response-times between AppFabric servers. Important: you also need to factor in that AppFabric will need to chat to AD when connections are initially opened. Pick the highest response time seen from any one of the X tests; double it; make that your starting point. If a WFE is also an AppFabric machine, you need to test from every other WFE to the same machine regardless.

This error comes because a call from a WFE to an AppFabric machine, at that exact moment in time, took longer than the configured values for that specific cache container - because of high latency most likely. Give the timeout values plenty of room; there’s not actually much risk in doing-so…

AppFabric Timeout Reconfiguration Risks

These values are at the end-of-the-day for AppFabric and not SharePoint. AppFabric is a generic cache-clustering solution and isn’t specific to SharePoint at all so these values basically ensure the “client” can’t denial-of-service-attack AppFabric, inadvertently or otherwise.

Now that said, as AppFabric in a SharePoint installation is only ever called by SharePoint code the value in these protective values (no pun intended) is depreciated somewhat, as we have only core-product code invoking core-product code. Given how much investment we put into stress-testing & quality control for our software compared to a normal bespoke project for example (which is partly what AppFabric is meant for), raising these values for SharePoint only presents a very minor risk indeed.

Remember though; if you go over 100 maximum connections you’ll break the maximum number that SharePoint supports and it’ll break the whole thing, so don’t go too crazy. Also bear in mind that there’s 10 cache-containers in all so setting this value to 100 (the highest number supported by SP) will be approximately 1000 active connections to all machines that host AppFabric in the farm, which if for example the machine is responsible for serving web-pages under load, could have quite a negative effect for users.

Monitoring AppFabric for SharePoint

First, remember that if SharePoint times-out while calling AppFabric that’s not a problem for AppFabric; at worst it’ll be responding to requests that’ll get ignored, or not even called at all if the timeouts are too low. In short that means there’s not much to monitor for AppFabric although there are plenty of guides to help you better than I could.

On the SharePoint side it’s actually not that easy to monitor as there’s no cache-is-broken specific event ID (apart from the generic fatal error event ID 6398 that mentions ‘Microsoft.Office.Server.UserProfiles.LMTRepopulationJob threw an exception. More information is included below. Unexpected exception in FeedCacheService.BulkLMTUpdate: Unable to create a DataCache. SPDistributedCache is probably down’) – but that’s for another caching container.

What you need to do is keep an eye-out for SharePoint ULS events “agyfw” and “ah24w” – most AppFabric messages of doom will have one of those IDs. Unfortunately at the time of writing I can’t think of a clever, System-Center type way of automating those checks but at least the event itself is clear & unambiguous.

Finally, when testing your new configuration values you might still see errors for a while afterwards as the cache cluster basically gets it’s act together again. Long story, but basically if you think you’re seeing errors when you shouldn’t, go have a coffee and check again a bit later :)

Wrap-up

That’s it for now – I hope this helps. If this becomes a hot-topic I’II consider expanding the topic. Happy SharePointing!

// Sam Betts