Recently, a particular deployment with cache servers on domain1 and cache clients (app servers or web servers) running on domain2 lead to some debugging challenges when the cache clients are unable to communicate with the servers. Whether it is your DEV servers accessing cache servers or a production topology that is common in your enterprise, this blog might help to unblock you with a quick workaround.

Symptoms

In such a deployment, your cache client might receive the following exception:

Message : ErrorCode<ERRCA0016>:SubStatus<ES0001>:The connection was terminated, possibly due to server or network problems or serialized Object size is greater than MaxBufferSize on server. Result of the request is unknown.

We understand that this message is misleading especially when the object being stored is only a few bytes, a string object for instance. Secondly, you might also notice that the instantiation of the DataCacheFactory and getting a reference to the DataCache object in your code succeeds and the exception gets thrown only when the first cache operation (GET or PUT) is executed.

Here are a set of things to confirm before concluding the problem:

  • Export the cache cluster configuration and verify the cache server names specified under the 'hosts' section match with the server names specified in the cache client configuration file
  • In the cache client configuration file, try changing the cache server names to Fully Qualified Domain Name (FQDN) and redo the operation. Eg: SERVER1.DOMAIN1.com
  • Do a simple 'ping command' from the cache client machine first using just the server name (SERVER1) and then using the Fully Qualified Domain Name (FQDN) server name. You might notice that the ping command with FQDN succeeds while the other one fails.
  • Capture a trace session from the client machine when this issue happens and analyze the output. For tracelog instructions, please refer to this blog.

Network trace analysis

Here is an extract from a trace file captured when this problem occurred.

2010-9-15 13:33:01.466

DistributedCache.ClientChannel.Client1

0x000005CC

Creating channel for [net.tcp://SERVER1.DOMAIN1:22233]."

----

----

2010-9-15 13:33:01.664

DistributedCache.DRM.Client1

0x00000A1C

'2:-1' PUT;Routed;MyCache;Default_Region_0982;1975349082;test key;Version = 0:0 - Starting to process."

2010-9-15 13:33:01.665

DistributedCache.DRM.Client1

0x00000A1C

Config for [MyCache,1975349082] is [net.tcp://SERVER1:22233 (120)]."

   

The problem is that the DataCacheFactory instantiation uses FQDN as seen above. Subsequently, the internal data structures reference only the server name which is maintained in the internal routing table. This causes an issue during a cache operation execution, since the cache client machine (app server or web server machine) DNS is unable to resolve SERVER1.

Workaround

  1. Modify the c:\windows\system32\drivers\etc\hosts file on the cache client machine (web or app server) by adding an IP address entry for the cache server(s). Retry the operation.

     

  2. If the above step does not succeed, try changing the search suffix order by browsing to the network connections as shown below:

The above snapshots have been taken from a customer engagement which resulted in this key lesson learnt. We deeply appreciate such feedback and patience in working with us to identifying the root cause.

We have surfaced this issue to the product team who are fixing this in the next subsequent release.

Author: Rama Ramani

Reviewers: Jaime Alva Bravo, Rahul Kaura, Jason Roth