Distributed Cache bug in SharePoint Server 2013

Distributed Cache is a new component of SharePoint 2013 that is used to cache data for activity feeds, news feeds, search queries, authentication tokens, security trimming, Apps-related data and views. Even though it’s making it’s debut, it’s a pretty critical component to the functionality of a SharePoint farm.

The Distributed Cache service uses Windows AppFabric caching technology behind the scenes.

The cache can consume a lot of memory and needs to be constantly accessing the stored data so for best performance, Microsoft recommends including dedicated Distributed Cache servers in your farm. In large server farms this makes a lot of sense, though for smaller farms you can usually make due without the dedicated servers.

On a recent project, I ran into an issue with Distributed Cache — requests for items in the cache kept timing out which caused delays to other components that were relying on the data from the cache. It wasn’t occasional requests either, there were hundreds of timeouts every second. Something was up with the service.

Tracing through the logs, we saw that when a user accesses a page, SharePoint attempts to authorize the user to ensure they have access. SharePoint stores the user’s token in the user’s browser session and in the DistributedCacheLogonTokenCache container. When SharePoint tried to retrieve the token from Distributed Cache, the connection would time out or a connection would be unavailable and the comparison would fail. Since it couldn’t validate the presented token SharePoint had no choice but to log the user out and redirect them to the sign in page.

One of the interesting things about this issue was when I consulted the MSDN about the timeout values, the documentation didn’t provide the units for the values. I had no idea if the timeouts were in milliseconds or seconds.

What are the units for the ChannelOpenTimeOut and RequestTimeout? The ChannelInitializationTimeout is much larger at 60000, so maybe it’s milliseconds. Are RequestTimeout and ChannelOpenTimeout then 20 milliseconds? That seems really small. Maybe it’s 20 seconds? The MSDN page for RequestTimeout doesn’t provide an answer so we initially had to guess. In our development environments we were able to reproduce the issue when we reduced the time outs to a value of “5”. So we tried increasing them to 40 in the test environments. Then 60. Then 120. The issue persisted.

With the help of Microsoft Support I sorted out these initial questions but the issue continued even after increasing the timeouts to larger values. Microsoft called in help from their development support team and with some additional logging determined the issue was actually caused by the way AppFabric handles garbage collection. AppFabric 1.1 Cumulative Update 1 is a prerequisite for SharePoint 2013 and in this version garbage collection “takes too long.”

In AppFabric 1.1 CU1, imagine that the garbage collection happens with a little man who walks around the memory of the computer with one of those sticks with a nail on the end. When the man finds things lying around that AppFabric no longer needs he stabs the garbage with the nail-stick and takes it away. He continues looking for other pieces to clean up and for a room that is 14 GB in size this can take quite some time. He tells AppFabric once he’s done, and then AppFabric un-pauses and continues where it left off. Since everything is waiting for our garbage collector to finish checking everything lying around, other dependent services will get tired of waiting and move on. Sometimes this results in having to perform the original operation again (like a search query), and sometimes it means there is no data available to the requesting service. Sometimes it will result in an exception, and sometimes, as in our case the user gets logged out of the site.

So Microsoft wrote a hotfix that changes the way garbage collection happens in AppFabric. Instead of telling everything to wait for our garbage collector and asking him to go find all of the trash, the hotfix now tells the garbage collector to walk around looking for trash to pick up forever. With our man on the ground always tidying up, AppFabric can now just request things without waiting.

As of this writing, the most recent AppFabric CU is AppFabric Cumulative Update 4. I recommend applying this update to your SharePoint 2013 farms if you’re experiencing lots of timeouts with calling Distributed Cache. Once applied you need to modify the Distributed Cache configuration file, which is typically found in C:\Program Files\AppFabric 1.1 for Windows Server\DistributedCacheService.exe.config. Add the following section within the Configuration element between the configurationSections and dataCacheConfig elements:

<appSettings><add key="backgroundGC" value="true"/></appSettings>

So you end up with something like this:


<?xml version="1.0" encoding="utf-8"?>
<configuration>
   <configSections>
      ... other configurations ...
   </configSections>
   <appSettings><add key="backgroundGC" value="true"/></appSettings>
   <dataCacheConfig>
   ... other configurations ...
</configuration>

(Update January 30, 2014: HT to Aben Samuel and Gavin Barron for discovering that the appSettings element neets to go between configSections and dataCacheConfig)

Share Button

One thought on “Distributed Cache bug in SharePoint Server 2013”

Comments are closed.