Posts Tagged ‘SharePoint 2013’

SharePoint Distributed Cache Service ….DANGER WILL ROBINSON!!

 

This is a late night blog so sorry for the brevity and less cohesive thought pattern. I just needed to get this out there as much for my benefit as anyone elses.

So I have to admit. This service has always been a bit of a mystery to me. Simply put, it always worked. Never had to dig. Never had to care. Just a few commandlets and it was running and I went on my merry way. Well I had a recent case where I lost a host and had to dig when it would not come back.

What a scary experience. All these tools to use, the App Fabric cache tool, App fabric cache commandlets, SP commandlets, Services MMC, and multiple CA points to manage it. And then the warnings. If you use the wrong ones, you may need to rebuild your farm. Yes, rebuild the farm. Rebuild a 13 server farm. Can you say Holy crap!

I found a lot of blogs on fixing it, a lot of MSDN and Technet articles on App Fabric cache architecture then I got a virtual and began dissecting it. I looked here (http://almondlabs.com/blog/manage-the-distributed-cache/ ) got some great pointers. I looked here http://www.microsoft.com/en-us/download/details.aspx?id=35557 and got some good reference material. I found lots of Commandlets on configuring the cluster and manhandling it. Pushing in providers (which they tell you your choices are XML or SQL), and found plenty of hacks. In many areas the lines between SharePoint 2013 and App Fabric blurred and people were unknowingly straying into dangerous areas without realizing it. Then the clouds parted around 2200 after 3 cups of coffee. The answer came to me when I looked at the HKEY_LOCAL_MACHINE -> SOFTWARE\Microsoft\AppFabric\V1.0\Providers\AppFabricCaching registry key. The provider listed was a third…SPDistributedCacheClusterProvider.

The fog lifted

                So for those who want to better understand it. In VERY simple terms. An app fabric cluster is a set of cache hosts tied together through a central point which is a SQL Server DB or a XML file on a file share. Similar to SharePoint and its reliability on a Config Db to help keep all the servers in line. In the case of SharePoint 2013, that cluster is managed through the SharePoint configuration database and the set of Distributed Cache Service instances in the farm. The SharePoint Distributed Cache Service sits on top of it. It manages it. It maintains the App Fabric cluster underneath, and keeps everything in line. This is why you only run SP commandlets to add/remove, etc. It is also why if you use alternate tools like the App Fabric cluster tool you can kill your farm. A SharePoint farms greatest weakness has always been its config DB. It is the 1 point where you can kill the farm. Using another tool risks putting the cache cluster in App Fabric out of sync with the Configuration database in SharePoint. Once that happens, well anything goes. You might be able to recover it, you might never get another DC service ever running on your farm again.

If you want to manage this service then, stick to those commandlets. Stick to the guidance on the Microsoft documents at http://www.microsoft.com/en-us/download/details.aspx?id=35557 for maintaining it. Be VERY careful on any mods outside those commandlets. The farm is literally at stake here. It is a very dangerous game and from the blog posts I have seen out there, a lot of folks are hacking in fixes.

Another great article I found which just about laid it out for me was here(I am dense and did not get the simple truth of it right away when I read this): http://social.technet.microsoft.com/wiki/contents/articles/20348.sharepoint-2013-appfabric-and-distributed-cache-service.aspx

 

              Let me say it, I love PowerShell. For building out predictable, repeatable configuration, installation, and maintenance processes, nothing beats it. So like many others, I have built up a large PowerShell script library. I have ones for full farm config, for site provisioning, Site collection configuration, and many, many more.

              Sort of scary I never came across this before but while working with a client, we combined a script that provisioned some site collections, turned on some features, installed/activate some custom features, set a custom page layout and content type to be the default on the pages library, and removed all other content types and page layouts from that library. This script failed halfway through the process, while activating a custom feature that had been installed in the previous step. It returned the error: “the feature is not a farm level features and is not found in a site level defined by the URL…”. So this was confusing as in central admin I could see the feature was there, was installed, and was at the correct scope. After Binging this for a while, I found very little useful information as it seemed most of these issues were related to incorrectly entering the name of the feature. We had the GUID in there and it definitely matched. Also, we could re-run the offending line of script in the PowerShell window and it would run fine the second time around.

              So after a little tinkering we found that by chunking up the PowerShell script and running it in pieces, it all ran to completion without error. So we found out how to get the script to work but not really why it erred out to begin with. I run many large scripts without issue. The key was in the install/activate code. This is because adding a WSP to a farm and deploying it is an asynchronous process. A really fast one for many solutions but it runs asynchronously, nonetheless.  The script we had ran so fast that it would attempt to execute the feature before it was fully installed. So we would get the error message that it could not be found.

              So lesson learned, PowerShell is fast and effective, sometime too much so. When you run a large script and encounter strange errors that appear to make no sense (at the time), try breaking it up or executing it in chunks. It also helps to analyze your code to be sure you fully appreciate which parts are running asynchronously and which are not. This may help you avoid this frustrating error.