Making the Case for a True SharePoint Staging Farm.

Posted: September 29, 2010 in MOSS 2007, SharePoint 2010

One of the more common reoccurring battles I have at client sites, is the need for a full-fledged multi-server Staging/Test farm. Many will refuse to see the cost justification in the creation of this setup. They look at it as more of a luxury or indulgence. Ironically, most of these same clients also have virtualized environments where the costs of configuring these systems is smaller than in the old world where this required physical boxes. Granted SAN space does not come cheap but I would argue the cost of engaging your technical staff in an emergency Disaster Recovery process as a result of a simple hotfix tends to get more expensive especially when you have new OS patches every month that could break SharePoint (seem to remember a particular IIS patch that cause some serious pain). Now add in the Service packs, cumulative updates, and hotfixes we encountered with SharePoint 2007.

So first my recommendation, then the why’s. I would strongly recommend a staging farm mimic production in server count, networking, SQL clustering, etc. As much as possible. At the very least the Staging farm should have multiple servers (1. WFE, 1 APP/Index, 1 SQL). I would also suggest that having a proper staging farm is not a luxury or a nice to have but absolutely a requirement for an organization wanting to run long term with the SharePoint platform. On networking, as expensive as it can be, I would recommend implementing NLB is you are using it in production with the SAME hardware/software.

In using the staging farm, there should be a couple governance guidelines as well. The first, all customizations installed in staging, should be done exactly as they are in production. This means WSP deployments. If you are not using WSPs, you need to get on the bandwagon, it is not just a good idea, it is the way things need to be done in SharePoint whenever possible. This also means the same site collections, the same web applications, security, profiles, profile import, search settings, etc.

Now the why’s. I will going into actual scenarios I have seen. There are a lot of them I could use but we will use these shining examples I have.

Company A, large insurance company 20,000 plus users on a 5 server farm. They had need to meet some very specific compliance needs. This required the implementation of GPO’s on the system. Many of these GPOs had nothing to do with SharePoint. Like many companies they got everything up and running in staging, then production then proceeded with the lockdown. They had a staging environment. The GPOs were applied in groups of 10-100 depending on perceived risk. Was followed was a 2 month endeavor in which staging was down 90% of the time. The GPOs blocked the OS’ from working and in some cases created communication issues that would NEVER have taken place without a multi-server staging farm. There was zero production outages associated with the application of GPOs because of the staging farm. With a system like SharePoint, which relies on multiple servers functioning as one unit, the only way to truly reach compliance without risking the production farm is to have a multi-server staging farm to use to determine how to reach compliance.

Another example Company A again, MOSS Service Pack 1, rolled out in staging first, DCOM permissions popped up, declarative workflows ceased to function. What was worse, we could not roll back. There was  absolutely no way to uninstall the service pack and we had to go into DR mode on the farm. Bringing it down, restoring server images and DBs. Staging was down for a full day. Production never went down.

Company B, large pharmaceutical, 10,000 employees with a SharePoint based intranet, refused to implement a staging farm, citing costs. Implemented a large WSP deployed branding solution consisting of custom master pages, page layouts, themes, feature stapling, event handlers, custom web parts, and custom CSS. Deployment on development server, completely successful. Deployment on 5 server production farm went smoothly. Testing revealed sporadic outages in production almost immediately. For a 3 days users in production had to deal with sporadic outages, data loss, and other issues. After 3 days issue was traced down to one of the branding features that failed to fully deploy on  one of the WFEs. The sporadic outages were a result of NLB bouncing the user back and forth between WFE servers. Using production as a testing platform (yes I said we were forced to use production as a testing platform, lacking a multi-server staging farm), we determined a way to force the deployment to succeed on multi-server environments and got it working. Company B employed a multi-server staging environment on the tail end of that effort and had NOT had a production outage in the last 8 months since staging was implemented. Though they have had plenty of staging outages.

Company A again, wanted SSL on their site but wanted it terminated on the load balancers. Implemented in staging, went fine. Implemented in production and had an immediate outage. Turned out their NLBs in staging were not the same as the production ones. Staging were actually older and cheaper models and worked fine. The “Good” production ones, stripped off host headers after decrypting the packets. Without host headers, IIS never sent the traffic to the correct web application. When we suggest SAME version in production and staging, this is why. This is the only time we took production down for Company A. It was a painful lesson for an otherwise VERY careful company but one they learned well.

Company C, large healthcare organization, 40,000 plus users, SharePoint 2007 Intranet. Company C instituted a multi-server staging per our suggestion. Company C has a heavily branded solution with a moderate number of coding customizations. They have multiple WSPs for branding, custom event handlers, themes, custom web parts, and other customizations. Initial WSP deployment was smooth. Updates, caused issues as a result of the self-referencing issue with master pages and page layouts in SharePoint 2007. We were able to build a customized upgrade path for the clients implementation without bring production down. In staging it took 2 days to develop this. In production, as a result of our efforts we rolled out changes and implemented the upgrade plan in 10 minutes as a result.

Company D, large insurance organization. 15,000 plus users, SharePoint 2010 intranet. Implemented without any staging farm. Heavy GPO environment. Performed an in place upgrade from SharePoint 2007 to SharePoint 2010 and pushed out multiple customizations. The end result was an immediate failure of the production system. GPO’s disabled a number of key components in the OS such as IIS, DCOM, and ACLs. The end result was a complete repave of the production servers (actually half a dozen of them), and major production outages over the period of 3 weeks it took to troubleshoot their couple of thousand GPO settings.

Company E, large aerospace firm, 80,000+ users, SharePoint 2007 based intranet, 5 server farm. Engaged us at the tail end of a development effort for architecture guidance. Per our recommendation, they implemented SharePoint in full-fledged staging system. Immediately, upon deployment numerous security issues with development customizations occurred. Mainly, they encountered NTLM double hop issues but some other deployment issues. They had to implement Kerberos with constrained delegation in staging (and eventually production. As with many very large organizations, we had a lot of free reign in staging to implement Kerberos settings, and other administrative tasks. They were handled locally. We were able to do these quickly. The production farm was not the same and managed on the other side of the country. Implementing a new Kerberos setting or any custom setting was 3 weeks from format request to implementation. We were able to implement all items in staging in 2 weeks, and in production we required a single request for all settings as a result.

I could go on with many other samples. The fact is with SharePoint 2007 we saw it over and

over and I would expect nothing but the same with 2010, it is much more complex a system with a lot more rich features to break.  A multi-server staging farm is the best way to keep your production farm up and running. Even if you have no customizations and only have a small farm, sooner or later despite their best efforts Microsoft will issue a patch, update, KB, hotfix, etc that breaks your SharePoint farm. Unless you are extremely lucky and/or deliberately keep your farm well behind on the patching (even that will not always work), you will sooner or later have an issue related to patching.

  1. Minesh says:

    So what would you roughly advise to hav e astaging environment for a sp 2010 7 server farm:
    1. 2 sql servers clustered,
    2. 2 wfes windows load balanced (these 2 are virtulized from 1 physical server),
    3. 1 index,
    4. 1 pps server & 1 excel services server (these 2 are virtulized from 1 physical server).

    • mstarr13 says:

      I will always advise that a staging farm match as close as possible (realizing that financials don’t always allow an identical farm) to production. Each variance from production is a chance for an issue to get through Stage and to impact your production farm. Stage is your chance to demo all patching, updates, config mods, etc that you will put to production. It is your last chance to know how production will react, whether your install/update docs/process are correct before impacting production uptime. Another key is if you will use Stage to verify performance in production. If you need it to perform identically, then you need to duplicate it.

      As a minimum for the envirnment you mention, the load balancing is a key point I would duplicate, the SQL cluster is another critical point. The others, are where I would look first to consolidate IF you have to. So long as stage is not intended to verify performance. If it is, then I would push real hard to get the other servers pushed in as well.

      A good example, recently put in a hotfix on a client farm. Ran fine in Dev, Test, and when it got to staging it hosed a service app (not the first time I have had that happen to me). It ended up being a single step in the documented guidance for applying the patch was missed. In the full fledged production configuration, which was present in stage, it created a serious issue. Luckily, all we did was bring stage down for a day.

      It is a matter of risk analysis. How much risk is your organization willing to accept vs cost to match the environment? It is often you are willing to accept some risk and allow stage to be lower. Also for performance, sometimes you are willingf to accept that a certain performance metric in a scaled down stage farm is good enough. As is frequently the case, the business/financial concerns may be harder to address.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s