NLB (Network Load Balancing) and SharePoint... Troubleshooting and Configuration tips.
Joys of NLB.
Customer was recently asking me about issues they were having with NLB and Kerberos.
He came across a recent KB 325608 where authentication delegation through Kerberos doesn't work straight forward in load balanced environments.
There are some recommendations for how to configure NLB with Kerberos in a load balanced environment.
A very good article is referenced in the KB... "Kerberos authentication for load balanced web sites" white paper.
For additional information about network load balancing, visit the following Microsoft Web site:
Couple of free tips...
Make your sessions sticky
No matter what you use for load balancing try to keep the sessions sticky. In NLB they call it affinity. Although with load balancing you may think you want to have 50% of traffic go between the servers and load balance each request, in reality, it's not bad to have a 52% 48% load balance where the sessions are sticky and the users maintain the same server throughout their session. The experience will be more consistent. For example... "Hey I'm getting a 500 error or server error," "Works for me" and me. Oh, well I can repro it on this machine, but not this client. Looks like you're having a problem with one of the servers in the load balanced cluster. Had you had both with all requests load balanced, it would be. "I see intermittent outages" when I refresh the page comes up. It's really *wierd.* The other thing is let's say it's an expired certificate. You can imagine the experience with that, or let's say you decided to use basic over SSL. You can imagine your users having to login twice! Also, imagine trying to track down a session across two servers trying to figure out on which node, which action happened from an IIS log perspective. You'll likely have to track it down anyway, but once you get there, you don't want to have to keep jumping back and forth.
Make sure the servers can talk
Another tip for ya. Let's say you put both your WFE's (web front ends) in an NLB cluster but you can't ping between them. To our DC guys in IT they were always fine with it. You don't need an extra network or cross over cable or HUB they use to say. NOT SO! in SharePoint, especially in 2003. If the WFE's can't talk to each other the Admin service would get upset and either pages would load very slow the first time or search requests would be super slow the first time. You may see this an think it's an IIS or SharePoint .NET compilation or server caching issue. Not so. If you know the page has already been assembled or is hot and you get a slow response the first time, but not the second, you may want to check for communication issues between your WFEs and even communication between your WFEs and Search servers. In IT we use to use cross over cables for the two node clusters, it was a cheap solution. We'd use perfmon to determine that it was really working. If it was even 40-60, we'd have to call it good some times. You can simply use the web service, concurrent connections, although it is a funny and controversial counter, it will give you an idea if there is so many current sessions on one node or the other.
Avoid Collisions for communication
In 3 node NLB clusters we would either use a HUB or put pressure on the network folks to put the second NIC on a backend LAN with the SQL environment. That was COOL. We saw a lot of our network perf issues work themselves out by simply having front end NICs which took end user traffic with the second NIC being used for communication to the SQL server, and having server to server communication go over that NIC. For SQL having a second NIC for backups was something we used in another company. They had an entire "backup" VLAN. It worked well for them. Setting up static routes required documentation and some tribal knowledge when troubleshooting to be passed on, so from that perspective I preferred to use hubs and cross over cables.
When determining your load balancing strategy figure out how much you're willing to pay, what your security requirements are, what your availability requirements are. NLB is cheap, but it does lack some of the intelligence you might think it has. It does *not* know when the web service is stopped for example. It doesn't know when SharePoint is down. It's not very smart. I've told the NLB folks this, but for the price of a NIC, and with some of your own intelligence it is a little bit better than round robin DNS.
If you decide to get serious with load balancing and start writing scripts around it to integrate with your MOM environment and/or web sites and services and you use ISA, you may want to look at the ISA 2006 load balancing stuff. It's got some intelligence in it to determine that the web service isn't responding properly.
At one point I was trying to build some intelligence in a Web Sites and Services script that would force NLB to stop services, recycle IIS, and come back up after checking itself. I never got past the Visio diagram stage, but did share it with the NLB Windows guys. I'm attaching my Visio "SharePoint Uptime" logic from a past life. In this world where you are the "person of the year," maybe you can build this and share it for the rest of us.
As far as good "How to Resources" for configuring or planning NLB. Here's some references:
The documentation on NLB is in development. There are references to it being included in the WSS deployment guide.