Authentication Load - How many DCs for SharePoint?
First let me start off with what you can find for yourself if you dig. There's an interesting reference at the bottom of a TechNet capacity article: Additional performance and capacity planning factors (Office SharePoint Server)
"Domain controllers: It is possible for authentication to become a performance bottleneck in your SharePoint environment if the domain controller (DC) receives requests more quickly than it can respond. For environments using user authentication such as NTLM, we recommend a ratio of 3 WFEs per DC. If your tests indicate that the authentication load at 3 WFEs per DC is acceptable, you can add one more WFE per DC for a supported limit of 4 WFEs per DC."
My personal experience is... yes, DC's can be bottlenecks. With NTLM, the most common auth, I dare you to look at your security log. Crack open your IIS log, see all those 401's before you see the 200's? 401s are a request for credentials to the browser. If the client can provide them such as in Basic, the request is handled. If NTLM the challenge response is provided. See KB below for how authentication in IIS for browsers occur. The result is a lot of results and round trips to the DC. I know it was not uncommon for us to see over 1 million entries in the security log in a day for a single server with 30,000 unique users hitting a node (WFE). More than once, in SPS 2003/WSS 2.0 we would get a blank page with just the chrome, and if you refreshed it you'd see an ASP.NET error, Unable to Authenticate. Some of those long delays we thought were network latency or congestion were that, but some times it was a DC that was simply overloaded from Exchange/SharePoint authentication traffic. Don't underestimate the DC as a bottleneck.
To be honest, there was once a customer who said they noticed that their DCs CPU was going up and spiking and their SharePoint environment was experiencing perf issues. I asked them, have you tried Kerberos? They hadn't. They also hadn't anticipated DC load or needing to purchase or beef up or even monitor authentication traffic in relation to their SharePoint deployment. It's true, it is rare to say, oh you're deploying SharePoint, how many DC's are you adding? Most companies have 2 different teams managing Exchange/AD vs. SharePoint. Maybe not so in your company? Either way, you may want to use netmon and sniff the traffic, how many RTs (round trips) and what does the delay look like.
One other story for you. We managed a deployment in Europe, but our team was out of Redmond. We created a service account in our domain since that was easy for us to get. After weeks or even months of investigating perf issues with that farm, we were looking at netmon traffic and noticed that the auth traffic was coming back to Redmond! Amazing. No wonder we were having slow page render time and no wonder requests would queue up. When we changed the service account to a regional domain that had DCs actually in Europe... boom perf was quick. App pool memory stayed much lower... life was good again...
No matter whether you are running WSS 2.0/SPS 2003 or WSS 3.0/MOSS 2007, auth traffic matters. Make sure you do consider it, and if you're doing backend LOB integration or FBA you have to ask yourself about perf and those dependencies in those situations. Obviously FBA will reduce round trips, but you still have dependencies for auth. NTLM vs. Kerberos... Kerberos has a larger header so the individual packet will be larger, but if those sessions are 10 minutes, you'll be glad you chose kerberos for the many pages, js, css, gifs, etc...
Word of warning: Don't discount the DC traffic! But don't stop a deployment if you didn't purchase a DC in a new environment. It's ok to watch it and keep it in the back of your head. There's obviously some capacity and your AD team should be planning for growth. Give them a heads up that you'll be using some perf for auth and potentially profile imports and work out the details with them. They may convince you to use Kerberos. :)
<Update 5/21>
One of the key things I failed to mention here that may have brought up more questions than answers was... how do I tell how much load I'm introducing. The other aspect is of course not all 401s are as simple as I explained it. Chris Gideon provides a more verbose example... (Thanks Chris!)
"When the browser makes a request, the first request is always Anonymous. Therefore, it does not send any credentials. This will generate a 401 and a list of the authentication types that are supported. No trip to the DC will occur in this case.
If Windows Integrated is the only supported method, the server sends Negotiate (Kerberos) first. If this fails, then the client falls back to NTLM. From a SharePoint perspective this is a little tricky. For WSSv2 and SPS2003 when a virtual server was extended NTAuthenticationProviders metabase key was set to NTLM. This was true until Service Pack 2 at which time Negotiate(Kerberos) became the default. The value was not set so it inherited the default value of Negotiate (set in code). In WSSv3 and MOSS the value of the key is set to the value chosen during creation of the web app. But if you take defaults its NTLM.
If the client tries Kerberos and it fails there will be a 401.2 generated, the client will try again with NTLM and this will succeed(depending on configuration). In this scenario you will get 2 trips to a DC. One for the 401.2 and one to validate the client for the NTLM login. If Kerberos is used then there will be a trip to a DC for what we call Pac Validation. This will use the Secure Channel established at boot by calling DsGetDC. There are several caches in WSSv2 and WSSv3 that make it difficult with Kerberos to pin down the exact number of minutes before reexamining the Group Membership sent in the Kerberos Ticket (TGS) from the client. The ranges are between 10-15 minutes for group membership validations. However, you can avoid trips to the DC for the Pac validation(group membership) of the Ticket by implementing this registry setting which became available with Service Pack 2 for Windows Server 2003. You experience a delay in the user-authentication process when you run a high-volume server program on a domain member in Windows 2000 or Windows Server 2003 http://support.microsoft.com/default.aspx?scid=kb;EN-US;906736 .
If Basic is the only supported method (or if Anonymous fails), then a dialog box appears on the to get the credentials. It attempts to send the credentials up to three times(each resulting in a trip to the DC and a 401). Basic makes a trip to the DC if we are using Domain accounts. Basic is a transport(base64 encoded, in the clear) but is really NTLM behind the scenes after the password and user name arrive.
If both Basic and Windows Integrated are supported, the browser determines which method is used. If the browser supports Kerberos or NTLM, it uses this method. It does not fall back to Basic. If NTLM and Kerberos are not supported, the browser uses Basic, or Digest if it supports these. When Internet Explorer has established a connection with the server by using an authentication method other than Anonymous, it automatically passes the credentials for every new request during the duration of the session.
As you can see from the information above it is not an easy task nor is there a single answer for profiling your authentication volume. The best methods employ a team of people from different disciplines. For example, a network trace taken in conjunction with Netlogon.log(this log does not get more granular than a second) will tell you how many authentications you are getting a second that result in a DC request. It is important to ensure the DC that is responding during your test by using NLTEST to force your secure channel to a particular DC. Then use Server Performance Advisor on the DC to profile the load. This will tell you where the bottlenecks are on your DC. You can also use Server Performance Advisor on the WFE to determine the overall render time for the client and run tests with both NTLM and Kerberos to see which performs best in your environment.
To turn on netlogon.logging see: Enabling debug logging for the Net Logon service http://support.microsoft.com/default.aspx?scid=kb;EN-US;109626 .
264921 How IIS authenticates browser clients
http://support.microsoft.com/default.aspx?scid=kb;EN-US;264921
907273 Troubleshooting HTTP 401 errors in IIS
http://support.microsoft.com/default.aspx?scid=kb;EN-US;907273
</update>