I'm using forms or kerb auth and search/crawl (indexing) isn't working
<Updated Jan 10 with latest info from the testers.>
Got this question from a reader:
Question: The consultant that helped us develop the SharePoint architecture plan said that all web applications that have sites that need to be crawled need to use NTLM as the "default" authentication method and then can be extended using Kerberos in the Intranet zone. He said that this was because the crawler could not crawl a Kerberos site, only the NTLM address.
In my research to learn SharePoint I have not run across this requirement in any other location. No one else ever mentions it. Was the consultant off base? Or is just not mentioned often?
Answer: The consultant was right based on the information on TechNet. On Jan 9th the testers narrowed down the issue to custom ports which is still debated by some MVPs who attest that they have it working (more on that when those results are verfied. If you use standard ports with Kerbers and setup your SPNs correctly you can get indexing to work.
If you are testing kerberos authentication and setting it up on a single server where either SQL or SQL Express or windows internal database engine is installed locally this hasn't been seen as an issue. More details from the team, by the way in the comming months you should see more content on crawling and authentication in the near future as the docs on TechNet are updated.
There is an issue which prevents the correct start address from being created if the default zone is digest. The timer still picks it as valid even though it is not. The workaround had been to make the default zone NTLM and have some other zone be digest.
These are the docs that will be updated that are currently misleading.
http://technet2.microsoft.com/Office/en-us/library/40117fda-70a0-4e3d-8cd3-0def768da16c1033.mspx?mfr=true
Start with the section titled: “Order in which the crawler accesses zones”
When planning the zones for a Web application, consider the polling order in which the crawler accesses zones when attempting to authenticate. The polling order is important, because if the crawler encounters a zone configured to use Kerberos or digest authentication, authentication fails and the crawler does not attempt to access the next zone in the polling order. If this occurs, the crawler will not crawl content on that Web application.
<etc>
Planning zones for your authentication design
If you plan to implement more than one authentication method for a Web application by using zones, use the following guidelines:
• Use the default zone to implement your most secure authentication settings. If a request cannot be associated with a specific zone, the authentication settings and other security policies of the default zone are applied. The default zone is the zone that is created when you initially create a Web application. Typically, the most secure authentication settings are designed for end-user access. Consequently, the default zone will likely be the zone that is accessed by end users.
• Use the minimum number of zones that is required by the application. Each zone is associated with a new IIS site and domain for accessing the Web application. Only add new access points when these are required.
• If you want content within the Web application to be included in search results, ensure that at least one zone is configured to use NTLM authentication. NTLM authentication is required by the index component to crawl content. Do not create a dedicated zone for the index component unless necessary.
Troy: Yeah, we explicitly documented that Search cannot crawl over Kerberos authentication.
Additional thoughts by Joel:
I can imagine people asking about the overhead of the second web app for the already existing web apps for content. My recommendation would be to consolidate the app pools for all of the secondary "crawling only" web apps and set it to idle when not in use. The overhead of those other web apps is really in the app pool and hence the work processes, so by consolidating them, you could quite quickly eliminate the additional overhead.
Not sure if it's clear anywhere, but in the past I've seen people get burned by it... Make sure you always extend all web apps on all web front ends in the farm. The timer service should take care of it, but just in case... So when concerned about the web apps make sure you're consistent about the app pool configuration across your WFEs. If you're targetting your indexing at one WFE or having your index as dual purpose for indexing itself then having the web app that's set specifically just for the crawl having it shut down on the servers that aren't using it will save your resources. Hope this additional tip is useful and clear.
Let me explain it one more time with an example to make sure it's clear. Let's say you've got a medium farm 2 WFE/Query in load balancing an Index and a SQL Cluster. Let's say you're using forms and create additional web apps for all the web apps you want to index. You make the Index server also a web front end. All of the web apps should be consistent across all 3 servers (2 load balanced WFE/Query, WFE Index). On each of these you consolidate web apps to use a single app pool that is configured to shut down the worker process when idle on the WFEs. Hence, you don't have the overhead of worker processes that don't do anything. When you get worried about having lots of web apps the concern should really be focused on the worker processes. That's where your resources, memory and CPU is consumed.