In part 1, I described the load balancing that occurs among Query Components (e.g. SSA load balancing). In this post, I'm expanding the focus to the load balancing involved when there are multiple instances of the SQ&SS (e.g. Farm level load balancing) and from this, begin to explain why the "Internal Server Error Exception Occurred" failure is intentionally ambiguous.
I'll then wrap up this series in a third installment that builds upon this foundation to describe troubleshooting tactics for Query failures. As noted before, this series of posts will focus on SharePoint 2010, but I'm already working on a similar post that will focus on SharePoint 2013.
Search Query & Site Settings (SQ&SS) Revisited
Let's start with a quick review of the load balancing in SharePoint 2010 within Search. In SharePoint 2010, the SQ&SS acts as the Query Processor for the SSA by:
When a user issues a query (e.g. http://loadBalancedURL/results.aspx?k=foo), the WFE (more specifically, the SP Web Application) has no idea how to process a Search query. However, the Web App does know how to talk to an SSA's WCF Service EndPoint (implemented by the SQ&SS) defined in the [default] Service Connection (aka “Service App Proxy”). After the results are processed, the SQ&SS returns the result set as XML back to the Search Web Parts, where they are then rendered.
Primer on WCF EndPoints and SharePoint Service Applications
When it comes to explaining the SharePoint Service Application architecture - an essential aspect for understanding Farm level load balancing, no one has explained this better than Spencer Harbar's post:
"When you start a service machine instance for which there is an associated Service Application, an IIS Virtual Application will be created within the SharePoint Web Services IIS Web site. This will include the Service Application Endpoint (a WCF or ASMX). Each service application must expose a service application endpoint. The service application endpoint is only created on the machine(s) hosting the service machine instance."
In short, the WCF EndPoint is the point of interaction between a client (in this case, the WFE) and the application being consumed (in this case, the SSA).
Reference: WCF Endpoints: Addresses, Bindings, and Contracts "All communication with a Windows Communication Foundation (WCF) service occurs through the endpoints of the service. Endpoints provide clients access to the functionality offered by a WCF service. Each endpoint consists of four properties: an address that indicates where the endpoint can be found, a binding that specifies how a client can communicate with the endpoint, a contract that identifies the operations available, and a set of behaviors that specify local implementation details of the endpoint."
For an overly simplistic analogy for WCF, think of a company [that represents an application]. If you [the client] wanted to talk to a live person, you would have to call 555-555-1234 [the "address"] on the telephone [the binding] and navigate through the automated message tree [the contract]. Further, this company may have multiple phone numbers (555-555-4321, 555-555-6789), but each phone number gets you to the same company (e.g. multiple endpoint addresses to the same application).
Similarly, SharePoint Service Applications can have multiple WCF EndPoints as well. For example, the WCF EndPoint for Search is structured as http://[someServerName]:32843/-ssa-guid-/SearchService.svc (where someServerName is the server hosting this service EndPoint) and is provided when the SQ&SS is started on a SharePoint server. For example, in my farm where the SQ&SS is started on two servers, the two EndPoints are http://initech:32843/066239ec05e347a88106bde8749f8cc9/SearchService.svc and http://swingline:32843/066239ec05e347a88106bde8749f8cc9/SearchService.svc.
At the SharePoint Farm level, the Application Load Balancing Service Application (the “Farm Topology” service), keeps track of the state of each Service Application's WCF EndPoint(s) and helps the WFEs load balance across each Service Connection (aka “Service App Proxy”). For all SharePoint Service Applications, a Web Application Service Connection is essentially a reference to the particular Service Application's WCF EndPoint(s).
Reference: "How I Learned to Stop Worrying and Love the SharePoint Topology Service" provides a great deep dive into the SharePoint Topology load balancing
Multiple SQ&SS in the Farm
Let's now extend our original diagram to illustrate multiple SQ&SS components (to keep it simple, I'm intentionally showing just one query component to focus specifically on the SharePoint load balancing. See part 1 to demonstrate how an SQ&SS load balances at the SSA level when there are multiple Index Partitions and/or multiple Mirrors per Partition):
In this example, the WFE would send the query to one instance of the SQ&SS via WCF SOAP requests. If a user then submitted a second query, the WFE would then round-robin this second query to the next instance of the SQ&SS such as:
To follow this in ULS (Hint: This is the BEST way to start troubleshooting query failures):
12/03/2013 09:06:58.21 w3wp.exe (0x2958) 0x2AB4 SharePoint Foundation Topology e5mc Medium WcfSendRequest: RemoteAddress: 'http://swingline:32843/066239ec05e347a88106bde8749f8cc9/SearchService.svc' Channel: 'Microsoft.Office.Server.Search.Administration.ISearchServiceApplication' Action: 'http://tempuri.org/ISearchQueryServiceApplication/Execute' MessageId: 'urn:uuid:eb7296b7-ef26-4f7d-87f7-dceb6251d2f5' f41cb190-945e-458e-b924-77ec2fd066d4
This ULS entry above (to reiterate, from the WFE server) provides a few key pieces of data:
From here, go to the ULS on the "swingline" server and filter by the same Correlation Id (in this case "f41cb190-945e-458e-b924-77ec2fd066d4") to see the corresponding "WcfReceiveRequest", which is the acknowledgement in ULS that the request has been received at the specified "LocalAddress":
10/03/2013 09:06:58.48 w3wp.exe (0x2A8C) 0x130C SharePoint Foundation Topology e5mb Medium WcfReceiveRequest: LocalAddress: 'http://swingline:32843/066239ec05e347a88106bde8749f8cc9/SearchService.svc' Channel: 'System.ServiceModel.Channels.ServiceChannel' Action: 'http://tempuri.org/ISearchQueryServiceApplication/Execute' MessageId: 'urn:uuid:eb7296b7-ef26-4f7d-87f7-dceb6251d2f5' f41cb190-945e-458e-b924-77ec2fd066d4
Troubleshooting: the Basics
Although I'm going to defer troubleshooting strategies for a subsequent blog post, I will note that you should then find all of the related query processing activity (e.g. Tripoli connections to the various Query Component(s), queries to the Property Store, merging/sorting results, security trimming, and removing [near] duplicates) and these will all maintain the same Correlation Id that we found on the WFE. In other words, this Correlation Id will span from server-to-server for all events related to *this query (Russ Maxwell's post here does a great job walking through a query step-by-step)
Also, if you can't find the corresponding WcfReceiveRequest on the SQ&SS server (in this case, "swingline"), then the request from the WFE to the SQ&SS server was most likely lost in-flight (e.g. proxy issues, network failures, and/or a missing SharePoint Web Services IIS Web site are the most common that I've seen).
In other words, if the request were to fail within SharePoint per se, then you would see the WcfReceiveRequest and then some failure. However, if the "WcfReceiveRequest" is missing, then the SharePoint/SQ&SS never received the request.
And finally, if you happen to receive the completely generic "Internal Server Error Exception Occurred" with a query, this simply means that *some error occurred when the WFE sent the request to the SSA, whether in-flight or somewhere further downstream within SharePoint. I've seen countless blog posts, forum threads, and so on suggesting fixes for this error. However, this error is completely ambiguous and not intended to indicate *what error occurred... only that *some error occurred.
Being said, with the error being ambiguous by nature, there is no single fix when this message occurs. If you receive this message, use the corresponding Correlation Id to find the request on the applicable WFE and step through the flow of events from the WFE-to-SQ&SS to see what *actual error occurred.
In Summary - Three Layers of Load Balancing
As I began in part 1, seemingly “sporadic” query problems are often just straightforward failures being masked by the three levels of load balancing involved with a SharePoint 2010 Search Query and my goal here has been to help unravel all the moving pieces.
When a user submits a query (e.g. to the Enterprise Search Center), the request is typically first load balanced by a NLB to a particular WFE. The WFE then sends the request to the SSA's WCF Service EndPoint (implemented by the SQ&SS), but it is the SharePoint Farm topology load balancing that determines which WCF EndPoint the WFE will contact - and subsequent requests will be round-robin'd (this isn't a word, but I'm going with it) to the next SQ&SS EndPoint.
Once an SQ&SS receives the request from the WFE, the SQ&SS needs to reach out to one Query Component for each Index Partition (part 1 goes deeper into load balancing at the SSA level). If an Index Partition has multiple Query Components (as with the previous example also illustrated above such that Partition1 => QC1a,QC1b and Partition2 => QC2a,QC2b), then the SQ&SS will round-robin its requests to one QC mirror from each Partition. More specifically, in the first query, the SQ&SS may reach out to Partition1a and Partition2b. Then, in a second query, the SQ&SS cycles to Partion1b and Partition2a. In the third query, the SQ&SS again returns to Partition1a and Partition2b.
In upcoming posts, I plan to dive deeper into troubleshooting Query failures as well as write a related post that focuses on SharePoint 2013.