The customers issue was that they were trying to provision a Project site within the Project SharePoint Application. This was done via a PowerShell script that they ran on one of the SharePoint App Servers.

They had two SharePoint App Servers – AppServerA and AppServerB. They had indicated that the provisioning would fail on either App Server and it started failing around November of last year (4 months ago). The error that they would see when the failure occurred was the following from the SharePoint ULS Logs:

02/05/2014 10:14:32.87        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Database        880i        High        System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)

02/05/2014 10:14:32.87        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Database        880j        High        SqlError: 'A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)' Source: '.Net SqlClient Data Provider' Number: 53 State: 0 Class: 20 Procedure: '' LineNumber: 0 Server: ''        f5009e1d-12cd-4a70-a0af-f0400acf99e6

02/05/2014 10:14:32.87        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Database        tzkv        High        SqlCommand: 'CREATE PROCEDURE dbo.MSP_TimesheetQ_Acknowledge_Control_Message @serverUID UID , @ctrlMsgId int AS BEGIN IF @@TRANCOUNT > 0 BEGIN RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1) RETURN END DECLARE @lastError INT SELECT @lastError = 0 UPDATE dbo.MSP_QUEUE_TIMESHEET_HEALTH SET LAST_CONTROL_ID = @ctrlMsgId WHERE SERVER_UID = @serverUID SELECT @lastError = @@ERROR Exit1: RETURN @lastError END ' CommandType: Text CommandTimeout: 0        f5009e1d-12cd-4a70-a0af-f0400acf99e6

02/05/2014 10:14:32.87        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Provisioning        6935        Critical        Error provisioning database. Script: C:\Program Files\Microsoft Office Servers\14.0\Sql\Project Server\Core\addqueue1timesheetsps12.sql, Line: 0, Error: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server), Line: CREATE PROCEDURE dbo.MSP_TimesheetQ_Acknowledge_Control_Message @serverUID UID , @ctrlMsgId int AS BEGIN IF @@TRANCOUNT > 0 BEGIN RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1) RETURN END DECLARE @lastError INT SELECT @lastError = 0 UPDATE dbo.MSP_QUEUE_TIMESHEET_HEALTH SET LAST_CONTROL_ID = @ctrlMsgId WHERE SERVER_UID = @serverUID SELECT @lastError = @@ERROR Exit1: RETURN @lastError END .        f5009e1d-12cd-4a70-a0af-f0400acf99e6

02/05/2014 10:14:32.89        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Provisioning        6971        Critical        Failed to provision site /CMS with error: Microsoft.Office.Project.Server.Administration.ProvisionException: Failed to provision databases. ---> Microsoft.Office.Project.Server.Administration.ProvisionException: CREATE PROCEDURE dbo.MSP_TimesheetQ_Acknowledge_Control_Message @serverUID UID , @ctrlMsgId int AS BEGIN IF @@TRANCOUNT > 0 BEGIN RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1) RETURN END DECLARE @lastError INT SELECT @lastError = 0 UPDATE dbo.MSP_QUEUE_TIMESHEET_HEALTH SET LAST_CONTROL_ID = @ctrlMsgId WHERE SERVER_UID = @serverUID SELECT @lastError = @@ERROR Exit1: RETURN @lastError END ---> System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) at

One thing they had mentioned was that if they increased the Connection Timeout to 60 seconds, it would sometimes work. My thought process on this is that if connection timeout would sometimes allow it to work that we may have had a timeout when actually connecting to SQL Server, but that wasn’t the error.

Looking at the actual error we can draw some conclusions.

provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server

By default, we should be using TCP. If there is a serious error with that, we will use Named Pipes. The error Named Pipes got back was that we couldn’t open the connection. Not a timeout. Think of this as “SQL Server does not exist or access denied”. SQL Server in this case was also a default instance Cluster. Not a Named Instance, so SQL Browser was not coming into the picture. This is a straight shot to port 1433 via TCP.

Which machine was getting the error?

For troubleshooting, we need to consider which machines are involved. One thing that we noticed over the course of troubleshooting was that the error always occurred on AppServerB and we were always starting the script from AppServerA. If you think about how SharePoint works with its App Servers, when a service is running, you can have it started on individual App Servers and control the load.  The fact that we were always seeing the error on AppServerB led me to believe that the Project Application Server Service was only started on AppServerB and not AppServerA.  Looking in Central Admin, this was correct.  So, we want to concentrate data collection from AppServerB.

Network Traces

The first thing that was looked at was getting a network trace. We collected network traces from AppServerB and the SQL Server.  If we go back to error that was happening, we recall that we know that TCP was not working as expected and then Named Pipes was failing.  Named Pipes uses the SMB protocol to talk.  This will first reach out to TCP port 445.  We didn’t see any traffic in the Network trace going to that.  We also didn’t see any SMB traffic that was relevant to the error.  We only saw browser announcements which had nothing to do with us.  This tells me that we never hit the wire.  So, the network traces wouldn’t be helpful.

BIDTrace

Enter BIDTrace.  BIDTrace is really just diagnostic logging within our client providers and server SNI stack.  Think Event Tracing for Windows (ETW).  I’m not going to dive into how to set this up as it would take its own blog post.  You can read more about it in the following MSDN Page:

Data Access Tracing in SQL Server 2012
http://msdn.microsoft.com/en-us/library/hh880086.aspx

Typically I won’t go this route unless I know what I’m looking to get out of it.  It is actually pretty rare that I’ll jump to this.  In this particular case, it was an excellent case.  We have some evidence that we are not getting far enough to hit the wire, and we know we are getting an error when trying to make a connection to SQL.  So, what I’m looking for here is if there is some Windows Error that we are getting that wasn’t presented in the actual exception.

Here is the Logman command that I used to start the capture after getting the BIDTrace items configured.

Logman start MyTrace -pf ctrl.guid -ct perf -o Out%d.etl -mode NewFile -max 150 –ets

A few things I’ll point out with this comment.  The output file has a %d in it.  This is a format string because we will end up with multiple files.  -mode is used to tell it to create a new file after hitting the max size that is listed.  We then set –max to 150 which means that we want to cap the size of the file to 150MB in size.  I did this because when we first went for it with a single file, the ETL file was 300MB and when I went to convert it to text it was over 1GB in size.  That’s a lot to look through.  I also had troubles opening it.  So, I decided to break it up.  Of note, it took about 4-5 minutes to reproduce the issue.  That’s a long time to capture a BIDTrace.  When you go to capture a BIDTrace, it is better to get a small window to capture if you can.  These files fill up fast.

Here is the ctrl.guid that I used to capture.  This is effectively the event providers that I wanted to capture:

{8B98D3F2-3CC6-0B9C-6651-9649CCE5C752}  0x630ff  0   MSDADIAG.ETW
{914ABDE2-171E-C600-3348-C514171DE148}  0x630ff  0   System.Data.1
{C9996FA5-C06F-F20C-8A20-69B3BA392315}  0x630ff  0   System.Data.SNI.1

The capture will produce ETL files which are binary files.  You need to convert them after you are done.  I use TraceRPT to do this.  It is part of Windows.  Here is the command I used to output it to a CSV file to look at.

TraceRPT out5.etl –of CSV

In our case, it had generated 5 etl files – remember the %d?  So, we grabbed the last file that was produced which was out5.etl and converted it.  Although at first, I didn’t know it was out5.etl.  I actually started with out4.etl.  One problem is though is I didn’t have timestamps within the CSV output.  I had clock CPU time which is hard to visualize compared to an actual timestamp.

Enter Message Analyzer! Message Analyzer is a replacement for Network Monitor.  But it has another awesome ability in that it can open ETL files.  One other thing I had was the timestamp of the error from the SharePoint ULS Log on the attempt that we made when we captured the BIDTrace.

02/06/2014 13:14:20.55     OWSTIMER.EXE (0x2024)                       0x1C4C    Project Server                    Database                          880i    High        System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)     at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)     at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj)     at System.Data.SqlClient.TdsParser.Connect(ServerInfo serverInfo, SqlInternalConnectionTds connHandler, Boolean ignoreSniOpenTimeout, Int64 timerExpire, Boolean encrypt, Boolean trustServerCert, Boolean integratedSec...    bc7aaa60-93fc-4873-8f75-416d802aa55b

02/06/2014 13:14:20.55     OWSTIMER.EXE (0x2024)                       0x1C4C    Project Server                    Provisioning                      6993    Critical    Provisioning '/Test3': Failed to provision databases. An exception occurred: CREATE PROCEDURE dbo.MSP_TimesheetQ_Get_Job_Count_Simple   @correlationID UID ,    @groupState int ,    @msgType int  AS BEGIN    IF @@TRANCOUNT > 0    BEGIN              RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1)       RETURN    END     SELECT COUNT(*) FROM dbo.MSP_QUEUE_TIMESHEET_GROUP        WHERE CORRELATION_UID = @correlationID       AND   GRP_QUEUE_STATE = @groupState       AND   GRP_QUEUE_MESSAGE_TYPE = @msgType END .    bc7aaa60-93fc-4873-8f75-416d802aa55b

Our issue occurred at 1:14:20.55 Server Time. We can also see the statement it was going to try and run.  If we open the ETL file within Message Analyzer, we can see the timestamps that are covered within the file. 

image

We can see that this went up to 12:14:04 local time.  We were looking for 12:14:20.55.  So, out4.etl was not the file I was looking for.  Which left Out5.etl.  Technically you can read the data within Message Analyzer as you can see from the lower right of the screenshot.  It’s unicode data, and we see l.e.a.v.e.  I still prefer the output from TraceRPT when going to CSV as I can get the readable text from that.  It is just a little easier to work with.

So, I have the CSV output from out5.etl, but what do we look for?  Well, we know the statement that it was trying to make, so lets look for that - MSP_TimesheetQ_Get_Job_Count_Simple. We get a hit and it looks like this:

System.Data,      TextW,            0,          0,          0,          0,         18,          0, 0x0000000000000000, 0x00002024, 0x00001C4C,                    0,             ,                     ,   {00000000-0000-0000-0000-000000000000},                                         ,   130361840460213871,       7080,      21510,        2, "<sc.SqlCommand.set_CommandText|API> 4187832#, '"
System.Data,      TextW,            0,          0,          0,          0,         18,          0, 0x0000000000000000, 0x00002024, 0x00001C4C,                    0,             ,                     ,   {00000000-0000-0000-0000-000000000000},                                         ,   130361840460213910,       7080,      21510,        2, "CREATE PROCEDURE dbo.MSP_TimesheetQ_Get_Job_Count_Simple   @correlationID UID ,    @groupState int ,    @msgType int  AS BEGIN    IF @@TRANCOUNT > 0    BEGIN              RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1)       RETURN    END     SELECT COUNT(*) FROM dbo.MSP_QUEUE_TIMESHEET_GROUP        WHERE CORRELATION_UID = @correlationID       AND   GRP_QUEUE_STATE = @groupState       AND   GRP_QUEUE_MESSAGE_TYPE = @msgType END "
System.Data,      TextW,            0,          0,          0,          0,         18,          0, 0x0000000000000000, 0x00002024, 0x00001C4C,                    0,             ,                     ,   {00000000-0000-0000-0000-000000000000},                                         ,   130361840460213935,       7080,      21510,        2, "' "

Not the prettiest, but when looking in notepad or some other text reader, we can just go over to the right to get a better view.

image

The first time you look at this it can be a little overwhelming.  Especially if you aren’t familiar with how SNI/TDS works.  If we go through the results, we’ll see a few interesting things.

<prov.DbConnectionHelper.ConnectionString_Set|API> 4184523#, 'Data Source=<server>;Initial Catalog=<database>;Integrated Security=True;Pooling=False;Asynchronous Processing=False;Connect Timeout=15;Application Name="Microsoft Project Server"' "

<GetProtocolEnum|API|SNI>

<Tcp::FInit|API|SNI>

<Tcp::SocketOpenSync|API|SNI>

<Tcp::SocketOpenSync|RET|SNI> 10055{WINERR}

<Tcp::Open|ERR|SNI> ProviderNum: 7{ProviderNum}, SNIError: 0{SNIError}, NativeError: 10055{WINERR} <-- 10055 = WSAENOBUFS

<Np::FInit|RET|SNI> 0{WINERR}

<Np::OpenPipe|API|SNI> 212439#, szPipeName: '\\<server>\PIPE\sql\query', dwTimeout: 5000

<Np::OpenPipe|ERR|SNI> ProviderNum: 1{ProviderNum}, SNIError: 40{SNIError}, NativeError: 53{WINERR} <-- ERROR_BAD_NETPATH = network path was not found

We can get the Connection string, which was also available in the SharePoint ULS Log.  We will also see some entries around Protocol Enumeration.  This is where we look at the Client Registry items to see what Protocols we will go through and in what order (TCP, NP, LPC, etc…).  Then we see TCP trying to connect.  You’ll recall I mentioned that we try TCP first by default.  We then see that this received a Windows error of 10055 (WSAENOBUFS).  We then see Named Pipes fail with Error 53 which is ERROR_BAD_NETPATH.  We got what we were looking for out of the BIDTrace.

WSAENOBUFS is the key here.  It is a WinSock error which we actually have a KB Article on.

When you try to connect from TCP ports greater than 5000 you receive the error 'WSAENOBUFS (10055)'
http://support.microsoft.com/kb/196271

There is a registry key called MaxUserPort which can increase the number of dynamic ports that are available.  In Windows 2003, this was under 5000.  Starting in Windows 2008, this was increased as we use to see a lot of problems here.  Especially when connection pooling was not being used.  Here is the port range on my Windows 8.1 machine.

image

And for a Windows 2008 R2 Server, which the customer was using:

image

I have 64510 ports available.  On the customer’s machine, they had mentioned that for a previous issue, the engineer had asked them to add this registry key, and they set the value to 4999.  By setting it to 4999, we are effectively limiting the number of ports that would have otherwise been available.  If you look back at the connection string, you can see that Pooling was set to False.  This means we are turning off connection pooling, and every time we go to connect, we will establish a new hard connection.  This eats up a port.  You can look at NETSTAT to see what it looks like.  We did then when running the provisioning scripts and we saw it get up to around 3000 or so before it was done.  You will also see a lot of ports in a TIME_WAIT status.  When you disconnect and the port is released, it will go into a TIME_WAIT state for a set amount of time.  The default of which is around 4 minutes.  That’s 4 minutes you can’t use that port.  If you are opening and closing connections a lot, you will run out of ports because a lot will be in the TIME_WAIT state.  That’s typically when we would bump up the number of ports using the MaxUserPort registry key.  However, this is never really a fix, you are just putting  a bandaid on without understanding the problem.

End result…

In our case, Project Server was turning off connection pooling.  I don’t know why they are doing that, but that, in conjunction with the MaxUserPort being set to 4999, was causing this issue.  We had removed the MaxUserPort registry key and rebooted AppServerB, and it started working after that.  Of note, we had also started the Project Application Server on AppServerA and cleaned up the TCP registry keys on that machine as well so that they could effectively balance their load on the SharePoint App Server.

 

Adam W. Saxton | Microsoft SQL Server Escalation Services
http://twitter.com/awsaxton