• SQL Server Cast

    Kerberos Error Configuring SQL Server Linked Server for Delegation

    Today I want to post about my findings on a case involving a linked server between two SQL Server servers. Before anything else I want to give thanks to my colleague Ignacio Alonso Portillo, who was kindly enough to help me not only with this support case but with many others too.

    My customer called Microsoft Support looking for help to configure a linked server using a three-tier client/server model, where a client connects to an intermediate server first and the client credentials are transferred from the intermediate server to the backend server. This is commonly known as “double-hop” scenario; the following diagram illustrates this situation:

    MSDN_Image_Double_Hop - Copy

    The process that takes care of forwarding the client credentials from the intermediate server (“SQL Server 1” in our example) to the back-end server (“SQL Server 2”) is called delegation.

    In our scenario the client was trying to access the remote SQL Server (“SQL Server 2”) by the means of a Linked server connection on the intermediate (“SQL Sever 1”) box. Testing the Linked server was consistently failing with the following error:

    ===================================
    "The test connection to the linked server failed."
    ===================================
    An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)
    ------------------------------
    Program Location:
    at Microsoft.SqlServer.Management.Common.ServerConnection.ExecuteNonQuery(String sqlCommand, ExecutionTypes executionType)
    at Microsoft.SqlServer.Management.Smo.ExecutionManager.ExecuteNonQuery(String cmd)
    at Microsoft.SqlServer.Management.Smo.LinkedServer.TestConnection()
    at Microsoft.SqlServer.Management.UI.VSIntegration.ObjectExplorer.LinkedServerConnectionTest.Invoke()
    ===================================
    TCP Provider: An existing connection was forcibly closed by the remote host.
    Login failed for user '(null)'. Reason: Not associated with a trusted SQL Server connection.
    OLE DB provider "SQLNCLI" for linked server "SRVSQL2000" returned message "Communication link failure". (.Net SqlClient Data Provider)
    ------------------------------
    Server Name: SRVSQL2005
    Error Number: 10054
    Severity: 16
    State: 1

    This is how the Security properties of this Linked server were configured:

    Linked_Server_Security

    In this case we are allowing the user CONTOSO\SQLADMIN to connect to the remote server using the credentials given to the local one. If any other user tries to connect to the remote server by the means of this linked connection, access will be denied (hence the “Not be made” option). Pay attention to this security dialog, it can be a little bit tricky. By "Local Login" we can specify either a SQL Server login or a Windows account however, the “Remote User” cannot be a Windows user, only a SQL Server user. Selecting the “Impersonate” option disables the “Remote User” and “Remote Password” options since we are already using the credentials given to the local server to impersonate (i.e., authenticate) the user on the remote server. The configuration on this dialog box was not responsible of the error we were hitting, but I am showing you this here to illustrate how this dialog works in SQL Server Management Studio.

    The error “Login failed for user '(null)'” found is typically caused when Kerberos is not being used or when Kerberos have been configured but Delegation is not working. For this setup to work Kerberos has to be used because NTLM authentication can’t manage a double-hop scenario. We make sure that Kerberos requirements were met in our environment; a summary of these requirements can be found in the following diagram:

    MSDN_Image_Double_Hop

    We did followed all the steps outlined in this MSDN entry to configure Delegation but the error was still the same so we decide to capture network traces at the intermediate server while reproducing the error. After loading the traces in Network Monitor and applying a filter (use "KerberosV5" keyword as filter in Network Monitor) we found the following network traffic:

    452   2010-11-16 12:22:59.334000    10.64.100.37      10.64.101.218     KRB5  TGS-REQ
    453   2010-11-16 12:22:59.334000    10.64.101.218     10.64.100.37      KRB5  KRB Error: KRB5KDC_ERR_S_PRINCIPAL_UNKNOWN
    537   2010-11-16 12:23:01.599625    10.64.112.104     10.64.100.37      DCERPC      Bind: call_id: 70 IOXIDResolver V0.0
    539   2010-11-16 12:23:01.599625    10.64.100.37      10.64.112.104     DCERPC      Bind_ack: call_id: 70 accept max_xmit: 5840 max_recv: 5840
    540   2010-11-16 12:23:01.599625    10.64.112.104     10.64.100.37      DCERPC      Alter_context: call_id: 70 IOXIDResolver V0.0

    The two typical reasons why you would get a KRB5KDC_ERR_S_PRINCIPAL_UNKNOWN error, provided DNS registration was working fine and direct/reverse name resolution were returning correct information, are:

    1. Actually the SPN is not registered for the account under which the SQL Server service is running
    2. The SPN is registered for the account under which SQL is running, but it is also registered under a different account (i.e. duplicate SPNs).

    We confirmed previously that the correct SPNs existed for the SQL Server service account so we then needed to find for duplicate SPNs. One way to find these duplicate entries is by using the setspn tool included in Windows Server 2008 and later, or the updated setspn version for Windows Server 2003. Personally, I prefer to use the not-so-intuitive Windows ldifde utility. Using ldifde.exe /? will show you the required syntax to output a list of SPNs for your domain. In the following example we use ldifde to provide only those SPNs that has been registered for SQL Server service:

    LDIFDE -f filename.txt -t 3268 -d "DC=contoso,DC=com" -l serviceprincipalname -r (serviceprincipalname=MSSQLSvc/*) -p subtree

    Here was the output for this instruction in our case:

    dn: CN=SQLServer,OU=Contoso Users,OU=Managed Accounts,DC=contoso,DC=lab
    changetype: add
    servicePrincipalName: MSSQLSvc/SRVSQL2005.CONTOSO.lab:1433
    servicePrincipalName: MSSQLSvc/SRVSQL2005.CONTOSO.lab
    servicePrincipalName: MSSQLSvc/SRVSQL2005:1433
    servicePrincipalName: MSSQLSvc/SRVSQL2005
    servicePrincipalName: MSSQLSvc/newhost10.fabrikam.lan:1433
    servicePrincipalName: MSSQLSvc/fabrikam.intranet.lan:1433
    servicePrincipalName: MSSQLSvc/fabrikam.intranet.lan
    servicePrincipalName: MSSQLSvc/newhost05.fabrikam.lan:1433
    servicePrincipalName: MSSQLSvc/newhost05.fabrikam.lan

    dn: CN=SRVSQL2005,OU=Cluster,OU=Servers,OU=Managed Computers,DC=contoso,DC=lab
    changetype: add
    servicePrincipalName: MSSqlSvc/SRVSQL2005.CONTOSO.lab:1433
    servicePrincipalName: MSSQLSvc/SRVSQL2005:1433
    servicePrincipalName: MSClusterVirtualServer/SRVSQL2005.CONTOSO.lab
    servicePrincipalName: MSClusterVirtualServer/SRVSQL2005
    servicePrincipalName: HOST/SRVSQL2005.CONTOSO.lab
    servicePrincipalName: HOST/SRVSQL2005

    As you can see the SPN required for our SQL Server 2005 server did exists not only for the SQL Server service account running on our server but also for another service account, resulting in a duplicated SPN on the network. We removed the duplicated SPN using setspn –D.

    The following MSDN entry lists some of the actions that can lead to duplicate SPNs on the network:

    There are several scenarios where an administrator can duplicate the service-principal-names (SPN) in the domain directory that will cause Kerberos authentication to fail. These include the following:

    - Making changes to the domain account under which the instance of SQL Server runs
    If SetSpn.exe is run while running an instance of SQL Server as one domain account, such as DOMAIN\User1, and then the domain account that is used to run SQL Server is changed, such as DOMAIN\User2, when SetSPN.exe is run again, it will cause the same SPN to be inserted in the directory under both accounts.

    - Installing multiple instances of SQL Server that run under different accounts
    If you install multiple instances of SQL Server and then run each instance under a different account, if SetSpn.exe is run on each instance, there will be duplicate accounts in the directory under each SQL Server service account. This applies for both instances that are running under a domain user and also the local System account.

    - Removing and reinstalling an instance of SQL Server under a different account
    If you install SQL Server under one account, register the SPNs, remove and reinstall SQL Server under a different account, and then reregister the SPNs, each domain account will have the same SPNs. This means the SPNs will be duplicated.

    Bear into mind that Kerberos tickets are stored in cache for certain of time so the Kerberos ticket provided originally by the KDC although invalid, remains in cache. If we test the Linked server connection right after fixing the duplicate SPN we will still be unable to connect because of these cached tickets. The Windows Resource Kit contains the Kerbtray utility we can use to purge cached Kerberos tickets. The tool has to be executed via "run as" (from Windows GUI or Command Prompt) using the SQL Server Service account, otherwise we will be deleting the tickets for our current account and not for the SQL Server one.

  • SQL Server Cast

    “Cannot open the datafile <path_to_file>” Error Accessing a Flat File in SSIS Using a Proxy Account

    • 1 Comments

    I did find this error on a Windows Server 2008 and SQL Server 2008 installation but as you will see the error is not tied to an specific SQL Server or Windows version.

    In this case my customer was trying to run a SQL Server Agent Job with 17 different steps. These steps were of different types like running T-SQL code, ActiveX scripts, Operating System (CmdExec), etc. The customer was looking to run all these job steps using a Proxy account, which is a way to restrict the security context where a job step is run in SQL Server; here you can read the Proxy Account definition from MSND:

    SQL Server Agent lets the database administrator run each job step in a security context that has only the permissions required to perform that job step, which is determined by a SQL Server Agent proxy. To set the permissions for a particular job step, you create a proxy that has the required permissions and then assign that proxy to the job step. A proxy can be specified for more than one job step. For job steps that require the same permissions, you use the same proxy.

    When using the Proxy account, two different jobs steps were failing with the following error:

    The error when running SSIS-step (11,12) using a proxy is:
    10.0.2531.0 for 64-bit  Copyright (C) Microsoft Corp 1984-2005. All rights reserved.    Started:  15:27:31  Error: 2010-10-16 15:27:32.38     Code: 0xC020200E     Source: Copy Data from AgentSource to AgentDestination Task Flat File Source [1]     Description: Cannot open the datafile "V:\MySharedFolder\MyTextFile.txt".  End Error  Error: 2010-10-16 15:27:32.38     Code: 0xC004701A     Source: Copy Data from AgentSource to AgentDestination Task SSIS.Pipeline     Description: component "Flat File Source" (1) failed the pre-execute phase and returned error code 0xC020200E.  End Error  DTExec: The package execution returned DTSER_FAILURE (1).  Started:  15:27:31  Finished: 15:27:32  Elapsed:  0.687 seconds.  The package execution failed.  The step failed.

    In this case the job owner was the ‘sa’ account so we first tried to avoid the error by changing the job owner from ‘sa’ to the Proxy account but this did not fixed the problem. During the troubleshooting process I worked using a remote session with the customer’s server to try to determine where the error was coming from; during that session I noticed that the V: drive the job was pointing to did not exist on the server. The customer explained to me that the drive mapping was taking place during the first job step.

    The mapping of the V: drive was taking place on the security context of the Proxy account so we first check for the correct xp_cmdshell permission for this account, as described in this MSND entry:

    When it is called by a user that is not a member of the sysadmin fixed server role, xp_cmdshell connects to Windows by using the account name and password stored in the credential named ##xp_cmdshell_proxy_account##. If this proxy credential does not exist, xp_cmdshell will fail.
    The proxy account credential can be created by executing sp_xp_cmdshell_proxy_account. As arguments, this stored procedure takes a Windows user name and password. For example, the following command creates a proxy credential for Windows domain user SHIPPING\KobeR that has the Windows password sdfh%dkc93vcMt0.

    EXEC sp_xp_cmdshell_proxy_account 'SHIPPING\KobeR','sdfh%dkc93vcMt0'

    In our case the permissions required by the Proxy account were correctly assigned so we tried to map the network drive out of the job execution, using our own session instead of the Proxy account session, but the result and error was exactly the same. Nevertheless, both the customer and myself kept thinking that the problem was related to the mapping of this drive so we created a new Agent job with the following single step to isolate the problem (abridged):

    net use v: \\server\share\ /user:username /persistent:yes
    c:\program files (…) dtexec.exe package.dtsx (…)
    dir v:\*.* > c:\out.txt

    This test worked. The difference here was that we were mapping and running the package in the same job step while the customer was mapping the drive in the first step, and executing the SSIS tasks in the subsequent job steps. As explained in KB180362 article this was precisely the problem with the Agent job:

    When the system establishes a redirected drive, it is stored on a per-user basis. Only the user himself can manipulate the redirected drive. The system keeps track of redirected drives based on the user's Logon Security Identifier (SID). The Logon SID is a unique identifier for the user's Logon Session. A single user can have multiple, simultaneous logon sessions on the system.

    If a service is configured to run under a user account, the system will always create a new logon session for the user and then launch the service in that new logon session. Thus, the service cannot manipulate the drive mappings that are established within the user's other session(s).

    At the beginning of the same KB article you can also read:

    A service (or any process that is running in a different security context) that must access a remote resource should use the Universal Naming Convention (UNC) name to access the resource. UNC names do not suffer from the limitations described in this article.

    In our case the solution was as easy as replacing the mapping to the V: drive for the UNC path (i.e. \\servername\sharename). Another option in this scenario is to perform the mapping on the same job step as the one where the action takes place.

  • SQL Server Cast

    What are Sparse Files and Why Should I Care as SQL Server DBA?

    Last week I worked with one of my co-workers from the Windows Platform Support group in a backup-related issue with SQL Server data files. We both learn several things because of that case and I think it is worth the time to share this information with you.

    Our customer was using Microsoft Data Protection Manager to backup his Windows servers and SQL Server databases and was finding several errors when copying the SQL Server databases. During our research, the customer found the following MSND post with a detailed explanation of the problem: “Did your backup program/utility leave your SQL Server running in an squirrely scenario?” If you have never heard of sparse files and how they affect SQL Server I highly recommend you to read this post from the CSS SQL Server Engineers’ blog.

    What are sparse files?

    Sparse files are created during SQL Server operations such as database snapshots or DBCC CHECKDB statements. In the case of a snapshot, the sparse file is deleted when the snapshot is deleted while in the case of a DBCC CHECKDB operation the sparse file is automatically deleted by SQL Server right after the execution. You can find extended information for the role of sparse files in SQL Server in the two following links:

    Internal Database Snapshot:

    DBCC CHECKDB uses an internal database snapshot for the transactional consistency needed to perform these checks. This prevents blocking and concurrency problems when these commands are executed. For more information, see Understanding Sparse File Sizes in Database Snapshots and the DBCC Internal Database Snapshot Usage section in DBCC (Transact-SQL).

    Understanding Sparse File Sizes in Database Snapshots,

    Sparse files are a feature of the NTFS file system. Initially, a sparse file contains no user data, and disk space for user data has not been allocated to it. For general information about the use of sparse files in database snapshots and how database snapshots grow, see How Database Snapshots Work.
    When first created, a sparse file takes up little disk space. As data is written to the sparse file, NTFS allocates disk space gradually. Potentially, a sparse file can grow very large. If a database snapshot runs out of space, it is marked as suspect, and it must be dropped. The source database, however, is not affected; actions on it continue normally.

    How this happens?

    Sparse files is a feature provided by the NTFS file system. Under some NTFS.SYS file versions the sparse file bit becomes sticky for the main (parent) file. Here is one example:

    1. You run a DBCC CHECKDB on your SQL Server database as part of your maintenance plan, this completes with no errors
    2. You perform a backup of your databases
    3. Your original database files are then marked as ‘sparse’

    What is the problem?

    SQL Server does not support backup over sparse files so you can put yourself in an unsupported scenario without even knowing it.

    How can I check if I am in this situation?

    Run the following query and look for values other than 0 (zero) under the ‘is_sparse’ column:

    use <DatabaseName>
    SELECT is_sparse, * from sys.database_files

    The following example shows a database with several sparse files:

    image

    SQL Server 2008 R2 Best Practices Analyzer includes a new rule to detect this condition too, you can find this information in the Microsoft KB2028447 article.

    What can I do to avoid the problem from occurring?

    You should update the NTFS.SYS version in your Windows Servers as a first step. The latest NTFS.SYS versions prevent this problem from happening. Here is a list of the hotfixes and NTFS.SYS versions you should keep in mind:

    For Windows Server 2003:

    You cannot restore large files in the NTFS file system when all the data streams that have sparse attributes are deleted in the 64-bit version of Windows XP SP2 or in Windows Server 2003 SP2, http://support.microsoft.com/kb/973886

    Error message when you run the "chkdsk" command together with the "/v" switch on a Windows Server 2003-based computer: "Correcting sparse file record segment <number>, http://support.microsoft.com/default.aspx?scid=kb;en-us;932021

    For Windows Server 2008:

    Disk Manager reports incorrect disk usage a while after the operating system starts in Windows Server 2008 or in Windows Vista,http://support.microsoft.com/kb/981891

    For SQL Server 2005 and 2008:

    In addition to the abovementioned Windows hotfixes, make sure you have the latest SQL Server Service Packs and Cumulative Updates installed in both platforms to prevent this issue from happening.

    How can I fix the problem?

    Having your Windows and SQL Server boxes updated with the latest service packs and hotfixes will prevent the problem from happening again, but will not fix the problem if it already exists on the database. The previously mentioned MSDN blog post explains how to fix this problem in SQL Server 2005 and 2008: “Did your backup program/utility leave your SQL Server running in an squirrely scenario?

    Another option you can use to fix this problem is to empty the content of the database file to another file in the same filegroup. You can achieve this by using the EMPTYFILE option in the SHRINKFILE T-SQL statement; here is an example:

    USE AdventureWorks2008R2;
    GO
    -- Create a data file and assume it contains data.
    ALTER DATABASE AdventureWorks2008R2
    ADD FILE (
        NAME = Test1data,
        FILENAME = 'C:\t1data.ndf',
        SIZE = 5MB
        );
    GO
    -- Empty the data file.
    DBCC SHRINKFILE (Test1data, EMPTYFILE);
    GO
    -- Remove the data file from the database.
    ALTER DATABASE AdventureWorks2008R2
    REMOVE FILE Test1data;
    GO

    This method can not be used if the file with the ‘sparse’ sticky bit is the first file on the PRIMARY filegroup because this file contains several system objects that cannot be moved.

    After following all these steps I still find my SQL Server data files as ‘sparse’!

    After following these steps you may find that Windows fsutil reports the file as ‘no sparse’ while SQL Server keeps reporting the file as ‘sparse’ via sys.database_files. If this is the case, make sure you detach and re-attach the database files; this is needed because the sys.database_files information has to be updated.

  • SQL Server Cast

    “The IP Address 'x.x.x.x' is already in use. To continue, specify a different IP address” SQL Server 2008 cluster setup error

    • 3 Comments

    Today I want to talk about my experience with one of the most strange setup errors installing SQL Server in a cluster. In this particular case I was helping the customer to install a new SQL Server 2008 R2 instance in a three-node cluster. The Windows cluster was already running with two instances, each one on a different cluster node and the customer wanted to install this third instance on the third cluster node. The rest of the instances on the other cluster nodes have been running for a while with no issues.

    The customer was trying to install the third cluster node but the setup was consistently failing with the following error:

    SQL Server Setup has encountered the following error: The IP Address ‘10.246.18.118’ is already in use. To continue, specify a different IP address. Error code 0x84B40000.

    image

    The initial question, “is there any other machine with that IP address on the network?”, was quickly answered: neither PING nor NSLOOKUP shown any other host owning that IP address. As usual with setup problems I looked into the setup log file for SQL Server. The “Summary.txt” file had the same error reported by the GUI:

    Exception type: Microsoft.SqlServer.Chainer.Infrastructure.InputSettingValidationException
        Message:
            The IP Address '10.246.18.118' is already in use. To continue, specify a different IP address
    .

    The “Detail.txt” setup log file had more information. We were able to see that indeed the IP address 10.246.18.118 did not exist on the network during the initial setup phase:

    2010-11-12 14:53:09 Slp: IP Addresses have been specified so no defaults will be generated.
    2010-11-12 14:53:38 Slp: SendARP didn't return a MAC address for IP address '10.246.18.118'.  The message was 'The network name cannot be found.'.  This indicates the address is valid to create.

    […]

    2010-11-12 14:59:33 Slp: SendARP didn't return a MAC address for IP address '10.246.18.118'.  The message was 'The network name cannot be found.'.  This indicates the address is valid to create.

    … but all of a sudden, the ARP request succeeded in finding a valid host with that same IP address, causing the setup to halt:

    2010-11-12 15:00:25 Slp: SendARP for IP Address '10.246.18.118' succeeded.  The found MAC address is '00:00:5e:00:01:65'.  The IP address is already in use.  Pick another IP address to continue.
    2010-11-12 15:00:28 Slp: SendARP didn't return a MAC address for IP address '10.246.16.118'.  The message was 'The network name cannot be found.'.  This indicates the address is valid to create.
    2010-11-12 15:00:28 Slp: Hosting object: Microsoft.SqlServer.Configuration.ClusterConfiguration.ClusterIPAddressPrivateConfigObject failed validation
    2010-11-12 15:00:28 Slp: Validation for setting 'FAILOVERCLUSTERIPADDRESSES' failed. Error message: The IP Address '10.246.18.118' is already in use. To continue, specify a different IP address.
    2010-11-12 15:00:28 Slp: Error: Action "Microsoft.SqlServer.Configuration.SetupExtension.ValidateFeatureSettingsAction" threw an exception during execution.
    2010-11-12 15:00:28 Slp: Microsoft.SqlServer.Setup.Chainer.Workflow.ActionExecutionException: The IP Address '10.246.18.118' is already in use. To continue, specify a different IP address. ---> Microsoft.SqlServer.Chainer.Infrastructure.InputSettingValidationException: The IP Address '10.246.18.118' is already in use. To continue, specify a different IP address. ---> Microsoft.SqlServer.Chainer.Infrastructure.InputSettingValidationException: The IP Address '10.246.18.118' is already in use. To continue, specify a different IP address.
    2010-11-12 15:00:28 Slp:    --- End of inner exception stack trace ---

    […]

    2010-11-12 15:05:26 Slp: Error result: -2068578304
    2010-11-12 15:05:26 Slp: Result facility code: 1204
    2010-11-12 15:05:26 Slp: Result error code: 0

    In SQL Server 2008 a cluster installation is divided into two main phases; the first one takes care of copying the instance files into the target machine and register all the components while the second one takes care of creating the clustered resources. In our case the setup was failing at the very end of the setup where the cluster resources are created. As you can see, the “Detail.txt” file was also pointing to the MAC address of the offending host.

    As you may know, ARP is a network protocol that takes care of resolving IP addresses based on physical or MAC addresses. This information is stored in memory into the ARP table. The command prompt ARP -a instruction checks the information stored by Windows on the ARP cache table. After uninstalling one more time the components left by the failed setup, we did a quick ARP check based on what the setup was doing and found the following information:

    image

    The IP addresses 10.246.16.100 and 10.246.18.100 were the two virtual IP addresses that we were trying to configure on the cluster. I am far for being an expert in networking but having an IP address under the x.x.18.x/24 network in the x.x.16.x/24 interface was strange enough to make me think in a network resolution problem. The IP address and MAC address was in fact the same that was causing the setup to fail. We tried to PING and NSLOOKUP again the same IP address but nothing come back.

    I double-checked with one of the SQL Server 2008 R2 clusters in my lab and found that every network address was correctly shown under its corresponding network interface:

    clip_image002

    Strange enough the “bogus” ARP entry was created as dynamic in the case of the customer’s cluster so we were expecting that to be removed from the ARP cache table several seconds, but this was not happening. I was not sure what was announcing the IP address via ARP on the network, I can only think in a problem with the NIC teaming but we did run out of time and were not able to test this hypothesis. After removing the offending ARP entry with ARP –d we run a new setup that finished successfully this time. If you have a similar experience with this error or have any idea of where that ARP entry could come from please, let me know.

Page 1 of 1 (4 items)