• SQL Server Cast

    Changing Gears …

    • 1 Comments

    After 4 years working in Microsoft Support, I have decided to change gears and move into the PFE organization (Premier Field Engineer). The PFE team works helping Microsoft Premier customers in proactive, rather than reactive, services. There are chances I will find less troubleshooting scenarios from now on.

    At this point I am not sure if I will find content in my new role to continue filling this board. I hope so :). For those of use that have followed this regularly, I hope the information and tips provided so far have been useful.

    Cheers,
    Jorge

  • SQL Server Cast

    Database Transaction Log Cannot be Shrunk Because it is Marked as “REPLICATION” (and Replication has not been Configured)

    This week I got involved in a case where a Microsoft Dynamics AX database log could not be reduced in size. In the past, I have found several issues with SQL Server 2005 transaction logs that could not be shrunk due to Ghost Records, this is discussed in the Microsoft KB953991 but in this case the database version was SQL Server 2008.

    There are different reasons for a transaction log blocking the shrink process, a list of these reasons can also be found on this MSDN article. Finding the reason for not being able to truncate the transaction log requires using the DMV sys.databases and looking for the information under the log_reuse_wait_desc column. Most of the times the description on this column will be LOG_BACKUP, indicating that a transaction log backup operation is needed in order to truncate the transaction log file. In our case the result from this query was REPLICATION (I am using the AdventureWorks sample database here to illustrate this):

    USE master;
    GO
    SELECT name, log_reuse_wait_desc, * FROM sys.databases
    WHERE name = 'AdventureWorks';

    image

    After discussing the case with the customer, I found out that a DBCC CHECKDB process was run on this database several weeks ago using the ALLOW_REPAIR_DATA_LOSS repair option. I am not sure these two actions are related but I experienced a similar situation on a SQL Server 2005 case where a database log was marked as REPLICATION after running a DBCC CHECKDB with this REPAIR option.

    Trying to restore the status of the transaction log file using the SQL Server Replication stored procedure sp_repldone will not fix the issue as the stored procedure will complain the database is not configured for replication. The fastest and easiest way to get rid of the REPLICATION mark in the transaction log file is to configure Snapshot replication for the database and then, afterwards, removing this configuration from the server.

    To configure a Snapshot replication using SQL Server Management Studio you can follow this 10-minutes TechNet video from Ty Anderson. The first half of the video will show you how to configure your SQL Server instance as its own Publisher and Distributor while the second half  shows how to set up the Subscriber. In our case we do not need to configure any subscriber; just remember this when completing the Snapshot Replication wizard:

    • Select the affected database as the database for Replication
    • You do not need to select all the objects for the Replication process, just select any table (one) on the database
    • Remember to create an initial Snapshot when asked to do so during the wizard
    • You can safely use the SQL Server Agent service account in the Security settings as this Replication will only be used temporarily

    Once the Snapshot Replication Wizard completes the replication scenario successfully, run again the previous SELECT in sys.database; at this point the ‘log_reuse_wait_desc’ column should show either NOTHING or LOG_BACKUP:

    image

    If the status is still REPLICATION, execute the following instruction to force all pending transactions to be distributed:

    EXEC sp_repldone @xactid = NULL, @xact_sgno = NULL, @numtrans = 0, @time = 0, @reset = 1;

    Then we need to remove all replication objects from the database as we do not longer need any Replication for this database. First we drop the Publication (we do not need to drop subscribers firsts as we did not configured any):

    USE AdventureWorks;
    GO
    EXEC sp_droppublication @publication = N'AW_Test_Publication'
    USE master
    GO
    EXEC sp_replicationdboption @dbname = N'AdventureWorks', @optname = N'publish', @value = N'false';

    Note that in this example our Publication name is “AW_Test_Publication”. In your case this name will vary depending on what you have specified during the Snapshot Replication Wizard.

    Then we need to drop the Distributor:

    USE master;
    GO
    exec sp_dropdistributor @no_checks = 1;

    And make sure, finally, no replication objects remain on the database by running the following stored procedure:

    USE master;
    GO
    sp_removedbreplication 'AdventureWorks';
  • SQL Server Cast

    Synchronize two tables using SQL Server Integration Services (SSIS)-Part II of II

    • 1 Comments

    Please, refer to this link for the first part of this post.

    In this part we will start with the same sample scenario where “table A” in “database A” needs to be synchronized with “table B” in data warehouse “database B”. During the first part of this post we discussed how to deal with new records in “table A” but we did not explain how to proceed when existing records are updated in “table A”.

    We can implement different methods within SSIS to discover the records that differ between source and destination tables. For example, we can use the EXCEPT operator to extract those differences and update the destination table accordingly however, I did find that using tablediff utility is probably the fastest way to implement this in SSIS. tablediff is used in SQL Server replication to compare and return detailed information about what difference exist between two tables.

    The great thing about tablediff is that it can not only compare tables, but also script out the differences between then too, this way tables can be synchronized just by running the script. The caveat about using tablediff is that this utility only works with SQL Server servers so if you are trying to synchronize tables that are hosted in a different database engine you are out of luck with this post.

    In my case this is what I did to synchronize the two tables:

    1. Ensure tablediff utility is installed on you SQL Server. The tablediff.exe executable is found under C:\Program Files\Microsoft SQL Server\<version>\COM

    2. Add the tablediff.exe path to the Windows system path using Computer Properties > Advanced System Settings > Advanced > Environment Variables > System Variables > PATH

    3. Make sure the xp_cmdshell advanced SQL Server option is enabled running sp_configure from SQL Server Management Stduio (SSMS):

    clip_image001

    Please, carefully read the xp_cmdshell documentation and understand the implication of using this option in your environment. xp_cmdshell creates a Windows process with the same security rights as the SQL Server service account which means that sysadmins members can access OS functions than his Windows account may not have. By default, only members of the SQL Server sysadmin fixed server role are authorized to run this extended procedure. If your company has no guidelines about what permissions are used for the SQL Server service accounts and who belongs to the sysadmin fixed server role, carefully evaluate the xp_cmdshell option.

    4. Using the same SSIS project we created during the first part of this post, create two “Execute SQL Task” objects under the Control Flow section in BIDS. The first task, called “Execute tablediff” in my example, will take care of executing the tablediff.exe command. Here is a sample of the code in my case:

    exec master..xp_cmdshell 'tablediff.exe -sourceserver SQL2008R2\KILIMANJARO64 -sourcedatabase SSISDBSource -sourcetable Customer -destinationserver SQL2008R2\KILIMANJARO64 -destinationdatabase SSISDBDest -destinationtable Customer -f C:\Temp\Diff'

    image

    The important part here is the –f switch, that creates a T-SQL script with the changes that have to be implemented on the destination table to have the same information as the source, here is an example of this automatically generated script:

    image

    The second task, called “Execute SQL Script” in my example, will take care of running the saved C:\Temp\Diff.sql script in the database, implementing the changes from the source table on the destination table:

    image

     

    5. Optionally you can combine the “Data Flow” task we created during the first part of this blog with these two “Execute SQL Task” to have a full synchronization package.

    image

  • SQL Server Cast

    Synchronize two tables using SQL Server Integration Services (SSIS)–Part I of II

    • 7 Comments

    There are situations where a SQL Server DBA needs to keep two different tables in sync. Usually, the DBA needs to keep a copy of a table in a in a data warehouse repository that is used as a solution for archiving and/or reporting.

    SQL Server provides a robust method for keeping data synchronized across databases using Replication but there are situations when all we need is just to keep an online copy of a single table for archiving or reporting purposes and we would prefer to do not tinker into SQL Server replication.

    This post is divided into two parts: Part I explains how to update a destination table with the information that is added into a source table while Part II explains how to replicate any change that happens in existing records in the source table into destination. Please, take into account this is just another way of doing this, there are many other possibilities but the one listed here is probably one of the fastest to implement and modify on-demand.

    This procedure is based on the following scenario: A “table A” on “database A” is periodically replicated using SSIS into “table B” on “database B”. “Table A” is updated with new records and we need to copy those records into “table B” too. Our final implementation will look like this in SQL Server Business Intelligence Development Studio:

    image

     

    Let’s see how this works:

    1. “Source Table” is “table A” in “database A” while “Dest Table” is destination “table B” in “database B”. We start creating two different OLEDB connections in the Data Flow component in SSIS using both the source and destination table as data sources.

    2. We need to perform a JOIN operation on the two different data sources to copy the information we need from one table into the other. For this JOIN to work correctly, the data has to be sorted; this is described in the following MSDN link:

    In Integration Services, the Merge and Merge Join transformations require sorted data for their inputs. The input data must be sorted physically, and sort options must be set on the outputs and the output columns in the source or in the upstream transformation. If the sort options indicate that the data is sorted, but the data is not actually sorted, the results of the merge or merge join operation are unpredictable.

    The “Merge Join” operation is where we filter the data that has been added in the source table (the one we need) and the data that has not been added (the one we do not need). In our case the source table (on the left) contains all the columns we want to copy while on the destination table (on the right) contains only the record that corresponds to the Primary Key column “No_”. Here is the description of this task:

    image

    Here is the key part of the process: the Left Outer Join retrieves all the records in the source table but those that do not exist on the destination table are retrieved as NULLs in the Join Key column “No_”. This is also described in the product documentation:

    To include all products, regardless of whether a review has been written for one, use an ISO left outer join. The following is the query:

    USE AdventureWorks2008R2;
    GO
    SELECT p.Name, pr.ProductReviewID
    FROM Production.Product p
    LEFT OUTER JOIN Production.ProductReview pr
    ON p.ProductID = pr.ProductID

    The LEFT OUTER JOIN includes all rows in the Product table in the results, whether or not there is a match on the ProductID column in the ProductReview table. Notice that in the results where there is no matching product review ID for a product, the row contains a null value in the ProductReviewID column.

    3. Then we have to split the data we need from the one we do not need. For this we use a “Conditional Split” task that takes care of saving the information for those records where the Join Key “No_” was NULL, i.e. saving the information only of the new records. Here is a description of this conditional split task:

    image

    4. Finally we INSERT the resultant data from the Split Conditional task into the destination table, which is the same we use as the source table at the very beginning.

    In this example the Join task is configured so you can reuse the tasks as many times as you want, the records on the destination table will not be duplicated with the information on the source table, only the new records will be populated into the destination table.

  • SQL Server Cast

    Kerberos Error Configuring SQL Server Linked Server for Delegation

    Today I want to post about my findings on a case involving a linked server between two SQL Server servers. Before anything else I want to give thanks to my colleague Ignacio Alonso Portillo, who was kindly enough to help me not only with this support case but with many others too.

    My customer called Microsoft Support looking for help to configure a linked server using a three-tier client/server model, where a client connects to an intermediate server first and the client credentials are transferred from the intermediate server to the backend server. This is commonly known as “double-hop” scenario; the following diagram illustrates this situation:

    MSDN_Image_Double_Hop - Copy

    The process that takes care of forwarding the client credentials from the intermediate server (“SQL Server 1” in our example) to the back-end server (“SQL Server 2”) is called delegation.

    In our scenario the client was trying to access the remote SQL Server (“SQL Server 2”) by the means of a Linked server connection on the intermediate (“SQL Sever 1”) box. Testing the Linked server was consistently failing with the following error:

    ===================================
    "The test connection to the linked server failed."
    ===================================
    An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)
    ------------------------------
    Program Location:
    at Microsoft.SqlServer.Management.Common.ServerConnection.ExecuteNonQuery(String sqlCommand, ExecutionTypes executionType)
    at Microsoft.SqlServer.Management.Smo.ExecutionManager.ExecuteNonQuery(String cmd)
    at Microsoft.SqlServer.Management.Smo.LinkedServer.TestConnection()
    at Microsoft.SqlServer.Management.UI.VSIntegration.ObjectExplorer.LinkedServerConnectionTest.Invoke()
    ===================================
    TCP Provider: An existing connection was forcibly closed by the remote host.
    Login failed for user '(null)'. Reason: Not associated with a trusted SQL Server connection.
    OLE DB provider "SQLNCLI" for linked server "SRVSQL2000" returned message "Communication link failure". (.Net SqlClient Data Provider)
    ------------------------------
    Server Name: SRVSQL2005
    Error Number: 10054
    Severity: 16
    State: 1

    This is how the Security properties of this Linked server were configured:

    Linked_Server_Security

    In this case we are allowing the user CONTOSO\SQLADMIN to connect to the remote server using the credentials given to the local one. If any other user tries to connect to the remote server by the means of this linked connection, access will be denied (hence the “Not be made” option). Pay attention to this security dialog, it can be a little bit tricky. By "Local Login" we can specify either a SQL Server login or a Windows account however, the “Remote User” cannot be a Windows user, only a SQL Server user. Selecting the “Impersonate” option disables the “Remote User” and “Remote Password” options since we are already using the credentials given to the local server to impersonate (i.e., authenticate) the user on the remote server. The configuration on this dialog box was not responsible of the error we were hitting, but I am showing you this here to illustrate how this dialog works in SQL Server Management Studio.

    The error “Login failed for user '(null)'” found is typically caused when Kerberos is not being used or when Kerberos have been configured but Delegation is not working. For this setup to work Kerberos has to be used because NTLM authentication can’t manage a double-hop scenario. We make sure that Kerberos requirements were met in our environment; a summary of these requirements can be found in the following diagram:

    MSDN_Image_Double_Hop

    We did followed all the steps outlined in this MSDN entry to configure Delegation but the error was still the same so we decide to capture network traces at the intermediate server while reproducing the error. After loading the traces in Network Monitor and applying a filter (use "KerberosV5" keyword as filter in Network Monitor) we found the following network traffic:

    452   2010-11-16 12:22:59.334000    10.64.100.37      10.64.101.218     KRB5  TGS-REQ
    453   2010-11-16 12:22:59.334000    10.64.101.218     10.64.100.37      KRB5  KRB Error: KRB5KDC_ERR_S_PRINCIPAL_UNKNOWN
    537   2010-11-16 12:23:01.599625    10.64.112.104     10.64.100.37      DCERPC      Bind: call_id: 70 IOXIDResolver V0.0
    539   2010-11-16 12:23:01.599625    10.64.100.37      10.64.112.104     DCERPC      Bind_ack: call_id: 70 accept max_xmit: 5840 max_recv: 5840
    540   2010-11-16 12:23:01.599625    10.64.112.104     10.64.100.37      DCERPC      Alter_context: call_id: 70 IOXIDResolver V0.0

    The two typical reasons why you would get a KRB5KDC_ERR_S_PRINCIPAL_UNKNOWN error, provided DNS registration was working fine and direct/reverse name resolution were returning correct information, are:

    1. Actually the SPN is not registered for the account under which the SQL Server service is running
    2. The SPN is registered for the account under which SQL is running, but it is also registered under a different account (i.e. duplicate SPNs).

    We confirmed previously that the correct SPNs existed for the SQL Server service account so we then needed to find for duplicate SPNs. One way to find these duplicate entries is by using the setspn tool included in Windows Server 2008 and later, or the updated setspn version for Windows Server 2003. Personally, I prefer to use the not-so-intuitive Windows ldifde utility. Using ldifde.exe /? will show you the required syntax to output a list of SPNs for your domain. In the following example we use ldifde to provide only those SPNs that has been registered for SQL Server service:

    LDIFDE -f filename.txt -t 3268 -d "DC=contoso,DC=com" -l serviceprincipalname -r (serviceprincipalname=MSSQLSvc/*) -p subtree

    Here was the output for this instruction in our case:

    dn: CN=SQLServer,OU=Contoso Users,OU=Managed Accounts,DC=contoso,DC=lab
    changetype: add
    servicePrincipalName: MSSQLSvc/SRVSQL2005.CONTOSO.lab:1433
    servicePrincipalName: MSSQLSvc/SRVSQL2005.CONTOSO.lab
    servicePrincipalName: MSSQLSvc/SRVSQL2005:1433
    servicePrincipalName: MSSQLSvc/SRVSQL2005
    servicePrincipalName: MSSQLSvc/newhost10.fabrikam.lan:1433
    servicePrincipalName: MSSQLSvc/fabrikam.intranet.lan:1433
    servicePrincipalName: MSSQLSvc/fabrikam.intranet.lan
    servicePrincipalName: MSSQLSvc/newhost05.fabrikam.lan:1433
    servicePrincipalName: MSSQLSvc/newhost05.fabrikam.lan

    dn: CN=SRVSQL2005,OU=Cluster,OU=Servers,OU=Managed Computers,DC=contoso,DC=lab
    changetype: add
    servicePrincipalName: MSSqlSvc/SRVSQL2005.CONTOSO.lab:1433
    servicePrincipalName: MSSQLSvc/SRVSQL2005:1433
    servicePrincipalName: MSClusterVirtualServer/SRVSQL2005.CONTOSO.lab
    servicePrincipalName: MSClusterVirtualServer/SRVSQL2005
    servicePrincipalName: HOST/SRVSQL2005.CONTOSO.lab
    servicePrincipalName: HOST/SRVSQL2005

    As you can see the SPN required for our SQL Server 2005 server did exists not only for the SQL Server service account running on our server but also for another service account, resulting in a duplicated SPN on the network. We removed the duplicated SPN using setspn –D.

    The following MSDN entry lists some of the actions that can lead to duplicate SPNs on the network:

    There are several scenarios where an administrator can duplicate the service-principal-names (SPN) in the domain directory that will cause Kerberos authentication to fail. These include the following:

    - Making changes to the domain account under which the instance of SQL Server runs
    If SetSpn.exe is run while running an instance of SQL Server as one domain account, such as DOMAIN\User1, and then the domain account that is used to run SQL Server is changed, such as DOMAIN\User2, when SetSPN.exe is run again, it will cause the same SPN to be inserted in the directory under both accounts.

    - Installing multiple instances of SQL Server that run under different accounts
    If you install multiple instances of SQL Server and then run each instance under a different account, if SetSpn.exe is run on each instance, there will be duplicate accounts in the directory under each SQL Server service account. This applies for both instances that are running under a domain user and also the local System account.

    - Removing and reinstalling an instance of SQL Server under a different account
    If you install SQL Server under one account, register the SPNs, remove and reinstall SQL Server under a different account, and then reregister the SPNs, each domain account will have the same SPNs. This means the SPNs will be duplicated.

    Bear into mind that Kerberos tickets are stored in cache for certain of time so the Kerberos ticket provided originally by the KDC although invalid, remains in cache. If we test the Linked server connection right after fixing the duplicate SPN we will still be unable to connect because of these cached tickets. The Windows Resource Kit contains the Kerbtray utility we can use to purge cached Kerberos tickets. The tool has to be executed via "run as" (from Windows GUI or Command Prompt) using the SQL Server Service account, otherwise we will be deleting the tickets for our current account and not for the SQL Server one.

  • SQL Server Cast

    “Cannot open the datafile <path_to_file>” Error Accessing a Flat File in SSIS Using a Proxy Account

    • 1 Comments

    I did find this error on a Windows Server 2008 and SQL Server 2008 installation but as you will see the error is not tied to an specific SQL Server or Windows version.

    In this case my customer was trying to run a SQL Server Agent Job with 17 different steps. These steps were of different types like running T-SQL code, ActiveX scripts, Operating System (CmdExec), etc. The customer was looking to run all these job steps using a Proxy account, which is a way to restrict the security context where a job step is run in SQL Server; here you can read the Proxy Account definition from MSND:

    SQL Server Agent lets the database administrator run each job step in a security context that has only the permissions required to perform that job step, which is determined by a SQL Server Agent proxy. To set the permissions for a particular job step, you create a proxy that has the required permissions and then assign that proxy to the job step. A proxy can be specified for more than one job step. For job steps that require the same permissions, you use the same proxy.

    When using the Proxy account, two different jobs steps were failing with the following error:

    The error when running SSIS-step (11,12) using a proxy is:
    10.0.2531.0 for 64-bit  Copyright (C) Microsoft Corp 1984-2005. All rights reserved.    Started:  15:27:31  Error: 2010-10-16 15:27:32.38     Code: 0xC020200E     Source: Copy Data from AgentSource to AgentDestination Task Flat File Source [1]     Description: Cannot open the datafile "V:\MySharedFolder\MyTextFile.txt".  End Error  Error: 2010-10-16 15:27:32.38     Code: 0xC004701A     Source: Copy Data from AgentSource to AgentDestination Task SSIS.Pipeline     Description: component "Flat File Source" (1) failed the pre-execute phase and returned error code 0xC020200E.  End Error  DTExec: The package execution returned DTSER_FAILURE (1).  Started:  15:27:31  Finished: 15:27:32  Elapsed:  0.687 seconds.  The package execution failed.  The step failed.

    In this case the job owner was the ‘sa’ account so we first tried to avoid the error by changing the job owner from ‘sa’ to the Proxy account but this did not fixed the problem. During the troubleshooting process I worked using a remote session with the customer’s server to try to determine where the error was coming from; during that session I noticed that the V: drive the job was pointing to did not exist on the server. The customer explained to me that the drive mapping was taking place during the first job step.

    The mapping of the V: drive was taking place on the security context of the Proxy account so we first check for the correct xp_cmdshell permission for this account, as described in this MSND entry:

    When it is called by a user that is not a member of the sysadmin fixed server role, xp_cmdshell connects to Windows by using the account name and password stored in the credential named ##xp_cmdshell_proxy_account##. If this proxy credential does not exist, xp_cmdshell will fail.
    The proxy account credential can be created by executing sp_xp_cmdshell_proxy_account. As arguments, this stored procedure takes a Windows user name and password. For example, the following command creates a proxy credential for Windows domain user SHIPPING\KobeR that has the Windows password sdfh%dkc93vcMt0.

    EXEC sp_xp_cmdshell_proxy_account 'SHIPPING\KobeR','sdfh%dkc93vcMt0'

    In our case the permissions required by the Proxy account were correctly assigned so we tried to map the network drive out of the job execution, using our own session instead of the Proxy account session, but the result and error was exactly the same. Nevertheless, both the customer and myself kept thinking that the problem was related to the mapping of this drive so we created a new Agent job with the following single step to isolate the problem (abridged):

    net use v: \\server\share\ /user:username /persistent:yes
    c:\program files (…) dtexec.exe package.dtsx (…)
    dir v:\*.* > c:\out.txt

    This test worked. The difference here was that we were mapping and running the package in the same job step while the customer was mapping the drive in the first step, and executing the SSIS tasks in the subsequent job steps. As explained in KB180362 article this was precisely the problem with the Agent job:

    When the system establishes a redirected drive, it is stored on a per-user basis. Only the user himself can manipulate the redirected drive. The system keeps track of redirected drives based on the user's Logon Security Identifier (SID). The Logon SID is a unique identifier for the user's Logon Session. A single user can have multiple, simultaneous logon sessions on the system.

    If a service is configured to run under a user account, the system will always create a new logon session for the user and then launch the service in that new logon session. Thus, the service cannot manipulate the drive mappings that are established within the user's other session(s).

    At the beginning of the same KB article you can also read:

    A service (or any process that is running in a different security context) that must access a remote resource should use the Universal Naming Convention (UNC) name to access the resource. UNC names do not suffer from the limitations described in this article.

    In our case the solution was as easy as replacing the mapping to the V: drive for the UNC path (i.e. \\servername\sharename). Another option in this scenario is to perform the mapping on the same job step as the one where the action takes place.

  • SQL Server Cast

    What are Sparse Files and Why Should I Care as SQL Server DBA?

    Last week I worked with one of my co-workers from the Windows Platform Support group in a backup-related issue with SQL Server data files. We both learn several things because of that case and I think it is worth the time to share this information with you.

    Our customer was using Microsoft Data Protection Manager to backup his Windows servers and SQL Server databases and was finding several errors when copying the SQL Server databases. During our research, the customer found the following MSND post with a detailed explanation of the problem: “Did your backup program/utility leave your SQL Server running in an squirrely scenario?” If you have never heard of sparse files and how they affect SQL Server I highly recommend you to read this post from the CSS SQL Server Engineers’ blog.

    What are sparse files?

    Sparse files are created during SQL Server operations such as database snapshots or DBCC CHECKDB statements. In the case of a snapshot, the sparse file is deleted when the snapshot is deleted while in the case of a DBCC CHECKDB operation the sparse file is automatically deleted by SQL Server right after the execution. You can find extended information for the role of sparse files in SQL Server in the two following links:

    Internal Database Snapshot:

    DBCC CHECKDB uses an internal database snapshot for the transactional consistency needed to perform these checks. This prevents blocking and concurrency problems when these commands are executed. For more information, see Understanding Sparse File Sizes in Database Snapshots and the DBCC Internal Database Snapshot Usage section in DBCC (Transact-SQL).

    Understanding Sparse File Sizes in Database Snapshots,

    Sparse files are a feature of the NTFS file system. Initially, a sparse file contains no user data, and disk space for user data has not been allocated to it. For general information about the use of sparse files in database snapshots and how database snapshots grow, see How Database Snapshots Work.
    When first created, a sparse file takes up little disk space. As data is written to the sparse file, NTFS allocates disk space gradually. Potentially, a sparse file can grow very large. If a database snapshot runs out of space, it is marked as suspect, and it must be dropped. The source database, however, is not affected; actions on it continue normally.

    How this happens?

    Sparse files is a feature provided by the NTFS file system. Under some NTFS.SYS file versions the sparse file bit becomes sticky for the main (parent) file. Here is one example:

    1. You run a DBCC CHECKDB on your SQL Server database as part of your maintenance plan, this completes with no errors
    2. You perform a backup of your databases
    3. Your original database files are then marked as ‘sparse’

    What is the problem?

    SQL Server does not support backup over sparse files so you can put yourself in an unsupported scenario without even knowing it.

    How can I check if I am in this situation?

    Run the following query and look for values other than 0 (zero) under the ‘is_sparse’ column:

    use <DatabaseName>
    SELECT is_sparse, * from sys.database_files

    The following example shows a database with several sparse files:

    image

    SQL Server 2008 R2 Best Practices Analyzer includes a new rule to detect this condition too, you can find this information in the Microsoft KB2028447 article.

    What can I do to avoid the problem from occurring?

    You should update the NTFS.SYS version in your Windows Servers as a first step. The latest NTFS.SYS versions prevent this problem from happening. Here is a list of the hotfixes and NTFS.SYS versions you should keep in mind:

    For Windows Server 2003:

    You cannot restore large files in the NTFS file system when all the data streams that have sparse attributes are deleted in the 64-bit version of Windows XP SP2 or in Windows Server 2003 SP2, http://support.microsoft.com/kb/973886

    Error message when you run the "chkdsk" command together with the "/v" switch on a Windows Server 2003-based computer: "Correcting sparse file record segment <number>, http://support.microsoft.com/default.aspx?scid=kb;en-us;932021

    For Windows Server 2008:

    Disk Manager reports incorrect disk usage a while after the operating system starts in Windows Server 2008 or in Windows Vista,http://support.microsoft.com/kb/981891

    For SQL Server 2005 and 2008:

    In addition to the abovementioned Windows hotfixes, make sure you have the latest SQL Server Service Packs and Cumulative Updates installed in both platforms to prevent this issue from happening.

    How can I fix the problem?

    Having your Windows and SQL Server boxes updated with the latest service packs and hotfixes will prevent the problem from happening again, but will not fix the problem if it already exists on the database. The previously mentioned MSDN blog post explains how to fix this problem in SQL Server 2005 and 2008: “Did your backup program/utility leave your SQL Server running in an squirrely scenario?

    Another option you can use to fix this problem is to empty the content of the database file to another file in the same filegroup. You can achieve this by using the EMPTYFILE option in the SHRINKFILE T-SQL statement; here is an example:

    USE AdventureWorks2008R2;
    GO
    -- Create a data file and assume it contains data.
    ALTER DATABASE AdventureWorks2008R2
    ADD FILE (
        NAME = Test1data,
        FILENAME = 'C:\t1data.ndf',
        SIZE = 5MB
        );
    GO
    -- Empty the data file.
    DBCC SHRINKFILE (Test1data, EMPTYFILE);
    GO
    -- Remove the data file from the database.
    ALTER DATABASE AdventureWorks2008R2
    REMOVE FILE Test1data;
    GO

    This method can not be used if the file with the ‘sparse’ sticky bit is the first file on the PRIMARY filegroup because this file contains several system objects that cannot be moved.

    After following all these steps I still find my SQL Server data files as ‘sparse’!

    After following these steps you may find that Windows fsutil reports the file as ‘no sparse’ while SQL Server keeps reporting the file as ‘sparse’ via sys.database_files. If this is the case, make sure you detach and re-attach the database files; this is needed because the sys.database_files information has to be updated.

  • SQL Server Cast

    “The IP Address 'x.x.x.x' is already in use. To continue, specify a different IP address” SQL Server 2008 cluster setup error

    • 3 Comments

    Today I want to talk about my experience with one of the most strange setup errors installing SQL Server in a cluster. In this particular case I was helping the customer to install a new SQL Server 2008 R2 instance in a three-node cluster. The Windows cluster was already running with two instances, each one on a different cluster node and the customer wanted to install this third instance on the third cluster node. The rest of the instances on the other cluster nodes have been running for a while with no issues.

    The customer was trying to install the third cluster node but the setup was consistently failing with the following error:

    SQL Server Setup has encountered the following error: The IP Address ‘10.246.18.118’ is already in use. To continue, specify a different IP address. Error code 0x84B40000.

    image

    The initial question, “is there any other machine with that IP address on the network?”, was quickly answered: neither PING nor NSLOOKUP shown any other host owning that IP address. As usual with setup problems I looked into the setup log file for SQL Server. The “Summary.txt” file had the same error reported by the GUI:

    Exception type: Microsoft.SqlServer.Chainer.Infrastructure.InputSettingValidationException
        Message:
            The IP Address '10.246.18.118' is already in use. To continue, specify a different IP address
    .

    The “Detail.txt” setup log file had more information. We were able to see that indeed the IP address 10.246.18.118 did not exist on the network during the initial setup phase:

    2010-11-12 14:53:09 Slp: IP Addresses have been specified so no defaults will be generated.
    2010-11-12 14:53:38 Slp: SendARP didn't return a MAC address for IP address '10.246.18.118'.  The message was 'The network name cannot be found.'.  This indicates the address is valid to create.

    […]

    2010-11-12 14:59:33 Slp: SendARP didn't return a MAC address for IP address '10.246.18.118'.  The message was 'The network name cannot be found.'.  This indicates the address is valid to create.

    … but all of a sudden, the ARP request succeeded in finding a valid host with that same IP address, causing the setup to halt:

    2010-11-12 15:00:25 Slp: SendARP for IP Address '10.246.18.118' succeeded.  The found MAC address is '00:00:5e:00:01:65'.  The IP address is already in use.  Pick another IP address to continue.
    2010-11-12 15:00:28 Slp: SendARP didn't return a MAC address for IP address '10.246.16.118'.  The message was 'The network name cannot be found.'.  This indicates the address is valid to create.
    2010-11-12 15:00:28 Slp: Hosting object: Microsoft.SqlServer.Configuration.ClusterConfiguration.ClusterIPAddressPrivateConfigObject failed validation
    2010-11-12 15:00:28 Slp: Validation for setting 'FAILOVERCLUSTERIPADDRESSES' failed. Error message: The IP Address '10.246.18.118' is already in use. To continue, specify a different IP address.
    2010-11-12 15:00:28 Slp: Error: Action "Microsoft.SqlServer.Configuration.SetupExtension.ValidateFeatureSettingsAction" threw an exception during execution.
    2010-11-12 15:00:28 Slp: Microsoft.SqlServer.Setup.Chainer.Workflow.ActionExecutionException: The IP Address '10.246.18.118' is already in use. To continue, specify a different IP address. ---> Microsoft.SqlServer.Chainer.Infrastructure.InputSettingValidationException: The IP Address '10.246.18.118' is already in use. To continue, specify a different IP address. ---> Microsoft.SqlServer.Chainer.Infrastructure.InputSettingValidationException: The IP Address '10.246.18.118' is already in use. To continue, specify a different IP address.
    2010-11-12 15:00:28 Slp:    --- End of inner exception stack trace ---

    […]

    2010-11-12 15:05:26 Slp: Error result: -2068578304
    2010-11-12 15:05:26 Slp: Result facility code: 1204
    2010-11-12 15:05:26 Slp: Result error code: 0

    In SQL Server 2008 a cluster installation is divided into two main phases; the first one takes care of copying the instance files into the target machine and register all the components while the second one takes care of creating the clustered resources. In our case the setup was failing at the very end of the setup where the cluster resources are created. As you can see, the “Detail.txt” file was also pointing to the MAC address of the offending host.

    As you may know, ARP is a network protocol that takes care of resolving IP addresses based on physical or MAC addresses. This information is stored in memory into the ARP table. The command prompt ARP -a instruction checks the information stored by Windows on the ARP cache table. After uninstalling one more time the components left by the failed setup, we did a quick ARP check based on what the setup was doing and found the following information:

    image

    The IP addresses 10.246.16.100 and 10.246.18.100 were the two virtual IP addresses that we were trying to configure on the cluster. I am far for being an expert in networking but having an IP address under the x.x.18.x/24 network in the x.x.16.x/24 interface was strange enough to make me think in a network resolution problem. The IP address and MAC address was in fact the same that was causing the setup to fail. We tried to PING and NSLOOKUP again the same IP address but nothing come back.

    I double-checked with one of the SQL Server 2008 R2 clusters in my lab and found that every network address was correctly shown under its corresponding network interface:

    clip_image002

    Strange enough the “bogus” ARP entry was created as dynamic in the case of the customer’s cluster so we were expecting that to be removed from the ARP cache table several seconds, but this was not happening. I was not sure what was announcing the IP address via ARP on the network, I can only think in a problem with the NIC teaming but we did run out of time and were not able to test this hypothesis. After removing the offending ARP entry with ARP –d we run a new setup that finished successfully this time. If you have a similar experience with this error or have any idea of where that ARP entry could come from please, let me know.

  • SQL Server Cast

    “OnlineThread: SQL Cluster shared data upgrade failed” error (ID 19019) installing Cumulative Update in SQL Server cluster

    • 5 Comments

    Last week I was working with a customer on this obscure SQL Server error right after installing Cumulative Update 9 (CU9) on a SQL Server 2008 two-node cluster. As you will see, the root-cause of the problem was easy to understand, but getting to the point of finding this root-cause was not.

    In this case the customer was running a virtual instance of SQL Server 2008 SP1; this instance was running with no errors and moving back and forth the clustered resources was not causing any problem. The customer was installing CU9 to avoid a database issue that was affecting the performance of a CRM system. Following best practices in SQL Server 2008/R2, she was installing the CU in the passive node first, in order to minimize the amount of downtime in the production CRM system. The problem was that trying to start the SQL Server service on the updated cluster node was failing immediately.

    We looked into the Windows Event log and Cluster log but the information from these two sources did not provide any clue:

    Application Event log:

    10/09/2010 08:08:15 PM  Error         NODE1 19019   MSSQL$SQL2008                    Failover        N/A                                [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed       (status 0, Worker retval = 3) 
    10/09/2010 08:08:11 PM  Error         NODE1 19019   MSSQL$SQL2008                    Failover        N/A                                [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed       (status 0, Worker retval = 3) 
    10/09/2010 08:08:10 PM  Error         NODE1 19019   MSSQL$SQL2008                    Failover        N/A                                [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed       (status 0, Worker retval = 3) 
    10/09/2010 07:56:26 PM  Error         NODE1 19019   MSSQL$SQL2008                    Failover        N/A                                [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed       (status 0, Worker retval = 3) 
    10/09/2010 07:55:26 PM  Error         NODE1 19019   MSSQL$SQL2008                    Failover        N/A                                [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed       (status 0, Worker retval = 3)
     

    Cluster log:

    00000d78.000010dc::2010/09/14-16:50:34.060 ERR   [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed                            (status 0, Worker retval = 3)
    00000d78.000010dc::2010/09/14-16:50:34.060 ERR   [RHS] Online for resource SQL Server (SQL2008) failed.
    000009dc.000011cc::2010/09/14-16:50:34.060 INFO  [RCM] HandleMonitorReply: ONLINERESOURCE for 'SQL Server (SQL2008)', gen(0) result 5018.
    000009dc.000011cc::2010/09/14-16:50:34.060 INFO  [RCM] TransitionToState(SQL Server (SQL2008)) OnlinePending-->ProcessingFailure.
    000009dc.00001168::2010/09/14-16:50:34.060 ERR   [RCM] rcm::RcmResource::HandleFailure: (SQL Server (SQL2008))
    000009dc.00001168::2010/09/14-16:50:34.060 INFO  [RCM] resource SQL Server (SQL2008): failure count: 1, restartAction: 2.

    The error was preventing SQL Server to create and write any ERRORLOG file so we couldn’t rely in this log either. In these kind of situations Sysinternals Process Monitor (Procmon) is usually a good option so we run this utility while reproducing the service start failure. After capturing Procmon, we filtered the trace by the rhs.exe process which is the Resource Health Check process responsible of loading SQSRVRES.DLL running the IsAlive check. Through this test we could see this DLL was detecting that the SQL Server component required an updated because the PatchLevel and SharedDataPatchLevel on Windows Registry were different. The error was taking place during this update:

    image

    So Procmon was useful telling us what was failing but not telling us why it was failing. Working with my colleague João Loureiro I learnt that by modifying the following registry key to “1” (one), the Windows cluster.log file could provide a more verbose output:

    HKLM\Cluster\Resources\<ID for SQL Server Instance>\Parameters\VerboseLogging

    After modifying this key, we failed the updated node again and check one more time to the cluster.log file, this time the information in the file was the following:

    000014c8.00001eb0::2010/10/06-16:34:54.991 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: enter; VirtualServerName=NETNAME; InstanceName=SQL2008
    000014c8.00001eb0::2010/10/06-16:34:54.991 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: ServiceName=MSSQL$SQL2008
    000014c8.00001eb0::2010/10/06-16:34:55.022 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] ClusterResourceControl, found the network name: NETNAME.
    000014c8.00001eb0::2010/10/06-16:34:55.022 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: ServerName=NETNAME\SQL2008
    000014c8.00001eb0::2010/10/06-16:34:55.054 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Calling SQLClusterResourceWorker::WaitForCompletion (200)
    000014c8.000017f0::2010/10/06-16:34:55.054 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Worker thread starting ...
    000014c8.000017f0::2010/10/06-16:34:55.054 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Entering SQLClusterSharedDataUpgradeWorker thread.
    000014c8.000017f0::2010/10/06-16:34:55.054
    INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): SqlDataRoot = 'R:\MSSQL10.SQL2008\MSSQL',         CurrentPatchLevel = '10.1.2789.0',         SharedDataPatchLevel = '10.1.2531.0'
    000014c8.000017f0::2010/10/06-16:34:55.054 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): FTDataPath = 'R:\MSSQL10.SQL2008\MSSQL\FTData',         SQLGroup = 'S-1-5-80-2586167408-2258694634-1203600018-1002233561-722790735'
    000014c8.000017f0::2010/10/06-16:34:55.054 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0):
    Entering DoSQLDataRootApplyACL (R:\MSSQL10.SQL2008\MSSQL\Data).
    000014c8.000017f0::2010/10/06-16:34:55.054 WARN  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): DoSQLDataRootApplyACL : Failed to create directory tree at SQLDataRoot.
    000014c8.000017f0::2010/10/06-16:34:55.069 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Exiting DoSQLDataRootApplyACL (3).
    000014c8.000017f0::2010/10/06-16:34:55.069 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Exiting SQLClusterSharedDataUpgradeWorker thread (3).
    000014c8.000017f0::2010/10/06-16:34:55.069 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Worker thread exiting (retval = 3) ...
    000014c8.00001eb0::2010/10/06-16:34:55.069 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Calling SQLClusterResourceWorker::WaitForCompletion (4294967295)
    000014c8.00001eb0::2010/10/06-16:34:55.069 ERR   [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed       (status 0, Worker retval = 3)
    000014c8.00001eb0::2010/10/06-16:34:55.069 ERR   [RHS] Online for resource SQL Server (SQL2008) failed.

    Here we could see much more detailed information. After reviewing this log with the help of the customer we did find that the R: drive was indeed non-existent on the cluster. Only then we were able to understand why the SQLClusterSharedDataUpgradeWorker thread was failing. Checking the following registry key:

    HKLM\Software\Microsoft\Microsoft SQL Server\MSSQL10.SQL2008\Setup\SQLDataRoot

    we found drive R: listed instead of the actual SQL Server Data root drive. From this point fixing the problem was a matter of changing this registry key one cluster node at a time right after installing the required CU. It is clear than the problem was not related to this CU9 specifically, using any other CU or SP would have probably lead to a similar situation.

    Neither the customer nor us were able to understand how this path could have been changed. We can only imagine some ALTER DATABASE statement was run at certain point, leaving a wrong registry entry for the SQLDataRoot key.

  • SQL Server Cast

    Unexpected Behavior Setting Up Firewall Rules with Clustered SQL Server Instances

    If you have ever try to setup firewall rules in your network to manage access to SQL Server you have probably found a somehow weird behavior when connecting to a clustered instance. In this particular case my customer was finding that the rules she had configured on her Windows firewall were working fine when connecting to a SQL Server 2008 cluster, but were failing when connecting to a SQL Server 2005 cluster instead.

    This behavior is indeed clearly documented on KB944390 article but in our case it took some time to match what we were finding looking at the network traces to this specific Knowledge Base article. I am posting this information here so you can see what you can expect when running your prefer network sniffer while tracing down this problem.

    If you want to go ahead and try this to yourself, grab you favorite network sniffing utility, a PC with SQL Server Management Studio (SSMS) and locate a SQL Server 2005 and SQL Server 2008 clusters on your network. The test only includes opening SSMS on your client and capture some network traces so there is no risk for the clusters on a the production system. In my case I used Microsoft Network Monitor  because I find easy to interpret the SQL Server packet information using the available SQL Server parser; you can find and download this parser here.

    Open Microsoft Network Monitor and make sure the correct SQL Server parser is selected, the new Network Monitor version makes this easier than ever:

    image

    Click on New Capture to open a new capture window and then click on the Start button to initiate the capture process. Then, open SSMS and connect to the clustered SQL Server 2008 instance using the default setting;, as soon as the connection completes, go back to Network Monitor and stop the capture process.

    As you may know, when connecting to SQL Server named instances the SQL Browser service plays an important role giving to the clients the required information. SQL Browser runs as a local service on each standalone or cluster server where SQL Server is installed (you won’t find SQL Browser service on SQL Server 2000 because it was implemented as a stand-alone service starting with SQL Server 2005). Here is in summary what SQL Server browser does from the MSDN documentation:

    “On startup, SQL Server Browser starts and claims UDP port 1434. SQL Server Browser reads the registry, identifies all SQL Server instances on the computer, and notes the ports and named pipes that they use. When a server has two or more network cards, SQL Server Browser will return all ports enabled for SQL Server. SQL Server 2005 and SQL Server Browser support ipv6 and ipv4. […]

    When SQL Server 2000 and SQL Server 2005 clients request SQL Server resources, the client network library sends a UDP message to the server using port 1434. SQL Server Browser responds with the TCP/IP port or named pipe of the requested instance.”

    The component in SQL Browser that takes care of all this is called the SQL Server Resolution Protocol or SSRP for short. When connecting to my SQL Server 2008 clustered and named instance I was able to see the expected pattern of network packets going in and out. First we can see my client “JORGEPC-SQL” trying to access the remote clustered instance “SQL2008\KATMAI64”, the first Browser request is identified by and UDP Ssrp packet to the destination port 1434:

    image

    Then we can see the response from the SQL Server Browser service using the same SSRP protocol; the response contains all the information required by the client to complete the connection successfully:

    image

    As you can see, the answer in the SSRP traffic comes from the SQL Server Network Name resource in the cluster (sql2008.contoso.lab), not from the cluster node where the SQL Browser service is running. This is expected because I am initiating the communication with the virtual server name in SQL Server, not with the cluster node.

    See now what happens when I perform the very same action using connecting to a clustered SQL Server 2005 named instance called “SQL2005\YUKON”. Here you have the screen captures of the SSRP request from my client (first) and the SSRP response from the server (second):

    image

     

    image

    As you can see from the second image the Browser service is providing the correct information to the client via SSRP, including the instance name, version, and TCP and NP information (you can also see from this frame there are two instances in the cluster, one for SQL Server 2000, represented by Version = 8 and another one for 2005, represented by Version = 9). However, the SSRP response in this case is not coming from the SQL Server Network Name “SQL2005” but from the cluster node where the instance is currently active, “JORGEPC-VM-CN1”.

    If you are not using a firewall or if your firewall rules do not include to block all unknown incoming traffic this behavior will be unnoticed. However, if you block all unknown incoming traffic on your client and using of a firewall that does not support loose source mapping, the connection attempt will fail. You firewall will log the following information on this case:

    2010-10-28 15:23:21 DROP UDP 192.168.0.50 192.168.0.40 1434 56039 122 - - - - - - - RECEIVE
    2010-10-01 15:23:22 DROP UDP 192.168.0.50 192.168.0.40 1434 56040 122 - - - - - - - RECEIVE
    2010-10-01 15:23:23 DROP UDP 192.168.0.50 192.168.0.40 1434 56041 122 - - - - - - - RECEIVE
    2010-10-01 15:23:25 DROP UDP 192.168.0.50 192.168.0.40 1434 56042 122 - - - - - - - RECEIVE

    The KB944390 mentioned before explain this situation in more detail:

    “When you connect to the named instance, the client network library sends a User Datagram Protocol (UDP) request packet to the IP address of the named instance. Then, SQL Server Browser returns a UDP response packet that contains the information about the connection endpoints.
    However, in the UDP response packet, the source IP address may not be the IP address to which the UDP request packet was sent. If the named instance is a failover cluster instance, the source IP address is the IP address of the physical computer instead of the virtual IP address of the remote server. If the remote server has multiple IP addresses, the source IP address may be any of the IP addresses that are assigned to the remote server”

    You can learn more about loose source mapping in the this MSDN link.

    These results depend not only on SQL Server but Windows versions. The network traces will shown the expected behavior (incoming RRSP traffic source is the same as the destination) in SQL Server 2008 SP1 and above when installed on Windows Server 2008 or above; other SQL Server and Windows combination below these two versions will suffer from the same problem, for example:

    • SQL Server 2008 (RTM) on Windows Server 2008
    • SQL Server 2008 SP1/SP2 on Windows Server 2003 cluster
    • SQL Server 2005 on Windows Server 2008

    For the SQL Server and Windows Server versions affected by this problem, 3 different workarounds are provided on the abovementioned KB944390 article.

Page 1 of 3 (30 items) 123