This Blog provides information about running SAP Applications on the Microsoft Platform. The Blog is written by people who are working with SAP on the Microsoft Platform for decades.
A while back one of our SAP customers benefited from Database Mirroring after the NAS device attached to the database principal did crash. Since the customer was using SQL Server database mirroring in a synchronized manner without automatic failover, the customer issued a failover to the secondary to keep their SAP ERP system running. Unfortunately the mirror server in this specific case was sized smaller than their database principal. Therefore they failed back over to the principal after the SAN device was up and running again. However executing a consistency check (dbcc checkdb) did reveal that the table GLPCA had several corruptions. Some on the data layer, some others in the B-Tree of the clustered index. In order to make sure whether the database mirror is consistent, the customer initiated a database snapshot (SQL Server feature since SQL Server 2005) against the mirrored database and ran a consistency check against GLPCA (dbcc checktable). As expected the table GLPCA on the mirror was consistent and did not have any corruptions.
Purely from the damage detected on the corrupt pages on the principal, it would have been a great case for the Autorepair feature we added to Database Mirroring in SQL Server 2008. However the customer still did run SQL Server 2005. Nevertheless Database Mirroring offered a great possibility to fix the damage to GLPCA even using SQL Server 2005. Here is the way how it can be done:
· A failover is done to the mirror again. After the failover production can continue
· Switch the database mirroring mode to asynchronous (reason will be described in one of the following articles on this blog)
· Rebuild the index(es) on the table(s) which are damaged on the former principal side (now acting as mirror). In most of the cases, these indexes can be rebuilt online. Hence production can continue. In the particular customer case it was good enough to rebuild the clustered index since a rebuild of the clustered index would rebuild the data layer as well.
· After rebuild of the index(es) is finished switch back to synchronous mirroring. After the two sides are synchronized one can take a snapshot on the current mirror side (former principal) to run consistency checks against the tables which got indexes rebuild or one can failover to the mirror (former principal) again.
· Expected result is that the corruptions on the former principal are resolved and no more existing.
This opens two questions. Why was it expected that the database mirror was not affected? Why did the rebuild of an index on the principal fix a corrupted index on the mirror?
Answer to the question one is that database mirroring is not transferring pages like storage replication methods do. SQL Server transfers the transaction log entries which describe changes applied to data. These changes described in the transaction log then will be applied to the mirror database. This also means that any damages occurring in the I/O path on the principal server will not be transferred over to the mirror. This would mostly not apply to storage replication where damages could be transferred to the secondary immediately.
Why did the clustered index rebuild on the former mirror fix the issue on the former principal? When SQL Server mirrors activities like creating indexes get executed, it doesn’t log the fact that it is building an index. But it does log every page allocation, every insert of a row of the old pages into the new allocated pages of the rebuild index or data layer. Means any data moved ends up in the transaction log, which again gets mirrored to the acting mirror. At the mirror these activities are repeated. Rebuilding an index then ends with de-allocating the old index structure (including data layer in case of a clustered index) and activation of the new index structure. At the end this means the data moved on the acting mirror is originated from the acting principal which had non-corrupted data. This is the reason why the repair as described above worked.