This Blog will provide information about running SAP applications on SQL Server and Windows. The Blog is written by folks of Microsoft who are working with SAP and SQL Server for more than a decade.
What is database corruption? Why should you be concerned about corruption in a database? What should you do when you encounter corruption in an SAP database on SQL Server? This blog post is relevant for handling a corrupt SAP NetWeaver database on SQL Server.
Physical corruption happens on all database platforms with any Relational Database Management System (RDBMS). Data is persisted on disks and natural, man-made, and manufacturing disasters happen. In our support experience with corrupted SAP systems, we have seen physical corruption happen due to: system power failure, bad physical memory, SAN infrastructure problems, etc.
What about SQL Server’s High Availability / Disaster Recovery (HA/DR) solutions: 2012 Always On, Database Mirroring, and Log Shipping? Yes, those all would usually help because at least one or more extra copies of the physical data is present. But SAP does not require that customers utilize SQL Server HA/DR solutions so not all systems are protected with multiple physical copies of the data.
This blog will discuss four main points:
(1) How is physical corruption normally detected in a SQL Server Database?
(2) What is the simplest, best way to recover from corruption?
(3) What is the difference between physical and logical corruption and why is this so important for an SAP database?
(4) The workflow to use in order to recover best from a corruption
1.How is physical corruption normally detected in a SQL Server Database?
In SQL Server Support for SAP, we periodically process customer messages where a physical corruption has been detected via various errors. The most common ways corruption is detected are via:
2.What is the simplest, best way to recover from corruption?
SAP Note 142731 [via SAPNet HERE]documents that the only officially supported process of handling corruption in SAP Databases on SQL Server is to restore from a known clean backup. But it also correctly dfferentiates the two classes of corruption that can be encountered: Those that are non-critical and can be repaired fairly easily and those that cannot be repaired from within the database itself. The former usually happens when the corruption occurs on index pages and the latter when the corruption resides on data or metadata pages. When the minimum level of repair needed to fix the corruption is stated as either “repair_fast” or “repair_rebuild” please follow the “Non-critical corruption” guidelines described in that SAP Note.
If you find evidence of significant corruption (where repair_allow_data_loss is listed as the minimum repair level needed), you must make every effort to restore from a clean backup. Some customers fear that restoring a full database and applying all the transaction logs is risky action because of the length of time it can take and the complexity. However, it is always far more complex, time consuming, and error-prone to try to repair or recover data manually.
Unfortunately, some customers do not have a recent clean backup which they can restore. In these cases, an attempt to repair or recover data is the only option available. Before moving into the repair or recovery steps, we will first explain some important principles about the SAP application that you must understand before trying to repair or recover physically corrupted data.
3.What is the difference between physical and logical corruption and why is this so important for an SAP database?
When corruption errors are logged, the first advice is to immediately run a SQL Server Database Consistency Check (or DBCC). When you do this, the DBCC output may direct you to execute a specific repair option. However, before running any repair command you must understand that the complete consequences of running the repair options.
The Microsoft documentation about DBCC CHECKDB and its repair options shows that there are really only two DBCC CHECKDB repair options:
The documentation then states that if you execute REPAIR_ALLOW_DATA_LOSS then you should afterwards run DBCC CHECKCONSTRAINTS. The reason is that if your database has used primary and foreign key relationships to ensure application data (or logical) consistency, then any data deleted by the repair which would result in a logical inconsistency will be identified with DBCC CHECKCONSTRAINTS.
However, many Enterprise Database Applications (like SAP) have their own proprietary metadata repository (such as the SAP Data Dictionary, or DDIC) that allows for self-management of referential integrity relationships, cluster tables, etc. The self-management means it does not rely on the RDBMS to provide the Declarative Referential Integrity (DRI) to manage the Primary Key and Foreign Key relationships. Almost all SAP database tables have Primary Key constraints specified but without any foreign keys that are associated to them. This is significant in that there is no way for the database to assess the logical consistency between tables and their data when those relationships aren’t defined. That can only be done within the realm of the SAP Application module itself. More information on Application and RDBMS consistency and corruption can be found HERE.
Therefore, if you execute DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS on an SAP NetWeaver database you risk losing data in multiple SAP application areas which cannot be detected afterward by running any database statement. And the application level logical inconsistencies can be far broader than simple data records. Any or all of the following examples could happen:
Identifying such application logical inconsistencies is almost impossible to do. There is no such thing as a DBCC CHECKSAPDB utility to assess the application consistency, or to ‘fix it’ when is broken.
In other words, one can try to use REPAIR_ALLOW_DATA_LOSS to ‘clean things up’ and actually end up making things worse. You could remove all of the blatant physical corruption but you would be creating multiple instances of logical corruption for which the application has no awareness. This would result in incorrect business data and a complete loss of confidence in the database store.
This is why the first and best advice when you experience corruption in your SAP NetWeaver database is to restore your database from backup. But if no backup is available for restore, you should open an SAP customer Support message in the SQL Server application area (BC-DB-MSS) so that you can get expert assistance. Be aware though that it is your responsibility to maintain backups of your data. If you have no backup, SAP Support can try to help you recover data but there is no guarantee it can be done and you may have to pay for the data recovery. Read more about this in SAP Note 1597910 [via SAPNet HERE].
If you try to fix such corruption either on your own or with the help of a third party, please consider these thoughts as you proceed to work on the situation. Specifically:1) [CONFIRM-BAD] Confirm the corruption actually exists (in some rare situations transient errors can give a false positive impression that corruption exists)
i.e. Don’t restore from a backup onto a database that is actually NOT corrupt. Always take time to confirm there actually is a problem first. And understand the landscape too. SAN replication can deliver unexpected results in terms of corruption being present at one time and not present at another (in these cases the corruption is only in one node of the SAN paired set so depending on which SAN you’re currently reading from, the data is corrupt or not).
2) [PRESERVE] Preserve the current state often, even when thought to be bad. Never make things worse. For example, backup frequently during your repair efforts as a contingency plan in case a mistake is made. Don’t restore from a backup onto a production system and then find out the backup is more corrupt than the restored to image, which you now have overwritten. Backup first.
3) [UNDERSTAND] Understand the root cause of the corruption. Most all such physical corruptions are sourced in hardware, firmware, environment, or human error. Try to identify the cause and the time when the first problematic symptoms were observed.
4) [ELIMINATE] Eliminate the underlying root problem. Don’t waste time fixing corruption on a system where corruption has already happened without eliminating the root cause. If it has happened once and you don’t understand the root cause, you cannot correct the root cause. This means it will happen again. If the root cause is ‘unknown’ then consider swapping out the entire hardware system if possible.
5) [FIX] Fix the corrupt state and any symptoms of it. Again, the best way to do this is to restore from a known good, clean backup. It is the best way to get a critical system back into production as quickly as possible.
Try to never use REPAIR_ALLOW_DATA_LOSS (as described HERE), but if that is your only option:
a) Open a customer support message to Microsoft SQL Server Support or SAP Support so that you can work with an expert before you execute REPAIR_ALLOW_DATA_LOSS.
142731 – DBCC checks in SQL Server
1297986 – Backup and Restore strategy for MS SQL Server
1420452 – FAQ: Restore and recovery with MS SQL Server
1597910 – Handling of database corruptions on SQL Server
b) Identify the affected tables and the type of tables.
c) Rescue as much information as possible
d) Document the impact. Have a complete list of all the affected tables where data loss is expected, the size of the tables, and the number of rows believed to be deleted.
e) For each affected table, open an SAP customer message in the application area component (e.g., an FI support message for FI tables, an SD message for SD tables, etc.). Sometimes with the preservation of Primary Keys it might be possible for responsible application experts to identify lost records, reconstruct data from other tables in the same system or pull data from a different system in the landscape.
6) [CONFIRM-GOOD] Get a final CHECKDB to ensure no more corruptions are reported. This is a critical step because the initial CHECKDB output can sometimes only report the first layer of corruptions. Once you have repaired that first layer, sometimes the DBCC check can then progress further and find new corruptions.This means you might need to execute a DBCC check and repair cycle several times until the DBCC check finally reports no further corruptions. Again, we recommend that you backup frequently so if you repair a first set of corruptions and new ones are reported you should backup the system before starting the second cycle of corruption repairs.Once the DBCC check reports there are no further corruptions, backup this good state. Then, run any test scripts available to assess the application logical consistency of the database (these might be supplied by the SAP application area colleagues when you open the SAP customer support message). Lastly, make sure that CHECKSUM is set for the database and the backup as mentioned near the start HERE.
7) [REVIEW] Learn from this event in order to either better prevent a corruption recurrence or to establish a disaster recovery plan. For example, frequently customers do not have a clearly defined disaster recovery plan. At a minimum, you should restore from a production backup at least once a month and run a DBCC check on the restored database. This will confirm that the media is readable and that your backups do not contain corrupted data. Customers who do this can safely restore a backup when corruption is experienced and can completely avoid the lengthy data repair/rescue process.Following are some specific suggestions for you to review after a corruption:
A more generic and detailed list of how to repair database corruptions in Enterprise class databases can be found HERE.
For handling corrupt SAP databases on SQL Server
Thanks to Leslie Moser for help with the text.