Common bad advice around disaster recovery

Common bad advice around disaster recovery

Rate This
  • Comments 17

I've just been itching to post this one.

I'm doing a chalk-talk (AKA cabana session, this year AKA Technical Learning Centre session) at TechEd next Wednesday (8.30am 6/14) called 'DBCC CHECKDB: Magic, Monsters, and Myths'. A lot of the material in this will be included in the mega-whitepaper on DBCC I'll be writing over the summer, but it also slices nicely into blog-size posts so I'm going to do a series of them over the next month or two covering everything from today's topic to very deep drilldowns into how specific repairs work. If there's anything specific you'd like to see, let me know.

Billy Connolly (a great Scottish comedian) tells a story from his childhood about his father listening to the radio at breakfast while eating his outmeal, and every so often shouting 'No!!!' at the radio. Eventually the radio had a nice spackling of dried outmeal. I usually read the newsgroups and various MS and non-MS forums every day or so - and if I were to read them at breakfast, my laptop screen would very quickly get a nice spackling too (not of oatmeal though - nasty stuff).

I really despair to see some of the 'advice' that's out there, even though its all well-intentioned. Here are some of the gems related to disaster recovery - and I'll try not to rant.

"Just run REPAIR_ALLOW_DATA_LOSS and you'll be fine..."

  • What about your backups? What's the point in taking backups if you don't use them. See my previous post for how to choose whether to repair or restore. Of course, you do have backups, right?
  • What about the fact that you may lose data? Do these advisors understand what repair's going to do to fix the errors that have been posted - I'm guessing not.
  • What about working out what happened? You should always do some root-cause analysis (RCA) to work out why the problem occured. This is so important, otherwise  you're just patching holes with duct-tape when you repair or restore.
  • I've seen people recommend running repair to 'fix' 823 errors caused by unreadable disk pages...

"Just rebuild your transaction log using these steps..."

  • This one kills me every time. Very, very rarely is it advised in the right context - corrupt transaction log with no backups, and even then the ramifications aren't explained.
  • All the same arguments as above apply here too, but additionally:
  • Do you know what rebuilding a transaction log does to the database? It's about the worst thing you can do - ripping out all the uncommitted and unrecovered transactions and doing nothing to fix things up. This is why its undocumented and unsupported for use except under the guidance of Product Support in last-chance situations.
  • I've seen people advise doing this if recovery is taking too long...
  • Now, in SQL Server 2005 we have a documented and supported solution for these situations, Emergency Mode Repair. See 'DBCC Statements' in Books Online and I'll cover it in detail in a future post.

"Just restore your database and carry on..."

  • This one's not really bad except that it's missing the crucial investigation into what happened. Chances are that whatever caused the problem this time will repeat and lead to another outage. Always do RCA - even if its just copying the database off somewhere for later analysis.

"Run CHECKALLOC, then CHECKDB, then CHECKTABLE on all your tables, then..."

  • It is a common misconception that DBCC CHECKDB does something different from DBCC CHECKALLOC and DBCC CHECKTABLE. It doesn't. At the simplest level, DBCC CHECKDB runs DBCC CHECKALLOC and then runs DBCC CHECKTABLE on every table in the database.
  • Granted, in SQL Server 2000 and before, it didn't run DBCC CHECKCATALOG, but we fixed that in SQL Server 2005. (Yes, we'll get to DBCC CHECKCONSTRAINTS too at some point).
  • Another thing I see is advise to run DBCC CHECKALLOC to determine the damage caused by an 823 or 824 error. That won't cut it - given the distribution of page types, the odds are that its not an allocation bitmap that's affected so you should run DBCC CHECKDB.

"Just flick the power switch on and off a few times on one of the drives..."

  • This is in response to 'how can I create a corrupt database to test by custom DBCC script?'. Nice - not only do you get a corrupt database, you also get a fried drive PSU. I'm sure your hardware support team would love you for that.
  • One thing that I used to recommend to people in pre-SQL Server 2005 days was to create a test database and manually delete a row from the sysindexes table. This will cause all manner of problems. Now, in SQL Server 2005 that's not really possible any more, so...
  • Here's the best way to create a test corrupt database: use a hex editor on the raw database files. Shutdown the server so the file isn't locked (don't detach the database because if you corrupt the 'wrong' page you may not be able to attach it again). Pick an offset more than, say, 100 pages into the file (at least 819200 bytes) but make sure its aligned on an 8192 byte boundary (a page boundary). This avoids critical metadata pages and allocation bitmaps so that you'll be able to start the database and run DBCC CHECKDB on it. Write a few bytes of zeroes into the file at the chosen offset and you're almost guaranteed some page header corruption errors.
  • Now remember, I'm only advocating doing this to create a test corrupt database, don't even think of doing this to anything you value.
  • If you want to test a specific error, let me know - we have tools that can engineer any single or combination of corruptions (so we can test the check and repair algorithms)

These are the worst ones around disaster recovery. I could go on all day about not shrinking database and not blindly rebuilding all indexes every night but it's lunchtime so I won't go there - yet :-)

Leave a Comment
  • Please add 5 and 1 and type the answer here:
  • Post
  • I mentioned a the storage engine blog recently and was reading the latest post Common bad advice around...
  • I mentioned a the storage engine blog recently and was reading the latest post Common bad advice around...
  • I mentioned a the storage engine blog recently and was reading the latest post Common bad advice around...
  • Now you are forgetting about some of my favorites... I am an "Expert" on experts-exchange... I answer questions in the SQL Topic Area there and some of my favorites there dont' center around DR per se but they sound something like this:

    "Just change to simple recovery mode and backup your log with truncate only.. then your log will be shrunk"

    "Just run the following trace flag to change the database to emergency mode...."

  • Paul,

    You wrote: "If you want to test a specific error, let me know - we have tools that can engineer any single or combination of corruptions "

    I have a monitoring program which runs a number of dbcc commands on a nightly basis.  I would like to get a hold of the tools that you mentioned for testing purposes.  Are they available online somewhere?

    Thanks!
  • Adding a note so I can get email updates to this thread.
  • Hi cgould,

    No, the tools aren't available outside of the engineering team here in Redmond and won't be at any time in the future. If you want, I could create a corrupt database and attach it to a new post for people to play with?

    Thanks
  • Paul, thanks for the offer.  I need to check a specific db.  Your hex editor trick above works for checkdb and after a bit more research I think I can script all the other types of problems I need to check for.

    Thanks anyway!
  • I work in the DR Environment. Thousands of companys test at our facility. Many fail to recover. The problem that I see daily is that.. Some has a recovery plan, and they really try to recover from a disaster and they still fail. Just imagine .. the ones that don't have a plan. The biggest problem that I have seen is ..1. Back up Software: Reaction to the OS. 2: Restoring to dissimiliar hardware. 3: Time Restraints. Now put them all together with about 20 techs.. who have been working for at least 8 hrs and you have back up software running that does not tell if it is running or erred out. Especially with large dbs like SQL.  How would you like to be running a restore for 6 hrs or so.. and find out then that It crapped out 4 hrs ago... More to come....

  • Just think about this for a moment.. Lets use Tivoli as our back up software ... We're backing up a SQL database. Using LTO 2  tape drives,  Our current servers are DL 380 G3's dual processors, 4 g mem, Licenced for 2 processors. ( Recovering to a ML570 G3 Server, 32G mem, Quad Processors, Quad Nics, Gige connections , P600 Raid Controller etc. DO THINK YOU WILL RECOVER OR "blue screen"

  • See .. Backup is not the problem, but rather utilizing that backup file to restore your db. Backup software,  Media Plays an enormous role in how and what you will restore to in order to recover.

  • I mentioned a the storage engine blog recently and was reading the latest post Common bad advice around

  • PingBack from http://music.247blogging.info/?p=1355

  • PingBack from http://outdoorceilingfansite.info/story.php?id=22128

  • PingBack from http://outdoorceilingfansite.info/story.php?id=4492

Page 1 of 2 (17 items) 12