Author: Thomas Kejser Reviewers: Peter Byrne, Sunil Agarwal, Prem Mehra, Peter Scharlock, Lindsey Allen, Mark Souza
As part of your database monitoring, you may be keeping track of the free space in the transaction log. One reason to do this, is to detect a rogue query consuming too much transaction log space. If you have such monitoring set up, you need to be a bit careful about tempdb.
Recall that unused space in the transaction log in a FULL and BULK LOGGED recovery mode database is reclaimed by a checkpoint operations, followed by a transaction log backup. Databases in SIMPLE recovery mode automatically reclaim the unused log space whenever a checkpoint is run – without the need to run a backup. Checkpoints can be initiated by running the CHECKPOINT command, by the transaction log becoming more than 70% full or if the time specified in the recovery interval (min) configuration option has been exceeded. There are other operations that cause checkpoints too, which you can read about in Books Online, but for the purpose of this blog, we focus on these three listed here.
Tempdb, a SIMPLE recovery mode database, is a special case: Unlike user databases, tempdb is recreated at server startup. Because recovery never needs to run in tempdb, SQL Server does not run regular checkpoints in tempdb with the frequency set in recovery interval (min). Usually, unused transaction log space is reclaimed in tempdb is when the transaction log runs 70% full or if a user initiated CHECKPOINT command is run.
I just paid a visit to a customer that has an alarm going off whenever a database (including tempdb) has less than 50% unused transaction log space. The customer monitored for this condition proactively discover long running, transaction log consuming queries. Such queries can take a long time to rollback and can sometimes be a result of a user error – for example an accidental delete of many records. The DBA at this customer knew he had to take action when log space was low, to investigate such potential user error.
With the monitoring scripts used by the customer, the alarm went off regularly on tempdb. Because of the checkpoint behavior described above, unused transaction log space is only reclaimed at 70% log usage of tempdb. The customer still wanted to headroom of the 50% limit, so he had to be able to distinguish between the “by design” checkpoint at 70% and a troublemaking, long running transaction preventing log space from being reclaimed.
The solution for the customer was to monitor to the column log_reuse_wait_desc in sys.databases in combination with the unused transaction log space. If this column has the value different than CHECKPOINT and NOTHING, the log usage is a result of a long running log space consuming transaction, in tempdb.
The script below illustrates how this is done:
SELECT name , db.log_reuse_wait_desc , ls.cntr_value AS size_kb , lu.cntr_value AS used_kb , CAST(lu.cntr_value AS FLOAT) / CAST(ls.cntr_value AS FLOAT) AS used_percent , CASE WHEN CAST(lu.cntr_value AS FLOAT) / CAST(ls.cntr_value AS FLOAT) > .5 THEN CASE /* tempdb special monitoring */ WHEN db.name = 'tempdb' AND log_reuse_wait_desc NOT IN ('CHECKPOINT', 'NOTHING') THEN 'WARNING' /* all other databases, monitor foor the 50% fill case */ WHEN db.name <> 'tempdb' THEN 'WARNING' ELSE 'OK' END ELSE 'OK' END AS log_status FROM sys.databases db JOIN sys.dm_os_performance_counters lu ON db.name = lu.instance_name JOIN sys.dm_os_performance_counters ls ON db.name = ls.instance_name WHERE lu.counter_name LIKE 'Log File(s) Used Size (KB)%' AND ls.counter_name LIKE 'Log File(s) Size (KB)%'
Using the technique described in the blog, you can now distinguish between the long running transactions and the by design behavior of checkpoint in tempdb.
Note the resulting used_percent column presents a fraction of the true percentage and could be rewritten as follow for clarity to match Performance Monitor counters:
, 100 * (CAST(lu.cntr_value AS FLOAT) / CAST(ls.cntr_value AS FLOAT)) AS used_percent
Getting divide by zero error on that query. Please hold my hand.
Use NULLIF to get around divide by zero errors. You might be having Database snapshots for instance whose log file size shows up as 0