Thursday, August 25, 2011

Possibly a faster DBCC CHECKDB

This article describes a utility that is able to report the most common DBCC CHECKDB errors significantly faster than DBCC CHECKDB does itself. This can be very important in determining quickly the correctness of data in large databases.
Now that I have your attention I should point out this utility is NOT a replacement for DBCC CHECKDB. However, I believe it will report on the most common cause of errors significantly faster that the usual DBCC CHECKDB routines.

Background

DBCC CHECKDB is great utility for ensuring data is correct. In my experience, most shops that have large databases run it on a weekly basis, since it can be a time consuming operation. It is a typical and recommended practice to restore a backup of the production database to another server and the DBCC CHECKDB is performed against this, thus freeing up the production server for other work.

The Problem

The main problem I hear about DBCC CHECKDB is that for large databases in particular, it can take a long time to run. This is exacerbated by the fact that if errors are found, it digs deeper into the data and takes even longer to complete.
I've known a typical DBCC CHECKDB on a 5 terabyte databases to take more than 10 hours to complete. However, when there were errors, the DBCC CHECKDB took in excess of 48 hours to complete.
It is possible to run a cut down version of DBCC CHECKDB using the PHYSICAL_ONLY option, however, even this takes significantly longer to run than the utility proposed in this article.

The Utility

The T-SQL code used to create this utility is attached to this article. The utility described here has the following advantages over DBCC CHECKDB:
  • It runs much faster (typically in 10% of the time of DBCC CHECKDB)
  • It stops immediately you get an error. This can represent a considerable saving in time over DBCC CHECKDB. On average it would identify an error in less than 5% of the time DBCC CHECKDB would take. Additionally DBCC CHECKDB takes even longer if an error is found.
  • It processes heaps and clustered indexes first, this is important in determining how critical a problem is. A problem with heaps and clustered indexes is more serious since non-clustered indexes there can be rebuild from the heaps/clustered indexes.
There are also some limitations:
  • It only reports on IO errors. However, a short email conversation with Paul Randal (who has written much of the DBCC CHECKDB code), confirmed that most errors from DBCC are due to IO problems. So maybe this limitation is not much of a disadvantage.
  • Assumes the databases's page_verify_option is set to 2 (CHECKSUM). This is a typical and sensible setting.

How it works

This utility reads every page of the databases' heaps, clustered and non-clustered indexes, reporting immediately when it encounters any IO problems.
The first part of the script clears the buffer pool. This is the area of memory that contains data that has been read from the underlying physical disks. The buffer pool allows much faster access to data that is needed again. By clearing the buffer pool, using the command DBCC DROPCLEANBUFFERS, we ensure any data is read from the underlying physical disks, and this is where we want to find any errors.
The next part of the script gets details of the heaps, clustered indexes and non-clustered indexes to check. These details are stored in a temporary table named #IndexDetails.
Next we dynamically build up the SQL we want to execute. For each heap, clustered index and non-clustered index in #IndexDetails, we create the SQL that will count the number of rows in the underlying table, using an index hint. This index hint forces the underlying data to be read using the index number associated with the named heap, clustered index, or non-clustered index. Since we have cleared the buffer pool, the data is read from the physical disk, thus allowing us to check for any IO errors.
A typical example of the SQL we want to execute is:
SELECT COUNT_BIG(*) AS [TableName: NameOfTable. IndexName: NameOfIndex. IndexId: 1] 
 FROM [dbo].[NameOfTable] WITH (INDEX( 1));
The keyword COUNT_BIG is used because the tables may have more rows than the maximum value of the int data type.
The heap and clustered indexes are processed first (as shown by the ORDER BY statement, heaps have an indexid of 0 and clustered indexes have an indexid of 1). This ensures the most important underlying data structures (i.e. heaps and clustered indexes) are processed first.
For each heap, clustered index and non-clustered index processed, the name of the table, index, and indexid is output. This can be used to show the progress of the executing SQL. An example of this output is shown in figure 1 below:

Figure 1: example output showing the underlying table, index and indexid.
If an error occurs, the script stops, and an error message output. You can determine the heap/index in error by comparing the expected normal output (shown in figure 1) with the debug output shown in the message tab of SQL Server management Studio. Additionally inspecting the table msdb..suspect_pages will also provide error details.
An example error message is given below:
Msg 824, Level 24, State 2, Line 3
SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 1:28254611; actual 0:0). It occurred during a read of page (1:28254611) in database ID 10 at offset 0x000035e4326000 in file 'K:\MSSQL.1\MSSQL\Data\Paris_Paris3.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Conclusion

The utility provided in this article can identify the most common IO errors significantly faster than DBCC CHECKDB. As such it should be useful to quickly identify errors. It should be noted again, this is NOT a replacement for DBCC CHECKDB which does many more things.

Credits

Ian Stirk has been working in IT as a developer, designer, and architect since 1987. He holds the following qualifications: M.Sc., MCSD.NET, MCDBA, and SCJP. He is a freelance consultant working with Microsoft technologies in London England. His new book, SQL Server DMVS in Action, was published in May 2011. He can be contacted at Ian_Stirk@yahoo.com.

The SQL Code

-- I. Stirk. ian_stirk@yahoo.com LightweightPageChecker utility...
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED-- Ensure buffer pool is empty.
DBCC DROPCLEANBUFFERS
-- Get details of all heaps, clustered indexes and non-clustered indexes to check.
SELECT 
 ss.name AS SchemaName
 , st.name AS TableName
 , s.name AS IndexName 
 , s.rowcnt AS 'Row Count' 
 , s.indidINTO #IndexDetailsFROM sys.sysindexes s INNER JOIN sys.tables st ON st.[object_id] = s.[id]
INNER JOIN sys.schemas ss ON ss.[schema_id] = st.[schema_id]
WHERE s.id > 100 -- Only user tables
AND s.rowcnt >= 1 -- Ignore stats rows 
ORDER BY s.indid, [Row Count] DESC -- Do heaps and clustered first
DECLARE @CheckIndexesSQL NVARCHAR(MAX)
SET @CheckIndexesSQL = ''
-- Build SQL to read each page in each index (including clustered index).
SELECT @CheckIndexesSQL = @CheckIndexesSQL + CHAR(10) 
 + 'SELECT COUNT_BIG(*) AS [TableName: ' + SchemaName + '.' 
 + TableName + '. IndexName: ' + ISNULL(IndexName, 'HEAP') 
 + '. IndexId: ' + CAST(indid AS VARCHAR(3)) + '] FROM ' 
 + QUOTENAME(SchemaName) + '.' + QUOTENAME(TableName) 
 + ' WITH (INDEX(' + CAST(indid AS VARCHAR(3)) + '));'
FROM #IndexDetails-- Debug.
DECLARE @StartOffset INT
DECLARE @Length INT
SET @StartOffset = 0
SET @Length = 4000
WHILE (@StartOffset < LEN(@CheckIndexesSQL))
BEGIN
 PRINT SUBSTRING(@CheckIndexesSQL, @StartOffset, @Length)
 SET @StartOffset = @StartOffset + @LengthEND
PRINT SUBSTRING(@CheckIndexesSQL, @StartOffset, @Length)
-- Do work.
EXECUTE sp_executesql @CheckIndexesSQL-- Tidy up.
DROP TABLE #IndexDetails

Resources:

A faster DBCC.doc