Welcome to MSDN Blogs Sign in | Join | Help

Implementing uniqueness constraints on large columns

SQL Server uniqueness constraints create an underlying unique index. SQL Server index keys may not be more than 900 bytes. Below I discuss how to implement uniqueness constraints with hash indexes when the key size can exceed 900 bytes and give the results of some tests on the relative performance of the hash index approach.

Calculating the size of an index

If there are variable length columns, SQL Server may allow the uniqueness constraint to be created but oversized rows will not be insertable. Below is a formula to determine when an index may overflow the 900 byte limit. The maximum size of a unique constraint index key can be calculated using Table 1 as:

clip_image002 where clip_image004 is the overhead, clip_image006 is the variable sized column overhead for clip_image008 variable sized columns, clip_image010 is the nullable column overhead for clip_image008 nullable columns, clip_image012 is the type cost for column clip_image014, clip_image016 is the number of variable sized columns, clip_image018is the number of nullable columns and clip_image020 is the number of columns. The overhead clip_image004 varies depending on whether or not the underlying index is clustered (see Table 1).

Table 1 Table of contributions to the size of an index key. To calculate the maximum size of an index key in bytes add the size of every column in the index key and the primary key and add the overhead.

 

Object

Size (bytes)

bigint

8

binary(n)

clip_image008 

binary(max)

clip_image024 

bit

1

char

clip_image008 

date

3

datetime

8

datetime2

8

datetimeoffset

10

decimal

17

float

8

geography

clip_image024 

geometry

clip_image024 

hierarchyid

892

image

clip_image024 

int

4

money

8

nchar(n)

clip_image026 

nchar(max)

clip_image024 

ntext

clip_image024 

numeric

17

nvarchar(n)

clip_image026 

nvarchar(max)

clip_image024 

real

4

smalldatetime

4

smallint

2

smallmoney

4

sql_variant

8016

sysname

256

text

clip_image024 

time

5

timestamp

8

tinyint

1

uniqueidentifier

16

varbinary

clip_image008 

varbinary(max)

clip_image024 

varchar

clip_image008 

xml

clip_image024 

CLR type

?

variable size columns (n)

clip_image028 (0 if n = 0)

nullable columns (n)

clip_image030 (0 if n = 0)

clustered overhead

4

nonclustered overhead

7

Using hash indexes to enforce uniqueness constraints

If the index key will always be 900 bytes or less, then an ordinary uniqueness constraint may be used. If the index key could be greater than 900 bytes, then a uniqueness constraint can be enforced with an after-trigger. To avoid table scans, a hash index can be generated for the columns with the unique constraint. The uniqueness constraint can be enforced with the scheme:

create table [Schema].[Table]
(
  [Id]      int not null constraint [PK_Table] primary key,
  ,
  [Column1] Column1TypeDeclaration not null,
  [Column2] Column2TypeDeclaration null,
  ,
  [Internal_Hash_Column1_Column2] as checksum([Column1], [Column2]) persisted not null
);
go

create
index [Unique_Hash_Table_Column1_Column2] on [Schema].[Table] ([Internal_Hash_Column1_Column2]);
go

create
trigger [ConstraintInsertTrigger_Extent] on [Schema].[Table]
  after of insert, update as
begin
  if (update([Column1]) or update([Column2])) and
      exists (select *
              from [Schema].[
Table] as T
                   inner join inserted as I on I.[
Internal_Hash_Column1_Column2] = T. [Internal_Hash_Column1_Column2] and
                                               I.[
Column1] = T.[Column1] and
                                               (I.[
Column2] = T.[Column2] or (I.[Column2] is null and T.[Column2] is null))
       group by I.[
Internal_Hash_Column1_Column2], I.[Column1], I.[Column2]
       having count(*) > 1)
 
 begin
    if @@trancount > 0
    begin
      rollback transaction;
    end;

    raiserror(N'Error: Values in the columns [Schema].[Table] ([Column1], [Column2]) must be unique.', 16, 0);

  end;
end;
go

Essentially, the strategy is to store an indexed hash of the columns in the uniqueness constraint. The after trigger makes sure there are no duplicates, by searching for rows that both hash to the same value as one of the inserted rows and match on the constrained columns. Using hashes avoids scanning the entire table for matches and has good space efficiency. The above scheme shows how to handle multiple columns and both nullable and non-nullable columns.

The reason to use checksum and neither hashbytes nor binary_checksum is that checksum respects collation. For example,

checksum(N'X') = checksum(N'x')

but

binary_checksum(N'X')binary_checksum(N'x').

However, hashbytes and binary_checksum are suitable hashing functions for non-text columns. Also if the constrained columns are unlikely to contain common prefixes, a prefix of the columns (e.g. left([Column], 10)) may be a very good hashing function.

Using a uniqueness constraint on a hashed column to enforce uniqueness constraints

A common alternative strategy is to put a unique constraint over a persisted column defined using hashbytes. The idea is that the chance of a collision for, say, a SHA1 hash is so low that other bugs or hardware failures are much more likely to cause a failure. However, because of the Birthday Fallacy/Paradox people often underestimate the chance of a collision happening. If a hashing function is perfect then the chance of there being a collision among n randomly chosen values in a b-bit hash is approximately:

clip_image032 In general, a collision is highly likely in a sample of size

clip_image034

Thus for, say, the 160 bit SHA1 hash the chance of a collision in even a trillion random rows seems very low. However, there are some flaws in this argument:

·         Cryptographic hashing functions are poorly understood but there are strong indications that MD5 and SHA1 are weaker than perfect hashing functions.

·         For a given value, it is often feasible to find another value that hashes to the same hash (i.e. a hash collision). Thus if an application could be compromised by a hash collision and an adversary has some control over the values the application uses then this strategy is unsuitable.

·         Cryptographic hash functions do not respect collations. Thus, two strings that are equal like ‘X’ and ‘x’ (in a case insensitive collation) may hash to different hashes.

·         The input to the hashing function is usually not random.

Relative performance of the hash index approach for columns under 900 bytes

It seems likely that the hash index approach is more efficient than the unique index based approach at enforcing uniqueness constraints. Table 2 and Table 3 show the average time to insert and delete 1 000 and 100 000 rows for various sized strings with a uniqueness constraint. The SQL Server unique constraint and the hash index approach are compared for both the no prefix and the 20 byte (10 Unicode character) common prefix cases. From this analysis, it is clear that for column sets under 900 bytes the hash index approach is unlikely to perform better than a traditional unique index based approach. The test was carried out on a dedicated workstation with an Intel Pentium D 3.00GHz processor and 4 GB of RAM. The test script is attached to this blog entry.

Table 2: The average time, in seconds, to insert and delete 1 000 rows with a Unicode string with a uniqueness constraint. Both the unique and hash index approaches to uniqueness constraint enforcement are tabulated. Both random strings and strings with a fixed 20 byte (10 Unicode character) prefix are tabulated. The first time in each pair, is the time to insert 1 000 rows at once. The second time in each pair, is the time to insert 1 000 rows separately (i.e. 1 at a time).

 

Unique index

Hash index

Length

No prefix

With prefix

No prefix

With prefix

50

0.10

0.43

0.11

0.46

0.12

0.59

0.12

0.59

100

0.18

0.45

0.18

0.44

0.14

0.58

0.14

0.57

200

0.26

0.56

0.26

0.54

0.28

0.62

0.24

0.60

400

0.37

0.62

0.39

0.66

0.31

0.65

0.42

0.70

Table 3: The average time, in seconds, to insert and delete 100 000 rows with a Unicode string with a uniqueness constraint. Both the unique and hash index approaches to uniqueness constraint enforcement are tabulated. Both random strings and strings with a fixed 20 byte (10 Unicode character) prefix are tabulated. The first time in each pair, is the time to insert 100 000 rows at once. The second time in each pair, is the time to insert 100 000 rows separately (i.e. 1 at a time).

 

Unique index

Hash index

Length

No prefix

With prefix

No prefix

With prefix

50

6.71

36.22

6.92

35.05

20.82

52.60

20.54

60.33

100

7.09

37.36

7.33

37.02

39.92

57.03

36.14

57.46

200

16.38

43.13

15.01

44.09

65.81

59.12

62.35

59.55

400

33.59

55.07

24.99

56.63

149.95

70.06

151.21

71.07

SQL Server Modeling Services announcement

The code name “Oslo” repository now has the official name “SQL Server Modeling Services”. SQL Server Modeling Services will be a SQL Server workload like SQL Server Reporting Services. Details will be announced at the PDC where the most relevant session is http://microsoftpdc.com/Sessions/SVR19. You can also read about it online at http://msdn.microsoft.com/en-us/library/dd129586(VS.85).aspx.

In Microsoft, like most organizations, it is hard to get things done when you are in different organizational units and on different schedules. This move makes the repository team firmly part of the SQL Server team and on the same schedule. Thus this change will make alignment and deep integration of the repository with rest of the SQL Server platform easier. Also it provides a simple setup experience and much simpler licensing model for customers. Already, there are many planned changes for the next version of SQL Server that will make for a much stronger metadata store and overall platform.

Paging SQL Server result sets

Paging through result sets is an approach for reducing the network and client resources used to display large result sets. Essentially, the approach is to load only a page (e.g. 100 rows) of data at a time. It is likely that a user will only want to see a page of data. If they want to see more data then the next page is retrieved etc. Because the entire result set is probably not retrieved and does not need to be stored on the client the network and client resource demands will be lower. Ideally, latency for showing the first rows of data should also be lower since only a subset of data is needed.

The most commonly advocated SQL Server paging approach is based on the row_number ranking function.

with NumberedResult as
(
  select row_number() over (order by M.[Id]) as [Row], M.[Id], M.[Name]
  from [System_Runtime].[Methods] as M
)
select NR.[Id], NR.[Name]
from NumberedResult as NR
where NR.[Row] between @startRow and @stopRow;

This works by associating a row number with each row and fetching blocks of row numbers. For this to work correctly there must be a fixed row order defined by the over expression and the result set must be unchanging. Otherwise each call to the query may return results in a different order or even different results destroying the illusion of paging through a point in time result set.

A generally better approach is to use top to retrieve a page of data.

select top(@pageSize) M.[Id], M.[Name]
from [System_Runtime].[Methods] as M
where M.[Id] > @lastId
order by M.[Id]

As before, for this to work correctly there must be a fixed row order defined by the order by clause and the result set must be unchanging. The reason this will usually work better is that if the order by clause is covered by an index then the data can be found without sorting the entire data set. In theory, row_number could work this way also but in practice does not. The difference can be seen by looking at query plans—typically a row_number based query will form the entire result set while a top based approach will seek to the first index entry and then return the next page of rows in the index.

If there is no fixed row order or the result set can change then data can be read into a temporary table and processed by either of the two previous approaches.

ADO.Net has good support for paging data http://msdn.microsoft.com/en-us/library/tx1c9c2f.aspx. The Fill method has a specific overload to support paging in of data.

Below I compare the approaches for two interesting performance questions for a table of .Net Framework with 207 966 rows:

·         What is the cost of retrieving the first page?

·         What is the cost of retrieving all the pages?

Below are the results in seconds to perform the tests 100 times. As always it is wise to not make assumptions in performance or concurrency critical code and to test various approaches yourself.

Approach

First page (sec)

All pages (sec)

Row number

2.78

7 310.47

Top

0.26

51.17

Temporary table

111.13

139.94

 

When applicable, the top approach has clear performance advantages. Also it will typically use less server resources than the other approaches.

Here are the test scripts (run with the query option to discard the result set after execution but to collect client statistics).

-- row_number approach: First page
dbcc freesystemcache ('All')

declare @repetitions int = 100;

declare @i int = 0;
while @i < @repetitions
begin
  declare @pageSize bigint = 100;
  declare @lastrow bigint = (select count(*) from [System_Runtime].[Methods]);

  declare @startRow bigint = 1;

  declare @stopRow bigint = @startRow + @pageSize - 1;

  with NumberedResult as
 
(
    select row_number() over (order by M.[Id]) as [Row], M.[Id], M.[Name]
    from [System_Runtime].[Methods] as M
  )
  select NR.[Id], NR.[Name]
  from NumberedResult as NR
  where NR.[Row] between @startRow and @stopRow;

  set @i += 1;
end;

-- row_number approach: All pages
dbcc freesystemcache ('All')

declare @repetitions int = 100;

declare @i int = 0;
while @i < @repetitions
begin
  declare @pageSize bigint = 100;
  declare @lastrow bigint = (select count(*) from [System_Runtime].[Methods]);

  declare @startRow bigint = 1;

  while @startRow < @lastRow
  begin
    declare @stopRow bigint = @startRow + @pageSize - 1;

    with NumberedResult as
   
(
      select row_number() over (order by M.[Id]) as [Row], M.[Id], M.[Name]
      from [System_Runtime].[Methods] as M
    )
    select NR.[Id], NR.[Name]
    from NumberedResult as NR
    where NR.[Row] between @startRow and @stopRow;

    set @startRow += @pageSize;
  end;

  set @i += 1;
end;

-- top approach: First page
dbcc freesystemcache ('All')

declare @repetitions int = 100;

declare @i int = 0;
while @i < @repetitions
begin
  declare @pageSize bigint = 100;
  declare @maxId bigint = (select max(M.Id) from [System_Runtime].[Methods] as M);

  declare @lastId bigint = 0;

  select top(@pageSize) M.[Id], M.[Name]
  from [System_Runtime].[Methods] as M
  where M.[Id] > @lastId
  order by M.[Id]

  set @i += 1;
end;

-- top approach: All pages
dbcc freesystemcache ('All')

declare @repetitions int = 100;

declare @i int = 0;
while @i < @repetitions
begin
  declare @pageSize bigint = 100;
  declare @maxId bigint = (select max(M.Id) from [System_Runtime].[Methods] as M);

  declare @lastId bigint = 0;

  while @lastId < @maxId
  begin
    select top(@pageSize) M.[Id], M.[Name]
    from [System_Runtime].[Methods] as M
    where M.[Id] > @lastId
    order by M.[Id]

    set @lastId += @pageSize;
  end;

  set @i += 1;
end;

-- temporary table approach: First page
dbcc freesystemcache ('All')

declare @repetitions int = 100;

declare @i int = 0;
while @i < @repetitions
begin
  declare @pageSize bigint = 100;

  create table #results
  (
    [Row]  bigint not null primary key,
    [Id]   bigint not null,
    [Name] nvarchar(max) not null
  );

  insert into #results ([Row], [Id], [Name])
    select row_number() over (order by M.[Id]) as [Row], M.[Id], M.[Name]
    from [System_Runtime].[Methods] as M

  declare @startRow bigint = 1;

  declare @stopRow bigint = @startRow + @pageSize - 1;

  select R.[Id], R.[Name]
  from #results as R
  where R.[Row] between @startRow and @stopRow;

  drop table #results;

  set @i += 1;
end;

-- temporary table approach: All pages
dbcc freesystemcache ('All')

declare @repetitions int = 100;

declare @i int = 0;
while @i < @repetitions
begin
  declare @pageSize bigint = 100;

  create table #results
  (
    [Row]  bigint not null primary key,
    [Id]   bigint not null,
    [Name] nvarchar(max) not null
  );

  insert into #results ([Row], [Id], [Name])
    select row_number() over (order by M.[Id]) as [Row], M.[Id], M.[Name]
    from [System_Runtime].[Methods] as M

  declare @lastrow bigint = (select count(*) from #results);
  declare @startRow bigint = 1;

  while @startRow < @lastRow
  begin
    declare @stopRow bigint = @startRow + @pageSize - 1;

    select R.[Id], R.[Name]
    from #results as R
    where R.[Row] between @startRow and @stopRow;

    set @startRow += @pageSize;
  end;

  drop table #results;

  set @i += 1;
end;

"Oslo" repository lifecycle/versioning whitepaper

Repositories face issues to do with versioning schema and data as well as integrating with an organization's lifecycle processes. Here is a whitepaper I wrote on how the "Oslo" team thinks about handling various lifecycle/versioning issues http://msdn.microsoft.com/en-us/library/dd638048.aspx. The whitepaper deals with the following issues:

Application lifecycle management (ALM): Facilitating team development of software in a continuous cycle of steps (that is, definition, design, development, testing, deployment, and management). Traditionally, Source Code Control Systems (SCCS) and build systems, like Visual Studio Team Foundation Server, automate ALM.

Schema evolution: Changing schemas over time in such a way that applications continue to run unchanged or with minimal changes.

"Oslo" product versioning: Upgrading a customer's "Oslo" repository, from version n to version n+1, in such a way that the data is preserved.

Application versioning: The ability to upgrade applications, stored in the "Oslo" repository, to new versions. The upgrade process has the option to retain the data from older versions of the application.

Operational versioning: The ability to understand operational/management changes to a deployed application.

How to make a copy of the repository

Here is a nice article on how to make a copy of the “Oslo” repository so you can later restore it to its previous state http://blogs.msdn.com/modelcitizen/archive/2009/04/06/how-to-clean-up-the-repository.aspx. This is great for experimenting with the repository and for edit debug cycles with domains. The steps are essentially what we do internally at Microsoft.

Posted by AnthonyBloesch | 1 Comments
Filed under: ,

Favorite keyboard shortcuts

One of the duties of working at Microsoft is the day-to-day use of pre-beta and pre-release products—we call it eating our own dog food. The idea is to find and fix real-life product issues before customers see them. In honor of the wonderful experience I have had with Windows 7 dog-food, attached are a list of my favorite shortcuts in US Letter docx, A4 docx, US Letter XPS and A4 XPS formats. I have included Windows 7, Internet Explorer 8 and SQL Server Management Studio 2008 shortcuts.

A thorough list of shortcuts for major Microsoft products can be found at http://www.microsoft.com/enable/products/keyboard.aspx.

SQL Server 2008 error handling best practice

Error handling in SQL Server 2008 needs careful implementation. The Microsoft “Oslo” Repository’s API has the further problem that we cannot mandate the error handling logic in our callers. Thus a stored procedure call could be in a transaction or not and in a try-catch block or not. Below is the pattern we have chosen based on experimentation and best practice guidance from the SQL Server team and other SQL Server experts. A test script for all the interesting cases can be found at http://blogs.msdn.com/anthonybloesch/attachment/9469577.ashx.

For a good overview of SQL Server error handling see http://www.sommarskog.se/error-handling-I.html and http://www.sommarskog.se/error-handling-II.html. In our design, we had the following main issues and mitigations:

·         A transaction rollback will rollback to the outermost transaction but if there is an outer transaction we would like to only rollback the changes to the module’s inner transaction. The solution is to use SQL Server save points if there is an outer transaction.

·         A sufficiently severe raiserror will terminate a module if there is an outer try-catch block but if there is none then execution will continue in the module. The solution is to always explicitly return after raising an error.

·         Some developers like to use stored procedure return values to encode error states. The solution is to return appropriate error codes.

·         SQL Server will raise warnings if the transaction depth entering and leaving a transaction do not match. The solution is to be careful.

·         Triggers have an implicit transaction. The solution is to use a simplified pattern for triggers where a transaction is never started.

·         Save points need unique names if modules can nest otherwise you can rollback to the wrong save point. The solution is to use a GUID to name the save points.

Here is the pattern for stored procedures (eliding our special error reporting routines):

if parameter error
begin
  raiserror(N'…', 16, 0);

  return -1;
end
else
begin
  begin try
    declare @hasOuterTransaction bit = case when @@trancount > 0 then 1 else 0 end;
    declare @rollbackPoint nchar(32) = replace(convert(nchar(36), newid()), N'-', N'');
   
    if @hasOuterTransaction = 1
    begin      
      save transaction @rollbackPoint;
    end
    else
    begin
      begin transaction @rollbackPoint;
    end;

   
Do work;

    if @hasOuterTransaction = 0
    begin
      commit transaction;
    end;
  end try
  begin catch
    if xact_state() = 1
    begin
      rollback transaction @rollbackPoint;
    end;

    execute
Standard module error handler;

    return -error_number();
  end catch;
end;

Here is the pattern for triggers (eliding our special error reporting routines):

if parameter error
begin

  rollback transaction;


  raiserror(N'…', 16, 0);

 

  return;
end
else
begin
  begin try

   
Do work;

  end try
  begin catch
    rollback transaction;

    execute
Standard module error handler;

    return;
  end catch;
end;

 

 

Testing strings for equality counting trailing spaces

The SQL standard requires that string comparisons, effectively, pad the shorter string with space characters. This leads to the surprising result that N'' ≠ N' ' (the empty string equals a string of one or more space characters) and more generally any string equals another string if they differ only by trailing spaces. This can be a problem in some contexts. Unicode combining marks further complicate the issue because for many collations two distinct byte strings should compare as equal. For example, \u212b () should be equal to \u0041 (A) + \u030a (combining ring above).

There are a number of approaches to the problem that people have suggested but they all have flaws:

1.        Convert the strings to varbinary values and then test for equality. This does not handle combining marks correctly.

2.       Replace all the space characters with an unlikely character like the unit separator (\u001f) and then test for equality. This can cause two strings to test equal if they match except for space characters corresponding to the unlikely character.

3.       Append a character that is not treated oddly by any collation like ‘X’ to each string and compare them. This could cause an error if the string is already of maximum size.

4.      Use the like operator to do the equality check. This may cause two strings to test equal if they match because of wildcard characters in the second string.

5.       Use XQuery’s string comparison. This does not handle combining marks correctly.

6.      Test for equality normally and check that the data lengths match. This does not handle combining marks correctly.

Overall I like 3 the best as it has lowest risk.

Here is a simple test script for testing various approaches:

declare @test table

(

   string1 nvarchar(100) collate Latin1_General_100_CI_AS_KS_WS,

   string2 nvarchar(100) collate Latin1_General_100_CI_AS_KS_WS,

   equal   bit

);

 

insert into @test (string1, string2, equal)

  values (N' ', N' ', 1),

         (N' ', N'  ', 0),

         (nchar(0x212b), nchar(0x0041) + nchar(0x030a), 1),

         (N'a', N'%', 0),

         (N' ', nchar(0x0001f), 0);

        

       

select convert(varbinary(10), T.string1) as [String1], convert(varbinary(10), T.string2) as [String2],

       case when case when T.string1 = T.string2 then 1 else 0 end = T.equal then N'OK'

            else N'wrong'

       end as [=],

       case when case when convert(varbinary(max), T.string1) = convert(varbinary(max), T.string2) then 1 else 0 end = T.equal then N'OK'

            else N'wrong'

       end as [varbinary],

       case when case when replace(T.string1, N' ', nchar(0x001f)) = replace(T.string2, N' ', nchar(0x0001f)) then 1 else 0 end = T.equal then N'OK'

            else N'wrong'

       end as [replace],

       case when case when T.string1 + N'X' = T.string2 + N'X' then 1 else 0 end = T.equal then N'OK'

            else N'wrong'

       end as [append],

       case when case when T.string1 like T.string2 then 1 else 0 end = T.equal then N'OK'

            else N'wrong'

       end as [like],

       case when convert(xml, N'').exist(N'/.[sql:column("T.string1") eq sql:column("T.string2")]') = T.equal then N'OK'

            else N'wrong'

       end as [XQuery],

       case when case when T.string1 = T.string2 and datalength(T.string1) = datalength(T.string2) then 1 else 0 end = T.equal then N'OK'

            else N'wrong'

       end as [datalength]

from @test as T;

This results in the table:

String1

String2

=

varbinary

replace

append

like

XQuery

datalength

0x2000

0x2000

OK

OK

OK

OK

OK

OK

OK

0x2000

0x20002000

wrong

OK

OK

OK

OK

OK

OK

0x2B21

0x41000A03

OK

wrong

OK

OK

OK

wrong

wrong

0x6100

0x2500

OK

OK

OK

OK

wrong

OK

OK

0x2000

0x1F00

OK

OK

wrong

OK

OK

OK

OK

 

Instead of triggers over views (part 1)

Views are useful for creating a business entity based view data while allowing for an efficient logical schema. However, views are normally not updatable--limiting their utility. However, SQL Server’s instead of triggers allow many of these views to be updatable. SQL Server view-based instead of triggers can be a tricky to use. Below are some experiments that show how to write them and some of their properties. A complete script is attached.

In instead of triggers, two pseudo-tables inserted and deleted contain the before and after values of the operation. The update and columns_updated functions indicate if a column is updated by the operation causing the trigger to fire.

Imagine the following table:

create table [Test].[TestsTable]

(

  [Id]     int not null identity constraint [PK_TestsTable] primary key,

  [Value1] nvarchar(100) not null,

  [Value2] nvarchar(100),

  [Value1and2] as [Value1] + [Value2]

);

with the following view:

create view [Test].[Tests]

as

  select T.[Id], T.[Value1], T.[Value2]

  from [Test].[TestsTable] as T;

The following instead of triggers make the view updatable. The highlighted portions are to show the behavior of the inserted and deleted pseudo-tables and the update and columns_updated functions:

create trigger [Test].[InsertTestTrigger] on [Test].[Tests]

  instead of insert as

begin

  select case when update([Id]) then N'yes' else N'no' end as [Id Updated],

         case when update([Value1]) then N'yes' else N'no' end  as [Value1 Updated],

         case when update([Value2]) then N'yes' else N'no' end  as [Value2 Updated],

         [Test].[BinaryToBinaryNumber](columns_updated()) as [Column Update Mask];

 

  select *

  from inserted

  order by [Id];

 

  select *

  from deleted

  order by [Id];

 

  insert into [Test].[TestsTable] ([Value1], [Value2])

    select I.[Value1], I.[Value2]

    from inserted as I;

end;

go

 

create trigger [Test]. [UpdateTestTrigger] on [Test].[Tests]

  instead of update as

begin

  select case when update([Id]) then N'yes' else N'no' end as [Id Updated],

         case when update([Value1]) then N'yes' else N'no' end  as [Value1 Updated],

         case when update([Value2]) then N'yes' else N'no' end  as [Value2 Updated],

         [Test].[BinaryToBinaryNumber](columns_updated()) as [Column Update Mask];

 

  select *

  from inserted

  order by [Id];

 

  select *

  from deleted

  order by [Id];

 

  if update([Id])

  begin

    raiserror (N'Error: May not updatethe column [Test].[TestsTable].[Id]', 16, 0);

  end

  else

  begin

    update [Test].[TestsTable]

      set [Value1] = I.[Value1],

          [Value2] = I.[Value2]

      from inserted as I

      where [Test].[TestsTable].[Id] = I.[Id];

  end;

end;

go

 

create trigger [Test].[DeleteTestTrigger] on [Test].[Tests]

  instead of delete as

begin

  select case when update([Id]) then N'yes' else N'no' end as [Id Updated],

         case when update([Value1]) then N'yes' else N'no' end  as [Value1 Updated],

         case when update([Value2]) then N'yes' else N'no' end  as [Value2 Updated],

         [Test].[BinaryToBinaryNumber](columns_updated()) as [Column Update Mask];

 

  select *

  from inserted

  order by [Id];

 

  select *

  from deleted

  order by [Id];

 

  delete from [Test].[TestsTable]

    from deleted as D

    where [Test].[TestsTable].[Id] = D.[Id];

end;

Suppose the following insert statement is executed.

insert into [Test].[Tests] ([Value1], [Value2])

  values (N'1-1', N'1-2'),

         (N'2-1', null);

Then the following update pattern is the result. Because this is an insert operation all columns are marked as updated. Notice that the bit corresponding to the first column is the low bit not the high bit. Also, the bit set is based on the column ID not the column ordinal.

Id Updated

Value1 Updated

Value2 Updated

Column Update Mask

Yes

Yes

Yes

0b00000111

 

The inserted pseudo-table is:

Id

Value1

Value2

0

1-1

1-2

0

2-1

NULL

 

The deleted pseudo-table is empty:

Id

Value1

Value2

 

 

 

 

The resulting view is:

Id

Value1

Value2

1

1-1

1-2

2

2-1

NULL

 

Suppose the following insert statement is executed.

insert into [Test].[Tests] ([Value1])

  values (N'3-1'),

         (N'4-1');

Then the following update pattern is the result. Because this is an insert operation all columns are marked as updated even unspecified columns.

Id Updated

Value1 Updated

Value2 Updated

Column Update Mask

Yes

Yes

Yes

0b00000111

 

The inserted pseudo-table is:

Id

Value1

Value2

0

3-1

NULL

0

4-1

NULL

 

The deleted pseudo-table is:

Id

Value1

Value2

 

 

 

 

The resulting view is:

Id

Value1

Value2

1

1-1

1-2

2

2-1

NULL

3

3-1

NULL

4

4-1

NULL

 

Suppose the following update statement is executed.

update [Test].[Tests]

  set [Value1] += N' updated';

The following update pattern is the result. Only the Value1 column is marked as updated.

Id Updated

Value1 Updated

Value2 Updated

Column Update Mask

No

Yes

No

0b00000010

 

The inserted pseudo-table is:

Id

Value1

Value2

1

1-1 updated

1-2

2

2-1 updated

NULL

 

The deleted pseudo-table is:

Id

Value1

Value2

1

1-1

1-2

2

2-1

NULL

 

The resulting view is:

Id

Value1

Value2

1

1-1 updated

1-2

2

2-1 updated

NULL

3

3-1

NULL

4

4-1

NULL

 

Suppose the following delete statement is executed.

delete from [Test].[Tests]

  where [Id] > 2;

The following update pattern is the result. Since this is a delete operation no column is marked as updated.

Id Updated

Value1 Updated

Value2 Updated

Column Update Mask

No

No

No

0b

 

The inserted pseudo-table is:

Id

Value1

Value2

 

 

 

 

The deleted pseudo-table is:

Id

Value1

Value2

3

3-1

NULL

4

4-1

NULL

 

The resulting view is:

Id

Value1

Value2

1

1-1 updated

1-2

2

2-1 updated

NULL

 

In the next part I will look at some issues you need to be aware of when working with instead of triggers.

Large tuple uniqueness constraints in SQL Server

SQL Server 2008 limits unique constraints to 900 bytes of data per tuple. Here is a technique for enforcing uniqueness over larger tuple sizes. In many cases, the technique is more efficient than SQL Server’s unique constraints and can be used for performance critical cases.

Hash index based implementation

Essentially, the approach stores a hash of the columns in that need to be unique and uses this to minimize the number of rows that need to be checked when a row is inserted or updated. A trigger is used to enforce the constraint. A script with test code can be found attached.

As an example, suppose you want to add a unique constraint over Column1 and Column2 to the table:

create table [Schema].[Table]

(

  [Id]      int not null constraint [PK_Table] primary key,

  [Column1] nvarchar(max) not null,

  [Column2] nvarchar(max) null

);

Firstly add a column to store the hash and an index over the hash column:

create table [Schema].[Table]

(

  [Id]      int not null constraint [PK_Table] primary key,

  [Column1] nvarchar(max) not null,

  [Column2] nvarchar(max) null,

  [Internal_Hash_Column1_Column2] as checksum([Column1], [Column2]) persisted not null

);

 

create index [Unique_Hash_Table_Column1_Column2] on [Schema].[Table] ([Internal_Hash_Column1_Column2]) include ([Column1], [Column2]);

If the columns are typically large then you may get better overall performance by removing the included columns from the index. Secondly, add a trigger to enforce the constraint:

create trigger [ConstraintInsertTrigger_Table] on [Schema].[Table]

  after insert, update as

begin

       if (update([Column1]) or update([Column2])) and

          exists (select *

                        from [Schema].[Table] as T

                             inner join inserted as I on I.[Internal_Hash_Column1_Column2] = T. [Internal_Hash_Column1_Column2] and

                                                         I.[Column1] = T.[Column1] and

                                                         (I.[Column2] = T.[Column2] or (I.[Column2] is null and T.[Column2] is null))

                        group by I.[Internal_Hash_Column1_Column2], I.[Column1], I.[Column2]

                        having count(*) > 1)

       begin

         if @@trancount > 0

         begin

              rollback transaction;

         end;

 

         raiserror(N'Error: Values in the columns [Schema].[Table] ([Column1], [Column1]) must be unique.', 16, 0);

       end;

end;

Care is needed in handling nullable columns like Column2.

Below are some simple test cases. The insert statement:

insert into [Schema].[Table] ([Id], [Column1], [Column2])

  values (1, N'1A', N'2A'),

         (2, N'1B', N'2B');

results in the table:

Id

Column1

Column2

Internal_Hash_Column1_Column2

1

1A

2A

585901004

2

1B

2B

-1712577588

 

The insert statement:

insert into [Schema].[Table] ([Id], [Column1], [Column2])

  values (3, N'1A', N'2A'),

         (4, N'1C', N'2C');

fails because the tuple (1A, 2A) is already in the table. The insert statement:

insert into [Schema].[Table] ([Id], [Column1], [Column2])

  values (3, N'1A', null),

         (4, N'1A', null);

fails because the tuple (1A, null) is duplicated. The insert statement:

insert into [Schema].[Table] ([Id], [Column1], [Column2])

  values (3, N'1a', N'2a'),

         (4, N'1C', N'2C');

fails because the collation is case insensitive so the tuple (1a, 2a) is matches the tuple (1A, 2A) which is already in the table. Finally to test for the case where two different tuples have the same hash value, the insert statement:

insert into [Schema].[Table] ([Id], [Column1], [Column2])

  values (3, N'1a', N'2a'),

         (4, N'1C', N'2C');

results in the table:

Id

Column1

Column2

Internal_Hash_Column1_Column2

1

1A

2A

585901004

2

1B

2B

-1712577588

5

aaabab

2E

250031328

6

aaaaam

2E

250031328

 

Hashbytes based implimentation

There is another technique that almost always works and is very fast. Create a new persistent column using hashbytes and put a unique constraint on it. There is a very small chance that a valid tuple will violate the constraint. For some algorithms, a malicious user can generate collisions fairly easily.

Top 25 most dangerous programming errors

The paper “Top 25 Most Dangerous Programming Errors” (http://cwe.mitre.org/top25) is an interesting study of the families of programming errors that lead to security issues. The paper is based on input from a wide variety of security experts and is worth reading. The complete list of errors is at http://cwe.mitre.org/data/lists/699.html.

From a T-SQL perspective, however, it seems off mark but the same group has a database of all the common weaknesses. Here is a list of Common SQL specific errors I was able to extract:

·         Access Control Bypass Through User-Controlled SQL Primary Key (http://cwe.mitre.org/data/definitions/566.html).

·         Individual Definition in a New Window Dangling Database Cursor (aka 'Cursor Injection') (http://cwe.mitre.org/data/definitions/619.html).

·         Failure to Preserve SQL Query Structure (aka 'SQL Injection') (http://cwe.mitre.org/data/definitions/89.html).

·         Failure to Sanitize Data within XQuery Expressions (aka 'XQuery Injection') (http://cwe.mitre.org/data/definitions/652.html).

·         SQL Injection: Hibernate (http://cwe.mitre.org/data/definitions/564.html).

 

There are many errors that apply to SQL and other languages. Here is a list that I thought was most relevant to SQL development:

·         Algorithmic Complexity (http://cwe.mitre.org/data/definitions/407.html).

·         Authentication Bypass by Alternate Name (http://cwe.mitre.org/data/definitions/289.html).

·         Client-Side Enforcement of Server-Side Security (http://cwe.mitre.org/data/definitions/602.html).

·         Discrepancy Information Leaks (http://cwe.mitre.org/data/definitions/203.html).

·         Error Handling (http://cwe.mitre.org/data/definitions/388.html).

·         Error Message Information Leak (http://cwe.mitre.org/data/definitions/209.html).

·         Failure to Encrypt Sensitive Data (http://cwe.mitre.org/data/definitions/311.html).

·         Failure to Handle Missing Value (http://cwe.mitre.org/data/definitions/230.html).

·         Improper Access Control (Authorization) (http://cwe.mitre.org/data/definitions/285.html).

·         Incorrect Ownership Assignment (http://cwe.mitre.org/data/definitions/708.html).

·         Incorrect Privilege Assignment (http://cwe.mitre.org/data/definitions/266.html).

·         Leftover Debug Code (http://cwe.mitre.org/data/definitions/489.html).

·         Not Using Password Aging (http://cwe.mitre.org/data/definitions/262.html).

·         Privacy Leak through Data Queries (http://cwe.mitre.org/data/definitions/202.html).

·         Unchecked Input for Loop Condition (http://cwe.mitre.org/data/definitions/606.html).

·         Use of Obsolete Functions (http://cwe.mitre.org/data/definitions/477.html).

Posted by AnthonyBloesch | 1 Comments
Filed under: , , ,

An insiders guide to giving feedback to Microsoft

Geoff Snowman had the great suggestion of a blog entry on how developers can give Microsoft feedback to Microsoft. I have worked on both sides—giving feedback to Microsoft and receiving it from customers. Here are the best ways I know of to have your voice heard:

1.        When installing a program select the “Send usage data to Microsoft.”/ “Send feedback to Microsoft” features of our products. This turns on instrumentation that, for example, records the program state during a crash and feature usage data. Development teams use this data to both eliminate high frequency crashes and prioritize features.

A great deal of care is taken to remove personal data from these reports so I do not hesitate to check these options. These options effectively give you a vote in what the development team does in the future. I have used the data to make many product decisions and to drive which bugs are fixed first. Development teams find the crash reports are particularly valuable since they are a highly effective way to find crashes that testing missed.

2.       Feature request, bugs and issues can be logged with Microsoft connect (https://connect.microsoft.com). Issues are automatically passed to development teams. Thus, your issue will receive the direct attention of, say, the developer responsible for the feature. I rarely use internal bug reporting tools to log bugs against other teams as Microsoft connect works better. It should work just as well for you.

3.       MSDN forums (http://social.msdn.microsoft.com/forums) are a great way to connect with other users. Development teams read the forums and many team members are active participants. Through the forums, you can discuss your issue with, say, the feature’s designer. Also forum participation can get you noticed and asked to participate in customer councils etc.

4.      MSDN documentation often has a “Send feedback” link. If you discover an issue with the documentation, this is a great way to get it fixed. Documentation teams are grateful for the feedback and giving it drives quality into the documentation for everyone. I have got hundreds of documentation issues fixed this way.

5.       Volume licensing agreements typically have 24×7 support agreements. Microsoft customers lose a lot of value by not using these fully. These are a great way to get critical issues fixed. Also, your Microsoft Representative is a great source of relevant information—they are paid to keep you happy.

6.      Microsoft often asks customers to participate in usability, focus groups and customer councils. These are great opportunities to give your feedback in a very interactive way. Often customers are invited to participate based on providing good feedback, in the past, through other channels.

7.       Support calls receive a lot of attention at Microsoft. Development teams regularly receive reports on the top support issues. Development teams work hard to eliminate or reduce these calls or at least provide a fix for customers that experience the issue. Also, it is the only way to get certain patches. If you have an issue it is really worth following up with support.

The n habits of highly effective developers

There is a famous book “The Seven Habits of Highly Effective People” by Stephen R. Covey, I have been thinking about what are the n habits of a great developer. Below is my list, which has some clear parallels to the book. Suggestions? Missing items?

1.        When fixing a defect a great developer searches for similar defects and fixes them. Frequently, a defect is caused by a mental lapse that is repeated throughout the code. A great developer searches for other defects caused by similar lapses. Since defects are expensive to find and fix this can make the developer much more productive.

2.       A great developer removes systemic issues. Code patterns, like poorly factored APIs and duplicate code, can cause lost productivity. Great developers look for these issues and fix the ones that are more costly, over the long term, to leave in the code than to fix. This can mesh well with habit 1 if by, say, adding a static analysis test a developer not only finds all the existing similar defects but prevents them from reoccurring.

3.       A great developer files defect reports and suggestions against other teams. This not only improves other people’s experience but also helps ensure the great developer will not see the issue in the future. Microsoft makes this easy to do through https://connect.microsoft.com and the various “Send feedback to Microsoft” features of Microsoft’s products. As well, this can improve the great developer’s network by creating contacts with other great developers.

4.      Great developers create a simple working system with unit tests early and then refine it. This approach drives confidence, makes it easier for testers and documenters but most importantly prevents over engineering. It is easy to create code that solves non-existent problems. By working this way, great developers only create the code they need to.

5.       Great developers are intentional about their career and self-improvement. They put effort into honestly assessing their weaknesses and strengths then take steps to reinforce their strengths and eliminate their weaknesses. The easiest way to do this is to schedule half an hour every week and use the time to ask yourself what you did right and what you did wrong during the week and then take steps to improve—I have often seen this turn a mediocre developer into a star.

6.      Great developers enable scenarios and experiences not features. By understanding customer’s needs, in an end-to-end fashion, great developers can create products that customers love and save effort building useless features. By thinking about the experience as a whole and not just their area, they improve the entire system not just their individual applications.

 

Posted by AnthonyBloesch | 1 Comments
Filed under:

SQL Server fragmentation

Recently we started shredding “Oslo” daily builds into the “Oslo” Repository. We did this to help with internal development but also to get realistic operational data to validate our design and get realistic traces to use for stress and performance testing. We are learning a lot and have found a major bug that sometimes leads to deadlocks. One of the great pieces of data we now have relates to database fragmentation. So far, the design is holding up and we have only one fragmentation issue to deal with. However, since database fragmentation is a major cause of poor performance I thought a discussion of how to minimize and deal with database fragmentation in SQL Server might be generally useful.

 

Essentially, three types of fragmentation affect SQL Server:

·         File (disk) fragmentation. SQL Server is optimized to access a data file’s logical pages sequentially. File fragmentation means that many logically sequential pages are not physically sequential and time is wasted doing disk seeks.

File fragmentation can be fixed with normal windows defragmentation utilities. File fragmentation is most frequently caused by autogrowing files instead of initially allocating sufficient space. This is why I tend to oppose efforts to initialize the Repository to a small size and then autogrow it. If accounts are setup correctly then fast file initialization means that there is only a minimal setup performance penalty to allocating sufficient space. If autogrow is needed then it should be in large chunks.

Putting log and data files on separate physical disk arrays is a best practice for reliability and performance reasons but it also helps minimize file fragmentation.

·         Data file pages that are out-of-order. Because pages are out-of-order, time is wasted doing disk seeks during, say, an index scan and disk caches are not used efficiently. Write performance can also suffer since, typically, the page to update needs to be found.

Out of order pages can be fixed by reorganizing the index with an alter index statement or with a SQL Server maintenance plan. Out of order pages are most frequently caused by poor index key choices relative to insert and delete behavior. For example, an externally generated GUID based key will mean that inserts are at random positions in the index leading to frequent page splits. Since most extents (i.e. 8 contiguous pages) will be filled up, the new page will have to go in another extent and the index pages will be out of order. Natural keys also frequently suffer from this problem since they are inserted out-of-order.

·         Data file pages with a lot of free space. Because pages are not full, time is wasted reading more pages than necessary during, say, an index seek or an index scan. Write performance can also suffer since, typically, the page to update needs to be found. However, sometimes partially full pages can improve performance because when data is added to an already full page it will be split which is often expensive. The index fill factor option can be used to fine tune how full an index leaf page is initially. By default, SQL Server fills leaf pages.

Poor page utilization can be fixed by rebuilding the index with an alter index statement or with an SQL Server maintenance plan. Poor page utilization is most frequently caused by poor index key choices relative to insert and delete behavior. For example, an externally generated GUID based key will mean that inserts are at random positions in the index leading to frequent page splits and poor page utilization. Natural keys also frequently suffer from this problem since they are inserted out-of-order.

 

My advice for minimizing page fragmentation, in SQL Server, is to:

1.        Understand how SQL Server indexes are structured (see http://msdn.microsoft.com/en-us/library/ms190969.aspx, http://msdn.microsoft.com/en-us/library/ms180978.aspx and http://blogs.msdn.com/sqlserverstorageengine/archive/2006/06/23/644607.aspx).

2.       Understand how data will be inserted and deleted and expected data volumes

3.       Make sure you database files are sized right and autogrowth is set to a suitable number

4.      Try to structure indexes and keys so most inserts will be at the end of the table/index and deletes will be in contiguous chunks. Typically, this means avoiding natural keys and externally allocated GUIDs.

5.       Monitor fragmentation

6.      Set up a maintenance plan to reorganize/rebuild indexes that are prone to fragmentation

 

To find any database fragmentation you can run the various Disk Usage reports in the database right click in SSMS. Alternatively, you can use various SQL Server management views and functions. For example:

select S.name as [Schema], O.name as [Object], I.name as [Index],

       round(P.avg_fragmentation_in_percent, 1) as [Percentage Out-of-Order],

       round(100.0 - P.avg_page_space_used_in_percent, 1) [Percentage Free Space],

       P.page_count * 8 [Size (Kb)]

from sys.dm_db_index_physical_stats(db_id(N'DailyBuilds'), null, null, null, N'detailed') as P

     inner join sys.objects as O on O.object_id = P.object_id

     inner join sys.indexes as I on I.index_id = P.index_id and I.object_id = P.object_id

     inner join sys.schemas as S on S.schema_id = O.schema_id

where P.avg_fragmentation_in_percent >= 10 or

      P.avg_page_space_used_in_percent <= 80

order by [Schema], [Object], [Index]

Architectural design tenets and corollaries

Now that the codename “Oslo” repository is public I can discuss its architecture and design.

When designing a new product or component I like to:

1.        Understand the business landscape. For example, what is the market opportunity, what are the major competitors, where are the competitors likely to be by the time the product is released, how will the product fit into customer’s business processes. Creating new products is difficult and expensive so it is important to get this part right. Also an in-depth understanding of the business landscape can help enormously in day-to-day decisions. For example, can a feature wait until version 2?

2.       Create design tenets. Design tenets are the bedrock on which good architecture is built. Any non-trivial design must make tradeoffs—design tenets help drive decisions by succinctly describing what is important. Without design tenets, scope creep and over-engineering can easily destroy a design.

I did an extensive multi-version business analysis for the “Oslo” repository which, of course, I cannot share with you but here are the “Oslo” repositories design tenets:

Integrate well with Oslo. Oslo should provide a seamless platform. The “Oslo” repository should integrate seamlessly with the rest of the Oslo platform.

Support a wide variety of tools. No toolset can fully anticipate customer needs. Support for ad hoc queries and updates, using familiar tools, will allow customers to circumvent the limitations in our tools. For example, it should be straightforward to build an Access based application over the “Oslo” repository or write a query that finds code vulnerable to a new security exploit.

The need to support traditional tools implies that the “Oslo” repository uses a straightforward relational representation of data as its primary API. Further to make queries easy to write, the metadata stored in the “Oslo” repository should form an integrated and consistent whole.

Scale well. Deployment and code analysis data (e.g. call graphs) can be very large. The “Oslo” repository needs to scale well to provide consistently good performance even when it contains a large enterprise’s metadata. In particular, good geo-scaling is likely to be important in “Oslo” repository scenarios.

Enterprises vary in their scaling strategies, the “Oslo” repository needs to support both scale-up and scale-out strategies. In particular, the “Oslo” repository should be designed to align with the SQL Server’s scale-out strategy since it will offer customers the most cost effective scale-out architecture for the near future. In practice, this means that “Oslo” repository data should be easily partitionable into related segments.

Support high availability. To be credible, the Oslo server will need to be highly available. As a critical component of the Oslo Server, the “Oslo” repository will also need to support high availability.

As well as being highly available itself, the “Oslo” repository should also support building fault resilient components that can survive “Oslo” repository failures. In particular, the “Oslo” repository should be designed to align with the SQL Server scale-out strategy and OCS (Occasionally Connected Systems) changes since they will offer customers the most cost effective high availability architecture for the near future. In practice, this means that “Oslo” repository data should be easily partitionable into related segments.

Be easily extensible. To quickly instantiate the “Oslo” repository with domain models for a wide variety of metadata and allow partners and customers to extend Microsoft tools or create new ones, the “Oslo” repository needs to be easily and cheaply extensible, supporting domain model extension and composition.

Ideally, the “Oslo” repository should make it easy to design and build domain models with, preferably, a wide variety of standard development tools or, less desirably, special purpose tools.

Support schema evolution. As technologies evolve, so will our domain models. The “Oslo” repository must support the easy and robust migration of applications to new domain model versions.

The “Oslo” repository should use good relational database design to ensure easy schema evolution. Because object-oriented meta-models are fragile, fixed object-oriented domain APIs should be avoided.

 

More Posts Next page »
 
Page view tracker