Problem 

Now days we are dealing with lot of data every day. We process more and more data in our application and it may be from upstream, same system, for downstream etc. Consider an upstream system from where we get 100,000 to 200,000 records every day and the data does not have any differentiator to the data definition i.e. we cannot find whether the data is modified / added (no timestamp column). If the upstream system uses a timestamp or exposes differential data, we are fine. But if the source doesn’t do that, we are end of consuming all data and do a merge with our destination. In this case also we need to process all the records from source. Is it really needed to process all the records from source (imagine out of the big numbers only 100s of records really need to be processed)? Can we find a delta data from source even the source doesn’t provide delta row info?

Proposal

We can use the Encryption algorithm to compare data. As the encryption algorithm will return same value for the same input. If a table has 30 columns, we can replace a single column to replace all 30 columns. We can use HASHBYTES(‘algorithm’,<input data>) function. Its output is a varbinary format. The supported algorithms are mentioned below.

<algorithm>::= MD2 | MD4 | MD5 | SHA | SHA1 | SHA2_256 | SHA2_512

SHA1 algorithm is the ideal one, but may give repeated data after many years. If the application doesn’t need to store the hash data, we can manage with this algorithm otherwise we can rely on SH2-256 algorithm which will generate 256 bit output data.

How it will help

We can use this algorithm to compare two datasets and find the delta data, so that we can process only the data which are modified.

SELECT

                TAB1.COL1, TAB1.COL2

FROM (

                 SELECT

                            COL1, COL2, HASHBYTES(‘SHA1’,ISNULL(COL1,’NA’) + CAST(ISNULL(COL2,0) AS VARCHAR(10))) HSHKey

                  FROM  TAB) TAB1

LEFT JOIN (

                  SELECT

                               COL1, COL2, HASHBYTES(‘SHA1’,ISNULL(COL1,’NA’) + CAST(ISNULL(COL2,0) AS VARCHAR(10))) HSHKey

                   FROM  TAB_BAK) TAB2

ON
           TAB1. HSHKey  = TAB2. HSHKey

WHERE                   
           TAB2. HSHKey IS NULL

Not only this will help in finding delta data, it also helps in joining tables where we join more columns. All joining columns can be represented by a single hash key and can be joined.

Things to remember

  1. The columns should not be NULL as the output is dependent on data. So the Null value should be handled with a default one.
  2. Since the functions accepts 2 arguments, the second one we can add all the needed columns and convert the int, numeric / decimal values to char format.
  3. We can use HASHBYTES(‘SHA1’,ISNULL(COL1,’NA’) +’|’+ CAST(ISNULL(COL2,0) AS VARCHAR(10))) to differentiate the output value of each row after adding all the columns. EX, R|Andy|Jo and Randy|J|O.
  4. The output value depends on case sensitive of the data. Hash value for a is different from A.

Conclusion

  1. This is a method of data comparison and we can create a computed column to store this varbinary data.
  2. By creating an index on this column we can enhance the performance of data joining of tables with more column join. The data compare performance can be improved.
  3. If there is any change to the data of any column the computed column also changes.
  4. We can process only the delta data, which will help in performance of data processing.

For more info - http://technet.microsoft.com/en-us/library/ms174415.aspx

SHA1 info - http://en.wikipedia.org/wiki/SHA-1