We recently had a situation where we needed to interface with an external non-relational data source that happened to use really long strings to identify entity instances. These identity strings had theoretically unbounded length, but in practice they were never more than 1000-2000 characters. The database layer needed to accept an identifier of this sort and use it to retrieve some information from the relational store that was related to that object. I’ll call the entity type “widget”, and the long string identifier from the external system the “external key”.
We could just store the external key in the database table and use it to locate the correct widget, but we couldn’t afford to scan the widget table for every lookup. The number of rows in the widget table was expected to be large, so we needed the lookup to be indexed. Unfortunately, SQL has a limit of 900 bytes for the keys in each index, and the external key values can be longer than this. If you create an index on combination of columns that could contain more than 900 bytes, you get this warning:
Warning! The maximum key length is 900 bytes. The index 'xyz' has maximum length of 4000 bytes. For some combination of large values, the insert/update operation will fail.
Then if you try to insert a row into the table that has more than 900 bytes in the indexed columns, it will fail with this error:
Msg 1946, Level 16, State 3, Line 11
Operation failed. The index entry of length 1501 bytes for the index 'xyz' exceeds the maximum length of 900 bytes.
Even if this restriction within SQL didn’t exist, I wouldn’t ever think of creating a table with a >1KB primary key. A surrogate primary key like an IDENTITY column or a GUID/UNIQUEIDENTIFIER column should be used, instead. You'd use the surrogate as your primary key, and all lookups and foreign keys within the data tier should use the surrogate. But the external system doesn’t know anything about the surrogate keys used in the data tier; the only identifier that the external system can pass to the database is that >900 byte external key value. So while a surrogate for primary key is a good idea here, it doesn’t solve the immediate problem – we still need to be able to do an efficient (indexed) lookup of an external key value in the database, even though a column that could store the external key values would be too large to index.
If you find yourself in a similar situation, here are some alternatives to consider:
A. Change the external system to use an identifier that is always less than 900 bytes long. Examine the entity’s attributes: is there some combination of attributes that together make up a candidate key (in other words, is there some combination of other attributes that is guaranteed to be unique for each entity instance)? Even if the external system doesn’t internally use those columns as an identifier, it could still pass them to the data tier when it wanted to look up a widget. In the database, you’d create a UNIQUE index on this column or combination of columns.
B. Require non-key columns as additional criteria for each lookup. For example, suppose a widget has a “Name” attribute that is always short enough to be indexed. Name is not guaranteed to be globally unique, so neither the external system nor the relational model can use this as a key. But if the column is fairly selective it may be sufficient to give you lookups that are efficient enough that it’s almost like doing an index seek on the external key. In this example you would index the [name] column in the database, and use a routine for lookups like this.
CREATE PROC usp_get_widet @external_key VARCHAR(MAX), @name NVARCHAR(60) AS SELECT * FROM widgets WHERE external_key = @external_key AND name = @name;
SQL will use a seek on the non-clustered [name] index to narrow the results down to a tiny percentage of the table (hopefully no more than a few rows), then it will do a residual filter to get rid of all but the single row that matches the external key value. Note that you can add more than one supplemental search column if necessary to increase the selectivity of the non-key index seek. Also be sure to note that you still need to do the filter on the [external_key] column. That predicate won’t be satisfied by an index seek, but it’s still necessary for correctness. Consider making the full external key INCLUDEd in the nonclustered index, to minimize the number of bookmark lookups the engine needs to do to find the single matching row.
One obvious downside to this approach is that the additional required search criteria complicate the interface between the database and the external component.
C. Truncate the external key to 900 bytes and index the truncated portion of the key. For example, your widget table might look like this:
CREATE TABLE widgets ( surrogate_key INT PRIMARY KEY IDENTITY, external_key VARCHAR(MAX), external_key_fragment AS LEFT (external_key, 900), ...
You’d create a (nonunique) index on the [external_key_fragment] column, and your lookup procedure might look something like this:
CREATE PROC usp_get_widet @external_key VARCHAR(MAX) AS SELECT * FROM widgets WHERE external_key = @external_key AND external_key_fragment = LEFT (@external_key, 900);
Of course, this assumes that some predictable 900-byte portion of the external key is reasonably selective. If it’s possible for a large portion of the rows to have the same initial 900 bytes (assuming the fragment is generated in the way shown above), then this won’t help any – the optimizer will correctly estimate that a scan would be more efficient than a seek, and you’ll still end up scanning the entire table for each lookup.
D. Break up or parse the external key into its component parts, and index a selective subset of the parts. This assumes that the long external key is really a serialized version of a composite key. It must have some underlying structure, and it must be practical to break the key into its child parts within the data tier. It also assumes that some portion of this structured key is guaranteed to be unique enough to support an efficient index seek that only returned a few rows.
For example, suppose that the external key was string that actually encoded a hierarchy like “RootIdentifierX\ChildIdentifierY\GrandChildIdentifierZ\...” If you’re willing to push knowledge of the structure of this serialized key down into the data tier, the database could parse this apart and use the identifiers for the first few levels of the hierarchy to do a secondary lookup that could be satisfied via an index seek. You’d need to store those parts of the external key in separate columns (so they could be index) in addition to the full external key. Once you’ve indexed the portions that you can, consider adding the remainder as an INCLUDEd column in the index; that would allow the index seek to locate the single matching row without resorting to a bookmark lookup. (It would also make it less likely that the optimizer would choose to scan the table because it guessed that a seek on the component parts wouldn’t be selective enough.)
E. Hash the external key and index the hash. This is essentially building a hash table and persisting it in the database. With this approach, your widgets table might look something like this:
CREATE TABLE widgets ( surrogate_key INT PRIMARY KEY IDENTITY, external_key VARCHAR(MAX), -- A computed column that hashes the external key (MD5 returns a -- 128-bit hash). external_key_hash AS CAST (HASHBYTES('MD5', external_key) AS VARBINARY(16)), ...
The [external_key_hash] column would be indexed, and the lookup proc would do this:
CREATE PROC usp_get_widet @external_key VARCHAR(MAX) AS SELECT * FROM widgets WHERE external_key = @external_key AND external_key_hash = CAST (HASHBYTES('MD5', @external_key) AS VARBINARY(16));
It could be argued that the CAST(HASHBYTES(…)) stuff should be hidden within a UDF (ditto for alternatives (C) and (D)):
CREATE FUNCTION dbo.fn_hash_external_key (@urn nvarchar(max)) RETURNS VARBINARY(16) WITH SCHEMABINDING AS BEGIN RETURN HASHBYTES ('MD5', @urn); END;
The computed column and the lookup proc would both reference this function so that the details of the hash function lived in a single place. That would indeed be cleaner, but before taking that route be sure you’re aware of the potential problems with scalar UDF performance that are described at http://www.developerfusion.com/community/blog-entry/8389063/tsql-scalar-functions-are-evil/ and http://sqlblog.com/blogs/adam_machanic/archive/2006/08/04/scalar-functions-inlining-and-performance-an-entertaining-title-for-a-boring-post.aspx.
Of all of these, option (A) is the only one that I find very aesthetically pleasing. The others could work in a pinch, though. Please drop me a comment if I’m overlooking some other alternative here.
A practical an complete round-up of countermeasures for this common problem. Liked reading it and will try to remember a few techniques.
When using the hash-option, beware of hash-collisions. Two unique (but large) values might have the same hash value, in which case you wold get a primary or unique key violation while the values are distinctively unique.
You're correct. But the index on the hash shouldn't be unique, and as noted above you still need to include a filter on the unhashed value, That means that the occasional collision wouldn't cause any problems.
I'm facing a similar constraint at the moment. I'm less concerned about the time to scan, and more concerned about the uniqueness. Thankfully, though expensive, a check constraint will have no such limitations.
in SQL server 2008, it won't give the error. But in SQL server 2012, it will give the error "The index entry of length 919 bytes for the index 'PIPVALUES_0_1' exceeds the maximum length of 900 bytes."Could you please give me some comments?