Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
So I was chatting with Kim and Paul after that .NET DA meeting I mentioned the other day. The one where Kim kind of laid out the way that SQL Server did its searches, whether unindexed, indexed via a clustered key, and indexed without one.
Probably the best description of it I have ever heard, by the way. Kim really is a SQL goddess!
After her talk about the internals followed by Paul's talk about index fragmentation, they had me thinking about an additional piece of information that might be of interest.
I mentioned it in terms of how it can be a problem that can come up when you add multiple language-based indexes. Like you get when you use that technique I mentioned in Making SQL Server index usage a bit more deterministic.
Basically you have the interesting case where some collations are basically identical.
Like if you are dealing with Unicode columns and you use two collations that are only different due to the fact that they support different code pages. Like Latin1_General_CS_AS vs. Arabic_CS_AS vs. Greek_CS_AS vs. Hebrew_CS_AS vs. Russian_CS_AS, for example.
Or one like Bosnian_Cyrillic_100_CI_AS vs. Bosnian_Latin_100_CI_AS vs. Croatian_100_CI_AS vs. Serbian_Cyrillic_100_CI_AS vs. Serbian_Latin_100_CI_AS (some of these are identical but split out for political reasons proving that in SQL Server 2008 the have learned the lession I mentioned previously and others are the same because there was no good reason not to include the identical data in two different sorts where the users would reasonably expect the data to sort for them properly in either script, another lesson that they have "learned" well by picking up the new data).
The list goes on.
If you create multiple indexes on a column in order to assure better international user support, this is a good thing.
But if you go too far and create indexes that literally duplicate the same information as previously created indexes then all you are doing is taking up space (the size of the extra index) and hurting performance (the cost of index maintenance is 100% parasitic if the extra index is extraneous).
I asked someone on the SQL Server development team about this and he confirmed what I noticed through experimentation. Though he was unclear on how common the scenario was (he technically has a point since the technique is really only well documented here at SiaO, though I am philosophically opposed to ever relying on poor documentation as a justification for a bug!).
Now in regard to size luckily indexes are not huge since SQL Server does not use NLS-type sort keys for their indexes -- they use B-trees created via CompareString-type results.
But the size hit is greater than zero and if one has a mulei-million row table keeping two B-trees around that return the same results is hardly in anyone's best interests.
Plus the hit of adding entries when one has filled the level they are at in the B-tree and have to split the entry, but having to do so twice? A person is taking one of the really bad sides of index fragmentation and doubling it ir worse, for no reason!
Note that there is no warning for when this happens -- the new index is cheerfully created (to the extent that SQL Server is cheerful about such things!).
And although one could through research work to find out which collations give identical results for Unicode columns, it is not easy to find, and the information is really not exposed in any way to query it (other than asking me or something!).
The problem is worse for Exchange and the Jet Blue engine (which may beg the question of why I slant this blog so much toward SQL Server, but I think that these kinds of scenarios are much worse conceptually in SQL Server given how huge the optimization information is there both inside and outside of Microsoft -- people just seem to care a lot about database optimization in SQL Server).
But getting back to ESENT for a moment, the fact is that an e.g. JetCreateIndex2 call will cheerfully succeed for English, German, and Dutch, no matter how long each one takes -- and it will actually also take the huge space hit, far beyond the hit SQL Server does. Since it has no optimization to partially combine identical sorts -- and those three truly are identical, as I mentioned before).
I suppose in the ideal world both database engines would have -- either through SQL DDL or function call -- two things:
there really is no good way for a developer to take advantage of this potentially huge size/performance optimization.
Well, there is the whole "just ask Michael Kaplan" thing but that hardly scales. :-)
So, is this important?
Well, that kind of depends.
If one is taking the time to be interested in index architecture and index fragmentation in general, and one cares bout multilingual applications, then I'd argue that yes, it can be quite important.
Though I admit that YMMV (your mileage may vary).
In the long run, adding the feature to query about or possibly even detect such cases is probably the best strategy for the various technologies and products. This would translate to features in NLS for the sake of clients like ESENT, and features in SQL Server.
This blog brought to you by ṃ (U+1e43, aka LATIN SMALL LETTER M WITH DOT BELOW)