This weekend I played with partitioning. I have a table with about 10 million rows spanning the last 6 years of stock history for AMEX, NYSE, Indexes, and NASDAQ. I partitioned each year into a separate filegroup so that should make it much easier to purge archive the old years and try to speed up queries on date ranges that involved multiple tables by storage-aligning the indexes. I aligned a related table used for storing calculations and metrics along with aligning an indexed view. Although a few queries actually ran slower, the most critical ones ran faster with the partitioning, since I typically retrieve data grouped together in a fairly tight range.
It seems that in order to really get storage alignment to work well with partitioned tables, you may want to move away from simple surrogate primary keys to compound primary keys where one segment is the partitioning key and the other segment is the unique key. One of the requirements for a partitioning a table is that the partitioning column be part of the clustered key, so unless you are partitioning on what is already the primary clustered key, you’re looking at either no longer using the primary key as the clustering key or creating a compound key. And until recently, the advice has typically been to use simple surrogate primary clustered keys for best performance, which means you would not typically want to partition on an existing surrogate key when moving to partitioning, since by definition a surrogate key generally has no significance, and thus typically not a relevant partitioning parameter.
Therefore partitioning introduces some new dynamics in the way one designs primary keys.
What would be helpful would be to create an identity column grouped on another column (nested). Say I have a partitioning key of year, then I could have my order Id be reset for each year. The advantage is that the overall index space can be reduced since the numbers are only unique within a range for each of the partitioning keys, allowing a smaller data type to be used for the nested identity key.
For example, if I have over 2^32 billion rows (or about 4.2 billion rows), I would have to use a bigint for a surrogate primary key, even if using the negative range (which is unlikely, so the real limit is more like 2.1 billion). If I create 20 partitions, I can now go to 80 billion rows using a simple surrogate integer plus a tinyint partitioning key, assuming non-negative identity values. This means that using the partitioning scheme, I actually use only 5 bytes per unique index value, rather than 8 bytes, and am able to accommodate up to 255 times more data than if using the simple 4 byte key alone.
The other scenario is where you commonly need to keep items in a static order within a parent, i.e. a bill of materials assembly where the line number represents the sequence in which an assembly is to be constructed. To be able to automatically generate sequence numbers guaranteed to be unique within the context of a parent assembly number would be useful.
This could be done with a trigger of course, but that’s not real efficient, and presents more challenges in bulk-loading scenarios.
What I’m thinking of is some syntax along the lines of
CREATE TABLE Table-name
Grouped-Identity-Key datatype identity [start,increment] group by PartitioningId
And then you just partition on the Grouping-key for the primary key on a partitioning scheme as in:
ALTER TABLE Table-name
CONSTRAINT [Primary-key-Constraint-Name] PRIMARY KEY CLUSTERED ( [Grouping-Key], [Grouped-Identity-Key] )
And storage-aligning the other indexes can be done by just referencing the grouping-key in the same way.
This would cause the GroupdIdentityKey to reset back to the initial value each time the Grouping-key changes. The table could then easily be partitioned on the grouping-key, which can be very compact to just do the partitioning, such as the year from an order date.
Thought on this? Is there a way to accomplish this that I am missing?