A clustered column store index stores rows in columnar storage format in group of rows, referred to as rowgroups. There are two types of rowgroups as follows:
The data compression achieved by columnar storage depends upon the number of rows in the rowgroup. In our experimentation, we have found that any rowgroup with 102400+ rows can achieve good column compression. Based on this, the clustered column store index follows these guidelines
For details please refer to http://msdn.microsoft.com/en-us/library/gg492088(v=sql.120).aspx. Besides the bulk import, the factors that impact the size of rowgroup are
Let us walk through each of these factors with examples
Here I have a table where I have loaded million rows. Now I create a clustered column store index (CCI) with different DOPs
CREATE TABLE dbo.t_colstore (
c1 int NOT NULL,
c2 INT NOT NULL,
c3 char(40) NOT NULL,
c4 char(1000) NOT NULL
set nocount on
-- load 2000000 rows
declare @outerloop int = 0
declare @i int = 0
while (@outerloop < 2000000)
Select @i = 0
while (@i < 2000)
insert t_colstore values (@i + @outerloop, @i + @outerloop, 'a',
concat (CONVERT(varchar, @i + @outerloop), (replicate ('b', 950))))
set @i += 1;
set @outerloop = @outerloop + @i
set @i = 0
CREATE CLUSTERED COLUMNSTORE INDEX t_colstore_cci ON t_colstore with (maxdop = 1)
Here is what I see for the rowgroup when querying the catalog view sys.colunm_store_row_groups. Note, you have 1 row group with 1 million rows and other rowgroup with < 1 million rows because this was the last rowgroup with < 1 million rows.
Now, let us recreate the index without DOP=1
ALTER INDEX t_colstore_cci ON t_colstore REBUILD
Here you see one extra rowgroup. This is because multiple threads are creating rowgroups and they run out of rows before reaching the million rows
Dictionary size limitation
As part of column compression, a dictionary (primary/secondary) is used to store commonly occurring (n)char/(n)varchar values. The size of dictionary is limited to 16MB. The size of rowgroup is automatically reduced if dictionary gets full. Let us walk through an illustrative example using the same table as before but after drop/recreate. In the data load below, notice the distinct prefix for column C4. This creates separate entries, each containing 950 bytes, causing the dictionary to grow larger than 16MB, which forces smaller rowgroups.
-- load 1100000 rows
while (@outerloop < 1100000)
Now, I create the CCI as follows and check the rowgroups using the catalog view sys.column_store_row_groups and you will see I have 22 row groups instead of 2
Now, let us try another variation by loading the 1.1 million rows through delta row groups and see how many compress row groups we get. Here are the additional rowgroups after loading the data.
Note, both of these are delta rowgroups. The closed rowgroup has 1 million rows as you would expect. Why did this not get split into multiple rowgroups due to dictionary? The simple answer is that dictionary is only relevant in the context of compressed rowgroup. Let us run the REORGANIZE command as follows to force compression of the delta rowgroup.
ALTER INDEX t_colstore_cci ON t_colstore REORGANIZE
Here is the output of the catalog view sys.column_store_row_groups. Note, that the closed 'delta' rowgroup' was broken into smaller rowgroups as part of compression
When compressing rowgroups, it is required that all the rows in the rowgroup must fit in-memory including the dictionary. If low memory is detected, the server closes the current segment even if the segment row count does not reach 1 million (i.e. leading to smaller sized rowgroup). As part of memory pressure, the index create/rebuild automatically adjusts the DOP to enable successful completion of index create/rebuild.