With a new storage engine comes new processing options. Here are some fast facts about processing tabular models, in Q&A format.
The following table shows the available processing options for each object.
The following table describes each processing option.
Any time after you do a Process Clear or Process Data, you must issue a Process Recalc. Also, after you merge partitions in script, you must perform a Process Recalc. Your model may be unqueryable if you skip the Process Recalc. Process Recalc is always the last thing you do, and you can batch up as many processing operations as you want before issuing that last Process Recalc.
You do not need to Process Recalc if you do a Process Add, Process Full, or Process Default. All of those commands recalculate the affected objects (and only the affected objects) for you. Also, you don’t need to issue a Process Recalc after you merge partitions through the SSMS UI, as the UI automatically executes a Process Recalc for you.
Keep in mind that you can only issue Process Recalc at the database level. This means that after a Process Data operation on a dimension, you may be stuck recalculating a large fact table when it was not impacted by the initial processing operation. If you want to limit the scope of recalculation, you might want to do a Process Full or Process Default instead of Process Data. However there is a tradeoff – if you perform multiple Process Full or Process Default operations in a row, the engine performs the recalculation every time. So your choice is multiple small recalcs every time vs. a single large recalc – choose wisely based on your specific processing scenario.
If you often add or remove partitions from a table, or if you often reprocess individual partitions in a table, you should Process Defrag regularly but judiciously. Process Defrag may be an expensive operation for a large table.
If you Process Data on a table level, it is never necessary to do a Process Defrag, as nice clean dictionaries are built when you process the table.
Here is an example where performing a Process Defrag is helpful. Imagine you have a table with 12 partitions, containing a rolling window of monthly data for a year. When January 2012 comes, you drop the January 2011 partition and add a new one for January 2012. However, deleting the January 2011 partition does not delete the column values in the dictionary for the table, even though there may not be any rows in the table that reference these values. The dictionary is now too large. Issuing a Process Defrag on this table gets rid of the useless entries, thus saving space and improving performance.
Usage Note: You can Process Defrag a database, even though that option is not exposed in the UI. This operation has a side effect in Denali – the data for unprocessed tables is loaded as well. You need to be aware of this side effect before issuing a Process Defrag on a database.
There are two ways to incrementally add data to a table:
If you are using Process Add, you should keep in mind that there is no UI for adding a specific query binding to use for the Process Add operation. So to make sure you’re not adding duplicate rows, you need to do one of the following:
No matter how you do incremental processing, you must Process Recalc on the database afterwards before moving on with your life. If you do this enough, you’ll want to Process Defrag too.
Parallel processing is supported for different tables, but not for different partitions within a table. This restriction is enforced by the engine when scheduling the processing job. You don’t get any notification of this restriction.
To clarify with an example. Say you send the engine a parallel batch processing command. You ask the engine to process Table A with Partitions 1, 2, 3 and Table B with partitions 4, 5, 6. The engine then goes and schedules your processing job. It is possible for partitions 1 (from Table A) and 4 (from Table B) to be processed in parallel. However, because processing is scheduled sequentially within a table, you will never see partitions 1 and 2 or 1 and 3 processed in parallel.
This error means you are trying to bring duplicate rows into a table that must have a column that is unique.
There are a few situations in which you must have a unique column in a table:
In all of the above cases, you are not allowed to have duplicate rows in a table. Processing will fail if you try.
See my post on managing tabular models for some automation options.
No. Everything is done on the fly at query time.
Thank you Akshai Mirchandani and Edward Melomed for providing source material via email for this post.
Edited 9/30 to add the aggregation question
Edited 1/23/2012 - Correcting an error. Process Add leaves the model in a queryable state and doesn't require a Process Recalc.