Offical team blog for SSDT, a tool for on and off-premise database development
In many cases a database needs to be deployed with data pre-populated in some tables. Loading static data into tables that are referenced from other tables is a common case – examples might be customer types, address types, data states, etc. In some cases these data values will never change, in other cases they may be initial values that can be added to by applications. This class of data is variously called reference data, seed data, domain data or static data. Regardless of what you call it, this data must be present before the database can be used. We can think of this as extending the contract between your database project and the database to include deployment of data as well as schema.
SSDT uses a declarative approach to defining schema, which has the advantage that it is idempotent - you don’t need to concern yourself with the state of the target database when you deploy your design. Ideally, the same approach would apply to deploying data, which in this context can be thought of as part of the schema – rather like enum values that are defined within application code. Ideally, SSDT would allow you to supply data values for any table as part of the definition of the table. These data values would be part of the schema model so would be validated against the table definition, included in refactoring and schema comparisons, and cause appropriate data deployment scripts to be generated. While SSDT does not support defining data in the schema yet it’s certainly on our radar.
So what to do in the meantime? Well, if you use SQL Server Database Projects you can augment the schema deployment with pre- and post-deployment scripts. While these scripts are not declarative and don’t participate in the build, they can be included in the scope of refactoring. And while these scripts are imperative there are ways to write data population scripts so that they are idempotent. This is important as you need these scripts to deploy their data regardless of the content of the tables at the time they are executed and with minimum impact on the integrity of surrounding data.
You can write a composite INSERT, UPDATE, and DELETE script operating over a temp table to do this but SQL Server 2008 added MERGE support which can be used to merge data into a target table in a more compact manner. By including an appropriate MERGE script for each reference data table in your post-deployment script you can deploy data to any number of tables declaratively. And for smaller data sets, with less than a thousand rows per table (the sweet spot for most reference data), MERGE allows you to define the data inline in the script without needing to load it into a temp table. Let’s look at an example…
Let's populate an AddressType table to be used by reference to describe the purpose of addresses held in other referencing tables (not defined).
First let's create a new SQL database project in SSDT and add the reference data table definition below to it. (You can add a new Table using the item template or write this into a build script).
CREATE TABLE [AddressType] ( [AddressTypeID] INT NOT NULL PRIMARY KEY, [Name] NVARCHAR (50) NOT NULL, );
Next add a post-deployment script and copy the data population script below into it. Note the difference between the build action property value on the object definition script and the post-deployment script. You’re allowed only one active post-deployment script per project.
-- Reference Data for AddressType MERGE INTO AddressType AS Target USING (VALUES (0, N'Undefined'), (1, N'Billing'), (2, N'Home'), (3, N'Main Office'), (4, N'Primary'), (5, N'Shipping'), (6, N'Archive') ) AS Source (AddressTypeID, Name) ON Target.AddressTypeID = Source.AddressTypeID -- update matched rows WHEN MATCHED THEN UPDATE SET Name = Source.Name -- insert new rows WHEN NOT MATCHED BY TARGET THEN INSERT (AddressTypeID, Name) VALUES (AddressTypeID, Name) -- delete rows that are in the target but not the source WHEN NOT MATCHED BY SOURCE THEN DELETE
As you can see it’s a compact syntax, with the data nicely encapsulated in one part of the statement.
The example above assumes a closed set of values, so it includes a delete clause that removes any extra rows that may have been added to the table. You might feel this is unnecessary if the table is suitably protected, but it does tighten the contract. Alternatively, you could modify this behavior if deletion is an issue or you need to cater for the possibility that extra rows may already have been referenced.
Here’s what it looks like in SSDT. The screen shot below was taken after publishing the project to the database. The post-deployment script is selected and visible in the upper tab and properties window, and the deployed data is visible in the Data Editor in the lower tab, opened by selecting View Data on the AddressType table in the SQL Server Object Explorer.
Some variations of this approach include:
While SSDT has no built-in solution for including data as part of the schema yet, using MERGE from a post-deployment script works with all SQL Server and SQL Azure databases since SQL Server 2008, is straightforward to implement and gets the job done.
Good post! Thanks Bill.
My two cents though: since reference data is more related to the application domain than to a database schema I would recommend keeping it within your application (inside application project) and to import during application init stage, this way you could for example populate AddressType db table from AddressType enum in your code without duplicating enum values in .sql (or .csv etc.) reference data files, also you will have an ability do update reference data more often if needed without republishing your database.
I have to disagree with Konstantin. I don't want a developer to deploy application code and make data changes. Deployment of a database project is different from the deployment of an application project. For one thing, you want to back up the database(s) before applying a data change. The database project has a built-in deployment option to make sure that happens, while an application project doesn't. I also run database unit tests as a step in the automated deployment of our database solution. Each of my database projects has an assocated test project containing the database unit tests for the database that will be deployed using that project. Keeping the reference data with the database project is a natural extension of this process. Everything relating to the database is contained in one location. I don't want to have to hunt through application code to make a simple data change.
How would you handle large amounts of data like over 5GB or 10GB or reference data?