Editor's Note: The following MVP Monday post is by SQL Server MVP Matija Lah and is part of our special series on SQL Server 2012.
SQL Server 2012: Migrating BLOBs to FILETABLEs
The storage of large data used to present quite a challenge in previous versions of SQL Server, especially prior to SQL Server 2008. Although appropriate data types were already available in earlier versions of the platform (e.g. IMAGE, VARBINARY(MAX), or even TEXT, or NTEXT data types), the maintenance, as well as consumption, of large data, stored in the database, was cumbersome at best. This left many database developers with two choices: either to store large data in the database and face all the challenges of moving it in and out of there, or place it outside the database and deal with another set of similarly difficult challenges of keeping the files in sync with the metadata.
Large data management has changed significantly with the arrival of SQL Server 2008, introducing FILESTREAM, an alternative method of storing large data in a dedicated filegroup (rather than together with the rest of the data); in a way that also allowed client applications direct access to large object files. The client application could, after receiving an appropriate file handle from the Database Engine, consume the file as a stream, rather than having to load it in its entirety, and twice for that matter – first into the Database Engine, and then into the application.
SQL Server 2012 introduced yet another alternative to storing large data inside a SQL Server database, and in this article you are going to – briefly – learn what it is, and how to implement it.
The examples in this article are based on two typical scenarios utilizing large data management techniques built on the SQL Server platform. Each scenario represents a document management solution with documents being acquired, maintained, and consumed in form of binary files. Document acquisition, and maintenance, is facilitated in both cases in an operational (OLTP) data store, which is separated from the consumption (DW) data store.
Document maintenance must be transactional, meaning that changes to the files must be part of the same transaction together with any changes to the document metadata.
Document consumption is provided online, via a web site, where the final versions of all the documents, after being approved by the authors, the internal reviewers, and the editors, are published, and made available to the end users. The ASP.NET application used by the web site should provide read-only access to the published documents to multiple concurrent users.
In the first scenario, illustrated in Figure 1 below, both the operational, as well as the consumption, data stores, are distributed between a SQL Server 2012 relational database, holding all document metadata, and a set of folders in the file system, holding the actual document files.
Figure 1: Distributed Maintenance
This particular configuration presents several drawbacks:
· First of all, all files are stored outside the database; therefore the client application cannot rely on transaction handling provided by SQL Server alone, but must provide complete transaction management logic, maintaining the ACID properties of all operations, regardless of whether they were performed against document metadata (inside SQL Server) or against document contents (outside SQL Server).
· Second, the database server is not “aware” of the files in the file system, and neither is the operating system “aware” of any document metadata in the database (signified by horizontal dashed blue lines, connecting Document Metadata tables with Document File folders, in Figure 1 above). The existence of orphan data (e.g. metadata without corresponding document files, or files without corresponding metadata) is possible.
· Third, the solution implements two distinct security models: access to metadata is subject to SQL Server authorization, and access to document files is subject to Windows operating system authorization. The solution must implement additional logic in order to expose two separate security models to the application, as if they were one and the same.
· Fourth, the solution also implements two distinct backup strategies, making it virtually impossible to maintain proper synchronicity between document metadata backups, and the document files backups. Backup maintenance utilizes two completely isolated techniques, and is distributed across two separate backup stores, causing restore operations to be far from trivial.
On the other hand, this configuration does provide two significant benefits:
· Since the documents are not stored in the database, document files can be streamed to the client application from the file system directly, completely bypassing the SQL Server Database Engine.
· The transfer of documents from the operational data store to the consumption data warehouse is reduced to a simple file copy operation.
The following migration objectives can be derived from the facts presented above:
· Implement complete transparency and transactional integrity of any and all modification operations, regardless of whether they are done against documents or document metadata.
· Implement a single, uniform, security model.
· Implement a single, homogeneous, backup strategy.
· Maintain direct file access.
In the second scenario, illustrated in Figure 2 below, the operational data store is implemented in a SQL Server database in its entirety, and only the consumption data store is distributed between a SQL Server database, and the file system. In the former, both the document contents as well as document metadata are maintained in a relational database, while in the latter, the metadata is stored relationally, while the files are maintained in the file system, as was the case in Scenario 1.
Figure 2: Distributed Consumption
The following three drawbacks are also present in Scenario 2:
· Two distinct security models are required.
· Two separate, incompatible, backup strategies may be required (unless, in case of failure, the consumption data store is rebuilt, rather than restored).
· The consumption database is not “aware” of the document files, and the operating system is not “aware” of the document metadata.
Compared to Scenario 1, Scenario 2 presents one additional drawback:
· The transfer of documents between the operational data store and the consumption data store, designated by the red arrow in Figure 2, is not quite as simple as a file copy; the documents must be retrieved from the database, using a dedicated client application, before they can be saved as files to the consumption file store.
However, the configuration of Scenario 2 does provide two significant benefits:
· Data maintenance, being restricted to the homogeneous operational data store, is fully transactional, and transparent, referential integrity of document metadata and document contents is guaranteed (designated by the solid blue line, connecting the Document Metadata and Document Content tables, in Figure 2 above).
· Direct file access is available to the client application in the consumption data store.
Similarly as before, the following migration objectives can be derived for Scenario 2:
· Simplify data transfer between the operational and the consumption data store.
Most migration objectives, presented earlier, are common to both scenarios; three of them, namely transactional integrity, unified security model, and a unified backup strategy, could be achieved if the documents were stored inside the database. This way all data access could be transactional, only the SQL Server security model would be required, and SQL Server backups would suffice for the disaster recovery requirements of document metadata, as well as document contents.
Also, if data is kept in tables, the process of transferring it from the operational data store to the consumption data store can remain trivial, implemented as a row set copy between two tables; in other words: data transfer could be simplified significantly for Scenario 2, if document contents were stored in the database.
However, one important objective could not be achieved by moving file data into the database. Direct file access, where files can be accessed without going through SQL Server Database Engine, would no longer be available:
· Upon retrieval, any large object stored in a table (for instance, in a VARBINARY(MAX) column, but not using the FILESTREAM feature), would have to be loaded in its entirety into the Database Engine memory before being passed to the client application.
· FILESTREAM data, on the other hand, could be accessed by the client application directly, but still not before access to the file has been granted by the Database Engine (see Books Online and MSDN articles entitled “FILESTREAM (SQL Server)” at http://msdn.microsoft.com/en-us/library/gg471497(SQL.110).aspx, and “FILESTREAM Data in SQL Server 2008” at http://msdn.microsoft.com/en-us/library/cc716724(VS.110).aspx, for details on how to access SQL Server FILESTREAM data from client applications).
Microsoft SQL Server 2012 introduces a new alternative to storing large objects in a SQL Server database: the FILETABLE. A comprehensive definition of this new feature is outside the scope of this article, so please refer to Books Online for details; you can start in the article entitled “FileTables (SQL Server)” at http://msdn.microsoft.com/en-us/library/ff929144(SQL.110).aspx, or even join me at the SQL University Summit 2012 in Bratislava (http://www.sqlu.com), on 21st September 2012, where I will present this significant new feature.
In brief, FILETABLE is a special type of table with a fixed schema, with its contents exposed to the Operating System, as if it were a folder in the file system. FILETABLEs are implemented using FILESTREAM, which allows direct, transactional, access to large data. However, in contrast to FILESTREAM columns, FILETABLE tables can be configured to also allow non-transactional access – i.e. the ability to access files without prior authorization from the Database Engine.
To enable FILETABLEs, the SQL Server instance, and the database, must be configured appropriately:
1. Unless already enabled, FILESTREAM must be enabled on the instance, and configured for both Transact-SQL, as well as remote access, with an appropriate UNC name, through which FILETABLE data stored inside the instance in question will be exposed to the operating system.
2. Unless already available, a FILESTREAM filegroup must be added to the database in question (of course, only user databases are supported).
3. A root folder, through which FILETABLE data of a particular database will be exposed to the operating system, must also be created. A single database-level root folder is reserved for every FILETABLE-enabled database.
As many FILETABLEs as necessary can be created in a FILETABLE-enabled database, each with its own table-level root folder, exposing the data stored in a particular FILETABLE.
In document databases, the document represents the principal business entity. Many of the Documents’ attributes are usually maintained relationally – as columns, or relationships, in one or more tables – e.g. a title, an abstract, authorship information, publication data, etc. When documents are stored as files, they may have additional attributes which – quite technically speaking – are not document attributes per se, but rather file attributes – e.g. the time of last modification, the size of the file, the location of the file, etc.
Typically, in document databases, especially in distributed data stores, such as is the case in scenarios presented in this article, the document management application must maintain what are strictly file properties (available in the file system), as part of document metadata (in the database). As a consequence, the same data is effectively maintained in two different places: in the file system, as well as in the database. So, by moving files into the database, one additional objective can be achieved: elimination of unnecessary redundancy.
The FILETABLE schema is described in the Books Online article entitled “FileTable Schema” at http://msdn.microsoft.com/en-us/library/gg492084(SQL.110).aspx; study it carefully in order to understand which document file properties are maintained by the Database Engine automatically.
After the instance and the databases have been enabled for FILETABLEs, the following steps need to be performed in both scenarios:
1. A new FILETABLE must be created;
2. Data must be copied to the FILETABLE;
3. A foreign key constraint must be established between the document metadata table, and the FILETABLE. An appropriate column has to be added to the document metadata table, if an appropriate column does not already exist; and finally
4. Columns, that were used to maintain document file attributes before migration, can be removed.
In Scenario 1, document files are initially stored outside the database. Before the FILETABLE can be created, a FILESTREAM filegroup will have to be added to the database.
After the FILETABLE has been created, the files must be transferred into the database. The least effort would be required, if the files were copied into the root folder of the newly created FILETABLE; e.g. using the robocopy utility. Of course, the appropriate non-transacted access level must be declared for the affected database for the file copy to succeed.
To complete the migration, remove any redundant columns from the document metadata table.
In Scenario 2, document files are already stored in the database; therefore a FILESTREAM filegroup may already exist; otherwise it would have to be added to the database.
After the FILETABLE has been created, the data can be copied using Transact-SQL (i.e. using an INSERT…SELECT statement). But to copy the files to appropriate subfolders, a file copy utility, such as robocopy, would be the more convenient alternative to Transact-SQL. Again, the appropriate non-transacted access level must be declared for the affected database, before attempting to copy the files.
To complete the migration, remove the table formerly used to store document data, as well as any redundant columns from the document metadata table.
In the end, the data store for both scenarios, and even for both the OLTP and the DW environment, could be homogeneous, as illustrated in Figure 3 below.
Figure 3: Result
Both document metadata, as well as document contents, are now stored inside a SQL Server 2012 database, with the document contents also accessible directly, via the file system. A single security model is used, a unified backup strategy can be implemented, complete transactional integrity is guaranteed, and data can now be transferred between the operational data store and the consumption data store using a row set copy, or a file copy operation.
For those of you out there, who prefer the hands-on approach to learning, I have prepared a set of Transact-SQL scripts, and sample files, that you can use for practice.
You can download the file from the following location: http://milambda.net/snippets/FileTable_Migration_Practice.zip
To complete the exercises, follow the instructions in the ReadMe.txt file, located in the ZIP archive.
Think about other SQL Server features that have not been mentioned in this article, and could not have been implemented in the consumption (DW) data store of either of the two scenarios, but can be implemented if document contents are stored in the database.
Can you imagine which specific SQL Server feature I am referring to? It is a highly important feature of any document management solution, especially for document consumption. Please, post your responses in the comments to this article, and explain why you think the feature(s) you have listed are important in document management.
Matija Lah has more than a decade of experience working with Microsoft SQL Server, mostly architecting data-centric solutions in the legal domain. His contributions to the SQL Server community have led to the Microsoft Most Valuable Professional award in 2007 (Windows Server System SQL Server). In 2008 Matija joined SolidQ (http://www.solidq.com/) as a Mentor, located in Central and Eastern Europe. He spends most of his time on projects involving advanced information management, and natural language processing. Matija also publishes some of his articles in his blog, entitled "snaps & snippets" (http://milambda.blogspot.com/).
About MVP Mondays
The MVP Monday Series is created by Melissa Travers. In this series we work to provide readers with a guest post from an MVP every Monday. Melissa is a Community Program Manager for Dynamics, Excel, Office 365, Platforms and SharePoint in the United States. She has been working with MVPs since her early days as Microsoft Exchange Support Engineer when MVPs would answer all the questions in the old newsgroups before she could get to them.
You can't store more than 100,000 items especially smaller files, windows slows down, every file operation, opening and closing is slowed down by 10x time for every 100,000 files and near one million files, be it in one folder or another folder, MFT is overloaded and every read write operation requires few seconds. Blobs are faster because they are compact in one file.
Nice information. I think FileTable is the one of the most coolest feature of SQL Server
Do you know a good way to parse and also to query excel files in a filetable?
Is OPENROWSET usable enough?