As promised, I will add more technical information about File Classification Infrastructure (FCI) in a series of technical posts devoted to FCI architecture and internals.
We started the classification project with an ambitious vision: to provide a simple, open and comprehensive way to organize data at enterprise level, primarily for files stored on shares. The classification process has to be simple in order to be understood intuitively by any administrator under pressure of managing millions, maybe billions of files. It needs to be open (i.e. extensible) as we need to deal with a huge matrix of file formats, classification methods and types of management policies on top of classification. Finally, it needs to be comprehensive - by installing the latest version of Windows, administrators should be able to get enough value “out of the box” to get started in organizing their information and apply policies right away.
Here is a quick introduction in classification, if you don’t know anything about it yet (By the way, a detailed overview of the problem space is presented here, and an functional overview is presented here. There is also a Microsoft Technet step-by-step guide for enabling classification and file management tasks here.)
In essence, classification allows you to assign metadata to every file on a file server. Specifically, classification deals with actionable file metadata, the kind of metadata that can be used to drive file management policies (such as encryption, protection, retention, expiration, etc). This metadata is just a set of classification properties (in the form Name=value) that is attached to files in some form or another, either through explicit or automatic classification. It is important to note that you (the user) are the one responsible in designing your own classification schema for your solution. By default, Windows does not provide (nor enforce) a predefined set of classification properties available by default.
Classification information can be “produced” in several different ways:
Classification can be consumed in several ways:
Of course, producing/consuming always occurs in parallel – which makes classification look a little bit complex at the first sight.
The whole process is simple to understand if you use the classification pipeline as a simple underlying concept. The pipeline defines how the classification process is executed in each combination of the scenarios above that produce/consume classification:
As you can see, classification works in five stages on a file-by-file basis. Let’s drill down into each of them:
1) Discover Files
In this phase we “assemble” a list of files that will be run through the pipeline.
After this step, we generate a stream of “property bag” objects (one per file) that is passed through the pipeline. You can think of the property bag object as an in-memory object holding the set of “Name=value” properties for a given file. Of course, NTFS metadata (file name, attributes, timestamps, etc) is extracted at this stage and inserted in the property bag.
2) Extract classification (retrieve existing classification from files)
In this phase, we try to extract existing properties from files, that have been previously stored by other FCI-unaware applications such as Microsoft Word or Sharepoint. Some of these mechanisms might parse the file content (such as in the case of Office files) or use other techniques (such as discovering classification information that has been “cached” from a previous classification run).
One important thing to note is that existing classification data might be extracted from different sources, and therefore we have a precise process of reconciliation of potentially conflicting metadata. This process is called aggregation – and it is one of the central concepts in the pipeline architecture.
Extracting/storing property values is done through dedicated software components called storage modules. Each of these storage modules implements a standard COM interface to essentially receive a “stream” of property bags (from the discovery/scanning process) and enhance them in the way through extracted property values.
In Windows Server 2008 R2 we ship several storage module by default:
One more thing to note – you can also define your own storage modules that can store metadata either in the file content, or in an external database. It is recommended, though, that any storage module that parses file content should live in its own process with tightened privileges for security reasons (as parsing file content could be subject to buffer overruns).
3) Classify files
At this stage, we identify which rules affect the current file being processed, and then run each rule to produce new classification property values. Again, we use aggregation to reconcile potentially conflicting values coming from different rules (and potentially from the previous stage as well).
Rules are implemented with the help of a different type of pipeline modules called classifiers. A classifier implements a similar interface as a storage module – it accepts a stream of incoming property bags, and it outputs a stream of “enhanced” property bags containing extra properties discovered during classification.
A classifier can look at the file content, the file system metadata, external storage, or all of them to determine the new property values to be assigned. There are many “flavors” of classifiers which we will detail later.
A pipeline could contain one or several different classifiers. There are two classifiers shipped by default with Windows:
The content classifier can also operate on non-text files (such as Office documents) through the IFilter technology, which basically offers a way to “convert” a certain Office file into a simplified pure text format. One more exciting thing to note – Windows Server 2008 R2 ships with an IFilter-based OCR engine, so the content classifier will also work against text contained in TIFF scanned images! (If you have seen the TechEd keynote demo you know what I mean :-)
4) Store classification properties
This is the natural counterpart of step 2 and is implemented, not surprisingly, by the same storage modules. In this stage the basic philosophy is the following: we store properties in as many locations as possible so we can reconstruct it later.
Of course, we store the aggregated property values as mentioned before.
5) Apply policy based on classification
This is the most interesting of the pipeline steps, and the final stage in the classification pipeline. After all, the whole point of classification is to run some policies against the classification data. Classification in itself is not interesting – what’s truly interesting is the fact that you can run declarative policies on top of classification data.
Policies come in several flavors:
That’s it for now – in next posts we will drill down even more in each of these phases.
(update – small typos)