Tuesday, March 08, 2005 9:59 AM
Michael S. Kaplan
I coffee, therefore IFilter (or, Language-specific processing #1)
Apologies for the title, I still cannot resist that sort of thing. Maybe one day....
If you have not read it yet, look at Language-specific processing #0 for more info about this series!
IFilter is one interface that you can use to lower the barriers between the engines that do the work of indexing and the data that may be sitting in proprietary formats. The documentation probably explains it better than I could here:
The IFilter interface scans documents for text and properties (also called attributes). It extracts chunks of text from these documents, filtering out embedded formatting and retaining information about the position of the text. It also extracts chunks of values, which are properties of an entire document or of well-defined parts of a document. IFilter provides the foundation for building higher-level applications such as document indexers and application-independent viewers.
Immediately several of what seems much like the shipping implementations of this feature like this will come to mind: Full Text Search in SQL Server, SharePoint, Exchange, and Index Server for starters. And then there are those like MSN Desktop Search, as well. All of the times that search suppots additional file formats. Imagine being able to get in on the fun to make sure your own format is supported for some type of indexing/searching?
This is a COM interface so to implement it you have to implement AddRef/Release/QueryInterface as always. The additional methods you have to implement:
- IFilter::Init - Initializes a filtering session.
- IFilter::GetChunk - Positions filter at beginning of first or next chunk and returns a descriptor.
- IFilter::GetText - Retrieves text from the current chunk.
- IFilter::GetValue - Retrieves values from the current chunk.
- IFilter::BindRegion - Retrieves an interface representing the specified portion of object. Currently reserved for future use (for now you would always return E_NOTIMPL).
The general topic about the IFilter interface has pointers to summaries, samples, instructions on building, applying and testing filters, as well as methods to bind to already existing IFilter implementations.
It is also nice to see such a great effort on the security side -- links and information to help guarantee that ISVs who write code against this interface do it securely. Throughout there are good warnings:
Caution IFilters for Indexing Service run in the Local System security context. They should be written to manage buffers and to stack correctly. All string copies must have explicit checks to guard against buffer overruns. You should always verify the allocated size of the buffer. You should always test the size of the data against the size of the buffer.
That and a link to secure code practices to consider when implementing these interfaces are a welcome touch as far as I am concerned (as it does no good for Microsoft to write secure code if an ISV writes a component with a security issue!).
Now note that this interface, this IFilter, is not really about language-specific processing as much as it is about format-specific processing. But one of the greatest strengths of a service like MS Search is the ability to apply it to different file formats. It makes IFilter a very important interface to stretch the boundaries of what can be searched.
And it gives the future topics, that deal with those more linguistic aspects of language-specific processing a much wider reach than they would otherwise have. So I will give IFilter an honorary "cool" status that I would usually reserve for things more linguisticalish :-)
This post was sponsored by "F" (U+0046, a.k.a. LATIN CAPITAL LETTER F)
A letter that realized it would never get to sponsor any of the fun "F" words while I am working for Microsoft, so it thought it should take "Filter" while it was available.