Microsoft Search uses IFilters (aka filter) to extract text and metadata out of documents. IFilters are COM Objects that implement the IFilter interface. Example of an IFilter is an html filter. The html filter is capable of extracting the text in html documents. In addition to text, it can also emit metadata like titles and links. MSDN documents the IFilter COM interface.
For a list of IFilters develped by Independent Software Vendors, please see http://addins.msn.com/
New Comments to this post are disabled
About Shajan Dasan
Leads a development team in Microsoft Search, responsible for the crawler; text extraction and linguistic components. Products that ship these components include Office 12, Windows Desktop Search, SQL Server and Index Server. Before Search, implemented the typesafety verifier in the .Net Just In Time compiler, and did the initial implementation of Code Access Security system including Security Policy System, Stack Walk, Link Demands and IsolatedStorage.