Hoping this might be a good topic for some people because i guarantee there are very few people who have been through integrating these products.
I have been working on integrating the two products at a client for the past 5 months so hopefully this will help someone out.
Legal discovery has very different requirements vs. enterprise search. MOSS is a relevance based search engine which means that searches conducted by a user are matched against content in the MOSS index using various rankings like link depth from authoritative sources, document type rankings(e.g. Word is ranked higher than Excel), word repetitions within a document, the number of times a document is linked to and etc… A certain portion of document is indexed (e.g. the first 15 MB) after which content is not termed relevant. Documents over a certain size are also not indexed out of the box. All of this can be adjusted by configuration but my point is that MOSS is purely focused on making the most relevant content available to a searcher. 95% of the time the average user never goes beyond the first page of results. This is why relevancy is so important for your average user and why the system is tuned for it. Legal discovery seems to be focused on a very different search. The legal user often times wants to view all search results and to search the entire document of every document indexed by MOSS. The legal user wants to put those results into a workflow to be saved off on a case by case basis. MOSS is not tuned to these needs. I will detail the biggest risks that MOSS introduces for a Legal discovery based search.
Gaps and Mitigations
Gaps and Mitigations
There is currently no iFilter for Tiff. I’m guessing that’s what files are being scanned in as.
Capatris is developing a Tiff iFilter - Capatris IFilter
Symantec Enterprise Vault is capable of archiving file shares. This means files on file shares will be replaced with placeholders. The SharePoint crawler does not recognize the difference between a placeholder and a file on the file system.
The question becomes do you want to crawl files that have been archived?
Our answer was no, since the Symantec Enterprise Vault builds its own index why have two. Also this lightens the load on the MOSS index. The behavior we witnessed is that MOSS would continue to pull items out of the vault when they are archived. The only way to prevent this is through putting Symantec Enterprise Vault in backup mode and adding the crawl account to the backup group. This however will flood the crawl log with errors. So not the best option either if you want search to remain manageable.
The best option would be if the Crawl account could ignore documents with the offline attribute. This is not currently available in MOSS but hopefully one day it will be.
Update: This has been fixed in the MOSS August 2008. It's not very clear from the notes but here it is...http://support.microsoft.com/kb/956056/. “When you try to crawl a content source, the offline files in the content source are indexed unexpectedly. The offline files are the files that have the PR_FILE_ATTRIBUTE_OFFLINE attribute set. Note After you apply the hotfix, offline files will are not indexed any longer”
If you want to search MOSS and Enterprise Vault from a single interface today you're only option is custom code calling the EV COM interfaces. This is very complicated code to write. Luckily around the end of year Symantec will release a new version that supports Search Federation so a single interface can be used.
Search Federation still does not cover the Discovery Accelerator product. It's the one product i feel should be integrated into MOSS. It would be beneficial if there were workflows that could launch out of MOSS search into DA. I have heard that DA will be able to federate queries back to MOSS however so when a legal user is within the DA interface MOSS search results can appear. This is due in the next release of Symantec EV.
Hope i helped someone with my thoughts around legal discovery, please feel free to comment.