I have heard a lot of complains from my friends and the community that desktop search interferes their daily usage of the computer. These applications, such as Windows Search(also known as WDS, Windows Desktop Search, and the builtin search engine in Vista/2008), Google Desktop Search, index your files while the computer is idle. In this theory, it should not affect your PC’s performance. However, sometimes you can notice that your CPU usage goes up to 100% (or 50% for a dual core system, with 100% usage for one of the cores) for quite a few minutes, even when you are busily working with some applications. This not only brings down the performance of your current applications, but also affects your battery life if you are using a laptop in mobile, and boring fan noise.
So, it is really a nasty problem. When you bring up task manager, you can see something relates with Search is sucking power from your CPU. In my experience, the name is SearchFilterHost.exe. Let’s take a look at it with Sysinternal’s Process Explorer, to understand the relationship of the services.
This is a child process of Windows Search. Without any thoughts, I killed this process, and after a short while it restarted, again with 50%+ CPU usage. Nasty. It’s quite hard to identify Windows Search problem because it cannot tell you what kind of thing it is working on (except those guys who can understand minidumps and can trace into the process, who is really a minority in our IT guys). But by its name, I can know it is a filter host, and with another child process in this tree, I can understand the two process is working on search job, one for the protocol of the file, one for the filter of the file.
Filter daemon is a common part of Microsoft Search architecture. SharePoint, SQL and Windows Search using this daemon to load ifilters, and extract information from different types of files. If this process takes a lot of CPU power, it is quite possible the ifilter is suffering from some problem. And it’s quite likely, the ifilter encountered something it cannot process.
Let’s take a look at the threads of SearchFilterHost. You can notice that one of the thread, RPCRT4.dll, is sucking CPU power. RPC stands for Remote Procedure Call, this can be another evidence.
Now, I’m suspecting there’s a corrupt or misformat file caused the problem. Because it’s corrupt, the ifilter might not be able to process it correctly, and the dead cycles drained all the processing power of the core. But with no log of activity from Windows Search, how can I know which file caused the problem?
Process Monitor is the tool this time. It is also from Sysinternal, as a combination or replacement for FILEMON/REGMON. Run it with filter setting to include all related processes, monitor only file activities, and wait for the problem to reappear.
After a short while, CPU usage goes up again. Stop the capture, and take a look at the log.
Only SearchIndexer is working, and this already lasts for quite a moment. This is abnormal, because it should load protocol daemon and filter daemon to process different files. Another evidence for the suspect of corrupt file. Now scroll up, try to find what is the last file it accessed.
Now it is clear, the indexer loaded “KurzfassungvonInhalt.docx” into memory, and stuck there for a few minutes. That file, should be the root cause of the problem.
I really didn’t have a idea that why this file is on my harddisk. But then I remembered this file was sent to me by one of my friend in Germany, she told me she had a word doc which she cannot open any more, and asked me to try to fix it if possible. If you open this file by Word, you will see an error notice.
I removed the file, and the CPU usage problem went away.
In a similar case, I observed some doc files which produced by WPS Pro edition(a Office clone in China, while its personal edition does not have the problem) caused the same problem. These files can be opened in MS Word, but cannot be processed by the ifilter. I don’t have the idea with doc files from OpenOffice, but these experiences might help you to identify the reason if you are suffering from the same problem.
Process Explorer and Process Monitor can be downloaded from http://technet.microsoft.com/en-us/sysinternals/default.aspx, or www.sysinternals.com. Don’t capture too many events with Process Monitor at one time, otherwise your RAM will run out.
Many people have been asking for the best practice or a guide to properly maintain Lotus Notes indexing function in SharePoint Search. So here it is, this is not a official guide, but our experience in several big customers. I will write this in a Q/A format, so you can navigate to see which question applies to your current problem.
Q1. How many Lotus Notes content source can I crawl at the same time? A1: One content source per Domino Server. If all of your stuff are put on a single Domino Server, you have to crawl them one by one. But If you have several Domino servers to index, then you can index them at the same time. This is a limitation of IBM Lotus Notes C++ API. So you may need to carefully set schedules to crawl these content sources.
Q2. How many Lotus Notes content source shall I crawl at the same time? A2: The only difference from the 1st question is CAN/SHALL. There should be a limit on this number, but what is this number? I don’t have the direct answer for the question, because this number depends on your hardware performance, memory usage, network legacy and bandwidth…. so many factors. For a recent hardware with 8GB ram, I would recommend 3,with scheduled memory recycling – we will talk about this later.
Q3. I have a Notes database indexed, but how come the time of full crawl is nearly the same with incremental crawl? A3: During an incremental crawl, SharePoint search engine will check LastModifiedTime property of target documents/items, and to determine if the target object should be fully retrieved back to its index. However, for certain content source, this property is not retrieved or mapped to something else by mistake, therefore, the engine can only get all the content back to check if there’s any difference. I’m checking a possible solution for this problem, and will update if I can find something.
Q4. Should I use x86 or x64 for Lotus Notes indexing? A4: Because of the limitation of IBM Notes C++ API, Notes Protocol Handler can only run on a x86 box. However, you can still use x64 query servers and WFEs. Remember: the same tier should not be mixed with x64/x86 boxes, but you can have x86 indexer tier with x64 query and x64 wfe tiers, this is recommended for Notes search in SharePoint 2007/Search Server 2008. (IBM released x64 version of their API recently, but it’s impossible to make current NotesPH to work with that, many things changed)
Q5. You mentioned memory recycling – what does that mean? A5: Due to x86 limitation, the memory per process is limited to certain number. And because we are calling Notes client through API, it’s quite possible MSSEARCH/MSSDMN process will hit memory limit after a crawl of large numbers of documents. So I recommend you to recycle these processes for every certain amount of time. This can prevent possible stuck of the crawl. In order to do this, you might need to write your own schedule program with SharePoint search administration APIs, and restart osearch service when it’s need. I will also add this function to SharePoint Search Admin 0.81 and later in a few days.
Q6. Any ideas about security trimming support? What should I do in Domino side? A6: You can use Lotus Notes users and groups to control security, and map them to AD users to achieve search result security trimming in SharePoint. But it is generally advised to not use Lotus Notes Roles for security control, as there’s no correspond thing in active directory.
Q7. To be added. A7:
Btw, I’m moving to a new position in IW PMG, as a Technical Product Manager to drive SharePoint IT Pro readiness. So in future there would be more things like SharePoint Governance appear on this blog:).