At Microsoft, it almost feels like we are installing a new CRM deployment every week. On the one hand, it is exciting to see that most deployments use the email router. On the other hand, one needs to worry about scalability and reliability. In case you are not familiar with the email router, this optional component enables automatic tracking of emails received by unattended mailboxes (ex: email@example.com). It can also be used to automatically track emails received by Outlook users (no need to manually click on the “track in CRM” button). Using server-side rules, received emails can be forwarded to forward mailboxes (typically one per CRM deployment). The email router logs in to those forward mailboxes to process emails. Based on various user settings (ex: track all emails vs. track only CRM-related emails), the email router decides whether to track or discard each email. If the decision is to track, and email activity is created and linked to other CRM records (contacts, accounts, etc.).
At the time of writing, Microsoft runs all CRM-related email processing using a single email router, shared by 83 CRM deployments. That’s a lot of deployments! Microsoft senior systems engineer Ken Charlton collected performance data during a representative 24 hour window to see how things shaped up. During this period, the email router logged on 91000 times to 83 forward mailboxes. 9400 forwarded emails were received, out of which over 2400 were tracked in CRM. About 6300 were discarded. Therefore, roughly 25% of forwarded emails were tracked in CRM. The remaining 75% were dismissed as not interesting. This percentage is quite specific to the way Microsoft CRM users setup email tracking options. For example, a call center may want to systematically track all received emails. Another type of business may track emails very selectively. Note that forward rules can be customized to control precisely what types of emails should be forwarded.
Ken uses various tools to manage and monitor the email router and forward mailboxes. For example, MOM is configured to report memory, CPU, and disk statistics. Statistics generated by the email router’s performance counters are collected as well. To prevent forward mailboxes from growing (and to detect delivery failures), Ken implemented a utility to check the size of each forward mailbox and generate weekly reports. Each report shows, for each forward mailbox, how many emails are sitting in the inbox (waiting to be processed), and how many were moved to the undeliverable folder (because they could not be delivered to CRM). MOM is configured to send alerts if the service stops or generates too many error / warning events. Finally, a web page allows CRM administrators to request the registration of new forward mailboxes (see this).
The email router seems to be able to keep up with our 83 internal deployments, but how much effort does it put into it? Is it stretched to the limit, or can it perform the task effortlessly? After all, our email router is running on an out-of-warranty ProLiant DL380 G3 server purchased in 2003. Looking at CPU usage, it remained negligible. Ken had an interesting comment about this: “Given the small CPU load, this could probably be done in a VM.”. Interesting idea! Looking at memory usage, the average footprint was 49 Megs. For comparison sake, opening a new notepad.exe instance consumes 4 Megs of memory. Overall, no performance issues, but what about reliability?
If you did the math, you probably noticed that the numbers above did not quite add up. Out of 9400 processed emails, 608 were reported as “suspect”. So what makes an email message “suspect”? Suppose a user uses the forward rule, but later on his CRM account is disabled. Unless someone remembers to delete the user’s forward rule, the email router will be left wondering why an unknown CRM user is forwarding emails. So a warning is logged. Alternatively, suppose that a spammer guesses the email address of a forward mailbox and bombards it with junk emails. Warnings will be logged for the same reason. Out of 9400 emails we processed, 118 emails were also reported as “failed” and moved to undeliverable folders. This makes the error rate close to 1%, which seems relatively high. The email router uses quite a bit of retry logic when processing emails. Rebooting a CRM web server, for example, should not cause delivery failures (just some retries). There is good evidence that those failures come from abandoned deployments. The email router is simply trying to deliver emails to CRM servers which do not exist any longer. Other factors may include corrupted emails, and long lasting network outages.
Here are a few tips and tricks to achieve high scalability and reliability.
When configured to process multiple forward mailboxes, the email router uses a round-robin polling strategy. Each polling cycle consists in logging on to forward mailboxes in sequence, followed by a sleep. To control the duration of the sleep, set registry key “PollingPeriod” (seconds). If many forward mailboxes must be processed, you may not want the service to sleep for too long. However, the longer you sleep, the fewer system resources are consumed.
Consider the case where a single forward mailbox receives a sudden blast of emails. You may not want the email router to process thousands of emails before advancing to the next forward mailbox (since processing all these emails may take hours). To introduce some level of fairness, set registry key “MaxMessageCount”. The email router will automatically advance to the next forward mailbox after processing up to the specified count.
Also consider the case where a single email is sent to 1000 users, all members of the same CRM deployment, and all forwarding to the same mailbox. The email router uses an in-memory cache to quickly filter out copies of the same email. You can set registry key CacheCapacity to control the size of this filtering cache. Increase the cache size to match the number of deployments and users.
To monitor the email router, you could use the CRM MOM pack. You should adjust default rules to match environment characteristics. If you cannot use MOM, the Windows performance monitor (perfmon) is useful. Just add counters found under performance object “MSCRMExRouterService”. For verbose logging, set the email router service registry key “LogLevel” to a value of 4. Also frequently monitor the size of inbox folders to make sure your forward mailbox is not accumulating too many emails (possibly because of a spam attack, or because the email router was shut down, etc.).
Because connections to some forward mailboxes may be unreliable, set registry key “ConnectionTimeout” (seconds) to a small value to prevent the email router from getting stuck on a particular mailbox. To control how long an email will be retried before it is moved to the undeliverable folder, set registry key “MessageExpiry” (seconds). If this value is too small, a simple CRM server reboot may cause failures. If this value is too high, excessive retries may waste system resources.
From a security standpoint, stay away from using forward mailbox email addresses that could easily be guessed. Secure access to prevent delivery outside of the intranet, unless you need to enable receiving forwarded emails from the internet. Also carefully monitor event logs / performance counters reporting “suspect” emails as those may signal security problems.
I hope these comments will give you an idea of how we use the email router at Microsoft. Expect many improvements in the next version. I wish I could say more, but we are not yet allowed to comment on this.
The report that includes the mailboxes (Inbox and undeliverable) is generated *every* day. This allows the CRM support team to know if a mailbox is having problems; they can compare one day to the next.
Another recent development is a custom MOM rule to monitor MSCRM processing. If the number of mailbox logons does not increase by at least 5 in 15 minutes, a MOM alert is generated. Typically we see 80+ logons every 1-2 minutes.
We found that a couple of the web servers are not responsive to the router and processing will halt until the time-out period is reached. If there are enough messages in the inbox, the router takes a long, long time to process a single inbox. The alert lets us know there is a problem. The workaround is to
1) Increase logging to 4
2) Cycle the router service
3) Review the logs to determine the faulty web server
4) Remove the entry from the registry
5) Decrease logging to 1
6) Cycle the router server
7) Notify the web server owner there is a problem. When the problem is remedied, the entry is added to the registy.