I no longer work at Microsoft, so please don't bother leaving a comment here or trying to contact me through my MSDN blog.
You can find my new blog at http://www.technologytoolbox.com/blog/jjameson. My new site also provides copies of all posts from my MSDN blog.
One of the things I like most about running System Center Operations Manager in the "Jameson Datacenter" (a.k.a. my home lab) is that it greatly reduces the amount of effort required to monitor numerous servers.
For example, in my environment I am currently monitoring 10 servers 24x7. To some infrastructure folks out there, this might seem like a laughable number -- but keep in mind that my primary role is development, not infrastructure.
While it's nice to have a number of different physical and virtual machines available to help me in my day-to-day work, I can't spend significant time managing the environment -- or else I simply wouldn't have time to do the things that really matter (like ensuring I always deliver the features committed for a particular sprint, or spending time with my wife and daughter).
Consequently, one of the few changes I've made to the out-of-the-box SCOM 2007 R2 configuration is to create a couple of rules so I get an email whenever there is a "hiccup" on any of the monitored servers in my environment. [This is in addition to installing a number of management packs for things like Active Directory, SQL Server, SharePoint, and TFS.]
In other words, I created two rules that generate alerts whenever an error occurs in the Application event log or the System event log on any server being monitored by Operation Manager. While there may be errors generated in other event logs (such as the custom "Operations Manager" log that gets created when you install the SCOM agent on a server), I don't attempt to monitor every event log. My assumption is that errors in other event logs will either be detected by a suitable management pack or will manifest themselves as errors in the Application or System event log if the problem is severe enough.
When adding custom monitors and rules in Operation Manager, you first have to decide which management pack should contain the customizations. Based on what I've read, the best practice is to avoid the Default Management Pack and instead create custom management packs that are specific to a particular area. For my custom "event log error" rules, I created a new management pack called Windows Core Library - Customizations.
To create a rule that generates an alert whenever an error occurs in the Application event log:
Source: $Data/EventSourceName$Event ID: $Data/EventDisplayNumber$Event Category: $Data/EventCategory$User: $Data/UserName$Computer: $Data/LoggingComputer$Event Description: $Data/EventDescription$
Repeat the process to create a similar alert for errors in the System event log.
If you do not specify any fields in the Alert Suppression dialog, then you may receive numerous alerts within a short period of time (for example, when SharePoint Server 2010 floods the Application event log due to an issue with least-privilege configuration).When this occurs, Operations Manager will detect the high frequency of alerts and temporarily suspend the notification, and display a different alert instead:Alert rule: Alert generation was temporarily suspended due to too many alerts.Alert description: A rule has generated 50 alerts in the last 60 seconds. Usually, when a rule generates this many alerts, it is because the rule definition is misconfigured. Please examine the rule for errors. In order to avoid excessive load, this rule will be temporarily suspended until ...
The reason why I choose to set the Severity to Warning (instead of the default -- Critical) is so that when an event log error generates a similar alert in one of the other management packs, I immediately focus on the "primary" alert (rather than the "duplicate" generated by the custom rule).
In order to minimize the effort required to investigate errors in the event logs, I include details from the event in the alert. This is especially useful for quickly understanding errors on a server since it is also included in email generated by the alert.
Generating alerts for any errors that occur in the Application and System event logs will definitely motivate you to take corrective action to resolve the errors. It will also encourage you to try to prevent the same errors from occurring again in the future.
I've attached my sample management pack with the two custom rules -- just in case you want to save yourself the 5 minutes or so it takes to configure the rules.