I described the mitigation for alert storm in the one of my previous post. This mitigation is included out of the box with current deployment of OpsMgr 2007 R2 after system management packs were imported.
As I failed to remember this post (which talks about how to insert alerts from SDK), I also failed to realize that this alert suppression feature will not work unless a change to your custom management pack is made.
I was reminded by Markus, that I need to provide an update in order for this feature to work properly for SDK inserted data generating too much alerts within short period of the time.
History of SDK inserted alert requirements:
When we designed this SDK feature, we decided that it is best for MP (and connector) author to make a copy of condition detection module System.Connectors.GenericAlertMapper, publisher write action System.Connectors.PublishAlert and final composite module type System.Connectors.GenerateAlertFromSdkEvent into their own management pack.
That was the reason why we did not include such definition(s) inside of Microsoft management packs (well some things had changed since then so it is possible we will make this available as public module from one of our system MP in the future).
Management pack with sample modules can be also downloaded directly by clicking this.
Unfortunately this decision has a negative impact on “alert storm mitigation” feature which was delivered as a part current OpsMgr release. This is because implementation of this feature included minor change in the configuration of the module converting input data type to alert (System.Connectors.GenericAlertMapper).
By seeing your current implementation still working, it is clear that such change was not breaking backward compatibility.
Unfortunately, if configuration is not changed for your modules generating alert for SDK inserted data as well, runtime will not initialize component responsible for alert storm recognition and suspension.
Customization needed:
It is required to change the version for all referenced system management packs to OpsMgr R2. This ensures that MP will not be imported to previous releases of OpsMgr where feature is not implemented and would cause alert generating workflow unload. (This is because of failure during configuration processing while loading a module responsible for alert generation.)
<Version>6.1.7221.0</Version>
It is also required to add ManagementGroupName tag into configuration of the module(s). Highlighted areas in XML representation of module types should be self descriptive both for alert generating module as well as final composite module type.
<ConditionDetectionModuleType ID="System.Connectors.GenericAlertMapper" Accessibility="Internal" Batching="false" Stateful="false" PassThrough="false">
<Configuration>
<IncludeSchemaTypes>
<SchemaType>Health!System.Health.AlertSchema</SchemaType>
</IncludeSchemaTypes>
<xsd:element name="Priority">
<xsd:simpleType>
<xsd:restriction base="xsd:integer">
<xsd:minInclusive value="0" />
<xsd:maxInclusive value="2" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element name="Severity">
<xsd:simpleType>
<xsd:restriction base="xsd:integer">
<xsd:minInclusive value="0" />
<xsd:maxInclusive value="2" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element name="ManagedEntityId" type="xsd:string" />
<xsd:element name="AlertName" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="AlertDescription" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="AlertOwner" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="AlertMessageId" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="AlertParameters" type="System.Health.AlertParameters" minOccurs="0" maxOccurs="1" />
<xsd:element name="Suppression" type="System.Health.Suppression" minOccurs="0" maxOccurs="1" />
<xsd:element name="WorkflowId" type="xsd:string" />
<xsd:element name="Custom1" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom2" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom3" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom4" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom5" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom6" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom7" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom8" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom9" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom10" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="ManagementGroupName" type="xsd:string" />
</Configuration>
<ModuleImplementation Isolation="Any">
<Native>
<ClassID>2325018e-eef4-41a3-8c17-db831b85c93b</ClassID>
</Native>
</ModuleImplementation>
<OutputType>Health!System.Health.AlertUpdateData</OutputType>
<InputTypes>
<InputType>System!System.BaseData</InputType>
</InputTypes>
</ConditionDetectionModuleType>
<WriteActionModuleType ID="System.Connectors.GenerateAlertFromSdkEvent" Accessibility="Public" Batching="false">
<Configuration>
<IncludeSchemaTypes>
<SchemaType>Health!System.Health.AlertSchema</SchemaType>
</IncludeSchemaTypes>
<xsd:element name="Priority">
<xsd:simpleType>
<xsd:restriction base="xsd:integer">
<xsd:minInclusive value="0" />
<xsd:maxInclusive value="2" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element name="Severity">
<xsd:simpleType>
<xsd:restriction base="xsd:integer">
<xsd:minInclusive value="0" />
<xsd:maxInclusive value="2" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element name="AlertName" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="AlertDescription" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="AlertOwner" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="AlertMessageId" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="AlertParameters" type="System.Health.AlertParameters" minOccurs="0" maxOccurs="1" />
<xsd:element name="Suppression" type="System.Health.Suppression" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom1" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom2" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom3" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom4" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom5" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom6" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom7" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom8" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom9" type="xsd:string" minOccurs="0" maxOccurs="1" />
<xsd:element name="Custom10" type="xsd:string" minOccurs="0" maxOccurs="1" />
</Configuration>
<ModuleImplementation Isolation="Any">
<Composite>
<MemberModules>
<ConditionDetection ID="Mapper" TypeID="System.Connectors.GenericAlertMapper">
<Priority>$Config/Priority$</Priority>
<Severity>$Config/Severity$</Severity>
<ManagedEntityId>$Data/ManagedEntityId$</ManagedEntityId>
<AlertName>$Config/AlertName$</AlertName>
<AlertDescription>$Config/AlertDescription$</AlertDescription>
<AlertOwner>$Config/AlertOwner$</AlertOwner>
<AlertMessageId>$Config/AlertMessageId$</AlertMessageId>
<AlertParameters>$Config/AlertParameters$</AlertParameters>
<Suppression>$Config/Suppression$</Suppression>
<WorkflowId>$MPElement$</WorkflowId>
<Custom1>$Config/Custom1$</Custom1>
<Custom2>$Config/Custom2$</Custom2>
<Custom3>$Config/Custom3$</Custom3>
<Custom4>$Config/Custom4$</Custom4>
<Custom5>$Config/Custom5$</Custom5>
<Custom6>$Config/Custom6$</Custom6>
<Custom7>$Config/Custom7$</Custom7>
<Custom8>$Config/Custom8$</Custom8>
<Custom9>$Config/Custom9$</Custom9>
<Custom10>$Config/Custom10$</Custom10>
<ManagementGroupName>$Target/ManagementGroup/Name$</ManagementGroupName>
</ConditionDetection>
<WriteAction ID="WA" TypeID="System.Connectors.LibraryPublishAlert" />
</MemberModules>
<Composition>
<Node ID="WA">
<Node ID="Mapper" />
</Node>
</Composition>
</Composite>
</ModuleImplementation>
<InputType>System!System.BaseData</InputType>
</WriteActionModuleType>
Enjoy alert storm mitigation (customization is possible as described in original post) after importing newly customized management pack!
Did you ever wonder what is the state of the instance as known to the runtime (health service) monitoring it? Did you believe that some state changes are unaccounted for? Did you see discrepancy in Health Explorer?
I believe many of you may answer yes to one of these questions.
Right now, there really is not a good guidance on how to troubleshoot state change problems, but since OpsMgr 2007 SP1 release, there was a way to at least display states of the monitors targeting the instance as recorded by runtime during state calculation. This led me to creation of the tool returning those states from runtime. It also provides visual comparison against “real” Health Explorer (states are returned from Ops DB) while integrated with OpsMgr console thru console task. This task targets instance of “HealthService” managed entity type. Tool uses Health Explorer like view of monitors for each active instance monitored by specific runtime. Following is a snapshot of the tool executed against my Root Management Server. Please observe that I created view listing all health service instances as well as console task associated with this type and accessible thru “Actions” pane.
There still may be a long way for us to recognize all the issues and take corrective actions automatically. That is the reason why this tool provides at least a manual way to synchronize states of the monitors associated with instance into operational DB – right click anywhere in tree control and select “Synchronize to DB”. Unfortunately, such corrective action is unable to synchronize state of the dependency rollup monitor, but I will try to find the way to achieve this although plumbing is not present in current implementation of the runtime yet!
Honestly, I’m not expecting big download count. But it is possible this tool helps someone with investigation of health state issues and that is the main reason why I do this post anyway. Tool works for OpsMgr2007 SP1 and R2!
Based on the feedback (if any) I may try to extend feature set in future versions too (permitting my bandwidth).
DISCLAIMER:
Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft.
Link to x86 installation package
Link to x64 installation package
|
System Center Operations Manager 2007 R2 RTM!!
|
|
New Functionality - Operations Manager 2007 R2 introduces key new and enhanced functionality, including:
· Enhanced application performance and availability across heterogeneous platforms
· Delivers monitoring across Windows, Linux and UNIX servers–all through a single console.
· Extends end to end monitoring of distributed applications to any workload running on Windows, UNIX and Linux platforms.
· Maximize availability of virtual workloads with integration with System Center Virtual Machine Manager 2008.
· Improved management of applications in the data center
· Delivers on the scale requirements of URL monitoring of your business.
· Meet agreed service levels with enhanced reporting showing application performance and availability.
· More efficient problem identification and action to resolve issues.
· Increased speed of access to information and functionality to drive management
· Faster load times for views and results.
· Improved and simplified management pack authoring experience
Where and when can I obtain the bits?
The RTM release is build 7221.
- At General Availability (July 1)
Where can I find collateral, training, and more on Operations Manager 2007 R2?
· Newly released collateral includes the following:
o Whitepaper: Introduction to Operations Manager 2007 R2
o Datasheet: What’s New in Operations Manager 2007 R2
o Datasheet: Reducing the cost of data center management with Operations Manager 2007 R2
o Datasheet: Monitoring UNIX/Linux with Operations Manager 2007 R2
o Datasheet: Tracking Service Levels with Operations Manager 2007 R2
o Datasheet: Interoperability Connectors for Operations Manager 2007 R2
- Released on the TechNet: Webcast Series on Operations Manager 2007 R2:
How else can I extend Operations Manager 2007 R2?
· Service Level Dashboard v2 from the Solution Accelerators team lets you measure and report application or system performance & availability in near real time across your organization. Using the Dashboard, you can easily spot trends and head off problems—before they occur. The Dashboard also lets you create role-specific dashboards to support different departments, like HR, Finance, or Operations. Download it today from Microsoft Connect.
· Operations Manager 2007 R2 Interoperability Connectors provide the ability to synchronize alerts and status between Operations Manager 2007 R2 and other management systems. Beta connectors for Tivoli Enterprise Console, HP OpenView Operations, and the new Universal Connector can be obtained from the Operations Manager R2 download on Connect. Download the Interop Connectors from the System Center Catalog.
· Operations Manager 2007 R2 Visio Add-in delivers the ability to link status and health information gathered by Operations Manager 2007 R2 into normally-static Visio diagrams, adding life and interaction to those diagrams. Download it today from Microsoft Connect.
· New Exchange Server 2007 Management Pack (MP) Beta, which provides enhancements over the current Exchange MP such as reducing alert noise and enhanced performance. Download it today from Microsoft Connect.
· BridgeWays MP Beta Program, providing beta MPs for MySQL, Apache, and Oracle running on Windows, Linux or Solaris. For more information, and to register into the BridgeWays MP Beta Program, visit http://www.bridgeways.ca/bw_management-pack-beta-program-signup_form.php |
Cameron had a nice example of using new R2 feature process monitoring in real life, but that raised a question about feature he wanted to use. Following is report of his issue and how we can help him to address his challenges using already existing feature of OpsMgr 2007.
Scenario: Monitoring a system with a process monitor. Define a recovery to reboot the system if it’s not running the process required. Run this recovery automatically on critical state.
Problem: In OpsMgr prior to R2 when a Recovery was created it had an option to “Reset monitor” which would put it to a healthy state. In R2, this now says “Recalculate State Monitor”. This is representing a challenge as described below:
Challenge: Recalculating the state may keep the monitor in a critical state until the system has been rebooted successfully and is in fact running the process. If the process does not start correctly after reboot, it gets stuck in the critical state and the recovery will not run again. With a Reset of this monitor to a Healthy state, this would work properly, but without that option available I am not seeing an effective way to make this work.
Workaround: Recovery is no different than other workflows loaded by OpsMgr and is rather similar to task. It consists of modules that are chained together and should provide some corrective action in order for monitor to fix its state. For that reason, first module could be the module which resets state of the monitor.
Following is module that could be used with recovery directly. It will reset the state of the monitor specified in configuration.
<WriteActionModuleType ID="Microsoft.SystemCenter.Community.Health.ResetTargetStateAction" Accessibility="Public" Batching="false">
<Configuration>
<xsd:element minOccurs="1" name="MonitorId" type="xsd:string" />
</Configuration>
<OverrideableParameters>
<OverrideableParameter ID="MonitorId" Selector="$Config/MonitorId$" ParameterType="string" />
</OverrideableParameters>
<ModuleImplementation Isolation="Any">
<Composite>
<MemberModules>
<WriteAction ID="Health.ResetStateAction" TypeID="Microsoft.SystemCenter.Community.Health.ResetStateAction">
<ManagementGroupId>$Target/ManagementGroup/Id$</ManagementGroupId>
<ManagedEntityId>$Target/Id$</ManagedEntityId>
<MonitorId>$Config/MonitorId$</MonitorId>
</WriteAction>
</MemberModules>
<Composition>
<Node ID="Health.ResetStateAction" />
</Composition>
</Composite>
</ModuleImplementation>
<OutputType>System!System.BaseData</OutputType>
<InputType>System!System.BaseData</InputType>
</WriteActionModuleType>
Next is another module which can be used as well. It resets the state of the monitor first and then executes command.
<WriteActionModuleType ID="Microsoft.SystemCenter.Community.Health.ResetTargetStateCommandExecuterAction" Accessibility="Public" Batching="false">
<Configuration>
<IncludeSchemaTypes>
<SchemaType>System!System.CommandExecuterSchema</SchemaType>
</IncludeSchemaTypes>
<xsd:element minOccurs="1" name="ApplicationName" type="xsd:string" />
<xsd:element minOccurs="1" name="WorkingDirectory" type="xsd:string" />
<xsd:element minOccurs="1" name="CommandLine" type="xsd:string" />
<xsd:element minOccurs="1" name="TimeoutSeconds" type="xsd:integer" />
<xsd:element minOccurs="1" name="RequireOutput" type="xsd:boolean" />
<xsd:element minOccurs="1" name="MonitorId" type="xsd:string" />
</Configuration>
<ModuleImplementation Isolation="Any">
<Composite>
<MemberModules>
<WriteAction ID="Command" TypeID="System!System.CommandExecuter">
<ApplicationName>$Config/ApplicationName$</ApplicationName>
<WorkingDirectory>$Config/WorkingDirectory$</WorkingDirectory>
<CommandLine>$Config/CommandLine$</CommandLine>
<TimeoutSeconds>$Config/TimeoutSeconds$</TimeoutSeconds>
<RequireOutput>$Config/RequireOutput$</RequireOutput>
</WriteAction>
<WriteAction ID="Reset" TypeID="Microsoft.SystemCenter.Community.Health.ResetTargetStateAction">
<MonitorId>$Config/MonitorId$</MonitorId>
</WriteAction>
</MemberModules>
<Composition>
<Node ID="Command">
<Node ID="Reset" />
</Node>
</Composition>
</Composite>
</ModuleImplementation>
<OutputType>System!System.BaseData</OutputType>
<InputType>System!System.BaseData</InputType>
</WriteActionModuleType>
Sealed MP with both modules is attached to this post.
Sample: Attached is also example providing use of modules with simple event based monitor. Monitor targets instance of “Root Management Server” and that is a reason why management pack also defines a view for state of this entity. When you choose to display “Health explorer”, you should be easily able to locate sample monitor.
One of the recoveries present in attached MP runs automatically with WARNING state. Highlighted is MPElement replacement representing monitor you want to reset. (It should be same as value of the attribute Monitor! Also, please observe that using just a reset module causes its output to be displayed in “Context” tab as well as two state changes will appear to have “same” time of change in Health Explorer.
<Recovery ID="Microsoft.SystemCenter.Community.Monitors.RecoverySample.StateWarningResetRecovery" Accessibility="Internal" Enabled="onStandardMonitoring" Target="SC!Microsoft.SystemCenter.RootManagementServer" Monitor="Microsoft.SystemCenter.Community.Monitors.RecoverySample.EventBasedMonitor" RecalculateMonitor="false" ExecuteOnState="Warning" Remotable="true" Timeout="300">
<Category>Maintenance</Category>
<WriteAction ID="Reset" TypeID="MicrosoftSystemCenterCommunityMonitorsExtensions!Microsoft.SystemCenter.Community.Health.ResetTargetStateAction">
<MonitorId>$MPElement[Name="Microsoft.SystemCenter.Community.Monitors.RecoverySample.EventBasedMonitor"]$</MonitorId>
</WriteAction>
</Recovery>
DISCLAIMER:
Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft.
Attachment(s): Microsoft.SystemCenter.Community.Monitors.MPs.zip
Pleased to announce that a small but important update to the Cluster Management Pack has now been released. This update focuses on improving scalability of the Management Pack.
The following summarizes the changes to the MP:
· Introduce cook-down for workflows that query the state of clustered resources hosted by the node. Positive effect is based on better and more effective WMI use (one query handles all resources compare to as many queries as many resource needed before).
· Changed the frequency for some workflows and documented how to override them.
· Added various other documentation improvements.
Subject of this post is an advanced authoring combining usage of the security features of OpsMgr 2007 with workflows while trying to explain how to troubleshoot alerts which may be raised at the end of such process. On the simple example, I display tool I developed to help resolving ambiguous or unclear obstacles which may surface with this scenario.
I’m not going to discuss why, let’s just say I have a need to create my own Run As profile. This profile is then be populated with custom Run As account I created as well. These steps need to be done manually.
· Open OpsMgr console
· Navigate to “Administration”, then “Run As Configuration”
· Please create “Windows Credentials” account (do not distribute to any computer)

· Please create new profile and associate with previously created account.

Just to note that this post doesn’t aim to explain the internals of association between profile and account nor account distribution details, there are (or will be) official guides available for that exact reason.
Let’s also assume simple rule which generates alert when event 123 is raised in Application log by EventCreate. When created profile is used with this rule while run as account was not distributed to computer where target instance is monitored, event 1108 is raised during configuration load and workflow for this profile is not loaded until issue is corrected.
· Open OpsMgr authoring console
· Create NT event based rule and use this profile with Event data source module.
Because we are using unsealed MP, this rule must be created in same file as initially created profile.
This event 1108 is picked by OpsMgr MP and alert is raised to notify that distribution was not set when Run As account was associated with Run As profile
Dialogs and wizards were re-designed in this milestone to notify about the need to distribute during the creation!
Unfortunately, this new alert may at cases contain somewhat cryptic information increasing TCO of its investigation. If alert is closed without investigating the root cause, it will appear again either after 24 hours from its original creation or when health service restarted.
To simplify investigation of affected Run As profile (where querying a DB would be a necessity), I created SDK tool and associated with the product as “console task”. Upon its execution, tool retrieves all alerts related to Run As Profile and provides user friendly information about affected Run As profile (as long as it was present in the DB).
Another alert that such tool is able to help investigate is based on event 1107 and can be simulated by importing attached MP.
DISCLAIMER:
Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft. Future versions of this tool may be created based on time and requests.
x86 installation package
x64 installation package
Attachment(s): Microsoft.SystemCenter.Runtime.RunAs.xml
Multiple network names scenario
Customer has cluster resource group with multiple network name resources associated with IP address resource. This scenario was described by one of the customers:
Configuring DTC in cluster resource group that also contains the default instance of SQL is our standard practice. This is based on documentation written by Mike Grasso (Microsoft). This practice was never question during the Best Practice Review done by Microsoft either and is the reason why it remains same over the years.
Cluster aware application monitoring requires discovering instance of “Virtual Server” while targeting discovery of such application to this instance. In OpsMgr 2007, definition of Virtual Server is cluster resource group with network name and IP address properties.
Previous releases discovered only single instance of Virtual Server and assigned first network name resource as this instance property. For case study mentioned above, such network name was frequently the one belonging to DTC and not SQL. That is the reason why this approach was often error prone with multiple network names assigned to group and was often causing “loss of monitoring”. Workaround consisting of changing the order of the network name properties is not really feasible and extremely costly to attempt in enterprise environment.
Solution at glance
Override has been introduced to allow each IP address -> Network name dependency discovered as instance of Virtual Server. This override must be turned on. Decision was taken not to change default discovery behavior, based on CSS cases up to date, only specific installations of SQL Server required this feature.
How to
1. Open OpsMgr2007 R2 console.
2. Navigate to Authoring, then Object Discoveries
3. Change the scope to Virtual Server (to simplify location of discovery)
4. Select Windows Clustering Discovery and then select “Override the discovery”
5. Pick Multiple Servers Discovery and change override value to true
Upon dialog closure and configuration reload, multiple Virtual Servers should be discovered where feasible.
Another override that is worth to be mentioned!
Imaging that undesired instance of Virtual Server is discovered after override has been applied. Can I remove it? Definitely …
Use Excluded Servers override (also visible in the picture above)! That override has always been present on all up-to-date releases of OpsMgr 2007 (although I’m not aware of anyone using it). Upon selection, reading description should be self explanatory for user to know what the “value” to use with override is:
String which contains semicolon delimited fully qualified names of virtual servers to be excluded from discovery.
Applying this to customer scenario above, when instance of Virtual Server representing DTC is not required, simply place it’s FQDN into Excluded Servers override value and wait for configuration reload. Such instance will then be deleted and will disappear from views.
HOTFIX allowing this same functionality on OpsMgr 2007 SP1 is at works too for those who will not have OpsMgr 2007 R2 depployed in their environments.
What is new?
OpsMgr 2007 R2 Release candidate finally released and can be downloaded from Connect. What is new in this release? PLENTY! Some of you get a glimpse at those features while evaluating Beta, some will see most improvement for the first time … very exiting!
That is the reason why I would like to start small series where I comment on some of the changes or additions. With this post, I would like to mention a design change to suspend alert creation in order to prevent alert storm – yes, we did bring MOM2005 feature (at least for the rule) back!
Alert storm mitigation at glance:
I need to clarify we are not trying to solve generic data storm problem – that is vNext scenario. We were only addressing possible “rogue” alert generating rule to flood our operational DB and/or raise too many notifications.
Settings to recognize such problem are per agent (across all targeting instances) per individual management group (there are multiple groups settings in registry in multi-homed scenario). Default throttle settings are 50/60/10. This means that if one rule generates more than 50 alerts within 60s, such rule is suspended for 10 minutes (alert generation is disabled)
Option to customize threshold values still exist … Customization will not work in very special deployment scenario – having OpsMgr2007 R2 agent multi-homed to at least one management group monitored by OpsMgr2007 SP1 server (reason is that such agent is forced to use SP1 management packs – and those obviously miss new configuration required when threshold customization was requested). In order for runtime to recognize customized values, health service must be restarted!
When runtime recognizes that possible storm is happening, event 5399 is raised. Following is English snap of such event:
;// Suspend alert generating rule
;// %1 = management group name
;// %2 = workflow name
;// %3 = name of targeted instance
;// %4 = instance id
;// %5 = alert origin (name or message id)
;// %6 = count
;// %7 = time
;// %8 = disabled time
MessageId=5399
SymbolicName=MSG_HS_HM_ALERT_SUSPENDED
Severity=Warning
Language=English
A rule has generated %6 alerts in the last %7 seconds. Usually, when a rule generates this many alerts, it is because the rule definition is misconfigured. Please examine the rule for errors. In order to avoid excessive load, this rule will be temporarily suspended until %8.
%nRule: %2
%nInstance: %3
%nInstance ID: %4
%nManagement Group: %1.
OpsMgr 2007 R2 health monitoring will recognize this event and will raise an alert to notify operator about this problem. Alert needs to be manually closed when corrective action is taken or when conditions causing possible storm are mitigated
Following is an example of customized threshold values. It shows customization 15/30/5 (15 alerts within 30 seconds will cause suspension for 5 minutes (300 seconds). It also shows where in registry such customization should be done. One must create “Alert Count”, “Alert Count Interval” and ”Alert Suspend Interval” under “HKLM\System\CurrentControlSet\Services\HealthService\Parameters\Management Groups\<name of MG>”.
I hope you enjoy this product as much as we hope you would. I always feel happy, this time I also feel rather confident about its quality and value! Questions, comments, feedback (anything) please let me know, I will try to continue this series often (so any things in particular, scream and I move it higher in my TODO list!)
We are very excited to announce the System Center Operations Manger 2007 R2 Release Candidate now available on connect!
Operations Manager 2007 R2 Release Candidate
Operations Manager 2007 R2 introduces key new and enhanced functionality, including:
Enhanced application performance and availability across heterogeneous platforms
· Delivers monitoring across Windows, Linux and Unix servers–all through a single console
· Extends end to end monitoring of distributed applications to any workload running on Windows, Unix and Linux platforms
· Maximize availability of virtual workloads with integration with System Center Virtual Machine Manager 2008
Improved management of applications in the data center
· Delivers on the scale requirements of URL monitoring of your business
· Meet agreed service levels with enhanced reporting showing application performance and availability
· More efficient problem identification and action to resolve issues
Increased speed of access to information and functionality to drive management
· Faster load times for views and results
· Improved and simplified management pack authoring experience
For those who are evaluating the Beta release, this Release Candidate offers a number of enhancements over the Operations Manager R2 Beta, including:
· New Power Management MP template (Monitored system must be Windows Server 2008 R2 or Win7)
· Updated branding across all User Interfaces
· Improved trace configuration tools on the CD to help support issues escalated to Customer Support (where applicable)
· Improved Run As Account Distribution Configuration
· Ability to run inline tasks for non-Microsoft servers
· Support for upgrade from Beta deployments to the Release Candidate
· New and updated documentation, including the Usage Guide, Design Guide, Deployment Guide, Upgrade Guide, Security Guide and Operations Guide
In addition to the build, we are providing the Release Notes as well as other key documentation including:
· Operations Manager 2007 Supported Configurations
· Operations Manager 2007 R2 RC Design Guide
· Operations Manager 2007 R2 RC Deployment Guide
· Reporting Deployment and Usage Troubleshooting
· Operations Manager 2007 R2 RC Upgrade Guide
· Operations Manager 2007 R2 RC Security Guide
· Operations Manager 2007 R2 RC Operations Guide
· Operations Manager 2007 R2 RC Usage Guide
· Operations Manager Management Pack Guide for Operations Manager 2007 R2
Operations Manager 2007 R2 Release Candidate Release Notes
Operations Manager 2007 R2 Release Candidate Documentation
We would also like to provide you access to the Service Level Dashboard 2.0 Beta for Operations Manager 2007 R2, available here:
Service Level Dashboard v2 Beta for Operations Manager 2007 R2
Lastly, Please post your RC feedback/bugs/suggestions here:
https://connect.microsoft.com/feedback/CreateFeedbackForm.aspx?FeedbackFormConfigurationID=1872&FeedbackType=1&SiteID=446
I write this to respond to the release of the hotfix. I would like to bring such fix to your attention and for your consideration. This fix should increase reliability of health state monitoring for numerous cases where dependency monitor is used.
Issue was discovered with dependency monitor may incorrectly indicate the wrong state due to a race condition during monitor registration. This could surface when the contributing instances are not available or in maintenance mode during registration, when target instance is leaving maintenance mode, and sometimes during distributed application creation.
Main symptoms may include unexpected alerts generated, incorrect state indicated based upon the rollup algorithm ad the state of its contributing monitors. (Many cases where state is not reflected at all and shows “Not Monitored” especially for distributed application.)
DA issue had been tried and evaluated by a customer and fix addressed their problem (this referral should not be used as advice to deploy into production immediately as it is encouraged to perform individual evaluation in your own pre-production environment). Please, in the case this hotfix won’t help your case, report it thru connect site so I have a chance to investigate your scenario.
Hotfix should be deployed to every computer experiencing issues with dependency monitor. In majority cases, monitor resides in RMS only.
IMPORTANT NOTE: Application of this hotfix will reset the Health Service configuration state on each computer where it is installed. It is therefore important to review unhealthy state within the Operations Manager console and resolve where possible symptoms causing unhealthy state prior to hotfix installation. Failure to do so may cause event based monitors to be reset to Healthy state and related Alerts automatically resolved, which may lead to loss of visibility into issues impacting the monitored environment.
In my opinion, Boris did kick a** job driving this puppy out, especially knowing that Windows hotfix is required for its functionality. This one should help monitoring infrastructure of clustered RMS with future releases of OpsMgr 2007 R2 as well (not beta though)! You can download at:
http://www.microsoft.com/downloads/details.aspx?FamilyId=AC7F42F5-33E9-453D-A923-171C8E1E8E55&displaylang=en&displaylang=en
This release contains some fixes for library and discovery so even Win2k3 cluster users should consider downloading. Thanks for being patient with us!
Not something I recommend, one would say I almost regret we did not prohibit closing alerts generated by monitor (especially when auto-resolve feature was used). But recently I learned about some ticketing systems closing alerts where it is unclear if issue was corrected, so I see some necessity to automate the action of resetting monitor health state to re-generate monitor state change when issue still present after ticket was closed.
Problem Description:
Again, as said, there may be legitimate situations where a customer needs to reset monitor health once an alert generated by its state change has been resolved. Such scenario would include automated ticketing systems resolving alerts without providing enough evidence that issue was indeed addressed, situation where operator resolves batch of alert without investigating their root cause (after network outage) and/or by mistake.
Recently I saw this type of request from multiple sources independently of each other so I decided to provide what I believe may be the only solution on how to achieve this functionality – OpsMgr connector.
Analyzing proposal:
OpsMgr connector is nice feature allowing subscribing to alert changes happening for members of specified group. It also allows reacting on such change, in our case by locating monitor associated with alert and requesting its state reset thru SDK call.
Note:
I will not discuss connector internals (registration, used subscription …) but will provide source code for possible reverse engineering of my implementation.
Solution:
Attached, you can find source code for my solution as well as binary you should copy into your RMS product folder. You need to initialize connector when you start it. Such action will import MP with group definition (if MP was not imported already), it creates connector and its subscription (again if such actions are necessary) and starts worker thread to receive monitor raised alerts.

You should see connector created (in Administration section of operations console) after successful initialization.

Currently connector uses group which is populated with instances of computer. This can be adjusted (steps described later) and you should be able to see all members after group calculation rule finishes (in Authoring section of the operations console).

Bellow is an example of alert raised by test event based monitor. When this alert is resolved, state of the monitor resets.


Customization:
As mentioned, this connector will respond to all alerts generated for any monitors which belong to the instance of Windows Computer. It is rather simple to customize the managed entity type you want to use though.
First export connector management pack:
Then edit management pack in XML editor of your choice. You need to change type used with relationship as well as group population rule:

After changes are saved and you imported your management pack, please restart connector application (initialization button will not overwrite changes to MP, but remove button will delete MP from OpsMgr when removing connector from your environment). You can see in the source code that worker thread starts in 3 minutes (to give group calculation time to populate group) and subscription uses 1 minute polling interval to retrieve all alerts as per subscription definition.
One more word of caution to be said is that connector like this may not be fully scalable in big environments and additional work could be needed. This post can serve as nice example and base stone for such more advanced application though.
DISCLAIMER:
Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft. Future versions of this tool may be created based on time and requests.
Attachment(s): http://msutara.members.winisp.net/Blog/Tools/ResetMonitorConnector/ResetMonitorConnector.zip
My initial and last posts described restart monitoring tool representing my idea how one can approach “clean up” of the monitoring (scenario described in the first post).
Since then I received some additional and valid feedback and mainly thanks to Microsoft PFE (premier field engineering) team, I can now offer somewhat (mainly speed-vice) improved version of this tool. On top of improving speed, feedback I received was related to the fact that for vast majority, recalculate monitoring task causes no change and has no visible effect. (For those interested, this is due the fact that most monitor types do not define “On-Demand” detection – which is real pity as many would, in my opinion, benefit from having ability o get the state of the monitor “now” rather than wait for next regular monitor detection to make a state change). Such feedback led to the addition of “pure” RESET task (for any instance of the group).
Here is the list of changes:
1. Fixed deployment and support upgrade for both tools (SDK application and Web application).
2. Addition of “Reset Monitoring” task, for plain monitor reset which is not followed by request to recalculate instance state.
3. Change task invocation from PartialMonitoringObject to MonitoringState: reason driven by the fact that timeout can be specified and one do not need to wait for task completion – runtime (currently ) doesn’t return task status to SDK, which means that once task is spawned, it will asynchronously finish its action (well unless low memory or other system type errors occur).
4. Task now provides output as DisplayName of instances which were affected by such action directly – this means those reset (or eventually recalculated). Indirect result of that should be visible thru dependency - one can locate all instances to which affected instance contributed its state and observe if state change for those was necessary (do-able thru Health Explorer)!
Additional info – command line options:
Location:
%Program Files%\System Center Operations Manager 2007 Restart Monitoring Tool
Usage:
Microsoft.SystemCenter.Community.RestartMonitoring.App [/s][/o][/r] /instances id [, id ...]
Options:
/s “Pure” command line tool. Option missing means, that WinForm version of the tool will be executed.
/o Flag to recognize if to include information about affected instance(s) in output.
/r Request instance recalculation after its state reset has performed.
Reset:
Microsoft.SystemCenter.Community.RestartMonitoring.App /s /o /instances <guid – get from PowerShell (all groups (like root management group) are visible thru initial DIR)>
Restart:
Microsoft.SystemCenter.Community.RestartMonitoring.App /s /o /r /instances <guid>
Attached, you can find NEW VERSION of this tool. Attachment for my old posts will update automatically thru my ISP site.
DISCLAIMER:
Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft. Future versions of this tool may be created based on time and requests.
Attachment(s): http://msutara.members.winisp.net/Blog/Tools/RestartMonitoring/RestartMonitoringTool.zip
My last post tried to introduce a tool which restarts monitoring of OpsMgr environment. Cameron (and others) is (are) actively looking at its use and here is an update based on some feedback.
1. Following was a state view for computer group which I achieved by selecting “Discovered Inventory” followed by “Change Target Type …” from “State Action” pane. I then selected “View all targets” and picked “Computer Group”.

To avoid this manual intervation, my next version is adding plain state view for target of my “Restart Monitoring” task.
2. “Restart Monitoring” task was using Timeout attribute set to 5 minutes. This is not long enough for task to complete so it often fails with “Timeout Expired” error. This timeout was for task execution which one cannot set thru authoring UI. I did manual change and bumped it to 1 hour. Please customize if that is still not long enough (remember that complete contaiment relationship tree is crawled prior restarting environment monitoring although tool would be flawed if it takes that long – remember: feedback is ALWAYS appreciated)
<Task ID="Microsoft.SystemCenter.Community.RestartMonitoring.Task" Accessibility="Internal" Enabled="true" Target="SCLibrary!Microsoft.SystemCenter.ComputerGroup" Timeout="3600" Remotable="true">
<Category>Maintenance</Category>
<WriteAction ID="PA" TypeID="System!System.CommandExecuter" Target="SCLibrary!Microsoft.SystemCenter.RootManagementServer">
<ApplicationName><![CDATA[%ProgramFiles%\System Center Operations Manager 2007 Restart Monitoring Tool\Microsoft.SystemCenter.Community.RestartMonitoring.App.exe]]></ApplicationName>
<WorkingDirectory>.</WorkingDirectory>
<CommandLine>/s $Target/Id$</CommandLine>
<TimeoutSeconds>3600</TimeoutSeconds>
<RequireOutput>true</RequireOutput>
</WriteAction>
</Task>
3. I fixed x64 deployment. My old packages were installing into “Program Files(x86)” and that broke task targetted to “Computer Group” from being succesfull. Having old version installed, you can simply move “System Center Operations Manager 2007 Restart Monitoring Tool” into “Program Files” (istallation removal thru Windows installer will break though with such approach).
4. It appears that deployment of WebApp needs to be executed as an user who is allowed to create WebSite as well as AppPool.
Attached, you can find NEW VERSION of this tool. Attachment for my old post will update automatically thru my ISP site..
DISCLAIMER:
Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft. Future versions of this tool may be created based on time and requests.
Attachment(s): http://msutara.members.winisp.net/Blog/Tools/RestartMonitoring/RestartMonitoringTool.zip
My friend Cameron and I discussed following issue as one of the challenges he occasionally faces with his customers. In order to minimize his TCO and manual interventions, I promised to help even when such design request is not making it in feature set for our next release.
Problem Description:
There may be legitimate situations where a customer needs to reset many health monitors at once. For example after a network outage, there might be a significant number of alerts which may have been generated as well as the health state of various items becomes unhealthy. Another case is incorrect approach to Maintenance Mode may cause similar outcome, especially when manual reset monitors or alerts generated without “auto-resolve” feature are present in instances involved with maintenance.
To address this type of situation, the bulk of alerts from the outage need to be closed (which can be done with a PowerShell script). Also, resetting of the health state for multiple systems is required but not viable and manual intervention is needed.
His proposal was that it should be possible to select multiple servers and force their health back to green. Specifically, the health model for those instances would be walked and each monitor not Healthy is reset. This would “restart” the environment to green so that only real issues would resurface as alerts recurred and the states would be updated.
Analyzing proposal:
It is already possible to use SDK tasks to accomplish this proposal. It is event achievable to “speed” up the recognition of real issues by submitting additional “recalculate” state task for given instance (where this task forces to recalculate what the state of given instance should be (at the time of the execution) by working with on-demand detection (assuming that such detection is defined for monitor types used for monitoring of that same instance)).
My approach to implementing this proposal was little different than stated above. I’m not finding every unhealthy monitor, but crawl relationship tree for selected instance recursively adding each instance contributing to the overall health. While making sure instance is present just once, result of reset request against each of those instances affects the health state of all other instances that depend on its state either directly or indirectly.
Note:
Following post contains video trying to describe the difference between Reset and Recalculate tasks. It also touches bases on what does “on-Demand” detection means etc. Please contact me thru comments if I should try to provide additional/different explanation of those monitor features.
Solution:
Attached, you can find source code for my solution as well as installers for deployment of already built binaries. I provide two types of integration with our operations console.
First is having a task associated with managed entity “Microsoft.SystemCenter.ComputerGroup”. This will become present when installation of “RestartMonitoringSetup” for particular SKU succeeds. Following is screenshot providing self-descriptive use of the task:

Second possible integration is using the fact that console is able to act like a browser. Deployment is performed by RestartMonitoringWebSetup and consists of creating Web application and MP import. Web application allows regular web browser to act as the tool which triggers requested restart action. MP associated with this approach contains following WEB view to allow integration with console:

Choosing option with group allows “restart” of the monitoring for all instances contained within all selected groups. Such operation may become rather consuming, as I hinted above, instance space is crawled and all necessary instances (contributing directly or indirectly) are asked to reset and then recalculate their state.


Option to restart monitoring for which active alert is present is doing similar operation as the one made for group, only difference is that likely-hood of having many instances contributing to overall health state is smaller that it is with group (or multiple groups for that matter).
Note:
Additional warning is that tool is not smart enough to recognize if there is an alert raised by monitor, which means that restarting may have no effect as in fact monitors were healthy and alert has been generated by rule. This may change in future versions.

DISCLAIMER:
Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft. Future versions of this tool may be created based on time and requests.
Attachment(s): http://msutara.members.winisp.net/Blog/Tools/RestartMonitoring/RestartMonitoringTool.zip