My friend Cameron and I discussed following issue as one of the challenges he occasionally faces with his customers. In order to minimize his TCO and manual interventions, I promised to help even when such design request is not making it in feature set for our next release.
There may be legitimate situations where a customer needs to reset many health monitors at once. For example after a network outage, there might be a significant number of alerts which may have been generated as well as the health state of various items becomes unhealthy. Another case is incorrect approach to Maintenance Mode may cause similar outcome, especially when manual reset monitors or alerts generated without “auto-resolve” feature are present in instances involved with maintenance.
To address this type of situation, the bulk of alerts from the outage need to be closed (which can be done with a PowerShell script). Also, resetting of the health state for multiple systems is required but not viable and manual intervention is needed.
His proposal was that it should be possible to select multiple servers and force their health back to green. Specifically, the health model for those instances would be walked and each monitor not Healthy is reset. This would “restart” the environment to green so that only real issues would resurface as alerts recurred and the states would be updated.
It is already possible to use SDK tasks to accomplish this proposal. It is event achievable to “speed” up the recognition of real issues by submitting additional “recalculate” state task for given instance (where this task forces to recalculate what the state of given instance should be (at the time of the execution) by working with on-demand detection (assuming that such detection is defined for monitor types used for monitoring of that same instance)).
My approach to implementing this proposal was little different than stated above. I’m not finding every unhealthy monitor, but crawl relationship tree for selected instance recursively adding each instance contributing to the overall health. While making sure instance is present just once, result of reset request against each of those instances affects the health state of all other instances that depend on its state either directly or indirectly.
Note:Following post contains video trying to describe the difference between Reset and Recalculate tasks. It also touches bases on what does “on-Demand” detection means etc. Please contact me thru comments if I should try to provide additional/different explanation of those monitor features.
Attached, you can find source code for my solution as well as installers for deployment of already built binaries. I provide two types of integration with our operations console.
First is having a task associated with managed entity “Microsoft.SystemCenter.ComputerGroup”. This will become present when installation of “RestartMonitoringSetup” for particular SKU succeeds. Following is screenshot providing self-descriptive use of the task:
Second possible integration is using the fact that console is able to act like a browser. Deployment is performed by RestartMonitoringWebSetup and consists of creating Web application and MP import. Web application allows regular web browser to act as the tool which triggers requested restart action. MP associated with this approach contains following WEB view to allow integration with console:
Choosing option with group allows “restart” of the monitoring for all instances contained within all selected groups. Such operation may become rather consuming, as I hinted above, instance space is crawled and all necessary instances (contributing directly or indirectly) are asked to reset and then recalculate their state.
Option to restart monitoring for which active alert is present is doing similar operation as the one made for group, only difference is that likely-hood of having many instances contributing to overall health state is smaller that it is with group (or multiple groups for that matter).
Note:Additional warning is that tool is not smart enough to recognize if there is an alert raised by monitor, which means that restarting may have no effect as in fact monitors were healthy and alert has been generated by rule. This may change in future versions.
Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft. Future versions of this tool may be created based on time and requests.