Welcome to MSDN Blogs Sign in | Join | Help

Tool: OpsMgr 2007 - RuntimeHealthExplorer

Did you ever wonder what is the state of the instance as known to the runtime (health service) monitoring it? Did you believe that some state changes are unaccounted for? Did you see discrepancy in Health Explorer?

I believe many of you may answer yes to one of these questions.

Right now, there really is not a good guidance on how to troubleshoot state change problems, but since OpsMgr 2007 SP1 release, there was a way to at least display states of the monitors targeting the instance as recorded by runtime during state calculation. This led me to creation of the tool returning those states from runtime. It also provides visual comparison against “real” Health Explorer (states are returned from Ops DB) while integrated with OpsMgr console thru console task. This task targets instance of “HealthService” managed entity type. Tool uses Health Explorer like view of monitors for each active instance monitored by specific runtime. Following is a snapshot of the tool executed against my Root Management Server. Please observe that I created view listing all health service instances as well as console task associated with this type and accessible thru “Actions” pane.

Runtime Health Explorer 

There still may be a long way for us to recognize all the issues and take corrective actions automatically. That is the reason why this tool provides at least a manual way to synchronize states of the monitors associated with instance into operational DB – right click anywhere in tree control and select “Synchronize to DB”. Unfortunately, such corrective action is unable to synchronize state of the dependency rollup monitor, but I will try to find the way to achieve this although plumbing is not present in current implementation of the runtime yet!

Honestly, I’m not expecting big download count. But it is possible this tool helps someone with investigation of health state issues and that is the main reason why I do this post anyway. Tool works for OpsMgr2007 SP1 and R2!

Based on the feedback (if any) I may try to extend feature set in future versions too (permitting my bandwidth).

DISCLAIMER:

Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft.

 

Link to x86 installation package

Link to x64 installation package

Posted by MSutara | 2 Comments
Filed under: ,

News: OpsMgr 2007 R2 RTM ...

 

System Center Operations Manager 2007 R2  RTM!!

 

 

New Functionality - Operations Manager 2007 R2 introduces key new and enhanced functionality, including:

 

·         Enhanced application performance and availability across heterogeneous platforms

·         Delivers monitoring across Windows, Linux and UNIX servers–all through a single console.

·         Extends end to end monitoring of distributed applications to any workload running on Windows, UNIX and Linux platforms.

·         Maximize availability of virtual workloads with integration with System Center Virtual Machine Manager 2008.

 

·         Improved management of applications in the data center

·         Delivers on the scale requirements of URL monitoring of your business.

·         Meet agreed service levels with enhanced reporting showing application performance and availability.

·         More efficient problem identification and action to resolve issues.

 

·         Increased speed of access to information and functionality to drive management

·         Faster load times for views and results.

·         Improved and simplified management pack authoring experience

 

Where and when can I obtain the bits?

 

The RTM release is build 7221.

 

Where can I find collateral, training, and more on Operations Manager 2007 R2?

 

·         Newly released collateral includes the following:

o   Whitepaper: Introduction to Operations Manager 2007 R2

o   Datasheet: What’s New in Operations Manager 2007 R2

o   Datasheet: Reducing the cost of data center management with Operations Manager 2007 R2

o   Datasheet: Monitoring UNIX/Linux with Operations Manager 2007 R2

o   Datasheet: Tracking Service Levels with Operations Manager 2007 R2

o   Datasheet: Interoperability Connectors for Operations Manager 2007 R2

 

How else can I extend Operations Manager 2007 R2?

 

·         Service Level Dashboard v2 from the Solution Accelerators team lets you measure and report application or system performance & availability in near real time across your organization.  Using the Dashboard, you can easily spot trends and head off problems—before they occur.  The Dashboard also lets you create role-specific dashboards to support different departments, like HR, Finance, or Operations.   Download it today from Microsoft Connect.

·         Operations Manager 2007 R2 Interoperability Connectors provide the ability to synchronize alerts and status between Operations Manager 2007 R2 and other management systems.  Beta connectors for Tivoli Enterprise Console, HP OpenView Operations, and the new Universal Connector can be obtained from the Operations Manager R2 download on Connect.  Download the Interop Connectors from the System Center Catalog.

·         Operations Manager 2007 R2 Visio Add-in delivers the ability to link status and health information gathered by Operations Manager 2007 R2 into normally-static Visio diagrams, adding life and interaction to those diagrams.   Download it today from Microsoft Connect.

·         New Exchange Server 2007 Management Pack (MP) Beta, which provides enhancements over the current Exchange MP such as reducing alert noise and enhanced performance.   Download it today from Microsoft Connect.

·         BridgeWays MP Beta Program, providing beta MPs for MySQL, Apache, and Oracle running on Windows, Linux or Solaris.  For more information, and to register into the BridgeWays MP Beta Program, visit http://www.bridgeways.ca/bw_management-pack-beta-program-signup_form.php  

 

Posted by MSutara | 1 Comments
Filed under:

What is new: OpsMgr 2007 R2 - How to reset monitor state with recovery?

Cameron had a nice example of using new R2 feature process monitoring in real life, but that raised a question about feature he wanted to use. Following is report of his issue and how we can help him to address his challenges using already existing feature of OpsMgr 2007.

 

Scenario: Monitoring a system with a process monitor. Define a recovery to reboot the system if it’s not running the process required. Run this recovery automatically on critical state.

 

Problem: In OpsMgr prior to R2 when a Recovery was created it had an option to “Reset monitor” which would put it to a healthy state. In R2, this now says “Recalculate State Monitor”. This is representing a challenge as described below:

 

Recovery wizzard 

 

Challenge: Recalculating the state may keep the monitor in a critical state until the system has been rebooted successfully and is in fact running the process. If the process does not start correctly after reboot, it gets stuck in the critical state and the recovery will not run again. With a Reset of this monitor to a Healthy state, this would work properly, but without that option available I am not seeing an effective way to make this work.

 

Workaround: Recovery is no different than other workflows loaded by OpsMgr and is rather similar to task. It consists of modules that are chained together and should provide some corrective action in order for monitor to fix its state. For that reason, first module could be the module which resets state of the monitor.

 

Following is module that could be used with recovery directly. It will reset the state of the monitor specified in configuration.

 

<WriteActionModuleType ID="Microsoft.SystemCenter.Community.Health.ResetTargetStateAction" Accessibility="Public" Batching="false">

  <Configuration>

    <xsd:element minOccurs="1" name="MonitorId" type="xsd:string" />

  </Configuration>

  <OverrideableParameters>

    <OverrideableParameter ID="MonitorId" Selector="$Config/MonitorId$" ParameterType="string" />

  </OverrideableParameters>

  <ModuleImplementation Isolation="Any">

    <Composite>

      <MemberModules>

        <WriteAction ID="Health.ResetStateAction" TypeID="Microsoft.SystemCenter.Community.Health.ResetStateAction">

          <ManagementGroupId>$Target/ManagementGroup/Id$</ManagementGroupId>

          <ManagedEntityId>$Target/Id$</ManagedEntityId>

          <MonitorId>$Config/MonitorId$</MonitorId>

        </WriteAction>

      </MemberModules>

      <Composition>

        <Node ID="Health.ResetStateAction" />

      </Composition>

    </Composite>

  </ModuleImplementation>

  <OutputType>System!System.BaseData</OutputType>

  <InputType>System!System.BaseData</InputType>

</WriteActionModuleType>

 

Next is another module which can be used as well. It resets the state of the monitor first and then executes command.

 

<WriteActionModuleType ID="Microsoft.SystemCenter.Community.Health.ResetTargetStateCommandExecuterAction" Accessibility="Public" Batching="false">

  <Configuration>

    <IncludeSchemaTypes>

      <SchemaType>System!System.CommandExecuterSchema</SchemaType>

    </IncludeSchemaTypes>

    <xsd:element minOccurs="1" name="ApplicationName" type="xsd:string" />

    <xsd:element minOccurs="1" name="WorkingDirectory" type="xsd:string" />

    <xsd:element minOccurs="1" name="CommandLine" type="xsd:string" />

    <xsd:element minOccurs="1" name="TimeoutSeconds" type="xsd:integer" />

    <xsd:element minOccurs="1" name="RequireOutput" type="xsd:boolean" />

    <xsd:element minOccurs="1" name="MonitorId" type="xsd:string" />

  </Configuration>

  <ModuleImplementation Isolation="Any">

    <Composite>

      <MemberModules>

        <WriteAction ID="Command" TypeID="System!System.CommandExecuter">

          <ApplicationName>$Config/ApplicationName$</ApplicationName>

          <WorkingDirectory>$Config/WorkingDirectory$</WorkingDirectory>

          <CommandLine>$Config/CommandLine$</CommandLine>

          <TimeoutSeconds>$Config/TimeoutSeconds$</TimeoutSeconds>

          <RequireOutput>$Config/RequireOutput$</RequireOutput>

        </WriteAction>

        <WriteAction ID="Reset" TypeID="Microsoft.SystemCenter.Community.Health.ResetTargetStateAction">

          <MonitorId>$Config/MonitorId$</MonitorId>

        </WriteAction>

      </MemberModules>

      <Composition>

        <Node ID="Command">

          <Node ID="Reset" />

        </Node>

      </Composition>

    </Composite>

  </ModuleImplementation>

  <OutputType>System!System.BaseData</OutputType>

  <InputType>System!System.BaseData</InputType>

</WriteActionModuleType>

 

Sealed MP with both modules is attached to this post.

 

Sample: Attached is also example providing use of modules with simple event based monitor. Monitor targets instance of “Root Management Server” and that is a reason why management pack also defines a view for state of this entity. When you choose to display “Health explorer”, you should be easily able to locate sample monitor.

 

Initial Configuration 

 

One of the recoveries present in attached MP runs automatically with WARNING state. Highlighted is MPElement replacement representing monitor you want to reset. (It should be same as value of the attribute Monitor! Also, please observe that using just a reset module causes its output to be displayed in “Context” tab as well as two state changes will appear to have “same” time of change in Health Explorer.

 

<Recovery ID="Microsoft.SystemCenter.Community.Monitors.RecoverySample.StateWarningResetRecovery" Accessibility="Internal" Enabled="onStandardMonitoring" Target="SC!Microsoft.SystemCenter.RootManagementServer" Monitor="Microsoft.SystemCenter.Community.Monitors.RecoverySample.EventBasedMonitor" RecalculateMonitor="false" ExecuteOnState="Warning" Remotable="true" Timeout="300">

  <Category>Maintenance</Category>

  <WriteAction ID="Reset" TypeID="MicrosoftSystemCenterCommunityMonitorsExtensions!Microsoft.SystemCenter.Community.Health.ResetTargetStateAction">

    <MonitorId>$MPElement[Name="Microsoft.SystemCenter.Community.Monitors.RecoverySample.EventBasedMonitor"]$</MonitorId>

  </WriteAction>

</Recovery>

 

Warning Recovery Context 

 

DISCLAIMER:

Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft.

What is new: New version of cluster MP released !

Pleased to announce that a small but important update to the Cluster Management Pack has now been released. This update focuses on improving scalability of the Management Pack.

 

The following summarizes the changes to the MP:

 

·         Introduce cook-down for workflows that query the state of clustered resources hosted by the node. Positive effect is based on better and more effective WMI use (one query handles all resources compare to as many queries as many resource needed before).

·         Changed the frequency for some workflows and documented how to override them.

·         Added various other documentation improvements.

Posted by MSutara | 1 Comments
Filed under: ,

Tool: OpsMgr 2007 R2 - What to do with Secure Reference Override Alert?

Subject of this post is an advanced authoring combining usage of the security features of OpsMgr 2007 with workflows while trying to explain how to troubleshoot alerts which may be raised at the end of such process. On the simple example, I display tool I developed to help resolving ambiguous or unclear obstacles which may surface with this scenario.

 

I’m not going to discuss why, let’s just say I have a need to create my own Run As profile. This profile is then be populated with custom Run As account I created as well. These steps need to be done manually.

·         Open OpsMgr console

·         Navigate to “Administration”, then “Run As Configuration”

·         Please create “Windows Credentials” account (do not distribute to any computer)

RunAs account

·         Please create new profile and associate with previously created account.

RunAs profiles

account in profile 

Just to note that this post doesn’t aim to explain the internals of association between profile and account nor account distribution details, there are (or will be) official guides available for that exact reason.

Let’s also assume simple rule which generates alert when event 123 is raised in Application log by EventCreate. When created profile is used with this rule while run as account was not distributed to computer where target instance is monitored, event 1108 is raised during configuration load and workflow for this profile is not loaded until issue is corrected.

·         Open OpsMgr authoring console

·         Create NT event based rule and use this profile with Event data source module.

Because we are using unsealed MP, this rule must be created in same file as initially created profile.

Profile in module 

event 1108 

This event 1108 is picked by OpsMgr MP and alert is raised to notify that distribution was not set when Run As account was associated with Run As profile

Dialogs and wizards were re-designed in this milestone to notify about the need to distribute during the creation!

Unfortunately, this new alert may at cases contain somewhat cryptic information increasing TCO of its investigation. If alert is closed without investigating the root cause, it will appear again either after 24 hours from its original creation or when health service restarted.

console task integration 

To simplify investigation of affected Run As profile (where querying a DB would be a necessity), I created SDK tool and associated with the product as “console task”. Upon its execution, tool retrieves all alerts related to Run As Profile and provides user friendly information about affected Run As profile (as long as it was present in the DB).

 

Another alert that such tool is able to help investigate is based on event 1107 and can be simulated by importing attached MP.

 

DISCLAIMER:

Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft. Future versions of this tool may be created based on time and requests.

 

x86 installation package

x64 installation package

Posted by MSutara | 1 Comments
Filed under: ,

Attachment(s): Microsoft.SystemCenter.Runtime.RunAs.xml

What is new: OpsMgr 2007 R2 - Discover multiple cluster network names (Virtual Server)

Multiple network names scenario

 

Customer has cluster resource group with multiple network name resources associated with IP address resource. This scenario was described by one of the customers:

Configuring DTC in cluster resource group that also contains the default instance of SQL is our standard practice. This is based on documentation written by Mike Grasso (Microsoft). This practice was never question during the Best Practice Review done by Microsoft either and is the reason why it remains same over the years.

Cluster aware application monitoring requires discovering instance of “Virtual Server” while targeting discovery of such application to this instance. In OpsMgr 2007, definition of Virtual Server is cluster resource group with network name and IP address properties.

Previous releases discovered only single instance of Virtual Server and assigned first network name resource as this instance property. For case study mentioned above, such network name was frequently the one belonging to DTC and not SQL. That is the reason why this approach was often error prone with multiple network names assigned to group and was often causing “loss of monitoring”. Workaround consisting of changing the order of the network name properties is not really feasible and extremely costly to attempt in enterprise environment.

Solution at glance

 

Override has been introduced to allow each IP address -> Network name dependency discovered as instance of Virtual Server. This override must be turned on. Decision was taken not to change default discovery behavior, based on CSS cases up to date, only specific installations of SQL Server required this feature.

How to

 

1.   Open OpsMgr2007 R2 console.

2.   Navigate to Authoring, then Object Discoveries

3.   Change the scope to Virtual Server (to simplify location of discovery)

4.   Select Windows Clustering Discovery and then select “Override the discovery”

5.   Pick Multiple Servers Discovery and change override value to true

 

Upon dialog closure and configuration reload, multiple Virtual Servers should be discovered where feasible.

 

Override snapshot 

Another override that is worth to be mentioned!

 

Imaging that undesired instance of Virtual Server is discovered after override has been applied. Can I remove it? Definitely …

Use Excluded Servers override (also visible in the picture above)! That override has always been present on all up-to-date releases of OpsMgr 2007 (although I’m not aware of anyone using it). Upon selection, reading description should be self explanatory for user to know what the “value” to use with override is:

String which contains semicolon delimited fully qualified names of virtual servers to be excluded from discovery.

Applying this to customer scenario above, when instance of Virtual Server representing DTC is not required, simply place it’s FQDN into Excluded Servers override value and wait for configuration reload. Such instance will then be deleted and will disappear from views.

 

HOTFIX allowing this same functionality on OpsMgr 2007 SP1 is at works too for those who will not have OpsMgr 2007 R2 depployed in their environments.

Posted by MSutara | 0 Comments
Filed under: ,

What is new: OpsMgr 2007 R2 - Alert Storm Recognition (possible rule misconfiguration)

What is new?

 

OpsMgr 2007 R2 Release candidate finally released and can be downloaded from Connect. What is new in this release? PLENTY! Some of you get a glimpse at those features while evaluating Beta, some will see most improvement for the first time … very exiting!

That is the reason why I would like to start small series where I comment on some of the changes or additions. With this post, I would like to mention a design change to suspend alert creation in order to prevent alert storm – yes, we did bring MOM2005 feature (at least for the rule) back!

Alert storm mitigation at glance:

 

I need to clarify we are not trying to solve generic data storm problem – that is vNext scenario. We were only addressing possible “rogue” alert generating rule to flood our operational DB and/or raise too many notifications.

Settings to recognize such problem are per agent (across all targeting instances) per individual management group (there are multiple groups settings in registry in multi-homed scenario). Default throttle settings are 50/60/10. This means that if one rule generates more than 50 alerts within 60s, such rule is suspended for 10 minutes (alert generation is disabled)

 Option to customize threshold values still exist … Customization will not work in very special deployment scenario – having OpsMgr2007 R2 agent multi-homed to at least one management group monitored by OpsMgr2007 SP1 server (reason is that such agent is forced to use SP1 management packs – and those obviously miss new configuration required when threshold customization was requested). In order for runtime to recognize customized values, health service must be restarted!

When runtime recognizes that possible storm is happening, event 5399 is raised. Following is English snap of such event:

;// Suspend alert generating rule
;// %1 = management group name
;// %2 = workflow name
;// %3 = name of targeted instance
;// %4 = instance id
;// %5 = alert origin (name or message id)
;// %6 = count
;// %7 = time
;// %8 = disabled time

MessageId=5399
SymbolicName=MSG_HS_HM_ALERT_SUSPENDED
Severity=Warning
Language=English

A rule has generated %6 alerts in the last %7 seconds.  Usually, when a rule generates this many alerts, it is because the rule definition is misconfigured.  Please examine the rule for errors. In order to avoid excessive load, this rule will be temporarily suspended until %8.
%nRule: %2
%nInstance: %3
%nInstance ID: %4
%nManagement Group: %1.

OpsMgr 2007 R2 health monitoring will recognize this event and will raise an alert to notify operator about this problem. Alert needs to be manually closed when corrective action is taken or when conditions causing possible storm are mitigated

 Following is an example of customized threshold values. It shows customization 15/30/5 (15 alerts within 30 seconds will cause suspension for 5 minutes (300 seconds). It also shows where in registry such customization should be done. One must create “Alert Count”, “Alert Count Interval” and ”Alert Suspend Interval” under “HKLM\System\CurrentControlSet\Services\HealthService\Parameters\Management Groups\<name of MG>”.

threshold customizations 

I hope you enjoy this product as much as we hope you would. I always feel happy, this time I also feel rather confident about its quality and value! Questions, comments, feedback (anything) please let me know, I will try to continue this series often (so any things in particular, scream and I move it higher in my TODO list!)

Posted by MSutara | 0 Comments
Filed under: , ,

What is new: OpsMgr 2007 R2 RC released! (publicly available)

We are very excited to announce the System Center Operations Manger 2007 R2 Release Candidate now available on connect!

Operations Manager 2007 R2 Release Candidate

 

Operations Manager 2007 R2 introduces key new and enhanced functionality, including:

Enhanced application performance and availability across heterogeneous platforms

·         Delivers monitoring across Windows, Linux and Unix servers–all through a single console
·         Extends end to end monitoring of distributed applications to any workload running on Windows, Unix and Linux platforms
·         Maximize availability of virtual workloads with integration with System Center Virtual Machine Manager 2008

Improved management of applications in the data center

·         Delivers on the scale requirements of URL monitoring of your business
·         Meet agreed service levels with enhanced reporting showing application performance and availability
·         More efficient problem identification and action to resolve issues

Increased speed of access to information and functionality to drive management

·         Faster load times for views and results
·         Improved and simplified management pack authoring experience

For those who are evaluating the Beta release, this Release Candidate offers a number of enhancements over the Operations Manager R2 Beta, including:

·         New Power Management MP template (Monitored system must be Windows Server 2008 R2 or Win7)
·         Updated branding across all User Interfaces
·         Improved trace configuration tools on the CD to help support issues escalated to Customer Support (where applicable)
·      Improved Run As Account Distribution Configuration
·      Ability to run inline tasks for non-Microsoft servers
·         Support for upgrade from Beta deployments to the Release Candidate
·         New and updated documentation, including the Usage Guide, Design Guide, Deployment Guide, Upgrade Guide, Security Guide and Operations Guide

In addition to the build, we are providing the Release Notes as well as other key documentation including:

·         Operations Manager 2007 Supported Configurations

·         Operations Manager 2007 R2 RC Design Guide

·         Operations Manager 2007 R2 RC Deployment Guide

·         Reporting Deployment and Usage Troubleshooting

·         Operations Manager 2007 R2 RC Upgrade Guide

·         Operations Manager 2007 R2 RC Security Guide

·         Operations Manager 2007 R2 RC Operations Guide

·         Operations Manager 2007 R2 RC Usage Guide

·         Operations Manager Management Pack Guide for Operations Manager 2007 R2

 

Operations Manager 2007 R2 Release Candidate Release Notes

Operations Manager 2007 R2 Release Candidate Documentation

 

 

We would also like to provide you access to the Service Level Dashboard 2.0 Beta for Operations Manager 2007 R2, available here:

Service Level Dashboard v2 Beta for Operations Manager 2007 R2

 

 

Lastly, Please post your RC feedback/bugs/suggestions here:

https://connect.microsoft.com/feedback/CreateFeedbackForm.aspx?FeedbackFormConfigurationID=1872&FeedbackType=1&SiteID=446

 

Posted by MSutara | 0 Comments
Filed under:

Dependency Monitor Hotfix to increase health state calculation reliability

I write this to respond to the release of the hotfix. I would like to bring such fix to your attention and for your consideration. This fix should increase reliability of health state monitoring for numerous cases where dependency monitor is used.

Issue was discovered with dependency monitor may incorrectly indicate the wrong state due to a race condition during monitor registration. This could surface when the contributing instances are not available or in maintenance mode during registration, when target instance is leaving maintenance mode, and sometimes during distributed application creation.

Main symptoms may include unexpected alerts generated, incorrect state indicated based upon the rollup algorithm ad the state of its contributing monitors. (Many cases where state is not reflected at all and shows “Not Monitored” especially for distributed application.)

DA issue had been tried and evaluated by a customer and fix addressed their problem (this referral should not be used as advice to deploy into production immediately as it is encouraged to perform individual evaluation in your own pre-production environment). Please, in the case this hotfix won’t help your case, report it thru connect site so I have a chance to investigate your scenario.

Hotfix should be deployed to every computer experiencing issues with dependency monitor. In majority cases, monitor resides in RMS only.

IMPORTANT NOTE: Application of this hotfix will reset the Health Service configuration state on each computer where it is installed. It is therefore important to review unhealthy state within the Operations Manager console and resolve where possible symptoms causing unhealthy state prior to hotfix installation. Failure to do so may cause event based monitors to be reset to Healthy state and related Alerts automatically resolved, which may lead to loss of visibility into issues impacting the monitored environment.

 

Posted by MSutara | 2 Comments
Filed under:

Windows Server 2008 Cluster MP released to web !!!

In my opinion, Boris did kick a** job driving this puppy out, especially knowing that Windows hotfix is required for its functionality. This one should help monitoring infrastructure of clustered RMS with future releases of OpsMgr 2007 R2 as well (not beta though)! You can download at:

 http://www.microsoft.com/downloads/details.aspx?FamilyId=AC7F42F5-33E9-453D-A923-171C8E1E8E55&amp;displaylang=en&displaylang=en

This release contains some fixes for library and discovery so even Win2k3 cluster users should consider downloading. Thanks for being patient with us!

Posted by MSutara | 2 Comments
Filed under: ,

How to: Reset monitor when closing alert?

Not something I recommend, one would say I almost regret we did not prohibit closing alerts generated by monitor (especially when auto-resolve feature was used). But recently I learned about some ticketing systems closing alerts where it is unclear if issue was corrected, so I see some necessity to automate the action of resetting monitor health state to re-generate monitor state change when issue still present after ticket was closed.

Problem Description:

Again, as said, there may be legitimate situations where a customer needs to reset monitor health once an alert generated by its state change has been resolved. Such scenario would include automated ticketing systems resolving alerts without providing enough evidence that issue was indeed addressed, situation where operator resolves batch of alert without investigating their root cause (after network outage) and/or by mistake.

Recently I saw this type of request from multiple sources independently of each other so I decided to provide what I believe may be the only solution on how to achieve this functionality – OpsMgr connector.

Analyzing proposal:

OpsMgr connector is nice feature allowing subscribing to alert changes happening for members of specified group. It also allows reacting on such change, in our case by locating monitor associated with alert and requesting its state reset thru SDK call.

Note:
I will not discuss connector internals (registration, used subscription …) but will provide source code for possible reverse engineering of my implementation.

Solution:

Attached, you can find source code for my solution as well as binary you should copy into your RMS product folder. You need to initialize connector when you start it. Such action will import MP with group definition (if MP was not imported already), it creates connector and its subscription (again if such actions are necessary) and starts worker thread to receive monitor raised alerts.

Connector initialization.

You should see connector created (in Administration section of operations console) after successful initialization.

Connector initialized.

Currently connector uses group which is populated with instances of computer. This can be adjusted (steps described later) and you should be able to see all members after group calculation rule finishes (in Authoring section of the operations console).

Connector group

Group members. 

Bellow is an example of alert raised by test event based monitor. When this alert is resolved, state of the monitor resets.

Monitor with alert.

Monitor reset its state.

Customization:

As mentioned, this connector will respond to all alerts generated for any monitors which belong to the instance of Windows Computer. It is rather simple to customize the managed entity type you want to use though.

First export connector management pack:

Export MP. 

Then edit management pack in XML editor of your choice. You need to change type used with relationship as well as group population rule:

Customize relationship. 

Customize group population.

After changes are saved and you imported your management pack, please restart connector application (initialization button will not overwrite changes to MP, but remove button will delete MP from OpsMgr when removing connector from your environment). You can see in the source code that worker thread starts in 3 minutes (to give group calculation time to populate group) and subscription uses 1 minute polling interval to retrieve all alerts as per subscription definition.

One more word of caution to be said is that connector like this may not be fully scalable in big environments and additional work could be needed. This post can serve as nice example and base stone for such more advanced application though.

DISCLAIMER:

Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft. Future versions of this tool may be created based on time and requests.

HOWTO: “restart” monitoring of my environment? (another version UPDATE)

My initial and last posts described restart monitoring tool representing my idea how one can approach “clean up” of the monitoring (scenario described in the first post).

Since then I received some additional and valid feedback and mainly thanks to Microsoft PFE (premier field engineering) team, I can now offer somewhat (mainly speed-vice) improved version of this tool. On top of improving speed, feedback I received was related to the fact that for vast majority, recalculate monitoring task causes no change and has no visible effect. (For those interested, this is due the fact that most monitor types do not define “On-Demand” detection – which is real pity as many would, in my opinion, benefit from having ability o get the state of the monitor “now” rather than wait for next regular monitor detection to make a state change). Such feedback led to the addition of “pure” RESET task (for any instance of the group).

Here is the list of changes:

 

1.       Fixed deployment and support upgrade for both tools (SDK application and Web application).

2.       Addition of “Reset Monitoring” task, for plain monitor reset which is not followed by request to recalculate instance state.

3.       Change task invocation from PartialMonitoringObject to MonitoringState: reason driven by the fact that timeout can be specified and one do not need to wait for task completion – runtime (currently ) doesn’t return task status to SDK, which means that once task is spawned, it will asynchronously finish its action (well unless low memory or other system type errors occur).

4.       Task now provides output as DisplayName of instances which were affected by such action directly – this means those reset (or eventually recalculated). Indirect result of that should be visible thru dependency - one can locate all instances to which affected instance contributed its state and observe if state change for those was necessary (do-able thru Health Explorer)!

 

Additional info – command line options:

Location:
%Program Files%\System Center Operations Manager 2007 Restart Monitoring Tool

Usage:
 Microsoft.SystemCenter.Community.RestartMonitoring.App [/s][/o][/r] /instances id [, id ...]


Options:
/s            “Pure” command line tool. Option missing means
, that WinForm version of the tool will be executed.
/o           Flag to recognize if to include information about affected instance(s) in output.

/r            Request instance recalculation after its state reset has performed.

Reset:
Microsoft.SystemCenter.Community.RestartMonitoring.App /s /o /instances <guid – get from PowerShell (all groups (like root management group) are visible thru initial DIR)>

Restart:
Microsoft.SystemCenter.Community.RestartMonitoring.App /s /o /r /instances <guid>

Attached, you can find NEW VERSION of this tool. Attachment for my old posts will update automatically thru my ISP site.

DISCLAIMER:

Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft. Future versions of this tool may be created based on time and requests.

 

HOWTO: “restart” monitoring of my environment? (UPDATE)

My last post tried to introduce a tool which restarts monitoring of OpsMgr environment. Cameron (and others) is (are) actively looking at its use and here is an update based on some feedback.

1.       Following was a state view for computer group which I achieved by selecting “Discovered Inventory” followed by “Change Target Type …” from “State Action” pane. I then selected “View all targets” and picked “Computer Group”.

Computer group restart monitoring

To avoid this manual intervation, my next version is adding plain state view for target of my “Restart Monitoring” task.

Computer Group state view 

2.       “Restart Monitoring” task was using Timeout attribute set to 5 minutes. This is not long enough for task to complete so it often fails with “Timeout Expired” error. This timeout was for task execution which one cannot set thru authoring UI. I did manual change and bumped it to 1 hour. Please customize if that is still not long enough (remember that complete contaiment relationship tree is crawled prior restarting environment monitoring although tool would be flawed if it takes that long – remember: feedback is ALWAYS appreciated)

<Task ID="Microsoft.SystemCenter.Community.RestartMonitoring.Task" Accessibility="Internal" Enabled="true" Target="SCLibrary!Microsoft.SystemCenter.ComputerGroup" Timeout="3600" Remotable="true">

  <Category>Maintenance</Category>

  <WriteAction ID="PA" TypeID="System!System.CommandExecuter" Target="SCLibrary!Microsoft.SystemCenter.RootManagementServer">

    <ApplicationName><![CDATA[%ProgramFiles%\System Center Operations Manager 2007 Restart Monitoring Tool\Microsoft.SystemCenter.Community.RestartMonitoring.App.exe]]></ApplicationName>

    <WorkingDirectory>.</WorkingDirectory>

    <CommandLine>/s $Target/Id$</CommandLine>

    <TimeoutSeconds>3600</TimeoutSeconds>

    <RequireOutput>true</RequireOutput>

  </WriteAction>

</Task>

 

3.       I fixed x64 deployment. My old packages were installing into “Program Files(x86)” and that broke task targetted to “Computer Group” from being succesfull. Having old version installed, you can simply move “System Center Operations Manager 2007 Restart Monitoring Tool” into “Program Files” (istallation removal thru Windows installer will break though with such approach).

 

4.       It appears that deployment of WebApp needs to be executed as an user who is allowed to create WebSite as well as AppPool.

 

Attached, you can find NEW VERSION of this tool. Attachment for my old post will update automatically thru my ISP site..

DISCLAIMER:

Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft. Future versions of this tool may be created based on time and requests.

 

HOWTO: “restart” monitoring of my environment?

 

My friend Cameron and I discussed following issue as one of the challenges he occasionally faces with his customers. In order to minimize his TCO and manual interventions, I promised to help even when such design request is not making it in feature set for our next release.

Problem Description:

There may be legitimate situations where a customer needs to reset many health monitors at once. For example after a network outage, there might be a significant number of alerts which may have been generated as well as the health state of various items becomes unhealthy. Another case is incorrect approach to Maintenance Mode may cause similar outcome, especially when manual reset monitors or alerts generated without “auto-resolve” feature are present in instances involved with maintenance.

To address this type of situation, the bulk of alerts from the outage need to be closed (which can be done with a PowerShell script). Also, resetting of the health state for multiple systems is required but not viable and manual intervention is needed.

His proposal was that it should be possible to select multiple servers and force their health back to green. Specifically, the health model for those instances would be walked and each monitor not Healthy is reset. This would “restart” the environment to green so that only real issues would resurface as alerts recurred and the states would be updated.

 

Analyzing proposal:

It is already possible to use SDK tasks to accomplish this proposal. It is event achievable to “speed” up the recognition of real issues by submitting additional “recalculate” state task for given instance (where this task forces to recalculate what the state of given instance should be (at the time of the execution) by working with on-demand detection (assuming that such detection is defined for monitor types used for monitoring of that same instance)).

My approach to implementing this proposal was little different than stated above. I’m not finding every unhealthy monitor, but crawl relationship tree for selected instance recursively adding each instance contributing to the overall health. While making sure instance is present just once, result of reset request against each of those instances affects the health state of all other instances that depend on its state either directly or indirectly.

Note:
Following post contains video trying to describe the difference between Reset and Recalculate tasks. It also touches bases on what does “on-Demand” detection means etc. Please contact me thru comments if I should try to provide additional/different explanation of those monitor features.

 

Solution:

Attached, you can find source code for my solution as well as installers for deployment of already built binaries. I provide two types of integration with our operations console.

First is having a task associated with managed entity “Microsoft.SystemCenter.ComputerGroup”. This will become present when installation of “RestartMonitoringSetup” for particular SKU succeeds. Following is screenshot providing self-descriptive use of the task:

Computer group restart monitoring task.

 

Second possible integration is using the fact that console is able to act like a browser. Deployment is performed by RestartMonitoringWebSetup and consists of creating Web application and MP import. Web application allows regular web browser to act as the tool which triggers requested restart action. MP associated with this approach contains following WEB view to allow integration with console:

Web application view.

 

Choosing option with group allows “restart” of the monitoring for all instances contained within all selected groups. Such operation may become rather consuming, as I hinted above, instance space is crawled and all necessary instances (contributing directly or indirectly) are asked to reset and then recalculate their state.

Choosing to restart group(s).

Action confirmation.

 

Option to restart monitoring for which active alert is present is doing similar operation as the one made for group, only difference is that likely-hood of having many instances contributing to overall health state is smaller that it is with group (or multiple groups for that matter).

Note:
Additional warning is that tool is not smart enough to recognize if there is an alert raised by monitor, which means that restarting may have no effect as in fact monitors were healthy and alert has been generated by rule. This may change in future versions.

Choosing to restart instance(s).

 

DISCLAIMER:

Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft. Future versions of this tool may be created based on time and requests.

 

Clustered Virtual Server 2005 R2 and Operations Manager 2007

 

Problem:

 

Monitoring of clustered virtual machines (guests) is unreliable with Operations Manager 2007. Instances of Virtual Machine are not monitored without any apparent reason.

Scenario:

 

Having simple wolfpack cluster (implemented thru Microsoft Cluster Services – MSCS) with just a quorum cluster resource group and clustered Virtual Server 2005 R2 (following this guide). Operations Manager 2007 is installed and health service is pushed to every cluster node. Agents are also pushed to every virtual machine (and those computers become agent managed computers). At the end, Server Virtualization Management Pack for Microsoft System Center Operations Manager 2007 is imported after successful installation and deployment of OpsMgr.

 

Following is the list of issues one can observe with such setup:

 

1.       Instance of virtual machine (guest) or instance representing windows computer is NEVER monitored when virtual machine cluster resource group is active on the same cluster node as quorum resource group

2.       Instance of virtual machine (guest) or instance representing windows computer is monitored fine when its adequate cluster resource group is not active on the same cluster node as quorum resource group

3.       Everything works as expected when Server Virtualization MP is not present!

 

Root cause:

 

Based on investigation, all instances that are to be monitored by health service running on the virtual machine agent (guest) are disabled. It appears that Server Virtualization MP has not been designed to work under cluster scenarios, mainly due to the fact of associating Virtual Machine Host with virtual computer (cluster).

 

Possible solution:

 

Let’s quickly review some MP implementation details. Main discovery targets Microsoft.Windows.Server.Computer (and cluster is discovered as instance of Microsoft.Windows.Cluster.VirtualServer where this type extends Windows Server type). This means that discovery executed on cluster nodes is discovering instances of the VMHost while their virtual machines are active on particular node, but it also means that discovery is executed against quorum cluster resource group (virtual computer).

 

As mentioned in “Root cause” section, this is, I believe, where MP author might make a mistake with health modeling, as in fact you do not need to associate quorum with virtual machine group as they are able to coexist independently. I base such comment on the fact that virtual machine resource group will stay online when quorum moves to different cluster node. It also remains fully functional and accessible thru VS Administration Website with name of the physical computer (cluster node) used.

 

Translating this to real life example, having VM1 on node1 and VM2 plus quorum on node2, following seems discovered:

 

·         Instance of VM host for VM1 associated with physical computer 1.

·         Instance of VM host for VM2 associated with physical computer 2.

·         Instance of VM host for VM2 associated with virtual computer (cluster = quorum)

 

Investigation revealed that avoiding an association of VM host with virtual computer will fix the issue. This can be done by having DiscoveryPropertyOverride to disable VM host discovery in the context of Microsoft.Windows.Cluster.VirtualServer. (Solution verified while investigating issue.)

 

<DiscoveryPropertyOverride ID="VirtualServer.2005R2.Discovery.Override" Context="Cluster!Microsoft.Windows.Cluster.VirtualServer" Enforced="false" Discovery="VirtualServer!Microsoft.Virtualization.VirtualServer.2005R2.DiscoveryRule" Property="Enabled">

  <Value>false</Value>

</DiscoveryPropertyOverride>

 

Implementing workaround:

 

Really should be as simple as importing attached MP. But I had an environment where following steps had to be taken:

 

1.       Open Operations Console

2.       Go to “Administration” -> ManagementPack

3.       Select “Microsoft Virtualization Reports” -> right click -> delete

4.       Select “Microsoft Virtual Server 2005 R2” -> right click -> delete. If another dependency exist, delete as well (please be careful if such dependency is default MP as diff steps needs to be taken first)

5.       Right click in “Administration” pane -> Import management packs

6.       Add Microsoft.Virtualization.Reports.mp

7.       Add Microsoft.Virtualization.VirtualServer.2005R2.mp

8.       Add attached MP

 

When all management packs are imported at the same time, override is properly applied and such allowed my test environment to work and monitor virtual machines

 

DISCLAIMER:

Please evaluate in your test environment first! As expected, this solution is provided AS-IS, with no warranties and confers no rights. Use is subject to the terms specified at Microsoft.

More Posts Next page »
 
Page view tracker