Welcome to MSDN Blogs Sign in | Join | Help

Last contacted, better SQL query

My post about how to receive approximate last contacted time spawn good feedback and thanks to Robin Drake for rewriting my SQL script without use of cursors while producing single grid. Following is his version:

use OperationsManager

Go

declare @substract float

declare @numberOfMissing float

declare @interval float

-- Get the number of missing heartbeats

select @numberOfMissing = SettingValue from GlobalSettings GS

join ManagedTypeProperty MTP with(nolock) on GS.ManagedTypePropertyId = MTP.ManagedTypePropertyId

where MTP.ManagedTypePropertyName = 'NumberOfMissingHeartBeatsToMarkMachineDown'

-- Get the heartbeat interval

select @interval = SettingValue from GlobalSettings GS

join ManagedTypeProperty MTP with(nolock) on GS.ManagedTypePropertyId = MTP.ManagedTypePropertyId

where MTP.ManagedTypePropertyName = 'HeartbeatInterval'

-- Calculate the amount of lapsed time before a system is marked as non contactable

select @substract = (@numberOfMissing * @interval)/100000

select B.DisplayName, AH.TimeStarted, (cast((cast(tmp.MaxTimeStarted as float)- @substract) as datetime)) as [ApproxLastContactedTime (UTC)],

dateadd(hh, +9, (cast((cast(tmp.MaxTimeStarted as float)- @substract) as datetime))) as 'ApproxLastContactedTime (Pacific)' from Availability A

join BaseManagedEntity B with(nolock) on B.BaseManagedEntityId = A.BaseManagedEntityId

join AvailabilityHistory AH with(nolock) on AH.BaseManagedEntityId = A.BaseManagedEntityId

join

(

select MAX(AHTMP.TimeStarted) AS MaxTimeStarted, BME.BaseManagedEntityId from AvailabilityHistory AHTMP

join BaseManagedEntity BME with(nolock) on BME.BaseManagedEntityId = AHTMP.BaseManagedEntityId

where BME.IsDeleted = 0

group by BME.BaseManagedEntityId

)

TMP on AH.TimeStarted = MaxTimeStarted

where A.IsAvailable = 0 and B.IsDeleted = 0

Robin, Thanks again!

Posted by MSutara | 6 Comments
Filed under: , ,

“Net view” as diagnostic used when heartbeat is missing

This all started as question about how “Computer Not Reachable” works. It has been asked in newsgroup and I thought I will help and provide some inside. First, I would like to repeat, that in my opinion regarding the fact that computer not reachable detection is based on the fact that heart beat is missing while targeting “Health Service Watcher” is at least questionable and not very fortunate (it is almost like we try to find causality without root causality engine). Due to performance limitations (Main reasons lying in the fact we did not want to end up having as many workflows doing ICM Ping (which is from OpsMgr perspective a blocking operation) as many computers are present within topology, or not having efficient targeting story for such computer not reachable recognition) we now have this “nasty beast” where we execute diagnostic on heartbeat monitor state change which is later supposed to  (thru some set of recoveries) make a monitor state change for “Computer Not Reachable”. This all design was not very reliable in RTM and I had to introduce some mitigations and reliability improvements in SP1 (which off course was not too positive because now two alerts are raised instead of one which was raised in RTM, but believe me, you want reliability J).

Regardless of my runt, I decided to show how to add custom diagnostics for Heartbeat monitor and eventually answer a question and provide management pack which will set the state of some custom monitor based on recovery output. This all could be advance talk, please comment and/or ask question thru this blog if something requires further explanation (this I will not discuss on newsgroup, all this provided info is about undocumented functionality which may break in any future release!).

 

So to adding custom diagnostic is really easy. In this case all we need (monitor and target) is public and accessible outside of management pack where such elements are defined. At the end it is just adding following raw XML into custom management pack (assuming you have all MP references set properly).

 

<Diagnostic ID="Microsoft.SystemCenter.Community.Diagnostic.NetViewDiagnostic" Comment="In response to heartbeat failure, net view machine" Accessibility="Internal" Enabled="true" Target="SCLibrary!Microsoft.SystemCenter.HealthServiceWatcher" Monitor="SC2007!Microsoft.SystemCenter.HealthService.Heartbeat" ExecuteOnState="Error" Remotable="true" Timeout="300">

  <Category>Maintenance</Category>

  <ProbeAction ID="Command" TypeID="System!System.CommandExecuterProbe">

    <ApplicationName><![CDATA[%windir%\System32\net.exe]]></ApplicationName>

    <WorkingDirectory />

    <CommandLine>view $Target/Property[Type="SCLibrary!Microsoft.SystemCenter.HealthServiceWatcher"]/HealthServiceName$</CommandLine>

    <TimeoutSeconds>30</TimeoutSeconds>

    <RequireOutput>true</RequireOutput>

  </ProbeAction>

</Diagnostic>

 

Now acting based on the result is where real fun starts! First I need to mention that “Computer Not Reachable” is defined as internal, so is not accessible and new monitor must be defined rather then disable original thru override included with management pack. I will not provide much information about why it is “aggregate” monitor, the only thing I will say is that in OpsMgr, aggregate monitor is only monitor without a workflow and runtime is magically making all the state changes. In our case we are setting the state thru recovery, so we do not want to “WASTE” a workflow (and there would be as many of it as many computers within enterprise are monitored) if we never expect that workflow to set the state anyway. Also, there are some “magic” modules I WILL NOT DISCUSS (maybe ever, but we will see in next release), where those modules will set the state of the monitor. (There might be some of you willing to do some reverse engineering and you may get an idea how set state critical when result of recovery provides info about command “net view” failing, and how we set state when command succeeded though.) So here is the recap thru screenshots:

 

1.       After import, monitor state is NEVER set until net view command is executed

new computer has no state set

2.       When RMS recognizes that heart beat is missing, we execute “net view” command inside of diagnostic. When net view succeeds, we set state of monitor to “Healthy”, we set to “Critical” otherwise

net view succeeded 

succcess state change

net view failed

failure state change

Attached is management pack that provides this functionality. It may be used AS IS and confers no rights and support. Enjoy!

Last contacted

Some customers not only want to know that health service hosted on the computer was not heart beating (remember, this is way different from our approach to recognize computer down, which BTW solely based on my own opinion, is rather unfortunate attempt to recognize something for what health service watcher was not originally designed), but they also would like to get at least some information about the last time such health service contacted its management server. This post is one possible solution that can be used.

This information is already present in health explorer (in some sort of form as we will see later), but is not as friendly to locate and requires “big” TCO. I will mention how one can do that In SP1 version of OpsMgr2007 anyway:

1.       One need to open “Health Explorer” form health service watcher that was marked critical. (Health service watcher views are located inside of the folder “Operations Manager” subfolder “Agent” for those of you who never had to wonder there.)

2.       Locating “Health Service Heartbeat Failure” monitor and exploring “State Change Events” tab is next.

3.       Context of the top state change to critical carries data type that caused state change and its “Date and Time” carries value when runtime recognized that heart beat is missing

Following is screenshot from next version of OpsMgr. We did some improvements in unavailability recognition and changed internal plumbing for some of the “Health Service Watcher” monitors (that is outside of the scope of this post and I may do another one describing the changes once release date approaches). It display the fact that data type used still contains same information about when runtime recognized that health service was not heart beating and that such information is present inside of “Date and Time” within context of the state change:

 

Unavailability as recognized thru health explorer

 

So I just proved that this is highly ineffective to do when multiple health services are not heart beating and one wants to have a quick view with information when heartbeat miss was recognized and what was possibly last time given health service contacted its server.

Before we do this, I need to explain how availability is stored in our Operational Database a little bit. There is a table “Availability”. One of the columns for this table is “LastModified”. Value is equal to the time when runtime notified SDK service about availability changes. That is not the time when runtime was last contacted by health service though. Last contacted time can be calculated based on heart beat interval and how many heartbeats should be missed prior notifying SDK about the fact that heartbeat was missing. Values for interval and number of missies are stored within global settings. And that gives us opportunity to create following SQL script:

use OperationsManager

 

declare @substract float

 

declare @numberOfMissing float

declare @interval float

 

select @numberOfMissing = SettingValue from GlobalSettings GS

                join ManagedTypeProperty MTP with(nolock) on GS.ManagedTypePropertyId = MTP.ManagedTypePropertyId

                where MTP.ManagedTypePropertyName = 'NumberOfMissingHeartBeatsToMarkMachineDown'

 

select @interval = SettingValue from GlobalSettings GS

                join ManagedTypeProperty MTP with(nolock) on GS.ManagedTypePropertyId = MTP.ManagedTypePropertyId

                where MTP.ManagedTypePropertyName = 'HeartbeatInterval'

 

select @substract = (@numberOfMissing * @interval)/100000

 

declare availCursor cursor

for

                select B.DisplayName, AH.TimeStarted from Availability A

                join BaseManagedEntity B with(nolock) on B.BaseManagedEntityId = A.BaseManagedEntityId

                join AvailabilityHistory AH with(nolock) on AH.BaseManagedEntityId = A.BaseManagedEntityId

                join (

                                select MAX(AHTMP.TimeStarted) AS MaxTimeStarted, BME.BaseManagedEntityId

                                from AvailabilityHistory AHTMP

                                join BaseManagedEntity BME with(nolock) on BME.BaseManagedEntityId = AHTMP.BaseManagedEntityId

                                where BME.IsDeleted = 0

                                group by BME.BaseManagedEntityId

                ) TMP on AH.TimeStarted = MaxTimeStarted

                where A.IsAvailable = 0 and B.IsDeleted = 0

 

open availCursor

 

declare @name nvarchar(255)

declare @time datetime

 

fetch next from availCursor         into @name, @time

while @@FETCH_STATUS = 0

begin

 

                declare @approxTime datetime

                select @approxTime = (cast((cast(@time as float)- @substract) as datetime))

 

                select @name as Name, @time as Recognized, @approxTime as 'ApproxLastContactedTime (UTC)', dateadd(hh, -7, @approxTime) as 'ApproxLastContactedTime (Pacific)'

                fetch next from availCursor into @name, @time

end

 

close availCursor

deallocate availCursor

 

Results for last contacted

 

Based on the result and comparing with health explorer data, we can see that recognized is “equal” and approximate last contacted is calculated (by default it should be around 3 minutes before recognition). Maybe I will create a report in the future which will try to display this information in more unified manner, but that is not my intent right now ...

Posted by MSutara | 2 Comments
Filed under: , ,

Customizing monitor from sealed MP ("Device Status Check").

I used Device Status Check monitor in my last post when providing realistic sample of how to define alert settings thru overrides. Unfortunately, and mainly because of override value restriction regarding use of parameter replacement, description of such alert was not able to benefit from property values of its target instance (could not use device name, IP Address and others with alert description). This led me to work on this post, although it is rather possible some other guys already did similar work …

So to help Tim, I decided to customize that monitor without forcing customization thru overrides. I had used regular text editor to create everything in raw XML, but here is how I would approach this problem if had no access to “internal” unsealed version of shipped MP and would need to use our authoring tools.

 

·         Create new empty management pack in our authoring console. I only need to do this if I do not have MP with customization and/or overrides for that particular managed entity type already.

·         Next I need to find management pack holding the definition of monitor I want to customize. Then I need to export this management pack. (UI won’t allow exporting sealed MP, but we have Powershell!)

$monitor = Get-Monitor -Criteria "DisplayName = 'Device Status Check'"

$monitor.GetManagementPack() | export-managementpack –path c:\

·         Open this MP in regular text editor (it is stored in path used with commandlet) and copy unit monitor I want to customize into my custom MP file. This all work raises and falls on accessibility of managed entity used as target as well as accessibility of monitor type used with unit monitor implementation. Target of your monitor MUST be public (to be accessible outside of MP where it is defined) and same goes for monitor type you are implementing (monitor type is driven by TypeID attribute). Please be aware of referencing. When copying monitor into your custom MP, you need to provide reference to MP where those elements are defined. Please observe how it is done in attached file. Also, ID for this monitor needs to be changed; it is sufficient to simply add string Custom to the end of original ID. Also, this monitor needs to go to Monitoring -> Monitors section, please create those tags when not present.

NOTE: When you do not make this effort, your MP will not pass verification and authoring console will not load your MP for further customization!

It is also very good idea to copy KnowledgeBase article for your new custom monitor (if KB exists for original monitor) into your custom MP. Just remember to adjust ID of this article to match ID you had chosen for your monitor. (Attached MP file should be sufficient sample to follow for this paragraph.)

·         Now save this file and re-open in authoring console. Go to section Monitors and provide all display strings to help SDK displaying user friendly strings rather than “ugly” MP element ID strings.

·         Make changes to alert settings of your custom monitor. Remember that you can benefit from all replacements here.

·         Disable original monitor. This is possible to do in UI thru operations console or you can manually add override thru editing raw XML (again we do not have overrides in shipped authoring console yet). Later is in my opinion much better option allowing deployment of compact solution to your customer with disabling original monitor and providing customized approach in one MP import.

 

Original monitor is disabled

 

·         “MPVerify” your file (you can do this thru Tools section of authoring console), save the file, import into product and test its functionality. Once satisfied with results, MP seal and ship to your customers …

 

Next is a screenshot that displays alert after customized monitor picked a problem with SNMP. My customized MP is attached to benefit some of you (I hope); just please remember that its use is bound to usual AS-IS clause…

Alert raised for custom monitor

Dynamic alert settings for monitor which never had any (done thru overrides)

Tim from QuickenLoans contacted me and wanted verification about ability to use AlertMessage override (discussed in my previous post) with unit monitor for SNMP network device. He was using “Device Status Check” monitor and we can see that initial requirement for override use is met – monitor is PUBLIC.

<UnitMonitor ID="Microsoft.SystemCenter.NetworkDevice.CheckDeviceStatus" Accessibility="Public" Target="Microsoft.SystemCenter.NetworkDevice" Enabled="true" TypeID="Microsoft.SystemCenter.NetworkDevice.CheckDeviceState" ParentMonitorID="Health!System.Health.AvailabilityState">

  <Category>PerformanceHealth</Category>

  <OperationalStates>

    <OperationalState HealthState="Success" MonitorTypeStateID="DeviceUp" ID="Success" />

    <OperationalState HealthState="Error" MonitorTypeStateID="DeviceDown" ID="Error" />

  </OperationalStates>

  <Configuration>

    <Interval>120</Interval>

    <IsWriteAction>false</IsWriteAction>

    <IP>$Target/Property[Type="Microsoft.SystemCenter.NetworkDevice"]/IPAddress$</IP>

    <CommunityString>$Target/Property[Type="Microsoft.SystemCenter.NetworkDevice"]/CommunityString$</CommunityString>

    <Version>$Target/Property[Type="Microsoft.SystemCenter.NetworkDevice"]/Version$</Version>

    <SnmpVarBinds>

      <SnmpVarBind>

        <OID>.1.3.6.1.2.1.1.5.0</OID>

        <Syntax>1</Syntax>

        <Value VariantType="8" />

      </SnmpVarBind>

    </SnmpVarBinds>

  </Configuration>

</UnitMonitor>

 

As evident above (and to my surprise), this monitor doesn’t have any alert associated with its state change! Luckily, plenty of overrides which are required to raise alert are accessible thru UI, which leads me to believe that some of you were able to raise alert already. For those who never did, nor needed to do such work, we will use GenerateAlert, AlertOnState, AlertPriority and AlertServerity overrides. I’m going to include those overrides in my Management Pack directly and will not spend much time with them as they should be self-explanatory.

<!-- generate alert -->

<MonitorPropertyOverride ID="MonitorPropertyOverrideGenerateAlert" Context="NetworkDeviceLibrary!Microsoft.SystemCenter.NetworkDevice" Enforced="false" Monitor="NetworkDeviceLibrary!Microsoft.SystemCenter.NetworkDevice.CheckDeviceStatus" Property="GenerateAlert">

  <Value>true</Value>

</MonitorPropertyOverride>

 

<!-- auto-resolve this alert when state improves -->

<MonitorPropertyOverride ID="MonitorPropertyOverrideAutoResolve" Context="NetworkDeviceLibrary!Microsoft.SystemCenter.NetworkDevice" Enforced="false" Monitor="NetworkDeviceLibrary!Microsoft.SystemCenter.NetworkDevice.CheckDeviceStatus" Property="AutoResolve">

  <Value>true</Value>

</MonitorPropertyOverride>

 

<!-- minimal state used for alert creation -->

<MonitorPropertyOverride ID="MonitorPropertyOverrideAlertOnState" Context="NetworkDeviceLibrary!Microsoft.SystemCenter.NetworkDevice" Enforced="false" Monitor="NetworkDeviceLibrary!Microsoft.SystemCenter.NetworkDevice.CheckDeviceStatus" Property="AlertOnState">

  <Value>Error</Value>

</MonitorPropertyOverride>

 

<!-- priority -->

<MonitorPropertyOverride ID="MonitorPropertyOverrideAlertPriority" Context="NetworkDeviceLibrary!Microsoft.SystemCenter.NetworkDevice" Enforced="false" Monitor="NetworkDeviceLibrary!Microsoft.SystemCenter.NetworkDevice.CheckDeviceStatus" Property="AlertPriority">

  <Value>Normal</Value>

</MonitorPropertyOverride>

 

<!-- severity -->

<MonitorPropertyOverride ID="MonitorPropertyOverrideAlertSeverity" Context="NetworkDeviceLibrary!Microsoft.SystemCenter.NetworkDevice" Enforced="false" Monitor="NetworkDeviceLibrary!Microsoft.SystemCenter.NetworkDevice.CheckDeviceStatus" Property="AlertSeverity">

  <Value>MatchMonitorHealth</Value>

</MonitorPropertyOverride>

 

But as some of you are probably aware, using those overrides only helps alert to be raised, but alert will not have any user friendly description. This is where my previous post comes handy. It provides the guide on how to customize alert description. Our new alert would need description, but that is equal to customizing anyway, so same process applies here as well. Initially we need to define alert description and retrieve its GUID from database after it was imported (described in my previous post).

<StringResource ID="Microsoft.SystemCenter.NetworkDevice.CheckDeviceStatus.Override.AlertMessageResourceID" />

 

<DisplayString ElementID="Microsoft.SystemCenter.NetworkDevice.CheckDeviceStatus.Override.AlertMessageResourceID">

  <Name>Network device is down</Name>

  <Description>Network device identified by commu