Some customers not only want to know that health service hosted on the computer was not heart beating (remember, this is way different from our approach to recognize computer down, which BTW solely based on my own opinion, is rather unfortunate attempt to recognize something for what health service watcher was not originally designed), but they also would like to get at least some information about the last time such health service contacted its management server. This post is one possible solution that can be used.

This information is already present in health explorer (in some sort of form as we will see later), but is not as friendly to locate and requires “big” TCO. I will mention how one can do that In SP1 version of OpsMgr2007 anyway:

1.       One need to open “Health Explorer” form health service watcher that was marked critical. (Health service watcher views are located inside of the folder “Operations Manager” subfolder “Agent” for those of you who never had to wonder there.)

2.       Locating “Health Service Heartbeat Failure” monitor and exploring “State Change Events” tab is next.

3.       Context of the top state change to critical carries data type that caused state change and its “Date and Time” carries value when runtime recognized that heart beat is missing

Following is screenshot from next version of OpsMgr. We did some improvements in unavailability recognition and changed internal plumbing for some of the “Health Service Watcher” monitors (that is outside of the scope of this post and I may do another one describing the changes once release date approaches). It display the fact that data type used still contains same information about when runtime recognized that health service was not heart beating and that such information is present inside of “Date and Time” within context of the state change:

 

Unavailability as recognized thru health explorer

 

So I just proved that this is highly ineffective to do when multiple health services are not heart beating and one wants to have a quick view with information when heartbeat miss was recognized and what was possibly last time given health service contacted its server.

Before we do this, I need to explain how availability is stored in our Operational Database a little bit. There is a table “Availability”. One of the columns for this table is “LastModified”. Value is equal to the time when runtime notified SDK service about availability changes. That is not the time when runtime was last contacted by health service though. Last contacted time can be calculated based on heart beat interval and how many heartbeats should be missed prior notifying SDK about the fact that heartbeat was missing. Values for interval and number of missies are stored within global settings. And that gives us opportunity to create following SQL script:

use OperationsManager

 

declare @substract float

 

declare @numberOfMissing float

declare @interval float

 

select @numberOfMissing = SettingValue from GlobalSettings GS

                join ManagedTypeProperty MTP with(nolock) on GS.ManagedTypePropertyId = MTP.ManagedTypePropertyId

                where MTP.ManagedTypePropertyName = 'NumberOfMissingHeartBeatsToMarkMachineDown'

 

select @interval = SettingValue from GlobalSettings GS

                join ManagedTypeProperty MTP with(nolock) on GS.ManagedTypePropertyId = MTP.ManagedTypePropertyId

                where MTP.ManagedTypePropertyName = 'HeartbeatInterval'

 

select @substract = (@numberOfMissing * @interval)/100000

 

declare availCursor cursor

for

                select B.DisplayName, AH.TimeStarted from Availability A

                join BaseManagedEntity B with(nolock) on B.BaseManagedEntityId = A.BaseManagedEntityId

                join AvailabilityHistory AH with(nolock) on AH.BaseManagedEntityId = A.BaseManagedEntityId

                join (

                                select MAX(AHTMP.TimeStarted) AS MaxTimeStarted, BME.BaseManagedEntityId

                                from AvailabilityHistory AHTMP

                                join BaseManagedEntity BME with(nolock) on BME.BaseManagedEntityId = AHTMP.BaseManagedEntityId

                                where BME.IsDeleted = 0

                                group by BME.BaseManagedEntityId

                ) TMP on AH.TimeStarted = MaxTimeStarted

                where A.IsAvailable = 0 and B.IsDeleted = 0

 

open availCursor

 

declare @name nvarchar(255)

declare @time datetime

 

fetch next from availCursor         into @name, @time

while @@FETCH_STATUS = 0

begin

 

                declare @approxTime datetime

                select @approxTime = (cast((cast(@time as float)- @substract) as datetime))

 

                select @name as Name, @time as Recognized, @approxTime as 'ApproxLastContactedTime (UTC)', dateadd(hh, -7, @approxTime) as 'ApproxLastContactedTime (Pacific)'

                fetch next from availCursor into @name, @time

end

 

close availCursor

deallocate availCursor

 

Results for last contacted

 

Based on the result and comparing with health explorer data, we can see that recognized is “equal” and approximate last contacted is calculated (by default it should be around 3 minutes before recognition). Maybe I will create a report in the future which will try to display this information in more unified manner, but that is not my intent right now ...