Just some shameless personal plug here, pointing out that I recently wrote two technical posts on the momteam blog about the APM feature in Operations Manager 2012 – maybe you want to check them out:
Hope you find them useful – if you are one of my “OpsMgr readers”
Go read the announcement at http://blogs.technet.com/b/server-cloud/archive/2011/11/10/system-center-operations-manager-2012-release-candidate-from-the-datacenter-to-the-cloud.aspx
This is the first public release since I am part of the team (I started in this role the day after the team had shipped Beta) and this is the first release that contains some direct output of my work. It feels so good!
Documentation has also been refreshed – it starts here http://technet.microsoft.com/en-us/library/hh205987.aspx
The part specifically about the APM feature is here http://technet.microsoft.com/en-us/library/hh457578.aspx
Enjoy!
I just saw that my former colleague (PFE) Tristan has posted an interesting note about the use of SetSPN “–A” vs SetSPN “–S”. I normally don’t repost other people’s content, but I thought this would be useful as there are a few SPN used in OpsMgr and it is not always easy to get them all right… and you can find a few tricks I was not aware of, by reading his post.
Check out the original post at http://blogs.technet.com/b/tristank/archive/2011/10/10/psa-you-really-need-to-update-your-kerberos-setup-documentation.aspx
Hey, I have just realized that I have been in my new PM role for a month already – time flies!
If you are one of my OpsMgr readers, in case you haven’t noticed, I have been silent here but I have published a post on the momteam blog – check it out: http://blogs.technet.com/b/momteam/archive/2011/08/12/application-performance-monitoring-in-opsmgr-2012-beta.aspx
If you are one of those few readers interested in following what I do, instead – I can tell you that I am loving the new job. Lot to do, of course, and that also applies to the private sphere – did you know that relocating to another continent takes some energy and effort? - but we are settling in nicely and things are going very smooth overall.
I have been in Premier Field Engineering for nearly 7 years (it was not even called PFE when I joined - it was just "another type of support"...) and I have to admit that it has been a fun, fun ride: I worked with awesome people and managed to make a difference with our products and services for many customers - directly working with some of those customers, as well as indirectly thru the OpsMgr Health Check program - the service I led for the last 3+ years, which nowadays gets delivered hundreds of times a year around the globe by my other fellow PFEs.
But it is time to move on: I have decided to go thru a big life change for me and my family, and I won't be working as a Premier Field Engineer anymore as of next week.
But don't panic - I am staying at Microsoft!
I have actually never been closer to Microsoft than now: we are packing and moving to Seattle the coming weekend, and on July 18th I will start working as a Program Manager in the Operations Manager product team, in Redmond. I am hoping this will enable me to make a difference with even more customers.
Exciting times ahead - wish me luck!
That said – PFE is hiring! If you are interested in working for Microsoft – we have open positions (including my vacant position in Italy) for almost all the Microsoft technologies. Simply visit http://careers.microsoft.com and search on “PFE”.
As for the OpsMgr Health Check, don't you worry: it will continue being improved - I left it in the hands of some capable colleagues: Bruno Gabrielli, Stefan Stranger and Tim McFadden - and they have a plan and commitment to update it to OpsMgr 2012.
This has been sitting on my hard drive for a long time. Long story short, the report I posted at Permanent Link to Audit Collection Services Database Partitions Size Report had a couple of bugs:
I fixed both bugs, but I don’t have a machine with SQL 2005 and Visual Studio 2005 anymore… so I can’t rebuild my report – but I don’t want to distribute one that only works on SQL 2008 because I know that SQL2005 is still out there. This is partially the reason that held this post back.
Without waiting so much longer, therefore, I decided I’ll just give you the fixed query. Enjoy
--Query to get the Partition Table --for each partition we launch the sp_spaceused stored procedure to determine the size and other info --partition list select PartitionId,Status,PartitionStartTime,PartitionCloseTime into #t1 from dbo.dtPartition with (nolock) order by PartitionStartTime Desc --sp_spaceused holder table for dtEvent create table #t2 ( PartitionId nvarchar(MAX) Collate SQL_Latin1_General_CP1_CI_AS, rows nvarchar(MAX) Collate SQL_Latin1_General_CP1_CI_AS, reserved nvarchar(MAX) Collate SQL_Latin1_General_CP1_CI_AS, data nvarchar(MAX) Collate SQL_Latin1_General_CP1_CI_AS, index_size nvarchar(MAX) Collate SQL_Latin1_General_CP1_CI_AS, unused nvarchar(MAX) Collate SQL_Latin1_General_CP1_CI_AS ) --sp_spaceused holder table for dtString create table #t3 ( PartitionId nvarchar(MAX) Collate SQL_Latin1_General_CP1_CI_AS, rows nvarchar(MAX) Collate SQL_Latin1_General_CP1_CI_AS, reserved nvarchar(MAX) Collate SQL_Latin1_General_CP1_CI_AS, data nvarchar(MAX) Collate SQL_Latin1_General_CP1_CI_AS, index_size nvarchar(MAX) Collate SQL_Latin1_General_CP1_CI_AS, unused nvarchar(MAX) Collate SQL_Latin1_General_CP1_CI_AS ) set nocount on --vars used for building Partition GUID and main table name declare @partGUID nvarchar(MAX) declare @tblName nvarchar(MAX) declare @tblNameComplete nvarchar(MAX) declare @schema nvarchar(MAX) DECLARE @vQuery NVARCHAR(MAX) --cursor declare c cursor for select PartitionID from #t1 open c fetch next from c into @partGUID --start cursor usage while @@FETCH_STATUS = 0 begin --tblName - first usage for dtEvent set @tblName = 'dtEvent_' + @partGUID --retrieve the schema name SET @vQuery = 'SELECT @dbschema = TABLE_SCHEMA from INFORMATION_SCHEMA.tables where TABLE_NAME = ''' + @tblName + '''' EXEC sp_executesql @vQuery,N'@dbschema nvarchar(max) out, @dbtblName nvarchar(max)',@schema out, @tblname --tblNameComplete set @tblNameComplete = @schema + '.' + @tblName INSERT #t2 EXEC sp_spaceused @tblNameComplete --tblName - second usage for dtString set @tblName = 'dtString_' + @partGUID --retrieve the schema name SET @vQuery = 'SELECT @dbschema = TABLE_SCHEMA from INFORMATION_SCHEMA.tables where TABLE_NAME = ''' + @tblName + '''' EXEC sp_executesql @vQuery,N'@dbschema nvarchar(max) out, @dbtblName nvarchar(max)',@schema out, @tblname --tblNameComplete set @tblNameComplete = @schema + '.' + @tblName INSERT #t3 EXEC sp_spaceused @tblNameComplete fetch next from c into @partGUID end close c deallocate c --select * from #t2 --select * from #t3 --results select #t1.PartitionId, #t1.Status, #t1.PartitionStartTime, #t1.PartitionCloseTime, #t2.rows, (CAST(LEFT(#t2.reserved,LEN(#t2.reserved)-3) AS NUMERIC(18,0)) + CAST(LEFT(#t2.reserved,LEN(#t2.reserved)-3) AS NUMERIC(18,0))) as 'reservedKB', (CAST(LEFT(#t2.data,LEN(#t2.data)-3) AS NUMERIC(18,0)) + CAST(LEFT(#t3.data,LEN(#t3.data)-3) AS NUMERIC(18,0)))as 'dataKB', (CAST(LEFT(#t2.index_size,LEN(#t2.index_size)-3) AS NUMERIC(18,0)) + CAST(LEFT(#t3.index_size,LEN(#t3.index_size)-3) AS NUMERIC(18,0))) as 'indexKB', (CAST(LEFT(#t2.unused,LEN(#t2.unused)-3) AS NUMERIC(18,0)) + CAST(LEFT(#t3.unused,LEN(#t3.unused)-3) AS NUMERIC(18,0))) as 'unusedKB' from #t1 join #t2 on #t2.PartitionId = ('dtEvent_' + #t1.PartitionId) join #t3 on #t3.PartitionId = ('dtString_' + #t1.PartitionId) order by PartitionStartTime desc --cleanup drop table #t1 drop table #t2 drop table #t3
The following article by Jimmy Harper explains very well how to set up agents and gateways’ failover paths thru Powershell http://blogs.technet.com/b/jimmyharper/archive/2010/07/23/powershell-commands-to-configure-gateway-server-agent-failover.aspx . This is the approach I also recommend, and that article is great – I encourage you to check it out if you haven’t done it yet!
Anyhow, when checking for the actual failover paths that have been configured, the use of Powershell suggested by Jimmy is rather slow – especially if your agent count is high. In the Operations Manager Health Check tool I was also using that technique at the beginning, but eventually moved to the use of SQL queries just for performance reasons. Since then, we have been using these SQL queries quite successfully for about 3 years now.
But this the season of giving... and I guess SQL Queries can be a gift, right? Therefore I am now donating them as Christmas Gift to the OpsMrg community
Enjoy – and Merry Christmas!
--GetAgentForWhichServerIsPrimary SELECT SourceBME.DisplayName as Agent,TargetBME.DisplayName as Server FROM Relationship R WITH (NOLOCK) JOIN BaseManagedEntity SourceBME ON R.SourceEntityID = SourceBME.BaseManagedEntityID JOIN BaseManagedEntity TargetBME ON R.TargetEntityID = TargetBME.BaseManagedEntityID WHERE R.RelationshipTypeId = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceCommunication() AND SourceBME.DisplayName not in (select DisplayName from dbo.ManagedEntityGenericView WITH (NOLOCK) where MonitoringClassId in (select ManagedTypeId from dbo.ManagedType WITH (NOLOCK) where TypeName = 'Microsoft.SystemCenter.GatewayManagementServer') and IsDeleted ='0') AND SourceBME.DisplayName not in (select DisplayName from dbo.ManagedEntityGenericView WITH (NOLOCK) where MonitoringClassId in (select ManagedTypeId from dbo.ManagedType WITH (NOLOCK) where TypeName = 'Microsoft.SystemCenter.ManagementServer') and IsDeleted ='0') AND R.IsDeleted = '0' --GetAgentForWhichServerIsFailover SELECT SourceBME.DisplayName as Agent,TargetBME.DisplayName as Server FROM Relationship R WITH (NOLOCK) JOIN BaseManagedEntity SourceBME ON R.SourceEntityID = SourceBME.BaseManagedEntityID JOIN BaseManagedEntity TargetBME ON R.TargetEntityID = TargetBME.BaseManagedEntityID WHERE R.RelationshipTypeId = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceSecondaryCommunication() AND SourceBME.DisplayName not in (select DisplayName from dbo.ManagedEntityGenericView WITH (NOLOCK) where MonitoringClassId in (select ManagedTypeId from dbo.ManagedType WITH (NOLOCK) where TypeName = 'Microsoft.SystemCenter.GatewayManagementServer') and IsDeleted ='0') AND SourceBME.DisplayName not in (select DisplayName from dbo.ManagedEntityGenericView WITH (NOLOCK) where MonitoringClassId in (select ManagedTypeId from dbo.ManagedType WITH (NOLOCK) where TypeName = 'Microsoft.SystemCenter.ManagementServer') and IsDeleted ='0') AND R.IsDeleted = '0' --GetGatewayForWhichServerIsPrimary SELECT SourceBME.DisplayName as Gateway, TargetBME.DisplayName as Server FROM Relationship R WITH (NOLOCK) JOIN BaseManagedEntity SourceBME ON R.SourceEntityID = SourceBME.BaseManagedEntityID JOIN BaseManagedEntity TargetBME ON R.TargetEntityID = TargetBME.BaseManagedEntityID WHERE R.RelationshipTypeId = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceCommunication() AND SourceBME.DisplayName in (select DisplayName from dbo.ManagedEntityGenericView WITH (NOLOCK) where MonitoringClassId in (select ManagedTypeId from dbo.ManagedType WITH (NOLOCK) where TypeName = 'Microsoft.SystemCenter.GatewayManagementServer') and IsDeleted ='0') AND R.IsDeleted = '0' --GetGatewayForWhichServerIsFailover SELECT SourceBME.DisplayName As Gateway, TargetBME.DisplayName as Server FROM Relationship R WITH (NOLOCK) JOIN BaseManagedEntity SourceBME ON R.SourceEntityID = SourceBME.BaseManagedEntityID JOIN BaseManagedEntity TargetBME ON R.TargetEntityID = TargetBME.BaseManagedEntityID WHERE R.RelationshipTypeId = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceSecondaryCommunication() AND SourceBME.DisplayName in (select DisplayName from dbo.ManagedEntityGenericView WITH (NOLOCK) where MonitoringClassId in (select ManagedTypeId from dbo.ManagedType WITH (NOLOCK) where TypeName = 'Microsoft.SystemCenter.GatewayManagementServer') and IsDeleted ='0') AND R.IsDeleted = '0' --xplat agents select bme2.DisplayName as XPlatAgent, bme.DisplayName as Server from dbo.Relationship r with (nolock) join dbo.RelationshipType rt with (nolock) on r.RelationshipTypeId = rt.RelationshipTypeId join dbo.BasemanagedEntity bme with (nolock) on bme.basemanagedentityid = r.SourceEntityId join dbo.BasemanagedEntity bme2 with (nolock) on r.TargetEntityId = bme2.BaseManagedEntityId where rt.RelationshipTypeName = 'Microsoft.SystemCenter.HealthServiceManagesEntity' and bme.IsDeleted = 0 and r.IsDeleted = 0 and bme2.basemanagedtypeid in (SELECT DerivedTypeId FROM DerivedManagedTypes with (nolock) WHERE BaseTypeId = (select managedtypeid from managedtype where typename = 'Microsoft.Unix.Computer') and DerivedIsAbstract = 0)
Have you ever wondered what would happen if, in Operations Manager, you’d delete a Management Server or Gateway that managed objects (such as network devices) or has agents pointing uniquely to it as their primary server?
The answer is simple, but not very pleasant: you get ORPHANED objects, which will linger in the database but you won’t be able to “see” or re-assign anymore from the GUI.
So the first thing I want to share is a query to determine IF you have any of those orphaned agents. Or even if you know, since you are not able to "see" them from the console, you might have to dig their name out of the database. Here's a query I got from a colleague in our reactive support team:
-- Check for orphaned health services (e.g. agent). declare @DiscoverySourceId uniqueidentifier; SET @DiscoverySourceId = dbo.fn_DiscoverySourceId_User(); SELECT TME.[TypedManagedEntityid], HS.PrincipalName FROM MTV_HealthService HS INNER JOIN dbo.[BaseManagedEntity] BHS WITH(nolock) ON BHS.[BaseManagedEntityId] = HS.[BaseManagedEntityId] -- get host managed computer instances INNER JOIN dbo.[TypedManagedEntity] TME WITH(nolock) ON TME.[BaseManagedEntityId] = BHS.[TopLevelHostEntityId] AND TME.[IsDeleted] = 0 INNER JOIN dbo.[DerivedManagedTypes] DMT WITH(nolock) ON DMT.[DerivedTypeId] = TME.[ManagedTypeId] INNER JOIN dbo.[ManagedType] BT WITH(nolock) ON DMT.[BaseTypeId] = BT.[ManagedTypeId] AND BT.[TypeName] = N'Microsoft.Windows.Computer' -- only with missing primary LEFT OUTER JOIN dbo.Relationship HSC WITH(nolock) ON HSC.[SourceEntityId] = HS.[BaseManagedEntityId] AND HSC.[RelationshipTypeId] = dbo.fn_RelationshipTypeId_HealthServiceCommunication() AND HSC.[IsDeleted] = 0 INNER JOIN DiscoverySourceToTypedManagedEntity DSTME WITH(nolock) ON DSTME.[TypedManagedEntityId] = TME.[TypedManagedEntityId] AND DSTME.[DiscoverySourceId] = @DiscoverySourceId WHERE HS.[IsAgent] = 1 AND HSC.[RelationshipId] IS NULL;
Once you have identified the agent you need to re-assign to a new management server, this is doable from the SDK. Below is a powershell script I wrote which will re-assign it to the RMS. It has to run from within the OpsMgr Command Shell. You still need to change the logic which chooses which agent - this is meant as a starting base... you could easily expand it into accepting parameters and/or consuming an input text file, or using a different Management Server than the RMS... you get the point.
Similarly, you might get orphaned network devices. The script below is used to re-assign all Network Devices to the RMS. This script is actually something I have had even before the other one (yes, it has been sitting in my "digital drawer" for a couple of years or more...) and uses the same concept - only you might notice that the relation's source and target are "reversed", since the relationships are different:
With a bit of added logic it should be easy to have it work for specific devices.
Disclaimer
The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my own personal opinion. All code samples are provided "AS IS" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.
I have had the following in my notes for a while… and I have not blogged in a while (been too busy) so I decided to blog it today, before the topic gets too old and starts stinking
It all started when a customer showed me an Alert he was seeing in his environment from some XPlat workflow. The alert looks like the following:
Generic Performance Mapper Module Failed Execution Alert Description Source: RLWSCOM02.domain.dom Module was unable to convert parameter to a double value Original parameter: '$Data///*[local-name()="BytesPerSecond"]$' Parameter after $Data replacement: '' Error: 0x80020005 Details: Type mismatch. One or more workflows were affected by this. Workflow name: Microsoft.Linux.RHEL.5.LogicalDisk.DiskBytesPerSecond.Collection Instance name: / Instance ID: {4F6FA8F5-C56F-4C9B-ED36-12DAFF4073D1} Management group: DataCenter Path: RLWSCOM02.domain.dom\RLWSCOM02.domain.dom Alert Rule: Generic Performance Mapper Module Runtime Failure Created: 6/28/2010 11:30:28 PM
First I stumbled into this forum post which mentions he same symptom http://social.technet.microsoft.com/Forums/en-US/crossplatformgeneral/thread/62e0bf3e-be6f-4218-a37b-f1e66f02aa49 - but when looking at the resolution, the locale on the customer machine was good (== set to US settings), so I concluded that it was not the same root cause.
Then I looked at what that rule was supposed to do, and queried the same CIM class both remotely thru WS-Man and locally via CIM, and concluded that my issue was that certain values were returning as NULL while we were expecting to see a number on the Management Server – therefore the Type Mismatch!
I have explained previously how to run CIM queries against the XPlat agent; in this case it was the following one:
winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_FileSystemStatisticalInformation?__cimnamespace=root/scx -username:scomuser -password:password -r:https://rllspago01.domain.dom:1270/wsman -auth:basic –skipCACheck -skipCNCheck
SCX_FileSystemStatisticalInformation
AverageDiskQueueLength = null
AverageTransferTime = null
BytesPerSecond = null
Caption = File system information
Description = Performance statistics related to a logical unit of secondary storage
ElementName = null
FreeMegabytes = 4007
IsAggregate = false
IsOnline = true
Name = /
PercentBusyTime = null
PercentFreeSpace = 55
PercentIdleTime = null
PercentUsedSpace = 45
ReadBytesPerSecond = null
ReadsPerSecond = null
TransfersPerSecond = null
UsedMegabytes = 3278
WriteBytesPerSecond = null
WritesPerSecond = null
See the NULLs ? Those are our issue.
Now, before you continue reading, I will tell you that I have investigated this also internally, and apparently we have just (in Cumulative Update 3) changed this behaviour in our XPlat modules, so that when NULL is returned, we consider it to be ZERO. Good or bad that is, it will at least take care of the error. But if you don’t get any data from the Unix system… well, you are not getting any data – so that might cause a surprise later on when you go and look at those charts and expect to see your disk “performance counters” but in fact all you have is a bunch of ZERO’s (how very interesting!). So, basically, the fix in CU3 suppresses the symptom, but does not address the cause.
So, let’s see what is actually causing this, as you might well want to get those statistics, or probably you would not be monitoring that server!
I looked at the Cimd.log (set to verbose) only says the following (basically not much: is getting info for 3 partitions… and the provider code is working)
2010-09-01T08:38:32,796Z Trace [scx.core.providers.diskprovider:5964:3086830480] BaseProvider::EnumInstances()
2010-09-01T08:38:33,359Z Trace [scx.core.providers.diskprovider:5964:3086830480] Object Path = //rllspago01.domain.dom/root/scx:SCX_FileSystemStatisticalInformation
2010-09-01T08:38:33,359Z Trace [scx.core.providers.diskprovider:5964:3086830480] BaseProvider::EnumInstances() - Calling DoEnumInstances()
2010-09-01T08:38:33,359Z Trace [scx.core.providers.diskprovider:5964:3086830480] DiskProvider DoEnumInstances
2010-09-01T08:38:33,359Z Trace [scx.core.providers.diskprovider:5964:3086830480] DiskProvider GetDiskEnumeration - type 3
2010-09-01T08:38:33,360Z Trace [scx.core.providers.diskprovider:5964:3086830480] BaseProvider::EnumInstances() - DoEnumInstances() returned - 3
2010-09-01T08:38:33,360Z Trace [scx.core.providers.diskprovider:5964:3086830480] BaseProvider::EnumInstances() - Call ReturnDone
2010-09-01T08:38:33,360Z Trace [scx.core.providers.diskprovider:5964:3086830480] BaseProvider::EnumInstances() - return OK
2010-09-01T08:38:33,360Z Trace [scx.core.provsup.cmpibase.singleprovider.DiskProvider:5964:3086830480] SingleProvider::EnumInstances() - Returning - 0
but it still did not give me an idea as to why we would not get data for those “counters”. A this point I stopped using complex troubleshooting techniques and simply turned intuition on, and tried with some help from a search engine: http://www.bing.com/search?q=How+do+I+find+out+Linux+Disk+utilization
the results I got all mentioned that on Linux you would use the “iostat” command.
So I tried to use and… lol and behold: the iostat commend was NOT INSTALLED on that machine!
Guess what? We installed it (it is included in the “sysstat” package for RedHat linux, so a simple “yum install sysstat” took care of this) and the counters started working!
Hope that is useful to some.
In the last couple of weeks we have been driving thru America from the east coast (New York) to the west coast (Seattle).
I figured out I needed to show my family the Microsoft campus too. Of course they know I work at Microsoft... but having only seen the office of a subsidiary - the one in Rome, with about 250 people at its max - might not have given them (especially the kids) an idea of the actual size of the company.
I work in support (mostly with System Center Operations Manager, as you know), and I work with event logs every day. The following are typical situations:
Getting to the point: I, like everyone – don’t have every OpsMgr event memorized.
This is why I thought of building this spreadsheet, and I hope it might come in handy to more people.
The spreadsheet contains an “AllEvents” list – and then the same events are broken down by event source as well:
When you want to search for an events (in one of the situations described above) just open up the spreadsheet, go to the “AllEvents” tab, hit CTRL+F (“Find”) and type in the Event ID you are searching for:
And this will take you to the row containing the event, so you can look up its description:
The description shows the event standard text (which is in the message DLL, therefore is the part you will not see if opening an EVT on another machine that does not have OpsMgr installed), and where the event parameters are (%1, %2, etc – which will be the strings you see in the EVT anyway).
That way you can get an understanding of what the original message would have looked like on the original machine.
This is just one possible usage pattern of this reference. It can also be useful to just read/study the events, learning about new ones you have never encountered, or remembering those you HAVE seen in the past but did not quite remember. And of course you can also find other creative ways to use it.
You can get it from here.
A few last words to give due credit: this spreadsheet has been compiled by using Eventlog Explorer (http://blogs.technet.com/momteam/archive/2008/04/02/eventlog-explorer.aspx ) to extract the event information out of the message DLLs on a OpsMgr2007 R2 installation. That info has been then copied and pasted in Excel in order to have an “offline” reference. Also I would like to thank Kevin Holman for pointing me to Eventlog Explorer first, and then for insisting I should not keep this spreadsheet in my drawer, as it could be useful to more people!
In an earlier post I had shown how I got the Xplat agent running on Ubuntu. I perfected the technique over time, and what follows is a step-by-step process on how to convert and change the RedHat package to run on Debian/Ubuntu. Of course this is still a hack… but some people asked me to detail it a bit more. At the same time, the cross platform team is working to update the the source code on codeplex with extra bits that will make more straightforward to grab it, modify it and re-compile it than it is today. Until then, here is how I got it to work.
I assume you have already copied the right .RPM package off the OpsMgr server’s /AgentManagement directory to the Linux box here. The examples below refer to the 32bit package, but of course the same identical technique would work for the 64bit version.
We start by converting the RPM package to DEB format:
root# alien -k scx-1.0.4-258.rhel.5.x86.rpm --scripts
scx_1.0.4-258_i386.deb generated
Then we need to create a folder where we will extract the content of the package, modify stuff, and repackage it:
root# mkdir scx_1.0.4-258_i386
root# cd scx_1.0.4-258_i386
root# ar -x ../scx_1.0.4-258_i386.deb
root# mkdir debian
root# cd debian
root# mkdir DEBIAN
root# cd DEBIAN
root# cd ../..
root# rm debian-binary
root# mv control.tar.gz debian/DEBIAN/
root# mv data.tar.gz debian/
root# tar -xvzf data.tar.gz
root# rm data.tar.gz
root# cd DEBIAN/
root# tar -xvzf control.tar.gz
root# rm control.tar.gz
Now we have the “skeleton” of the package easily laid out on the filesystem and we are ready to modify the package and add/change stuff to and in it.
First, we need to add some stuff to it, which is expected to be found on a redhat distro, but is not present in debian. In particular:
1. You should copy the file “functions” (that you can get from a redhat/centos box under /etc/init.d) under the debian/etc/init.d folder in our package folder. This file is required/included by our startup scripts, so it needs to be deployed too.
Then we need to chang some of the packacge behavior by editing files under debian/DEBIAN:
2. edit the “control” file (a file describing what the package is, and does):
3. edit the “preinst” file (pre-installation instructions): we need to add instructions to copy the “issue” file onto “redhat-release” (as the SCX_OperatingSystem class will look into that file, and this is hard-coded in the binary, we need to let it find it):
these are the actual command lines to add for both packages (DEBIAN or UBUNTU):
# symbolic links for libaries called differently on Ubuntu and Debian vs. RedHat ln -s /usr/lib/libcrypto.so.0.9.8 /usr/lib/libcrypto.so.6 ln -s /usr/lib/libssl.so.0.9.8 /usr/lib/libssl.so.6
# symbolic links for libaries called differently on Ubuntu and Debian vs. RedHat
ln -s /usr/lib/libcrypto.so.0.9.8 /usr/lib/libcrypto.so.6
ln -s /usr/lib/libssl.so.0.9.8 /usr/lib/libssl.so.6
the following bit would be Ubuntu-specific:
#we need this file for the OS provider relies on it, so we convert what we have in /etc/issue #this is ok for Ubuntu (“Ubuntu 9.0.4 \n \l” becomes “Ubuntu 9.0.4”) cat /etc/issue | awk '/\\n/ {print $1, $2}' > /etc/redhat-release
#we need this file for the OS provider relies on it, so we convert what we have in /etc/issue
#this is ok for Ubuntu (“Ubuntu 9.0.4 \n \l” becomes “Ubuntu 9.0.4”)
cat /etc/issue | awk '/\\n/ {print $1, $2}' > /etc/redhat-release
while the following bit is Debian-specific:
#this is ok for Debian (“Debian GNU/Linux 5.0 \n \l” becomes “Debian GNU/Linux 5.0”) cat /etc/issue | awk '/\\n/ {print $1, $2, $3}' > /etc/redhat-release
#this is ok for Debian (“Debian GNU/Linux 5.0 \n \l” becomes “Debian GNU/Linux 5.0”)
cat /etc/issue | awk '/\\n/ {print $1, $2, $3}' > /etc/redhat-release
4. Then we edit/modify the “postinst” file (post-installation instructions) as follows:
a. remove the 2nd and 3rd lines which look like the following
RPM_INSTALL_PREFIX= export RPM_INSTALL_PREFIX
RPM_INSTALL_PREFIX=
export RPM_INSTALL_PREFIX
as they are only useful for the RPM system, not DEB/APT, so we don’t need them.
b. change the following 2 functions which contain RedHat-specific commands:
configure_pegasus_service() { /usr/lib/lsb/install_initd /etc/init.d/scx-cimd } start_pegasus_service() { service scx-cimd start }
configure_pegasus_service() {
/usr/lib/lsb/install_initd /etc/init.d/scx-cimd
}
start_pegasus_service() {
service scx-cimd start
c. We need to change in the Debian equivalents for registering a service in INIT and starting it:
configure_pegasus_service() { update-rc.d scx-cimd defaults } start_pegasus_service() { /etc/init.d/scx-cimd start }
update-rc.d scx-cimd defaults
/etc/init.d/scx-cimd start
5. Modify the “prerm” file (pre-removal instructions):
a. Just like “postinst”, remove the lines
b. Locate the two functions stopping and un-installing the service
stop_pegasus_service() { service scx-cimd stop } unregister_pegasus_service() { /usr/lib/lsb/remove_initd /etc/init.d/scx-cimd }
stop_pegasus_service() {
service scx-cimd stop
unregister_pegasus_service() {
/usr/lib/lsb/remove_initd /etc/init.d/scx-cimd
c. Change those two functions with the Debian-equivalent command lines
stop_pegasus_service() { /etc/init.d/scx-cimd stop } unregister_pegasus_service() { update-rc.d -f scx-cimd remove }
/etc/init.d/scx-cimd stop
update-rc.d -f scx-cimd remove
At this point the change we needed have been put in place, and we can re-build the DEB package.
Move yourself in the main folder of the application (the scx_1.0.4-258_i386 folder):
Create the package starting from the folders
root# dpkg-deb --build debian
dpkg-deb: building package `scx' in `debian.deb'.
Rename the package (for Ubuntu)
root# mv debian.deb scx_1.0.4-258_Ubuntu_9_i386.deb
Rename the package (for Debian)
root# mv debian.deb scx_1.0.4-258_Debian_5_i386.deb
Install it
root# dpkg -i scx_1.0.4-258_Platform_Version_i386.deb
All done! It should install and work!
Next step would be creating a Management Pack to monitor Debian and Ubuntu. It is pretty similar to what Robert Hearn has described step by step for CentOS, but with some different replacements of strings, as you can imagine. I have done this but have not written down the procedure yet, so I will post another article on how to do this as soon as I manage to get it standardized and reliable. There is a bit more work involved for Ubuntu/Debian… as some of the daemons/services have different names, and certain files too… but nothing terribly difficult to change so you might want to try it already and have a go at it!
In the meantime, as a teaser, here’s my server’s (http://www.muscetta.com) performance, being monitored with this “hack”:
The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my own personal opinion. All code samples are provided "AS IS" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose. THIS WORK IS NOT ENDORSED AND NOT EVEN CHECKED, AUTHORIZED, SCRUTINIZED NOR APPROVED BY MY EMPLOYER, AND IT ONLY REPRESENT SOMETHING WHICH I'VE DONE IN MY FREE TIME. NO GUARANTEE WHATSOEVER IS GIVEN ON THIS. THE AUTHOR SHALL NOT BE MADE RESPONSIBLE FOR ANY DAMAGE YOU MIGHT INCUR WHEN USING THIS INFORMATION. The solution presented here IS NOT SUPPORTED by Microsoft.
A number of people I have talked to liked my previous post on ACS sizing. One thing that was not extremely easy or clear to them in that post was *how* exactly I did one thing I wrote:
[…] use the dtEvent_GUID table to get the number of events for that day, and use the stored procedure “sp_spaceused” against that same table to get an overall idea of how much space that day is taking in the database […]
To be completely honest, I do not expect people to do this manually a hundred times if they have a hundred partitions. In fact, I have been doing this for a while with a script which will do the looping for me and run that sp_spaceused for me a number of time. I cannot share that script, but I do realize that this automation is very useful, therefore I wrote a “stand-alone” SQL query which, using a couple of temporary tables, produces a similar type of output. I also went a step further and packaged it into a SQL Server Reporting Services Report for everyone’s consumption. The report should look like the following screenshot, featuring a chart and the table with the numerical information about each and every partition in the database:
You can download the report from here.
You need to upload it to your report server, and change the data source to the shared Data Source that also the built-in ACS Reports use, and it should work.
People were already collecting logs with MOM, so why not the security log? Some people were doing that, but it did not scale enough; for this reason, a few years ago Eric Fitzgerald announced that he was working on Microsoft Audit Collection System. Anyhow, the tool as it was had no interface… and the rest is history: it has been integrated into System Center Operations Manager. Anyhow, ACS remains a lesser-known component of OpsMgr.
There are a number of resources on the web that is worth mentioning and linking to:
and, of course, many more, I cannot link them all.
As for myself, I have been playing with ACS since those early beta days (before I joined Microsoft and before going back to MOM, when I was working in Security), but I never really blogged about this piece.
Since I have been doing quite a lot of work around ACS lately, again, I thought it might be worth consolidating some thoughts about it, hence this post.
What I would like to explain here is the strategy and process I go thru when analyzing the data stored in a ACS database, in order to determine a filtering strategy: what to keep and what not to keep, by applying a filter on the ACS Collector.
So, the first thing I usually start with is using one of the many “ACS sizer” Excel spreadsheets around… which usually tell you that you need more space than it really is necessary… basically giving you a “worst case” scenario. I don’t know how some people can actually do this from a purely theoretical point of view, but I usually prefer a bottom up approach: I look at the actual data that the ACS is collecting without filters, and start from there for a better/more accurate sizing.
In the case of a new install this is easy – you just turn ACS on, set the retention to a few days (one or two weeks maximum), give the DB plenty of space to make sure it will make it, add all your forwarders… sit back and wait.
Then you come back 2 weeks later and start looking at the data that has been collected.
First of all, if we have not changed the default settings, the grooming and partitioning algorithm will create new partitioned tables every day. So my first step is to see how big each “partition” is.
But… what is a partition, anyway? A partition is a set of 4 tables joint together:
where GUID is a new GUID every day, and of course the 4 tables that make up a daily partition will have the same GUID.
The dtPartition table contains a list of all partitions and their GUIDs, together with their start and closing time.
Just to get a rough estimate we can ignore the space used by the last three tables – which are usually very small – and only use the dtEvent_GUID table to get the number of events for that day, and use the stored procedure “sp_spaceused” against that same table to get an overall idea of how much space that day is taking in the database.
By following this process, I come up with something like the following:
If you have just installed ACS and let it run without filters with your agents for a couple of weeks, you should get some numbers like those above for your “couple of weeks” of analysis. If you graph your numbers in Excel (both size and number of rows/events per day) you should get some similar lines that show a pattern or trend:
So, in my example above, we can clearly observe a “weekly” pattern (monday-to-friday being busier than the weekend) and we can see that – for that environment – the biggest partition is roughly 17GB. If we round this up to 20GB – and also considering the weekends are much quieter – we can forecast 20*7 = 140GB per week. This has an excess “buffer” which will let the system survive event storms, should they happen. We also always recommend having some free space to allow for re-indexing operations.
In fact, especially when collecting everything without filters, the daily size is a lot less predictable: imagine worms “trying out” administrator account’s passwords, and so on… those things can easily create event storms.
Anyway, in the example above, the customer would have liked to keep 6 MONTHS (180days) of data online, which would become 20*180 = 3600GB = THREE TERABYTE and a HALF! Therefore we need a filtering strategy – and badly – to reduce this size.
[edited on May 7th 2010 - if you want to automate the above analysis and produce a table and graphs like those just shown, you should look at my following post.]
Ok, then we need to look at WHAT actually comprises that amount of events we are collecting without filters. As I wrote above, I usually run queries to get this type of information.
I will not get into HOW TO write a filter here – a collector’s filter is a WMI notification query and it is already described pretty well elsewhere how to configure it.
Here, instead, I want to walk thru the process and the queries I use to understand where the noise comes from and what could be filtered – and get an estimate of how much space we could be saving if filter one way or another.
Number of Events per User
--event count by User (with Percentages) declare @total float select @total = count(HeaderUser) from AdtServer.dvHeader select count(HeaderUser),HeaderUser, cast(convert(float,(count(HeaderUser)) / (convert(float,@total)) * 100) as decimal(10,2)) from AdtServer.dvHeader group by HeaderUser order by count(HeaderUser) desc
In our example above, over the 14 days we were observing, we obtained percentages like the following ones:
Just by looking at this, it is pretty clear that filtering out events tracked by the accounts “SYSTEM”, “LOCAL SERVICE” and “ANONYMOUS”, we would save over 45% of the disk space!
Number of Events by EventID
Similarly, we can look at how different Event IDs have different weights on the total amount of events tracked in the database:
--event count by ID (with Percentages) declare @total float select @total = count(EventId) from AdtServer.dvHeader select count(EventId),EventId, cast(convert(float,(count(EventId)) / (convert(float,@total)) * 100) as decimal(10,2)) from AdtServer.dvHeader group by EventId order by count(EventId) desc
We would get some similar information here:
Also, do not forget that ACS provides some report to do this type of analysis out of the box, even if for my experience they are generally slower – on large datasets – than the queries provided here. Also, a number of reports have been buggy over time, so I just prefer to run queries and be on the safe side.
Below an example of such report (even if run against a different environment – just in case you were wondering why the numbers were not the same ones :-)):
The numbers and percentages we got from the two queries above should already point us in the right direction about what we might want to adjust in either our auditing policy directly on Windows and/or decide if there is something we want to filter out at the collector level (here you should ask yourself the question: “if they aren’t worth collecting are they worth generating?” – but I digress).
Also, a permutation of the above two queries should let you see which user is generating the most “noise” in regards to some events and not other ones… for example:
--event distribution for a specific user (change the @user) - with percentages for the user and compared with the total #events in the DB declare @user varchar(255) set @user = 'SYSTEM' declare @total float select @total = count(Id) from AdtServer.dvHeader declare @totalforuser float select @totalforuser = count(Id) from AdtServer.dvHeader where HeaderUser = @user select count(Id), EventID, cast(convert(float,(count(Id)) / convert(float,@totalforuser) * 100) as decimal(10,2)) as PercentageForUser, cast(convert(float,(count(Id)) / (convert(float,@total)) * 100) as decimal(10,2)) as PercentageTotal from AdtServer.dvHeader where HeaderUser = @user group by EventID order by count(Id) desc
The above is particularly important, as we might want to filter out a number of events for the SYSTEM account (i.e. logons that occur when starting and stopping services) but we might want to keep other events that are tracked by the SYSTEM account too, such as an administrator having wiped the Security Log clean – which might be something you want to keep:
of course the amount of EventIDs 517 over the total of events tracked by the SYSTEM account will not be as many, and we can still filter the other ones out.
Number of Events by EventID and by User
We could also combine the two approaches above – by EventID and by User:
select count(Id),HeaderUser, EventId
from AdtServer.dvHeader
group by HeaderUser, EventId
order by count(Id) desc
This will produce a table like the following one
which can be easily copied/pasted into Excel in order to produce a pivot Table:
One more aspect that is less widely known, but I think is worth showing, is the way that clusters behave when in ACS. I don’t mean all clusters… but if you keep the “eventlog replication” feature of clusters enabled (you should disable it also from a monitoring perspective, but I digress), each cluster node’s security eventlog will have events not just for itself, but for all other nodes as well.
Albeit I have not found a reliable way to filter out – other than disabling eventlog replication altogether.
Anyway, just to get an idea of how much this type of “duplicate” events weights on the total, I use the following query, that tells you how many events for each machine are tracked by another machine:
--to spot machines that are cluster nodes with eventlog repliation and write duplicate events (slow)
select Count(Id) as Total,replace(right(AgentMachine, (len(AgentMachine) - patindex('%\%',AgentMachine))),'$','') as ForwarderMachine, EventMachine
--where ForwarderMachine <> EventMachine
group by EventMachine,replace(right(AgentMachine, (len(AgentMachine) - patindex('%\%',AgentMachine))),'$','')
order by ForwarderMachine,EventMachine
Those presented above are just some of the approaches I usually look into at first. Of course there are a number more. Here I am including the same queries already shown in action, plus a few more that can be useful in this process.
I have even considered building a page with all these queries – a bit like those that Kevin is collecting for OpsMgr (we actually wrote some of them together when building the OpsMgr Health Check)… shall I move the below queries on such a page? I though I’d list them here and give some background on how I normally use them, to start off with.
--top event ids select count(EventId), EventId from AdtServer.dvHeader group by EventId order by count(EventId) desc
--which machines have ever written event 538 select distinct EventMachine, count(EventId) as total from AdtServer.dvHeader where EventID = 538 group by EventMachine
--machines select * from dtMachine
--machines (more readable) select replace(right(Description, (len(Description) - patindex('%\%',Description))),'$','') from dtMachine
--events by machine select count(EventMachine), EventMachine from AdtServer.dvHeader group by EventMachine
--rows where EventMachine field not available (typically events written by ACS itself for chekpointing) select * from AdtServer.dvHeader where EventMachine = 'n/a'
--event count by day select convert(varchar(20), CreationTime, 102) as Date, count(EventMachine) as total from AdtServer.dvHeader group by convert(varchar(20), CreationTime, 102) order by convert(varchar(20), CreationTime, 102)
--event count by day and by machine select convert(varchar(20), CreationTime, 102) as Date, EventMachine, count(EventMachine) as total from AdtServer.dvHeader group by EventMachine, convert(varchar(20), CreationTime, 102) order by convert(varchar(20), CreationTime, 102)
--event count by machine and by date (distinuishes between AgentMachine and EventMachine select convert(varchar(10),CreationTime,102),Count(Id),EventMachine,AgentMachine from AdtServer.dvHeader group by convert(varchar(10),CreationTime,102),EventMachine,AgentMachine order by convert(varchar(10),CreationTime,102) desc ,EventMachine
--event count by User select count(Id),HeaderUser from AdtServer.dvHeader group by HeaderUser order by count(Id) desc
--to spot machines that write duplicate events (such as cluster nodes with eventlog replication enabled) select Count(Id),EventMachine,AgentMachine from AdtServer.dvHeader group by EventMachine,AgentMachine order by EventMachine
--to spot machines that are cluster nodes with eventlog repliation and write duplicate events (better but slower) select Count(Id) as Total,replace(right(AgentMachine, (len(AgentMachine) - patindex('%\%',AgentMachine))),'$','') as ForwarderMachine, EventMachine from AdtServer.dvHeader --where ForwarderMachine <> EventMachine group by EventMachine,replace(right(AgentMachine, (len(AgentMachine) - patindex('%\%',AgentMachine))),'$','') order by ForwarderMachine,EventMachine
--which user and from which machine is target of elevation (network service doing "runas" is a 552 event) select count(Id),EventMachine, TargetUser from AdtServer.dvHeader where HeaderUser = 'NETWORK SERVICE' and EventID = 552 group by EventMachine, TargetUser order by count(Id) desc
--by hour, minute and user --(change the timestamp)... this query is useful to search which users are active in a given time period... --helpful to spot "peaks" of activities such as password brute force attacks, or other activities limited in time. select datepart(hour,CreationTime) as Hours, datepart(minute,CreationTime) as Minutes, HeaderUser, count(Id) as total from AdtServer.dvHeader where CreationTime < '2010-02-22T16:00:00.000' and CreationTime > '2010-02-22T15:00:00.000' group by datepart(hour,CreationTime), datepart(minute,CreationTime),HeaderUser order by datepart(hour,CreationTime), datepart(minute,CreationTime),HeaderUser
The following technique should already be understood by any powersheller. Here we focus on Operations Manager log entries, even if the data mining technique shows is entirely possibly – and encouraged :-) - with any other event log.
Let’s start by getting our eventlog into a variable called $evt:
PS >> $evt = Get-Eventlog “Operations Manager”
The above only works locally in POSH v1.
In POSH v2 you can go remotely by using the “-computername” parameter:
PS >> $evt = Get-Eventlog “Operations Manager” –computername RMS.domain.com
Anyhow, you can get to this remotely also in POSHv1 with this other more “dotNET-tish” syntax:
PS >> $evt = (New-Object System.Diagnostics.Eventlog -ArgumentList "Operations Manager").get_Entries()
you could even export this (or any of the above) to a CLIXML file:
PS >> (New-Object System.Diagnostics.Eventlog -ArgumentList "Operations Manager").get_Entries() | export-clixml -path c:\evt\Evt-OpsMgr-RMS.MYDOMAIN.COM.xml
and then you could reload your eventlog to another machine:
PS >> $evt = import-clixml c:\evt\Evt-OpsMgr-RMS.MYDOMAIN.COM.xml
whatever way you used to populate your $evt variable, be it from a “live” eventlog or by re-importing it from XML, you can then start analyzing it:
PS >> $evt | where {$_.Entrytype -match "Error"} | select EventId,Source,Message | group eventid
Count Name Group ----- ---- ----- 1510 4509 {@{EventID=4509; Source=HealthService; Message=The constructor for the managed module type "Microsoft.EnterpriseManagement.Mom.DatabaseQueryModules.GroupCalculatio. 15 20022 {@{EventID=20022; Source=OpsMgr Connector; Message=The health service {7B0E947B-2055... 3 26319 {@{EventID=26319; Source=OpsMgr SDK Service; Message=An exception was thrown while p... 1 4512 {@{EventID=4512; Source=HealthService; Message=Converting data batch to XML failed w...
the above is functionally identical to the following:
PS >> $evt | where {$_.Entrytype -eq 1} | select EventID,Source,Message | group eventid
Count Name Group ----- ---- ----- 1510 4509 {@{EventID=4509; Source=HealthService; Message=The constructor for the managed modul... 15 20022 {@{EventID=20022; Source=OpsMgr Connector; Message=The health service {7B0E947B-2055... 3 26319 {@{EventID=26319; Source=OpsMgr SDK Service; Message=An exception was thrown while p... 1 4512 {@{EventID=4512; Source=HealthService; Message=Converting data batch to XML failed w...
Note that Eventlog Entries’ type is an ENUM that has values of 0,1,2 – similarly to OpsMgr health states – but beware that their order is not the same, as shown in the following table:
Let’s now look at Information Events (Entrytype –eq 0)
PS >> $evt | where {$_.Entrytype -eq 0} | select EventID,Source,Message | group eventid
Count Name Group ----- ---- ----- 4135 2110 {@{EventID=2110; Source=HealthService; Message=Health Service successfully transferr... 1548 21025 {@{EventID=21025; Source=OpsMgr Connector; Message=OpsMgr has received new configura... 4644 7026 {@{EventID=7026; Source=HealthService; Message=The Health Service successfully logge... 1548 7023 {@{EventID=7023; Source=HealthService; Message=The Health Service has downloaded sec... 1548 7025 {@{EventID=7025; Source=HealthService; Message=The Health Service has authorized all... 1548 7024 {@{EventID=7024; Source=HealthService; Message=The Health Service successfully logge... 1548 7028 {@{EventID=7028; Source=HealthService; Message=All RunAs accounts for management gro... 16 20021 {@{EventID=20021; Source=OpsMgr Connector; Message=The health service {7B0E947B-2055... 13 7019 {@{EventID=7019; Source=HealthService; Message=The Health Service has validated all ... 4 4002 {@{EventID=4002; Source=Health Service Script; Message=Microsoft.Windows.Server.Logi...
And “Warning” events (Entrytype –eq 2):
PS >> $evt | where {$_.Entrytype -eq 2} | select EventID,Source,Message | group eventid
Count Name Group ----- ---- ----- 1511 1103 {@{EventID=1103; Source=HealthService; Message=Summary: 1 rule(s)/monitor(s) failed ... 501 20058 {@{EventID=20058; Source=OpsMgr Connector; Message=The Root Connector has received b... 5 29202 {@{EventID=29202; Source=OpsMgr Config Service; Message=OpsMgr Config Service could ... 421 31501 {@{EventID=31501; Source=Health Service Modules; Message=No primary recipients were ... 18 10103 {@{EventID=10103; Source=Health Service Modules; Message=In PerfDataSource, could no... 1 29105 {@{EventID=29105; Source=OpsMgr Config Service; Message=The request for management p...
Ok now let’s see those event 20022, for example… so we get an idea of which healthservices they are referring to (20022 indicates" “hearthbeat failure”, btw):
PS >> $evt | where {$_.eventid -eq 20022} | select message
Message ------- The health service {7B0E947B-2055-C12A-B6DB-DD6B311ADF39} running on host webapp3.domain1.mydomain.com and s... The health service {E3B3CCAA-E797-4F08-860F-47558B3DA477} running on host SERVER1.domain2.mydomain.com and serving... The health service {E3B3CCAA-E797-4F08-860F-47558B3DA477} running on host SERVER1.domain2.mydomain.com and serving... The health service {E3B3CCAA-E797-4F08-860F-47558B3DA477} running on host SERVER1.domain2.mydomain.com and serving... The health service {52E16F9C-EB1A-9FAF-5B9C-1AA9C8BC28E3} running on host DC4WK3.domain1.mydomain.com and se... The health service {F96CC9E6-2EC4-7E63-EE5A-FF9286031C50} running on host VWEBDL2.domain1.mydomain.com and s... The health service {71987EE0-909A-8465-C32D-05F315C301CC} running on host VDEVWEBPROBE2.domain2.mydomain.com.... The health service {BAF6716E-54A7-DF68-ABCB-B1101EDB2506} running on host XP2SMS002.domain2.mydomain.com and serving mana... The health service {30C81387-D5E0-32D6-C3A3-C649F1CF66F1} running on host stgweb3.domain3.mydomain.com and... The health service {3DCDD330-BBBB-B8E8-4FED-EF163B27DE0A} running on host VWEBDL1.domain1.mydomain.com and s... The health service {13A47552-2693-E774-4F87-87DF68B2F0C0} running on host DC2.domain4.mydomain.com and ... The health service {920BF9A8-C315-3064-A5AA-A92AA270529C} running on host FSCLU2 and serving management group Pr... The health service {FAA3C2B5-C162-C742-786F-F3F8DC8CAC2F} running on host WEBAPP4.domain1.mydomain.com and s... The health service {3DCDD330-BBBB-B8E8-4FED-EF163B27DE0A} running on host WEBDL1.domain1.mydomain.com and s... The health service {3DCDD330-BBBB-B8E8-4FED-EF163B27DE0A} running on host WEBDL1.domain1.mydomain.com and s...
or let’s look at some warning for the Config Service:
PS >> $evt | where {$_.Eventid -eq 29202}
Index Time EntryType Source InstanceID Message ----- ---- --------- ------ ---------- ------- 5535065 Dec 07 21:18 Warning OpsMgr Config Ser... 2147512850 OpsMgr Config Service could not retrieve a cons... 5543960 Dec 09 16:39 Warning OpsMgr Config Ser... 2147512850 OpsMgr Config Service could not retrieve a cons... 5545536 Dec 10 01:06 Warning OpsMgr Config Ser... 2147512850 OpsMgr Config Service could not retrieve a cons... 5553119 Dec 11 08:24 Warning OpsMgr Config Ser... 2147512850 OpsMgr Config Service could not retrieve a cons... 5555677 Dec 11 10:34 Warning OpsMgr Config Ser... 2147512850 OpsMgr Config Service could not retrieve a cons...
Once seen those, can you remember of any particular load you had on those days that justifies the instance space changing so quickly that the Config Service couldn’t keep up?
Or let’s group those events with ID 21025 by day, so we know how many Config recalculations we’ve had (which, if many, might indicate Config Churn):
PS >> $evt | where {$_.Eventid -eq 21025} | select TimeGenerated | % {$_.TimeGenerated.ToShortDateString()} | group
Count Name Group ----- ---- ----- 39 12/7/2009 {12/7/2009, 12/7/2009, 12/7/2009, 12/7/2009...} 203 12/8/2009 {12/8/2009, 12/8/2009, 12/8/2009, 12/8/2009...} 217 12/9/2009 {12/9/2009, 12/9/2009, 12/9/2009, 12/9/2009...} 278 12/10/2009 {12/10/2009, 12/10/2009, 12/10/2009, 12/10/2009...} 259 12/11/2009 {12/11/2009, 12/11/2009, 12/11/2009, 12/11/2009...} 224 12/12/2009 {12/12/2009, 12/12/2009, 12/12/2009, 12/12/2009...} 237 12/13/2009 {12/13/2009, 12/13/2009, 12/13/2009, 12/13/2009...} 91 12/14/2009 {12/14/2009, 12/14/2009, 12/14/2009, 12/14/2009...}
Event ID 21025 shows that there is a new configuration for the Management Group.
Event ID 29103 has a similar wording, but shows that there is a new configuration for a given Healthservice. These should normally be many more events, unless your only health Service is the RMS, which is unlikely…
If we look at the event description (“message”) in search for the name (or even the GUID, as both are present) or our RMS, as follows, then they should be the same numbers of the 21025 above:
PS >> $evt | where {$_.Eventid -eq 29103} | where {$_.message -match "myrms.domain.com"} | select TimeGenerated | % {$_.TimeGenerated.ToShortDateString()} | group
Going back to the initial counts of events by their IDs, when showing the errors the counts above had spotted the presence of a lonely 4512 event, which might have gone undetected if just browsing the eventlog with the GUI, since it only occurred once.
Let’s take a look at it:
PS >> $evt | where {$_.eventid -eq 4512}
Index Time EntryType Source InstanceID Message ----- ---- --------- ------ ---------- ------- 5560756 Dec 12 11:18 Error HealthService 3221229984 Converting data batch to XML failed with error ...
Now, when it is about counts, Powershell is great. But sometimes Powershell makes it difficult to actually READ the (long) event messages (descriptions) in the console. For example, our event ID 4512 is difficult to read in its entirety and gets truncated with trailing dots…
we can of course increase the window size and/or selecting only THAT one field to read it better:
PS >> $evt | where {$_.eventid -eq 4512} | select message
Message ------- Converting data batch to XML failed with error "Not enough storage is available to complete this operation." (0x8007000E) in rule "Microsoft.SystemCenter.ConfigurationService.CollectionRule.Event.ConfigurationChanged" running for instance "RMS.MYDOMAIN.COM" with id:"{04F4ADED-2C7F-92EF-D620-9AF9685F736F}" in management group "SCOMPROD"
Or, worst case, if it still does not fit, we can still go and search for it in the actual, usual eventlog application… but at least we will have spotted it!
The above wants to give you an idea of what is easily accomplished with some simple one-liners, and how it can be a useful aid in analyzing/digging into Eventlogs.
All of the above is ALSO be possible with Logparser, and it would actually be even less heavy on memory usage and it will be quicker, to be honest!
I just like Powershell syntax a lot more, and its ubiquity, which makes it a better option for me. Your mileage may vary, of course.
So I was testing other stuff tonight, to be honest, but I got pinged on Instant Messenger by my geek friend and colleague Stefan Stranger who pointed me at his request for help here http://friendfeed.com/sstranger/4571f39b/help-needed-on-winrs-or-winrm-and-openwsman-to
He wanted to use WINRM or any other command line utility to interact with the Xplat agent, and call methods on the Unix machine from windows. This could be very useful to – for example – restart a service (in fact it is what the RECOVERY actions in the Xplat Management Packs do, btw).
At first I told him I had only tested enumerations – such as on this other post http://www.muscetta.com/2009/06/01/using-the-scx-agent-with-wsman-from-powershell-v2/ … but the question intrigued me, so I check out the help for winrm’s INVOKE verb:
Which told me that you can pass in the parameters for the method to be called/invoked either as an hashtable @{KEY=”value”;KEY2=”value”}, or as an input XML file. I first tried the XML file but I could not get its format right.
After a few more minutes of trying, I figured out the right syntax.
This one works, for example:
winrm invoke ExecuteCommand http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_OperatingSystem?__cimnamespace=root/scx @{command="ps";timeout="60"} -username:root -password:password -auth:basic -r:https://virtubuntu.huis.dom:1270/wsman -skipCACheck -encoding:UTF-8
Happy remote management of your unix systems from Windows :-)
I printed a tshirt for Sara with a baby-friendly Powershell cmdlet ("Get-Milk"). She already seems to be wondering what script she can write with it.
During the OpsMgr Health Check engagement we use custom code to assess the customer’s Management group, as I wrote here already. Given that the customer tells us which machine is the RMS, one of the very first things that we do in our tool is to connect to the RMS’s registry, and check the values under HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Setup to see which machine holds the database. It is a rather critical piece of information for us, as we run a number of queries afterward… so we need to know where the db is, obviously :-)
I learned from here http://mybsinfo.blogspot.com/2007/01/powershell-remote-registry-and-you-part.html how to access registry remotely thru powershell, by using .Net classes. This is also one of the methods illustrated in this other article on Technet Script Center http://www.microsoft.com/technet/scriptcenter/resources/qanda/jan09/hey0105.mspx
Therefore the “core” instructions of the function I was using to access the registry looked like the following
!-->
[Note: the actual function is bigger, and contains error handling, and logging, and a number of other things that are unnecessary here]
Therefore, the function was called as follows: GetValueFromRegistry $RMS "SOFTWARE\\Microsoft\\Microsoft Operations Manager\\3.0\\Setup" "DatabaseServerName" Now so far so good.
In theory.
Now for some reason that I could not immediately explain, we had noticed that this piece of code performing registry accessm while working most of the times, only on SOME occasions was giving errors about not being able to open the registry value…
When you are onsite with a customer conducting an assessment, the PFE engineer does not always has the time to troubleshoot the error… as time is critical, we have usually resorted to just running the assessment from ANOTHER machine, and this “solved” the issue… but always left me wondering WHY this was giving an error. I had suspected an issue with permissions first, but it could not be as the permissions were obviously right: performing the assessment from another machine but with the same user was working!
A few days ago my colleague and buddy Stefan Stranger figured out that this was related to the platform architecture:
You don’t need to use our custom code to reproduce this, REGEDIT shows the behavior as well.
If, from a 64-bit server, you open a remote registry connection to 64-bit RMS server, you can see all OpsMgr registry keys:
If, anyhow, from a 32-bit server, you open a remote registry connection to 64-bit RMS server, you don’t see ALL – but only SOME - OpsMgr registry keys:
So here’s the reason! This is what was happening! How could I not think of this before? It was nothing related to permissions, but to registry redirection! The issue was happening because the 32 bit machine is using the 32bit registry editor and what it will do when accessing a 64bit machine will be to default to the Wow6432Node location in the registry. There all OpsMgr data won’t be in the WOW64 location on a 64bit machine, only some.
So, just like regedit, the 32bit powershell and the 32bit .Net framework were being redirected to the 32bit-compatibility registry keys… not finding the stuff we needed, whereas a 64bit application could find that. Any 32bit application by default gets redirected to a 32bit-safe registry.
So, after finally UNDERSTANDING what the issue was, I started wondering: ok... but how can I access the REAL “HLKM\SOFTWARE\Microsoft” key on a 64bit machine when running this FROM a 32bit machine – WITHOUT being redirected to “HKLM\SOFTWARE\Wow6432Node\Microsoft” ? What if my application CAN deal just fine with those values and actually NEEDs to access them?
The answer wasn’t as easy as the question. I did a bit of digging on this, and still I have NOT yet found a way to do this with the .Net classes. It seems that in a lot of situations, Powershell or even .Net classes are nice and sweet wrappers on the underlying Windows APIs… but for how sweet and easy they are, they are very often not very complete wrappers – letting you do just about enough for most situations, but not quite everything you would or could with the APi underneath. But I digress, here...
The good news is that I did manage to get this working, but I had to resort to using dear old WMI StdRegProvider… There are a number of locations on the Internet mentioning the issue of accessing 32bit registry from 64bit machines or vice versa, but all examples I have found were using VBScript. But I needed it in Powershell. Therefore I started with the VBScript example code that is present here, and I ported it to Powershell.
Handling the WMI COM object from Powershell was slightly less intuitive than in VBScript, and it took me a couple of hours to figure out how to change some stuff, especially this bit that sets the parameters collection:
Set Inparams = objStdRegProv.Methods_("GetStringValue").Inparameters Inparams.Hdefkey = HKLM Inparams.Ssubkeyname = RegKey Inparams.Svaluename = RegValue Set Outparams = objStdRegProv.ExecMethod_("GetStringValue", Inparams,,objCtx)
Set Inparams = objStdRegProv.Methods_("GetStringValue").Inparameters
Inparams.Hdefkey = HKLM
Inparams.Ssubkeyname = RegKey
Inparams.Svaluename = RegValue
Set Outparams = objStdRegProv.ExecMethod_("GetStringValue", Inparams,,objCtx)
INTO this:
$Inparams = ($objStdRegProv.Methods_ | where {$_.name -eq "GetStringValue"}).InParameters.SpawnInstance_() ($Inparams.Properties_ | where {$_.name -eq "Hdefkey"}).Value = $HKLM ($Inparams.Properties_ | where {$_.name -eq "Ssubkeyname"}).Value = $regkey ($Inparams.Properties_ | where {$_.name -eq "Svaluename"}).Value = $value $Outparams = $objStdRegProv.ExecMethod_("GetStringValue", $Inparams, "", $objNamedValueSet)
$Inparams = ($objStdRegProv.Methods_ | where {$_.name -eq "GetStringValue"}).InParameters.SpawnInstance_()
($Inparams.Properties_ | where {$_.name -eq "Hdefkey"}).Value = $HKLM
($Inparams.Properties_ | where {$_.name -eq "Ssubkeyname"}).Value = $regkey
($Inparams.Properties_ | where {$_.name -eq "Svaluename"}).Value = $value
$Outparams = $objStdRegProv.ExecMethod_("GetStringValue", $Inparams, "", $objNamedValueSet)
I have only done limited testing at this point and, even if the actual work now requires nearly 15 lines of code to be performed vs. the previous 3 lines in the .Net implementation, it at least seems to work just fine.
What follows is the complete code of my replacement function, in all its uglyness glory:
which can be called similarly to the previous one: GetValueFromRegistryThruWMI $RMS "SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Setup" "DatabaseServerName"
[Note: you don’t need the double\escape backslashes here, compared to the .Net implementation]
Enjoy your cross-architecture registry access: from 32bit to 64bit - and back!
During the beta of the Cross-Platform extensions and of System Center Operations Manager 2007 R2, the product team had promised to eventually release the SCX Providers'source code.
Now that this promise has been mantained, and the SCX providers have been released on Codeplex at http://xplatproviders.codeplex.com/ it should be finally possible to entirely build your own unsupported agent package, starting from source code, without having to modify the original package as I have shown earlier on this blog. Of course this will still be unsupported by Microsoft Product support, but will eventually work just fine! This is an extraordinary event in my opinion, as it is not a common event that Microsoft releases code as open source, especially when this is part of one of the product it sells. I suspect we will see more of this as we going forward.
Also, at R2 release time, some official documentation about buildilng Cross-Plaform Management Packs has been published on Technet.
Anyway, I have in the past posted a number of posts on my blog under this tag http://www.muscetta.com/tag/xplat/ (I will continue to use that tag going forward) which show/describe how I hacked/modified both the existing MPs AND the SCX agent package to let it run on unsupported distributions (and I think they are still useful as they show a number of techniques about how to test, understand and troubleshoot the Xplat agent a bit. In fact, I have first learned how to understand and modify the RedHat MPs to monitor CentOS and eventually even modified the RPM package to run on Ubuntu (which also works on Debian 5/Lenny), eventually, as you can see because I am now using it to monitor - from home, across the Internet - the machine running this blog:
Or even, with or without OpsMgr 2007 R2, you could write your own scripts to interact with those providers, by using your favourite Scripting Language.
After all, those experimentations with Xplat got me a fame of being a "Unix expert at Microsoft" (this expression still makes me laugh), as I was tweeting here:
But really, I have never hidden my interest for interoperability and the fact that I have been using Linux quite a bit in the past, and still do.
Also, one more related information is that the fine people at Xandros have released their Bridgeways Management Packs and at the same time also started their own blog at http://blog.xplatxperts.com/ where they discuss some troubleshooting techniques for the Xplat agent, both similar to what I have been writing about here and also - of course - specific to their own providers, that are in their XSM namespace.
So Powershell v2 adds a nice bunch of Ws-Man related cmdlets. Let’s see how we can use them to interact with OpenPegasus’s WSMan on a SCX Agent.
PS C:\maint> test-wsman -computer virtubuntu.huis.dom -port 1270 -authentication basic -credential (get-credential) -usessl
cmdlet Get-Credential at command pipeline position 1 Supply values for the following parameters: Credential
But we do get this error:
Test-WSMan : The server certificate on the destination computer (virtubuntu.huis.dom:1270) has the following errors: The SSL certificate could not be checked for revocation. The server used to check for revocation might be unreachable.
The SSL certificate is signed by an unknown certificate authority. At line:1 char:11 + test-wsman <<<< -computer virtubuntu.huis.dom -port 1270 -authentication basic -credential (get-credential) -usessl + CategoryInfo : InvalidOperation: (:) [Test-WSMan], InvalidOperationException + FullyQualifiedErrorId : WsManError,Microsoft.WSMan.Management.TestWSManCommand
The credentials above have to be a unix login. Which we typed correctly. But we still can't get thru, as the certificate used by the agent is not trusted by our workstation. This seems to be the “usual” issue I first faced when testing SCX with WINRM in beta1. At the time I simply dismissed it with the following sentence
[…] Of course you have to solve some other things such as DNS resolution AND trusting the self-issued certificates that the agent uses, first. Once you have done that, you can run test queries from the Windows box towards the Unix ones by using WinRM. […]
and I sincerely thought that it would explain pretty well… but eventually a lot of people got confused by this and did not know what to do, especially for the part that goes about trusting the certificate. Anyway, in the following posts I figured out you could pass the –skipCACheck parameter to WINRM… which solved the issue with having to trust the certificate (which is fine for testing, but I would not use that for automations and scripts running in production… as it might expose your credentials to man-in-the-middle attacks).
So it seems that with the Powershell cmdlets we are back to that issue, as I can’t find a parameter to skip the CA check. Maybe it is there, but with PSv2 not having been released yet, I don't know everything about it, and the CTP documentation is not yet complete. Therefore, back to trusting the certificate.
Trusting the certificate is actually very simple, but it can be a bit tricky when passing those certs back and forth from unix to windows. So let's make the process a bit clearer.
All of the SCX-agents certificates are ultimately signed by a key on the Management server that has discovered them, but I don't currently know where that certificate/key is stored on the management server. Anyway, you can get it from the agent certificate - as you only really need the public key, not the private signing key.
Use WinSCP or any other utility to copy the certificate off one of the agents. You can find that in the /etc/opt/microsoft/scx/ssl location:
that scx-host-computername.pem is your agent certificate.
Copy it to the Management server and change its extension from .pem to .cer. Now Windows will be happy to show it to you with the usual Certificate interface:
We need to go to the “Certification Path” tab, select the ISSUER certificate (the one called “SCX-Certificate”):
then go to the “Details” tab, and use the “Copy to File” button to export the certificate.
After you have the certificate in a .CER file, you can add it to the “trusted root certification authorities” store on the computer you are running your powershell tests from.
So after you have trusted it, the same command as above actually works now:
wsmid : http://schemas.dmtf.org/wbem/wsman/identify/1/wsmanidentity.xsd lang : ProtocolVersion : http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd ProductVendor : Microsoft System Center Cross Platform ProductVersion : 1.0.4-248
Ok, we can talk to it! Now we can do something funnier, like actually returning instances and/or calling methods:
PS C:\maint> Get-WSManInstance -computer virtubuntu.huis.dom -authentication basic -credential (get-credential) -port 1270 -usessl -enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_OperatingSystem?__cimnamespace=root/scx
This is far from exhaustive, but should get you started on a world of possibilities about automating diagnostics and responses with Powershell v2 towards the OpsMgr 2007 R2 Cross-Platform machines. Enjoy!
You know since the beta1 of Xplat I have been busy with modifying the Redhat management pack and monitor CentOS with OpsMgr. Now, CentOS is a distribution that is pretty similar to RedHat, so the RPM package just runs, and it is only a matter of hacking a modified MP.
I never went really further in my experiments, mostly due to lack of time… but then yesterday I got a comment to this older post asking about Ubuntu. Of course I know about Ubuntu, and have been using Debian-based distributions for years. I actually even prefer them over RPM-based distributions such as RedHat or SuSE (personal preference). Heck, even this weblog is running on Debian!
Anyway, I never really tried to see if one of the existing RPM packages for RedHat or SuSE could be modified to run on Ubuntu. I will eventually test this on Debian too, but for now I used Ubuntu which tends to have slightly newer packages and libraries, overall. The machine I tested on is a Ubuntu Server 8.04.2. Older/newer versions might slightly differ.
BEWARE THAT ALL THAT FOLLOWS BELOW IS NOT SUPPORTED BY MICROSOFT. It is only described here for EXPERIMENTAL (==fun) purpose. DO NOT USE THIS IN A PRODUCTION ENVIRONMENT.
So, you are warned. Now let’s hack it.
The first thing to do is to copy the Redhat agent’s RPM package off your OpsMgr2007 R2 server in the “usual” path “C:Program FilesSystem Center Operations manager 2007AgentManagementUnixAgents”. Let’s grab the RHEL5 agent, which is called scx-1.0.4-248.rhel.5.x86.rpm in R2 RTM.
First we need to CONVERT the RPM package to the DEB package format used by Ubuntu, by using the ALIEN package:
sudo apt-get update sudo apt-get install alien sudo bash alien -k scx-1.0.4-248.rhel.5.x86.rpm --scripts dpkg -i scx_1.0.4-248_i386.deb
The converted package will install… but the script execution will fail in a few places – most notably in the generation of the certificate, as it is not able to locate the right openssl libraries, as shown in the screenshot above.
If the libssl.so.6 file cannot be found, you might be missing the “libssl-dev” package, which you can install as follows:
apt-get install libssl-dev
But even if it is installed, you will find that the files are still missing. This is not really true: actually, the files are there, but on Ubuntu they have a different name than on RedHat, that’s all. You can therefore create hardlinks to the “right” files, so that they are aliased and get found afterwards:
cd /usr/lib ln -s libcrypto.so.0.9.8 libcrypto.so.6 ln -s libssl.so.0.9.8 libssl.so.6
So now when installing the package, the certificate generation will work:
You are nearly ready to go. You have to start the service by using the init scripts – the “service” command is RedHat-specific, that will still fail.
/etc/init.d/scx-cimd start is the “standard” way of starting daemons from init on Unix.
But it still fails, as it seems that the init script provided in the RedHat package is really searching for a file called “functions” which is present on RedHat and on CentOS, which provides re-usable functions for startup scripts to include:
How do you fix this? I just copied the /etc/init.d/functions file from a CentOS box to my Ubuntu box.
I copied it via SCP from the CentOS box I have:
cd /etc/init.d scp root@centos.huis.dom:/etc/init.d/functions .
cd /etc/init.d
scp root@centos.huis.dom:/etc/init.d/functions .
You can probably also find and fetch the file from the Internet (both CentOS and RedHat should have accessible repositories with all the files in their distributions, since it is open sourced).
After you have the file in place, the init script will be able to include it, will find the functions it needs, and the daemon/service will now start (even if with minor errors I have not investigated for now, but that don’t seem to be causing troubles):
and here you can see it is finally running:
So let’s try to issue a few queries as shown in a previous posts:
IT WORKS!!!
But… there is a “but”: not all classes actually return instances and values just yet. Most notably the “SCX_OperatingSystem” class does not seem to return anything right awy. That is a very important class, because is the one we would use to first discover the Operating System object in the Management Packs. So we need to fix it. The reason why the class does not return anything, is that the SCX provider is looking into the /etc/redhat-release file to return what OS version/distribution the machine is running. And the file is obviously not there on Ubuntu.
On all Linuxes there is a similar file, called /etc/issue... which again, we can copy with the other name and trick the provider into working:
cd /etc cp issue redhat-release
cd /etc
cp issue redhat-release
And NOW, the SCX_OperatingSystem Class also returns an instance:
The next step would be “cooking” an MP to discover Ubuntu. More on this on a later post (maybe). I did not test all classes and their implementation… you can try to poke at them by following the instructions and commands on my previous post here. But this should get you started.
I make heavy use of WMI.
But when using it to gather information from customer’s machines for assessments, I sometimes find the occasional broken WMI repository. There are a number of ways in which WMI can become corrupted and return weird results. Most of the times you would just get errors, such as “Class not registered” or “provider load failure”. I can handle those errors from within scripts.
But there are some, more subtle - and annoying – ways in which the WMI repository can get corrupted. the situations I am talking about are the ones when WMI will accept your query… will say it is executing it… but it will never actually return any error, and just stay stuck performing your query forever. Until your client application decides to time out. Which in some cases does not happen.
Now that was my issue – when my assessment script (which was using the handy Powershell Get-WmiObject cmdlet) would hit one of those machines… the whole script would hang forever and never finish its job. Ok, sure, the solution to this would be actually FIXING the WMI repository and then try again. But remember I am talking of an assessment: if the information I am getting is just one piece of a bigger puzzle, and I don’t necessarily care about it and can continue without that information – I want to be able to do it, to skip that info, maybe the whole section, report an error saying I am not able to get that information, and continue to get the remaining info. I can still fix the issue on the machine afterward AND then run the assessment script again, but in the first place I just want to get a picture of how the system looks like. With the good and with the bad things. Especially, I do want to take that whole picture – not just a piece of it.
Unfortunately, the Get-WmiObject cmdlet does not let you specify a timeout. Therefore I cooked my own function which has a compatible behaviour to that of Get-WmiObject, but with an added “-timeout” parameter which can be set. I dubbed it “Get-WmiCustom”
Function Get-WmiCustom([string]$computername,[string]$namespace,[string]$class,[int]$timeout=15) { $ConnectionOptions = new-object System.Management.ConnectionOptions $EnumerationOptions = new-object System.Management.EnumerationOptions
$timeoutseconds = new-timespan -seconds $timeout $EnumerationOptions.set_timeout($timeoutseconds)
$assembledpath = "\\" + $computername + "\" + $namespace #write-host $assembledpath -foregroundcolor yellow
$Scope = new-object System.Management.ManagementScope $assembledpath, $ConnectionOptions $Scope.Connect()
$querystring = "SELECT * FROM " + $class #write-host $querystring
$query = new-object System.Management.ObjectQuery $querystring $searcher = new-object System.Management.ManagementObjectSearcher $searcher.set_options($EnumerationOptions) $searcher.Query = $querystring $searcher.Scope = $Scope
trap { $_ } $result = $searcher.get()
return $result }
You can call it as follows, which is similar to how you would call get-WmiObject
get-wmicustom -class Win32_Service -namespace "root\cimv2" -computername server1.domain.dom
or, of course, specifying the timeout (in seconds):
get-wmicustom -class Win32_Service -namespace "root\cimv2" -computername server1.domain.dom –timeout 1
and obviously, since the function returns objects just like the original cmdlet, it is also possible to pipe them to other commands:
get-wmicustom -class Win32_Service -namespace "root\cimv2" -computername server1.domain.dom –timeout 1 | Format-Table
You have heard it all over the place, System Center Operations Manager 2007 R2 has reached the Release Candidate milestone and the RC bits have been made available on connect.microsoft.com.
As it is becoming a tradition for me with each new release, I want to take a look at the Unix Monitoring stuff like I did since beta1 of Xplat, passing thru beta2. I am an integration freak and I have always insisted that interoperability is key. I will leave the most obvious “release notes” kind of things out of here, such as saying that there are now agents for the x64 version of linux distro’s, and so on…. you can read this stuff in the release notes already and in a zillion of other places.
Let’s instead look at my first impression ( = I am amazed: this product is really getting awesome) and let’s do a bit of digging, mostly to note what changed since my previous posts on Xplat (which, by the way, is the MOST visited post on this blog I ever published) – of course there is A LOT more that has changed under the hood… but those are code changes, improvements, polishing of the product itself… while that would be interesting from a code perspective, here I am more interested in what the final user (the System Administrator) will ultimately interact with directly, and what he might need to troubleshoot and understand how the pieces fit together to realize Unix Monitoring in OpsMgr.
After having hacked the RedHat MP to work on my CentOS box (as usual), I started to take a look at what is installed on the Linux box. Here are the new services:
You will notice the daemons have changed names and get launched with new parameters.
Of course when you see who uses port 1270 everything becomes clearer:
Therefore I can place the two new names and understand that SCXCIMSERVER is the WSMAN implementation, while SCXCIMPROVAGT is the CIM/WBEM implementation.
There is one more difference at the “service” (or “daemon”) level: the fact that there is only ONE init script now: /etc/init.d/scx-cimd
So basically the SCX “Agent” will start and stop as a single thing, even if it is composed of multiple executables that will spawn various processes.
Another difference: if we look in “familiar” locations like /etc/opt/microsoft/scx/bin/tools/ we see that a number of configuration files is either empty (0 bytes) or missing (like the one described on Ander’s blog to enable verbose logging of WSMan requests), when compared to earlier versions:
But that is because I have been told we now have a nice new tool called scxadmin under /opt/microsoft/scx/bin/tools/ , which will let you configure those things:
Therefore you would enable VERBOSE logging for all components by issuing the command
./scxadmin -log-set all verbose
and you will bring it back to a less noisy setting of logging only errors with
./scxadmin -log-set all errors
the logs will be written under /var/opt/microsoft/scx/log just like they did before.
Other than this, a lot of the troubleshooting techniques I showed in one of my previous posts, like how to query CIM classes directly or thru WSMAN remotely by using winrm – they should really stay the same. I will mention them again here for reference.
SCXCIMCLI is a useful and simple tool used to query CIM directly. You can roughly compare it to wbemtest.exe in the WIndows world (other than not having a UI). This utility can also be found in /opt/microsoft/scx/bin/tools
A couple of examples of the most common/useful things you would do with scxcimcli:
1) Enumerate all Classes whose name contains “SCX_” in the root/scx namespace (the classes our Management packs use):
./scxcimcli nc -n root/scx -di |grep SCX_ | sort
2) Execute a Query
./scxcimcli xq "select * from SCX_OperatingSystem" -n root/scx
Also another thing that you might want to test when troubleshooting discoveries, is running the same queries through WS-Man (possibly from the same Management Server that will or should be managing that unix box). I already showed this in the past, it is the following command:
winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_OperatingSystem?__cimnamespace=root/scx -username:root -password:password -r:https://linuxbox.mydomain.com:1270/wsman -auth:basic –skipCACheck
but if you launch it that way it will now return an error like the following (or at least it did in my test lab):
Error number: -2144108468 0x8033804C The WS-Management service does not support the character set used in the request . Change the request to use UTF-8 or UTF-16.
the error message is pretty self explanatory: you need to specify the UTF-8 Character set. You can do it by adding the “-encoding” qualifier:
winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_OperatingSystem?__cimnamespace=root/scx -username:root -password:password -r:https://linuxbox.mydomain.com:1270/wsman -auth:basic –skipCACheck –encoding:UTF-8
Hope the above is useful to figure out the differences between the earlier beta releases of the System Center CrossPlatform extensions and the version built in OpsMgr 2007 R2 Release Candidate.
There are obviously a million of other things in R2 worth writing about (either related to the Unix monitoring or to everything else) and I am sure posts will start to appear on the many, more active, blogs out there (they have already started appearing, actually). I have not had time to dig further, but will likely do so AFTER Easter – as the next couple of weeks I will be travelling, working some of the time (but without my test environment and good connectivity) AND visiting relatives the rest of the time.
One last thing I noticed about the Unix/Cross Platform Management Packs in R2 Release Candidate… their current “release date” exposed by the MP Catalog Web Service is the 20th of March…
…which happens to be my Birthday - therefore they must be a present for me! :-)
One of the cool new features of System Center Operations Manager 2007 R2 is the possibility to check and update Management Packs from the catalog on the Internet directly from the Operators Console:
Even if the backend for this feature is not yet documented, I was extremely curious to see how this had actually been implemented. Especially since it took a while to have this feature available for OpsMgr, I had the suspicion that it could not be as simple as one downloadable XML file, like the old MOM2005's MPNotifier had been using in the past.
Therefore I observed the console's traffic through the lens of my proxy, and got my answer:
So that was it: a .Net Web Service.
I tried to ask the web service itself for discovery information, but failed:
Since there is no WSDL available, but I badly wanted to interact with it, I had to figure out: what kind of requests would be allowed to it, how should they be written, what methods could they call and what parameters should I pass in the call. In order to get started on this, I thought I could just observe its network traffic. And so I did... I fired up Network Monitor and captured the traffic:
Microsoft Network Monitor is beautiful and useful for this kind of stuff, as it lets you easily identify which application a given stream of traffic belongs to, just like in the picture above. After I had isolated just the traffic from the Operations Console, I then saved those captures packets in CAP format and opened it again in Wireshark for a different kind of analysis - "Follow TCP Stream":
This showed me the reassembled conversation, and what kind of request was actually done to the Web Service. That was the information I needed.
Ready to rock at this point, I came up with this Powershell script (to be run in OpsMgr Command Shell) that will:
1) connect to the web service and retrieve the complete MP list for R2 (this part is also useful on its own, as it shows how to interact with a SOAP web service in Powershell, invoking a method of the web service by issuing a specially crafted POST request. To give due credit, for this part I first looked at this PERL code, which I then adapted and ported to Powershell);
2) loop through the results of the "Get-ManagementPack" opsmgr cmdlet and compare each MP found in the Management Group with those pulled from the catalog;
3) display a table of all imported MPs with both the version imported in your Management Group AND the version available on the catalog:
Remember that this is just SAMPLE code, it is not meant to be used in production environment and it is worth mentioning again that OpsMgr2007 R2 this is BETA software at the time of writing, therefore this functionality (and its implementation) might change at any time, and the script will break. Also, at present, the MP Catalog web service still returns slightly older MP versions and it is not yet kept in sync and updated with MP Releases, but it will be ready and with complete/updated content by the time R2 gets released.
Here we go again. Now that the OpsMgr2007 R2 beta is out, with an improved and revamped version of the System Center Cross Platform Extensions, I faced the issue of how to upgrade my test lab.
I have to say that OpsMgr2007 R2 beta release notes explain the known issues, and I had no trouble whatsoever upgrading the windows part. It just took its time (I am running virtual machines in my test lab, that don't have the best performance), but it went smoothly and without a glitch. In a couple of hours I had everything upgraded: databases, RMS, reporting, agents, gateway. All right then. The new purple icons in System Center look cute, and the new UI has some great stuff, such as a long-awaited way to update your management packs directly from the Internet, better display of Overrides (kind of what we used to rely on Override Explorer for)... and A LOT more new stuff that I won't be wasting my Sunday writing about since everybody else has already done it two days ago:
Therefore let's get back to my upgrade, which is a lot more interesting (to me) than the marketing tam-tam :-)
As part of the upgrade to R2, I had to first uninstall the Xplat beta refresh bits, which I had installed, including all Unix Management Packs. Including my CentOS Management Pack I had improvised.
So this is the new start page of the integrated Discovery Wizard:
Looks nice and integrates the functionality of discovering and deploying Windows machines, SNMP Devices, and Unix/Linux machines.
Of course, my CentOS machine would not be discovered, and showed up as an unsupported platform. Of course my old Management Pack I had hacked together in XPlat Beta 1 did not work anymore. Therefore, I figured out I had to see what changes were there, and how to make it work again (of course it IS possible - It is NOT SUPPORTED, but I don't care, as long as it works).
Since the existing agent could not be discovered, the first step I took was logging on the Linux box, un-install the old agent, and install the new one:
There I tried to discover again, but of course it still failed.
At that point I started taking a look at the new layout of things on the unix side. Most stuff is located in the same directories where beta1 was installed, and there are a bunch of useful commands under /opt/microsoft/scx/bin/tools. You can check out the Open Pegasus version used:
[root@centos tools]# ./scxcimconfig --version Version 2.7.0
Let's take a look at what SCX classes we have available:
./scxcimcli nc -n root/scx -di |grep SCX | sort
Nice. That's the stuff we will be querying over WS-Man from the Management Server.
So let's look at the OS Discovery, and we test it from the OpsMgr 2007 box:
winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_OperatingSystem?__cimnamespace=root/scx -username:root -password:password -r:https://centos:1270/wsman -auth:basic -skipCACheck
it returns results:
At first I assumed this worked like in Beta1, therefore I exported RedHat management pack and I made my own version of it, replacing the strings it is expecting to find to discover CentOS instead than Redhat.
While the MP was syntactically correct and would import fine, the Discovery wizard still didn't work.
I took one more look at the discoveries in the MP, and I found there are two more, targeted to Management Server, which is probably what gets used by the Discovery Wizard to understand what kind of agent kit needs to be deployed.
So basically this discovery checks for the returned value from the module to determine if the discovered platform is a supported one:
But how does the module get its data?
Look at the layout of the /AgentManagement/UnixAgents folder on the Management Server:
That's it: GetOSVersion.sh - a shell script. A nice, open, clear text, hackable shell script. Let's take a look at it:
So that's it, and how my modification looks like. What happens during the discovery wizard is that we probably copy the script over SCP to the box, execute it, look at a number of things, and return the discovery data we need.
If you do those steps manually, you see how the script returns something very similar to a PropertyBag, just like discoveries done by VBScript on Windows machines:
So after modifying the script... here we go. The Wizard now thinks CentOS is Red Hat, and can install an agent on it:
Only when the Management Server discovery finally considers the CentOS machine worth managing, then the other discoveries that use WS-Man queries start kicking in, like the old one did, and find the OS objects and all the other hosted objects. In order for this to work you don't only need to hack the shell script, but to have a hacked MP - the "regular" Red Har one won't find CentOS, which is and remains an UNSUPPORTED platform.