This Blog provides information about running SAP Applications on the Microsoft Platform. The Blog is written by people who are working with SAP on the Microsoft Platform for decades.
Well, let’s go to the last part of our series about SAP DBACockpit and associated SQL Server scripts.
Another area which usually is investigated is to detect long running queries against the database. What can be from interests are questions like whether a requests is blocked in any way or experiences wait situations on some kind of resources and more important how the statement looks like. Especially in the case of SAP applications running against SQL Server, the interesting question as well is to trace these requests back to a specific SAP application server and the specific process the request is issued from. A query to get this information from would be:
select r.session_id, s.host_name, s.program_name, s.host_process_id,
(case when r.statement_end_offset = -1
then len(convert(nvarchar(max), qt.text)) * 2
else r.statement_end_offset end -r.statement_start_offset)/2)
as stmt_executing,r.blocking_session_id, r.cpu_time,r.total_elapsed_time,r.reads,r.writes,
from sys.dm_exec_requests r
cross apply sys.dm_exec_sql_text(sql_handle) as qt, sys.dm_exec_sessions s
where r.session_id > 50 and r.session_id=s.session_id
order by r.session_id, s.host_name, s.program_name, r.status
This query will answer most of the questions above and will show the statement as well as wait or blocking states and the SAP process issuing the query. Additionally you’ll find the plan handle at the end of the row. This allows getting to the execution plan as described in the first part of the series. What are we expecting to see as results? In the column status, we will see quite a few requests which are in the status ‘suspended’ and some others (like Query1 above) as ‘running’. The state ‘suspended’ simply means that the request currently is not active because it is waiting on a resource. The resource can be an I/O for reading a page, it can be communication on the network, or it simply that other requests are using the CPUs and the request is in the list of the scheduler to become ‘running’ again. So no real worries to see status suspended here. The resource the request is waiting on is described in the column wait_type. Expectations are that one sees some resources listed like PAGEIOLATCH_SH indicating disk I/O activity or ASYNCH_NETWORK_IO indicating network communication. It is normal to see those resources. To differentiate the good from the bad in disk I/O activity, one would need to look at the column wait_time. The lower this time, the better since this time mainly reflects the performance of the I/O subsystem. The higher the number the slower the I/O subsystem seems to be. Well or the more data is read per I/O. Therefore one should not take the wait_time as a direct time in milliseconds an I/O needs to be executed. To get the real average time an I/O takes one should use Windows performance Monitor. But the number gives a good indication and direction in which direction to investigate. For ASYNC_NETWORK_IO, we usually see higher numbers happening quite often. These higher numbers indicate as well that larger amounts of data are being transferred from the server to the client. This is normal even in a SAP ERP environment and wait_times of a few seconds here and there can happen, however should happen with the majority of requests. Blocking Lock situations also will show up with a wait_type of LCK….. In such a case the wait_time gives the length the request got blocked. More information about the different wait types can be found in a paper my colleague Tom Davidson wrote
We hardly expect to see Shared Locks on SQL Server while used by SAP Applications since SAP applications are reading mostly in uncommitted isolation level. Hence most of the locks we get to see are locks which indicate data modifications in form of update, inserts or deletes. Also the lock granularity mostly experienced is row level locks either on the clustered index (mostly on the data level) or row locks on non-clustered index rows. Running Query2 as below will show such locks in case they exist:
SELECT w.session_id, w.blocking_session_id, w.wait_type, w.wait_duration_ms, w.resource_description, substring(qt.text,(r.statement_start_offset/2) +1, (case when r.statement_end_offset = -1 then len(convert(nvarchar(max), qt.text)) * 2 else r.statement_end_offset end -r.statement_start_offset)/2) as stmt_executing FROM sys.dm_os_waiting_tasks w, sys.dm_exec_requests r cross apply sys.dm_exec_sql_text(sql_handle) as qt WHERE w.session_id > 49 and w.wait_type like 'LCK%' and w.session_id=r.session_id
The result of Query 2 could - in a more readable form since the row length is too long to present it nicely - look like:
Blocking Session_id: 639
Wait_time in ms: 1812
Resource description: keylock hobtid=72057594095730688 dbid=5 id=lock470e280 mode=U associatedObjectId=72057594095730688
Statement: SELECT TOP 1 "MANDT" AS c ,"OBJNR" AS c ,"OBTYP" AS c ,"STSMA" AS c,"CHGKZ" AS c ,"CHGNR" AS c FROM "JSTO" WITH (UPDLOCK REPEATABLEREAD) WHERE "MANDT" = @P000 AND "OBJNR" = @P001 /* R3:SAPLBSVU:954 T:JSTO */
The wait_type already is showing that we look at an Update lock which is requested by session 212. Session 639 seems to block this request. In the Resource description field you can get the fact that we are looking at a ‘keylock’ which is a row lock in the clustered index and in a most of the cases a row on the data level. You can also get the database ID and the object ID out of the resource description. The last column then gives you the statement which is getting blocked in execution at this point in time.
Query2 is nice in order to look at blocking locks situations which are rather simple, but for more complex situations and in order to work with SAP support, you should use a little set of queries and T-SQL procedures collected under the term ‘hangman’. Please check OSS notes #948633. Hangman gives even more data to analyze the situation, especially with intense lock chains, which can develop easily if longer locks are taken on critical resources as row in the number range table NRIV. Also SAP support uses the ‘hangman’ outputs frequently for investigating issues and hence is familiar with its output.
A lot of different resources where wait situations can happen in SQL Server are monitored. DMV sys.dm_os_wait_stats shows where all those waits are accumulated. A good query to look at this data as a whole would be the following:
select wait_type, wait_time_ms/waiting_tasks_count as 'Avg Wait in ms',
wait_time_ms/ signal_wait_time_ms 'Wait to Signal Wait Ratio',
waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms
from sys.dm_os_wait_stats where waiting_tasks_count > 0 and signal_wait_time_ms >0
The list can be rather confusing and a bit overwhelming. An interesting wait event to look at would be WRITELOG. This event indicates the write performance to the transaction log of the database(s). The counter on this event is incremented when the transaction log buffer in SQL Server gets flushed until an acknowledge of this synchronous I/O comes back from the I/O subsystem. The wait_types PAGEIOLATCH_SH and PAGEIOLATCH_EX give indications on the performance of read I/O from the data files. Be aware that in SQL Server 2005 some issues around accumulating these numbers got improved. Therefore the numbers show up higher than usually experienced in SQL Server 2000. But the numbers now pretty much match what is measured with Windows Performance monitor.
Nevertheless in cases where you want to analyze the next 30 or 60min workload you would need to run the query several times and build deltas since the numbers are accumulated since the start of SQL Server. Or you might get the idea to reset the values as meanwhile described in Books Online. This is a bad idea since it destroys the efforts SAP put into their database monitor to give you exactly the ability to investigate certain time ranges or even online observe own defined time frames. SAP DBA Cockpit offers you to look at all the wait types listed. It offers to reset them all to 0 when you desire and then displays the counts and values since you pressed ‘Reset’ it. The difference is that SAP DBA Cockpit will not reset the values on the database, but will simply start to perform delta reporting from the time of your reset on in DBA Cockpit. Therefore strictly avoid resetting those values on the database with the command that I purposely don’t name here, but use DBA Cockpit in order to monitor these wait statistics over a specific time frame.
Another approach to analyze the performance of a system or investigate improvement potentials is to look where SQL Server is waiting on resources and then try to get those resource bottlenecks resolved. Since we meanwhile have 200+ wait_types one certainly can get overwhelmed by getting clear indications what area the bottleneck is in. Tom already indicated some categories in his whitepaper. SQL Server Performance Dashboard which is a free download also materialized a categorization of wait_types. The query batch below will do so as well. Not very sophisticated using some temporary tables, but it works in principle. It basically reads the values of the 200+ wait_types, groups it in categories and displays the average wait time and the percentage of wait time. Any wait types that indicate some sleeping activity due to low usage of the server are not being calculated.
create table ##aggr_temp (wait_type nvarchar(60), waiting_tasks_count bigint, wait_time_ms bigint, max_wait_time_ms bigint, signal_wait_time_ms bigint,
set nocount on
declare @category nvarchar(60), @wait_type nvarchar(60), @waiting_tasks_count bigint, @wait_time_ms bigint, @max_wait_time_ms bigint, @signal_wait_time_ms bigint
declare c1 cursor for select * from sys.dm_os_wait_stats where waiting_tasks_count > 0
fetch c1 into @wait_type, @waiting_tasks_count, @wait_time_ms, @max_wait_time_ms, @signal_wait_time_ms
while @@fetch_status <>-1
select @category =
when @wait_type like N'LCK_%' then N'Locking'
when @wait_type like N'SOS%' or @wait_type in (N'THREADPOOL', N'LOWFAIL_MEMMGR_QUEUE') then N'SOS'
when @wait_type like N'LATCH_%' then N'Latching'
when @wait_type in (N'SLEEP_BPOOL_FLUSH') then N'Checkpoint'
when @wait_type like N'BROKER_%' then N'Service Broker or Database Mirroring Foundations'
when @wait_type like N'BACKUP%' or @wait_type in (N'ASYNC_DISKPOOL_LOCK', N'REQUEST_DISPENSER_PAUSE') then N'Backup or Restore'
when @wait_type like N'PAGELATCH_%' then N'Buffer Latch'
when @wait_type like N'PAGEIOLATCH_%' or @wait_type in (N'IO_COMPLETION', 'ASYNC_IO_COMPLETION', N'CHKPT', N'DISKIO_SUSPEND', N'IMPPROV_IOWAIT') then N'I/O Performance'
when @wait_type like N'QUERY%' or @wait_type like N'RESOURCE_%' or @wait_type like N'SQLSORT_%' or @wait_type like N'DEADLOCK_%'
or @wait_type in (N'RESOURCE_SEMAPHORE', N'RESOURCE_SEMAPHORE_MUTEX', N'RESOURCE_SEMAPHORE_QUERY_COMPILE', N'RESOURCE_SEMAPHORE_SMALL_QUERY',
N'CMEMTHREAD', N'SOS_RESERVEDMEMBLOCKLIST', N'BAD_PAGE_PROCESS', N'CURSOR', N'CURSOR_ASNC', N'EE_PMOLOCK',
N'EE_SPECPROC_MAP_INIT', N'QRY_MEM_GRANT_INFO_MUTEX', N'OLEDB', N'SQLSORT_NORMMUTEX', N'SQLSORT_SORTMUTEX', N'VIEW_DEFINITION_MUTEX') then N'Query Execution'
when @wait_type in (N'CXPACKET', N'EXCHANGE', N'EXECSYNC') then 'Parallel Query Execution'
when @wait_type = N'SOS_SCHEDULER_YIELD' then N'Scheduler Yield'
when @wait_type in (N'LOGMGR', N'LOGBUFFER', N'LOGMGR_RESERVE_APPEND', N'LOGMGR_FLUSH', N'WRITELOG') then N'Logging'
when @wait_type in (N'ASYNC_NETWORK_IO', N'NET_WAITFOR_PACKET') or @wait_type like N'SNI_%' then N'Network IO'
when @wait_type like N'CLR_%' or @wait_type like N'SQLCLR%' then N'CLR'
when @wait_type like N'DTC_%' then N'Distributed Transaction Controller'
when @wait_type like N'DBMIRROR%' or @wait_type = N'MIRROR_SEND_MESSAGE' then N'Mirroring'
when @wait_type like N'XACT%' or @wait_type like N'TRAN_MARKLATCH_%' or @wait_type like N'MSQL_XACT_%' or @wait_type = N'TRANSACTION_MUTEX' then N'Transaction'
when @wait_type like N'SLEEP_%' or @wait_type in(N'LAZYWRITER_SLEEP', N'WAITFOR', N'WAIT_FOR_RESULTS' ) then N'Sleep'
when @wait_type in (N'FCB_REPLICA_READ', N'FCB_REPLICA_WRITE', N'REPLICA_WRITES') then N'Database Snapshot'
when @wait_type like N'FT%' or @wait_type in (N'MSSEARCH') then N'Fulltext Search'
when @wait_type like N'REPL%' then 'Replication'
when @wait_type in (N'IO_AUDIT_MUTEX') or @wait_type like N'SQLTRACE%' then N'SQL Trace Infrastructure'
if @category <> 'Sleep'
insert into ##aggr_temp values (@wait_type, @waiting_tasks_count, @wait_time_ms, @max_wait_time_ms, @signal_wait_time_ms, @category)
fetch c1 into @wait_type, @waiting_tasks_count, @wait_time_ms, @max_wait_time_ms, @signal_wait_time_ms
select distinct a.category, sum(a.wait_time_ms)/sum(a.waiting_tasks_count) as 'Avg Wait Time', sum(a.waiting_tasks_count) as 'Wait Counts',
sum(a.wait_time_ms) as 'Wait Time', convert(decimal(7,5),(sum(a.wait_time_ms)*100)/b.sum_wait) as 'Percentage Wait'
from ##aggr_temp a, (select sum(wait_time_ms) as 'sum_wait' from ##aggr_temp) b
group by a.category, b.sum_wait
order by 'Percentage Wait' desc
select * from ##aggr_temp where category = 'Other'
drop table ##aggr_temp
The result of this query against a productive SAP ERP system of 4.2TB which shows good performance and doesn’t experience any problems could look like:
Avg Wait Time
Service Broker or Database Mirroring Foundations
Backup or Restore
SQL Trace Infrastructure
So basically 28% of the waits are falling into areas of I/O whereas the average (avg) time of a wait with 6ms seems to be reasonable. Means there might be some improvement possible, but it might not carry that far. The next one is a bit surprising since SAP indeed is not using SQL Server 2005 Service Broker. However the SAP ERP system the query was running against is using synchronous Database Mirroring. Since Database Mirroring is using the same communication foundation as Service Broker, this category might come up as well. The impact synchronous mirroring on performance is extremely small; therefore we are not concerned about these 21% too much for this particular system. Backup or Restore which is the next category at least in the backup case should not interfere with its wait queue with query requests being processed, except in the area of I/O requests to read data. The resource queues in that category mostly cover area where data is getting exchanged between SQL Server and the backup target. Network I/O would cover such wait_types as ASYNC_NETWORK_IO and hence besides indications of slow network also could indicate that a lot of data is constantly exchanged between the client and the server. Mirroring would then cover Mirroring specific resource queues beyond the foundations mentioned earlier. Another interesting category to look at could be ‘Locking’. SAP functionality usually is designed to avoid lock situations. Standard SAP reports are extremely sophisticated in trying to avoid database locks over extended periods of times. Nevertheless in some cases smaller blocking lock situations like reading a document number out of the number range table NRIV often are unavoidable. However poor ABAP/JAVA written reports might cause extended locking. In the results above locking was contributing with 3% to wait situations only. This definitely is at the lower end of a system with around 600 interactive users and over a dozen batch jobs working on the system concurrently. Other systems where we observed more blocking lock situations showed up with higher percentages up to 10%. Query Execution is another interesting category to look at. In times where queries are getting in a wait for memory resources, this category could show significant higher participation. One category which we won’t see here ‘Parallel Query Execution’. Since this is a SAP ERP system, as recommended the degree of parallelism is set to 1. Hence we don’t expect to see any indications of queries being executed in parallel.
The usual way of looking at a system with a relatively high I/O rate and sub-standard Buffer cache hit ratio always was to look for queries which were long running due to either reading a lot of data or not being supported by optimal indexes. In part 1 of this series I introduced queries looking at sys.dm_exec_query_stats where one can find the queries which are causing those physical I/O’s. Another interesting angle to look at some of those systems might be to check the population of the buffer cache. Purpose: Find the tables that keep the most space in the buffer pool. This would cover cases where a few tables indeed keep bigger parts of the cache populated. However not only out of the reason of queries scanning those tables, but also because the whole bandwidth of data in those larger tables is so heavily used. This is a problem, that cannot be solved by adding additional indexes. It would either require changes in ABAP/Java reports to limit the amount of data being accessed or the simple acceptance that one needs more memory for the buffer pool of SQL Server. A query batch to get a list of table populating the cache and the percentage of cache taken would be (Caution: Dependent on the size of the buffer pool this query can run one minute or more):
create table ##allocation_jth (alloc_unit bigint, Number_of_Pages bigint, Table_Name sysname, Percentage_of_Buffer_Pool decimal(7,5))
declare @alloc bigint, @nopage bigint, @tname sysname, @sp bigint, @percentage decimal(7,5)
select @sp=count(page_id) from sys.dm_os_buffer_descriptors
insert into ##allocation_jth select distinct b.allocation_unit_id, count(b.page_id), object_name(p.object_id), (convert(decimal,(count(b.page_id)))/@sp)*100
from sys.dm_os_buffer_descriptors b, sys.allocation_units c, sys.partitions p
where b.allocation_unit_id=c.allocation_unit_id and c.container_id=p.hobt_id and c.type in (1,3)
group by b.allocation_unit_id, p.object_id -- order by count(b.page_id) desc
create index allocation_jth_i1 on ##allocation_jth(Table_Name)
declare c1 cursor for select distinct b.allocation_unit_id, count(b.page_id), object_name(p.object_id), (convert(decimal,(count(b.page_id)))/@sp)*100
where b.allocation_unit_id=c.allocation_unit_id and c.container_id=p.PARTITION_id
and c.type =2
group by b.allocation_unit_id, p.object_id
fetch c1 into @alloc, @nopage, @tname, @percentage
if not exists (select Table_Name from ##allocation_jth where Table_Name = @tname)
insert into ##allocation values (@alloc, @nopage, @tname, @percentage)
update ##allocation_jth set Number_of_Pages = Number_of_Pages+@nopage where Table_Name=@tname
fetch c1 into @alloc, @nopage, @tname, @percentage
select Table_Name, Number_of_Pages, Percentage_of_Buffer_Pool from ##allocation_jth order by Number_of_Pages desc
drop table ##allocation_jth
There certainly might be nicer ways to code this up, but this was a shot to handle a rather unfortunate case in our Metadata in SQL Server 2005 where allocation_unit IDs are different foreign keys dependent on the page type. Please notice the difference between the two queries in the Query batch where the first query handles type 1 and 3, whereas the second one handles type 2.
The result of the query could look like:
Number of Pages
These results could look very different throughout different phases of the day. But the common observation of a few productive SAP systems points to the fact that the 50% of the buffer pool often is occupied by less than 20 tables. The statement above also gives you a good idea on how many different tables got touched and have data in SQL Server Buffer Pool. Dependent on the modules used in a SAP ERP system the number of different tables having pages loaded in the buffer pool at a given point in time can be between several hundred up to several thousand.
Another task could be to check whether a certain table does have data in buffer. This could be done with this kind of query
select distinct page_type as ‘Page Type’, count(page_id) as ‘Number of Pages’ from sys.dm_os_buffer_descriptors where allocation_unit_id in (select a.allocation_unit_id
from sys.partitions p, sys.allocation_units a where p.object_id=object_id('<table name>') and a.container_id=p.hobt_id)
group by sys.dm_os_buffer_descriptors.page_type
order by count(page_id) desc
The name of the table is marked in Red and Italic
The result set for the table VBAK in a productive system could look like:
In this case we look at a table having around 790K data pages in the buffer pool. With table REPOLOAD which contains LOB columns, the result could look like:
These were the most interesting queries so far. As soon as I find new interesting queries being used to investigate issues in SQL Server implementations running SAP applications, I will post them in this blog.
PingBack from http://outdoorceilingfansite.info/story.php?id=5113