SAP DBACockpit and some related SQL Scripts (Part 3)

SAP DBACockpit and some related SQL Scripts (Part 3)

Rate This
  • Comments 1

Well, let’s go to the last part of our series about SAP DBACockpit and associated SQL Server scripts.

Actual Running Requests

Another area which usually is investigated is to detect long running queries against the database. What can be from interests are questions like whether a requests is blocked in any way or experiences wait situations on some kind of resources and more important how the statement looks like. Especially in the case of SAP applications running against SQL Server, the interesting question as well is to trace these requests back to a specific SAP application server and the specific process the request is issued from. A query to get this information from would be:

--Query1

select r.session_id, s.host_name, s.program_name, s.host_process_id,

r.status, r.wait_time,wait_type,r.wait_resource,

substring(qt.text,(r.statement_start_offset/2) +1,

(case when r.statement_end_offset = -1

then len(convert(nvarchar(max), qt.text)) * 2

else r.statement_end_offset end -r.statement_start_offset)/2)

as stmt_executing,r.blocking_session_id, r.cpu_time,r.total_elapsed_time,r.reads,r.writes,

r.logical_reads, r.plan_handle

from sys.dm_exec_requests r

cross apply sys.dm_exec_sql_text(sql_handle) as qt, sys.dm_exec_sessions s

where r.session_id > 50 and r.session_id=s.session_id

order by r.session_id, s.host_name, s.program_name, r.status

 

This query will answer most of the questions above and will show the statement as well as wait or blocking states and the SAP process issuing the query. Additionally you’ll find the plan handle at the end of the row. This allows getting to the execution plan as described in the first part of the series. What are we expecting to see as results? In the column status, we will see quite a few requests which are in the status ‘suspended’ and some others (like Query1 above) as ‘running’. The state ‘suspended’ simply means that the request currently is not active because it is waiting on a resource. The resource can be an I/O for reading a page, it can be communication on the network, or it simply that other requests are using the CPUs and the request is in the list of the scheduler to become ‘running’ again. So no real worries to see status suspended here. The resource the request is waiting on is described in the column wait_type. Expectations are that one sees some resources listed like PAGEIOLATCH_SH indicating disk I/O activity or ASYNCH_NETWORK_IO indicating network communication. It is normal to see those resources. To differentiate the good from the bad in disk I/O activity, one would need to look at the column wait_time. The lower this time, the better since this time mainly reflects the performance of the I/O subsystem. The higher the number the slower the I/O subsystem seems to be. Well or the more data is read per I/O. Therefore one should not take the wait_time as a direct time in milliseconds an I/O needs to be executed. To get the real average time an I/O takes one should use Windows performance Monitor. But the number gives a good indication and direction in which direction to investigate. For ASYNC_NETWORK_IO, we usually see higher numbers happening quite often. These higher numbers indicate as well that larger amounts of data are being transferred from the server to the client. This is normal even in a SAP ERP environment and wait_times of a few seconds here and there can happen, however should happen with the majority of requests. Blocking Lock situations also will show up with a wait_type of LCK….. In such a case the wait_time gives the length the request got blocked. More information about the different wait types can be found in a paper my colleague Tom Davidson wrote

 

Monitoring requests waiting on locks

We hardly expect to see Shared Locks on SQL Server while used by SAP Applications since SAP applications are reading mostly in uncommitted isolation level. Hence most of the locks we get to see are locks which indicate data modifications in form of update, inserts or deletes. Also the lock granularity mostly experienced is row level locks either on the clustered index (mostly on the data level) or row locks on non-clustered index rows. Running Query2 as below will show such locks in case they exist:

--Query2:

SELECT w.session_id, w.blocking_session_id, w.wait_type, w.wait_duration_ms, w.resource_description, substring(qt.text,(r.statement_start_offset/2) +1, 
(case when r.statement_end_offset = -1  then len(convert(nvarchar(max), qt.text)) * 2 
else r.statement_end_offset end -r.statement_start_offset)/2)  as stmt_executing
FROM    sys.dm_os_waiting_tasks w, sys.dm_exec_requests r  cross apply sys.dm_exec_sql_text(sql_handle) as qt               
WHERE w.session_id > 49 and w.wait_type like 'LCK%' and w.session_id=r.session_id

The result of Query 2 could - in a more readable form since the row length is too long to present it nicely - look like:

Session_id:                         212        

Blocking Session_id:       639        

Wati_type:                         LCK_M_U           

Wait_time in ms:              1812      

Resource description:    keylock hobtid=72057594095730688 dbid=5 id=lock470e280 mode=U
                                                associatedObjectId=72057594095730688              

Statement:                         SELECT TOP 1 "MANDT" AS c ,"OBJNR" AS c ,"OBTYP" AS c ,"STSMA" AS c,"CHGKZ" AS c ,"CHGNR" AS c FROM "JSTO" WITH (UPDLOCK REPEATABLEREAD)  WHERE "MANDT" = @P000 AND "OBJNR" = @P001 /* R3:SAPLBSVU:954 T:JSTO */

The wait_type already is showing that we look at an Update lock which is requested by session 212. Session 639 seems to block this request. In the Resource description field you can get the fact that we are looking at a ‘keylock’ which is a row lock in the clustered index and in a most of the cases a row on the data level. You can also get the database ID and the object ID out of the resource description. The last column then gives you the statement which is getting blocked in execution at this point in time.

Query2 is nice in order to look at blocking locks situations which are rather simple, but for more complex situations and in order to work with SAP support, you should use a little set of queries and T-SQL procedures collected under the term ‘hangman’. Please check OSS notes #948633. Hangman gives even more data to analyze the situation, especially with intense lock chains, which can develop easily if longer locks are taken on critical resources as row in the number range table NRIV. Also SAP support uses the ‘hangman’ outputs frequently for investigating issues and hence is familiar with its output.

Wait Statistics

A lot of different resources where wait situations can happen in SQL Server are monitored. DMV  sys.dm_os_wait_stats shows where all those waits are accumulated. A good query to look at this data as a whole would be the following:

 

--Query 3:

select wait_type, wait_time_ms/waiting_tasks_count as 'Avg Wait in ms',

wait_time_ms/ signal_wait_time_ms 'Wait to Signal Wait Ratio',

waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms

from sys.dm_os_wait_stats where waiting_tasks_count > 0 and signal_wait_time_ms >0

 

The list can be rather confusing and a bit overwhelming. An interesting wait event to look at would be WRITELOG. This event indicates the write performance to the transaction log of the database(s). The counter on this event is incremented when the transaction log buffer in SQL Server gets flushed until an acknowledge of this synchronous I/O comes back from the I/O subsystem. The wait_types PAGEIOLATCH_SH and PAGEIOLATCH_EX give indications on the performance of read I/O from the data files. Be aware that in SQL Server 2005 some issues around accumulating these numbers got improved. Therefore the numbers show up higher than usually experienced in SQL Server 2000. But the numbers now pretty much match what is measured with Windows Performance monitor.

Nevertheless in cases where you want to analyze the next 30 or 60min workload you would need to run the query several times and build deltas since the numbers are accumulated since the start of SQL Server. Or you might get the idea to reset the values as meanwhile described in Books Online. This is a bad idea since it destroys the efforts SAP put into their database monitor to give you exactly the ability to investigate certain time ranges or even online observe own defined time frames. SAP DBA Cockpit offers you to look at all the wait types listed. It offers to reset them all to 0 when you desire and then displays the counts and values since you pressed ‘Reset’ it. The difference is that SAP DBA Cockpit will not reset the values on the database, but will simply start to perform delta reporting from the time of your reset on in DBA Cockpit. Therefore strictly avoid resetting those values on the database with the command that I purposely don’t name here, but use DBA Cockpit in order to monitor these wait statistics over a specific time frame.

 

Another approach to analyze the performance of a system or investigate improvement potentials is to look where SQL Server is waiting on resources and then try to get those resource bottlenecks resolved. Since we meanwhile have 200+ wait_types one certainly can get overwhelmed by getting clear indications what area the bottleneck is in. Tom already indicated some categories in his whitepaper. SQL Server Performance Dashboard which is a free download also materialized a categorization of wait_types. The query batch below will do so as well. Not very sophisticated using some temporary tables, but it works in principle. It basically reads the values of the 200+ wait_types, groups it in categories and displays the average wait time and the percentage of wait time. Any wait types that indicate some sleeping activity due to low usage of the server are not being calculated.

 

--Query4

create table ##aggr_temp (wait_type nvarchar(60), waiting_tasks_count bigint, wait_time_ms bigint, max_wait_time_ms bigint, signal_wait_time_ms bigint,

category nvarchar(60))

set nocount on

declare @category nvarchar(60), @wait_type nvarchar(60), @waiting_tasks_count bigint, @wait_time_ms bigint, @max_wait_time_ms bigint, @signal_wait_time_ms bigint

declare c1 cursor for select * from sys.dm_os_wait_stats where waiting_tasks_count > 0

open c1

fetch c1 into @wait_type, @waiting_tasks_count, @wait_time_ms, @max_wait_time_ms, @signal_wait_time_ms

while @@fetch_status <>-1

begin

                select @category =

                                case

                                                when @wait_type like N'LCK_%' then N'Locking'

                                                when @wait_type like N'SOS%' or @wait_type in (N'THREADPOOL', N'LOWFAIL_MEMMGR_QUEUE') then N'SOS'

                                                when @wait_type like N'LATCH_%' then N'Latching'

                                                when @wait_type in (N'SLEEP_BPOOL_FLUSH') then N'Checkpoint'

                                                when @wait_type like N'BROKER_%' then N'Service Broker or Database Mirroring Foundations'

                                                when @wait_type like N'BACKUP%' or @wait_type in (N'ASYNC_DISKPOOL_LOCK', N'REQUEST_DISPENSER_PAUSE') then N'Backup or Restore'

                                                when @wait_type like N'PAGELATCH_%' then N'Buffer Latch'

                                                when @wait_type like N'PAGEIOLATCH_%' or @wait_type in (N'IO_COMPLETION', 'ASYNC_IO_COMPLETION', N'CHKPT', N'DISKIO_SUSPEND', N'IMPPROV_IOWAIT') then N'I/O Performance'

                                                when @wait_type like N'QUERY%' or @wait_type like N'RESOURCE_%' or @wait_type like N'SQLSORT_%' or @wait_type like N'DEADLOCK_%'

                                                or @wait_type in (N'RESOURCE_SEMAPHORE', N'RESOURCE_SEMAPHORE_MUTEX', N'RESOURCE_SEMAPHORE_QUERY_COMPILE', N'RESOURCE_SEMAPHORE_SMALL_QUERY',

                        N'CMEMTHREAD', N'SOS_RESERVEDMEMBLOCKLIST', N'BAD_PAGE_PROCESS', N'CURSOR', N'CURSOR_ASNC', N'EE_PMOLOCK',

                                                N'EE_SPECPROC_MAP_INIT', N'QRY_MEM_GRANT_INFO_MUTEX', N'OLEDB', N'SQLSORT_NORMMUTEX', N'SQLSORT_SORTMUTEX', N'VIEW_DEFINITION_MUTEX') then N'Query Execution'

                                                when @wait_type in (N'CXPACKET', N'EXCHANGE', N'EXECSYNC') then 'Parallel Query Execution'

                                                when @wait_type = N'SOS_SCHEDULER_YIELD' then N'Scheduler Yield'

                                                when @wait_type in (N'LOGMGR', N'LOGBUFFER', N'LOGMGR_RESERVE_APPEND', N'LOGMGR_FLUSH', N'WRITELOG') then N'Logging'

                                                when @wait_type in (N'ASYNC_NETWORK_IO', N'NET_WAITFOR_PACKET') or @wait_type like N'SNI_%' then N'Network IO'

                                                when @wait_type like N'CLR_%' or @wait_type like N'SQLCLR%' then N'CLR'

                                                when @wait_type like N'DTC_%' then N'Distributed Transaction Controller'

                                                when @wait_type like N'DBMIRROR%' or @wait_type = N'MIRROR_SEND_MESSAGE' then N'Mirroring'

                                                when @wait_type like N'XACT%' or @wait_type like N'TRAN_MARKLATCH_%' or @wait_type like N'MSQL_XACT_%' or @wait_type = N'TRANSACTION_MUTEX' then N'Transaction'

                                                when @wait_type like N'SLEEP_%' or @wait_type in(N'LAZYWRITER_SLEEP', N'WAITFOR', N'WAIT_FOR_RESULTS' ) then N'Sleep'

                                                when @wait_type in (N'FCB_REPLICA_READ', N'FCB_REPLICA_WRITE', N'REPLICA_WRITES') then N'Database Snapshot'

                                                when @wait_type like N'FT%' or @wait_type in (N'MSSEARCH') then N'Fulltext Search'

                                                when @wait_type like N'REPL%' then 'Replication'

                                                when @wait_type in (N'IO_AUDIT_MUTEX') or @wait_type like N'SQLTRACE%' then N'SQL Trace Infrastructure'

                                                else N'Other'

                                end

                if @category <> 'Sleep'

                                insert into ##aggr_temp values (@wait_type, @waiting_tasks_count, @wait_time_ms, @max_wait_time_ms, @signal_wait_time_ms, @category)

                fetch c1 into @wait_type, @waiting_tasks_count, @wait_time_ms, @max_wait_time_ms, @signal_wait_time_ms

end

go

deallocate c1

select distinct a.category, sum(a.wait_time_ms)/sum(a.waiting_tasks_count) as 'Avg Wait Time', sum(a.waiting_tasks_count) as 'Wait Counts',

sum(a.wait_time_ms) as 'Wait Time', convert(decimal(7,5),(sum(a.wait_time_ms)*100)/b.sum_wait) as 'Percentage Wait'

from ##aggr_temp a, (select sum(wait_time_ms) as 'sum_wait' from ##aggr_temp) b

group by a.category, b.sum_wait

order by 'Percentage Wait' desc

select * from ##aggr_temp where category = 'Other'

drop table ##aggr_temp

 

 

The result of this query against a productive SAP ERP system of 4.2TB which shows good performance and doesn’t experience any problems could look like:

 

Category

Avg Wait Time

Wait Counts

Wait Time

Percentage Wait

I/O Performance

6

1178288536

7556375654

28.00

Service Broker or Database Mirroring Foundations

14

392964017

5845522421

21.00

Backup or Restore

10

389324951

4206180842

15.00

Network IO

3

986752687

3407236484

12.00

Mirroring

9

199557746

1901074483

7.00

SQL Trace Infrastructure

3999

427110

1708141421

6.00

Locking

1903

443802

844982386

3.00

SOS

1

300588972

356338999

1.00

Transaction

55

3873951

216409937

0.00

Logging

2

72535263

154614249

0.00

Other

282

718783

202713531

0.00

Query Execution

0

427123366

8718840

0.00

Buffer Latch

6

37372481

252618046

0.00

Checkpoint

0

7720735

7636500

0.00

Latching

6

768209

5289937

0.00

 

So basically 28% of the waits are falling into areas of I/O whereas the average (avg) time of a wait with 6ms seems to be reasonable. Means there might be some improvement possible, but it might not carry that far. The next one is a bit surprising since SAP indeed is not using SQL Server 2005 Service Broker. However the SAP ERP system the query was running against is using synchronous Database Mirroring. Since Database Mirroring is using the same communication foundation as Service Broker, this category might come up as well.  The impact synchronous mirroring on performance is extremely small; therefore we are not concerned about these 21% too much for this particular system. Backup or Restore which is the next category at least in the backup case should not interfere with its wait queue with query requests being processed, except in the area of I/O requests to read data. The resource queues in that category mostly cover area where data is getting exchanged between SQL Server and the backup target. Network I/O would cover such wait_types as ASYNC_NETWORK_IO and hence besides indications of slow network also could indicate that a lot of data is constantly exchanged between the client and the server. Mirroring would then cover Mirroring specific resource queues beyond the foundations mentioned earlier. Another interesting category to look at could be ‘Locking’. SAP functionality usually is designed to avoid lock situations. Standard SAP reports are extremely sophisticated in trying to avoid database locks over extended periods of times. Nevertheless in some cases smaller blocking lock situations like reading a document number out of the number range table NRIV often are unavoidable. However poor ABAP/JAVA written reports might cause extended locking. In the results above locking was contributing with 3% to wait situations only. This definitely is at the lower end of a system with around 600 interactive users and over a dozen batch jobs working on the system concurrently. Other systems where we observed more blocking lock situations showed up with higher percentages up to 10%. Query Execution is another interesting category to look at. In times where queries are getting in a wait for memory resources, this category could show significant higher participation. One category which we won’t see here ‘Parallel Query Execution’. Since this is a SAP ERP system, as recommended the degree of parallelism is set to 1. Hence we don’t expect to see any indications of queries being executed in parallel.

 

How is the SQL Buffer Pool being used

The usual way of looking at a system with a relatively high I/O rate and sub-standard Buffer cache hit ratio always was to look for queries which were long running due to either reading a lot of data or not being supported by optimal indexes. In part 1 of this series I introduced queries looking at sys.dm_exec_query_stats where one can find the queries which are causing those physical I/O’s. Another interesting angle to look at some of those systems might be to check the population of the buffer cache. Purpose: Find the tables that keep the most space in the buffer pool. This would cover cases where a few tables indeed keep bigger parts of the cache populated. However not only out of the reason of queries scanning those tables, but also because the whole bandwidth of data in those larger tables is so heavily used. This is a problem, that cannot be solved by adding additional indexes. It would  either require changes in ABAP/Java reports to limit the amount of data being accessed or the simple acceptance that one needs more memory for the buffer pool of SQL Server. A query batch to get a list of table populating the cache and the percentage of cache taken would be (Caution: Dependent on the size of the buffer pool this query can run one minute or more):

--Query5

create table ##allocation_jth (alloc_unit bigint, Number_of_Pages bigint, Table_Name sysname, Percentage_of_Buffer_Pool decimal(7,5))

go

declare @alloc bigint, @nopage bigint, @tname sysname, @sp bigint, @percentage decimal(7,5)

select @sp=count(page_id) from sys.dm_os_buffer_descriptors

insert into ##allocation_jth select distinct b.allocation_unit_id, count(b.page_id), object_name(p.object_id), (convert(decimal,(count(b.page_id)))/@sp)*100

from sys.dm_os_buffer_descriptors b, sys.allocation_units c, sys.partitions p

where b.allocation_unit_id=c.allocation_unit_id and c.container_id=p.hobt_id and c.type in (1,3)

group by b.allocation_unit_id, p.object_id -- order by count(b.page_id) desc

create index allocation_jth_i1 on ##allocation_jth(Table_Name)

 

declare c1 cursor for select distinct b.allocation_unit_id, count(b.page_id), object_name(p.object_id), (convert(decimal,(count(b.page_id)))/@sp)*100

from sys.dm_os_buffer_descriptors b, sys.allocation_units c, sys.partitions p

where b.allocation_unit_id=c.allocation_unit_id and c.container_id=p.PARTITION_id

and c.type =2

group by b.allocation_unit_id, p.object_id

open c1

fetch c1 into @alloc, @nopage, @tname, @percentage

while @@fetch_status<>-1

begin

                if not exists (select Table_Name from ##allocation_jth where Table_Name = @tname)

                                insert into ##allocation values (@alloc, @nopage, @tname, @percentage)

                else

                                update ##allocation_jth set Number_of_Pages = Number_of_Pages+@nopage where Table_Name=@tname

                fetch c1 into @alloc, @nopage, @tname, @percentage

end

deallocate c1

select Table_Name, Number_of_Pages, Percentage_of_Buffer_Pool from ##allocation_jth order by Number_of_Pages desc

drop table ##allocation_jth

 

There certainly might be nicer ways to code this up, but this was a shot to handle a rather unfortunate case in our Metadata in SQL Server 2005 where allocation_unit IDs are different foreign keys dependent on the page type. Please notice the difference between the two queries in the Query batch where the first query handles type 1 and 3, whereas the second one handles type 2.

The result of the query could look like:

Table Name

Number of Pages

Percentage_of_Buffer_Pool

ZXUM4XS

813588

16.99073

ZXUM4XSLOG

516540

10.78727

REGUH

453015

9.46063

VBRK

407574

8.51166

LIKP

354814

7.40983

ZXUM2PL

247153

5.16147

PA0008

237939

4.96905

ANLC

185510

3.87414

SOST

159460

3.33012

LIPS

94667

1.977

MARC

66501

1.38879

AUFK

66468

1.3881

VBRK

57900

1.20917

…..
…..

These results could look very different throughout different phases of the day.  But the common observation of a few productive SAP systems points to the fact that the 50% of the buffer pool often is occupied by less than 20 tables. The statement above also gives you a good idea on how many different tables got touched and have data in SQL Server Buffer Pool. Dependent on the modules used in a SAP ERP system the number of different tables having pages loaded in the buffer pool at a given point in time can be between several hundred up to several thousand.

 

Another task could be to check whether a certain table does have data in buffer. This could be done with this kind of query

--Query6

select distinct page_type as ‘Page Type’, count(page_id) as ‘Number of Pages’ from sys.dm_os_buffer_descriptors where allocation_unit_id in (select a.allocation_unit_id

from sys.partitions p, sys.allocation_units a where p.object_id=object_id('<table name>') and a.container_id=p.hobt_id)

group by sys.dm_os_buffer_descriptors.page_type

order by count(page_id) desc

 

The name of the table is marked in Red and Italic

The result set for the table VBAK in a productive system could look like:


 

Page Type

Number of Pages

DATA_PAGE

790732

INDEX_PAGE

10118

IAM_PAGE

34

 

In this case we look at a table having around 790K data pages in the buffer pool. With table REPOLOAD which contains LOB columns, the result could look like:

 

Page Type

Number of Pages

TEXT_MIX_PAGE

132

DATA_PAGE

21

INDEX_PAGE

17

TEXT_TREE_PAGE

8

IAM_PAGE

1

 

 

These were the most interesting queries so far. As soon as I find new interesting queries being used to investigate issues in SQL Server implementations running SAP applications, I will post them in this blog.

Leave a Comment
  • Please add 6 and 1 and type the answer here:
  • Post