Spat's WebLog (Steve Patrick)

When things go wrong...

Paged Pool , DFS and other ramblings..

Paged Pool , DFS and other ramblings..

  • Comments 2

An interesting case came across my desk the other day.  The initial symptoms were an unstable system which hung, quickly narrowed down to a memory resource issue.

A quick rundown of the environment.

The server's which were having issues were Server 2003 machines, primarlily used as web servers for a photo upload site. Users posted images for viewing or later retrieval. These IIS servers then stored the images on a back end 3rd party file server as we well as a 2003 cluster solution.

After a reboot the server's would work fine for about 12 hours. The resource issue was quickly narrowed down to a shortage of paged pool memory.

Event Type: Error
Event Source: Srv
Event Category: None
Event ID: 2020
Date: 12/14/2006
Time: 2:26:46 AM
User: N/A
Computer: Server18
Description:
The server was unable to allocate from the system paged pool because the pool was empty.

 

Here is what it looks like under a kernel debugger:

 

0: kd> !vm 1
*** Virtual Memory Usage ***
Physical Memory: 524014 ( 2096056 Kb)
Page File: \??\C:\pagefile.sys
Current: 2197504 Kb Free Space: 2194524 Kb
Minimum: 2197504 Kb Maximum: 4190208 Kb
Available Pages: 351447 ( 1405788 Kb)
ResAvail Pages: 479699 ( 1918796 Kb)
Locked IO Pages: 149 ( 596 Kb)
Free System PTEs: 12469 ( 49876 Kb)
Free NP PTEs: 16454 ( 65816 Kb)
Free Special NP: 0 ( 0 Kb)
Modified Pages: 340 ( 1360 Kb)
Modified PF Pages: 340 ( 1360 Kb)
NonPagedPool Usage: 2778 ( 11112 Kb)
NonPagedPool Max: 32328 ( 129312 Kb)
PagedPool 0 Usage: 3396 ( 13584 Kb)
PagedPool 1 Usage: 15774 ( 63096 Kb)
PagedPool 2 Usage: 15804 ( 63216 Kb)
PagedPool 3 Usage: 15834 ( 63336 Kb)
PagedPool 4 Usage: 15821 ( 63284 Kb)
********** Excessive Paged Pool Usage *****
PagedPool Usage: 66629 ( 266516 Kb)
PagedPool Maximum: 67072 ( 268288 Kb)
Shared Commit: 767 ( 3068 Kb)
Special Pool: 0 ( 0 Kb)

 

We could see that there was a paged pool shortage. Pooltags helped use sort out which driver was allocating the most and it turned out to be MUP.SYS.

Poolmon showed:

Tag Allocs Used Allocs Used
Mup 5 408 3004178 316760536 Multiple UNC provider allocations, generic , Binary: mup.sys
IoNm 0 0 169056 21647120 Io parsing names , Binary: nt!io

Unfortunately, within MUP there are many many places which end up calling ExAllocatePoolWithTag and they all used the same tag .. so we wrote a version which  broke them out individually so I could tell what paths were being taken.

Once this was done we collected a full memory dump in order to track the usage again. We could have used Poolmon again, but perhaps wewould get a chance to be able to catch MUP.SYS in the act and determine why it was allocating resources in this way.

Unfortunately this didn't happen, the dump had one thread which was processing MUP data and it didnt seem entirely relevant when I was looking at it ( later on however it did prove valueable to confirm various data structures relevant to MUP ) so I guess it was a good idea after all.

Once we got the new kernel dump we can easily sort the usage of Paged Pool

0: kd> !poolused 5

Sorting by Paged Pool Consumed
Pool Used:
NonPaged Paged

Tag Allocs Frees Diff Used Allocs Frees Diff Used
Mu37 0 0 0 0 494273 46 494227 100966192 UNKNOWN pooltag 'Mu37'
Mu46 0 0 0 0 123572 14 123558 25699824 UNKNOWN pooltag 'Mu46'
Mu36 0 0 0 0 494273 46 494227 21746008 UNKNOWN pooltag 'Mu36'
Mu2c 0 0 0 0 123558 0 123558 20757744 UNKNOWN pooltag 'Mu2c'
IoNm 0 0 0 0 678531 554089 124442 15938640 Io parsing names , Binary: nt!io
Mu39 0 0 0 0 123572 14 123558 15073184 UNKNOWN pooltag 'Mu39'
Mu38 0 0 0 0 123572 14 123558 15073184 UNKNOWN pooltag 'Mu38'
Mu4c 0 0 0 0 294942 4 294938 14157024 UNKNOWN pooltag 'Mu4c'
Mu4b 0 0 0 0 294942 4 294938 11002544 UNKNOWN pooltag 'Mu4b'
Mu04 0 0 0 0 128272 4666 123606 10877328 UNKNOWN pooltag 'Mu04'

Each tag "Mu37", "Mu38" etc.. related to a particular function, which in this case turned out to all be related to DFS code paths.

What does MUP do?  Well here is an overview Of course at this point I was interested in the DFS functionality of MUP since this was 2003 Server. BTW,  in Vista this has changed and the DFS functionality has been broken out into DFSC.SYS

Assuming this was a DFS client I asked about the DFS configuration, but they knew nothing of this, "...there was no DFS configured". 

 In order to confirm this I requested the client's( the IIS Server ) view of it's environment.  To do this, we ran a tool called DFSUTIL.EXE with the /PKTINFO swtich.

The results were interesting to say the least.  Apparently the 3rd party file server they were using had it's own DFS server built in and they weren't aware of this.

Here is what came back in their dfsutil output. ( edited of course )

Entry: \MyFileServer\Root\lores\20070110\R6\335\2038335\29109072_lores.jpg
ShortEntry: \MyFileServer\Root\lores\20070110\R6\335\2038335\29109072_lores.jpg
Expires in 57377 seconds
UseCount: 0 Type:0x1 ( DFS )
0:[\MyFileServer\other.name.16616083360685226265.4-{#}-Root\20070110\R6\335\2038335\29109072_lores.jpg] State:0x31 1:[\MyFileServer.domain.local\other.name.16616083360685226265.4-{#}-Root\20070110\R6\335\2038335\29109072_lores.jpg]
2:[\MyFileServer1\other.name.16616083360685226265.4-{#}-Root\20070110\R6\335\2038335\29109072_lores.jpg] State:0x21 )
3:[\MyFileServer1.domain.local\other.name.16616083360685226265.4-{#}-Root\20070110\R6\335\2038335\29109072_lores.jpg] State:0x21 ( )

 

If you look at the data above  - the first red entry is the PKT entry.. notice that it is the full path including the file name.  The second red entry is the link target information which is shown as active. They had over 10,500 of these entries... hence the problem.

It turns out that the 3rd party file server implemented a DFS server and was handing out the 'bad' referrals. As implemented and tested, "targets" in DFS are whole subtrees, not individual files.

 

Here is an example of what we expect: 

In this example the environment looks like this:

Server = 2k3entspat
Root = Root2
Link Name = Depth1\Depth2
Link Target(s) = \\2k3entspat\dfs_link_1 and \\spat2003-1\link1

MORE INFO:

Here is the actual folder for one of the link targets:

C:\DFS_LINK_1>tree /F
Folder PATH listing
Volume serial number is 0006EE50 50AB:3FD5 C:.
│ t.txt
│ test.txt

├───DirInLink
│ text_file_in_dir.txt

├───New Folder
└───this is a test after rpc reg values

When you do a "start \\2k3entspat\root2\Depth1\Depth2\dirinlink\text_file_in_dir.txt"

Here is what you should see:

Standard referral response for the root :

SMB: DFS Sharename =\2k3entspat\Root2
SMB: ...............................1 = Referral Server
SMB: ..............................1. = Storage Server

Client sends Referral request

SMB: DFS Request Filename =\2k3entspat\root2\Depth1\Depth2\dirinlink\

Server sends the referral back with 1 for each link.


Referral1
SMB: DFS Filename =\2k3entspat\root2\Depth1\Depth2
SMB: DFS Sharename =\spat2003-1\link1
Referral2
SMB: DFS Filename =\2k3entspat\root2\Depth1\Depth2
SMB: DFS Sharename =\2k3entspat\dfs_link_1

 

However, they were doing something like this:

Environment ( simplified )

Server = Server
Root = Root
Link Name = Link1
Link Target(s) = \\otherserver\share

In \\otherserver\share there was a dir called “newdir” and a file within that directory.

In their data we could see:

Referral request

SMB: DFS Request Filename =\Server\Root\Link1\newdir

And the response from the server was:

SMB: DFS Filename =\Server\root\Link1\newdir
SMB: DFS Sharename =\OtherServer\share\dir

Where DFS Sharename should have been \OtherServer\share  ( less the extra dir  in the path ) -- this is the target for the link itself. They are working on a patch for this issue on the server side in order to correct this.

More information about our DFS implementation is here:

http://www.microsoft.com/mind/1196/cifs.asp AND http://www.snia.org/tech_activities/CIFS

If the server determines that the DFS name actually maps to a different server share, access will fail with the distinguished error STATUS_PATH_NOT_COVERED (SMB status code 0xC0000257). On receiving this error, the DFS-enabled client should ask the server for a referral. The response to the referral request will contain a list of server and share names to try and the part of the request file name that links to the list of server shares. If the ServerType field of the referral is set to one (SMB server), then the client should resubmit the request with the original file name to one of the server shares in the list, once again setting the Flags2 bit 12 bit in the SMB. If the ServerType field is not one, then the client should strip off the part of the file name that links to the server share before resubmitting the request to one of servers in the list.

 

Hrmm this post is a bit long and rambles , but maybe it will help someone out someday.

Useful takeaways -

Perfmon is your friend  use this or the Event ID 2020 to identify the resource shortage.

Use Poolmon to track kernel memory issues and identify which pooltag ( and driver ) is causing the issue. Not all folks use a tag which is easily identified so you may have to use this method to find the exact driver.

Added  - Feb 9:

Or do this ( if pooltagging is enabled ) it will query the drivers for the tags:

C:\WINDDK\3790.1830\tools\other\i386>poolmon.exe -c
Poolmon: No localtag.txt in current directory
Creating local pool tag file: localtag.txt ...

In the end - had I known about the DFS config originally, I could have possibly made a connection between the Poolmon output ( knowing it was MUP ) and the DFS config.

However, since we weren't aware of the DFS implementation it took a few more hints to get us this far.  Debugging isn't a tool - it's a methodology. Use the knowledge of your system to piece things together in a logical fashion and you'll find the problem.

PS: Big thanks to Mahmood and Tony .... ( customer and 3rd party vendor  contacts ) the collaboration was excellent.

Keywords:  2020 paged pool pooltag MUP DFS

Leave a Comment
  • Please add 5 and 5 and type the answer here:
  • Post
  • In my last post I discussed an issue with paged pool exhaustion and MUP. One of the pieces of data collect

  • sweet blog bud... keep up the great work!

Page 1 of 1 (2 items)