Hello NTdebuggers, I'm very impressed with the depth of the answers we are seeing from our readers. As I stated in last week's response, this week's puzzler is going to be harder. With that said let's take it up a notch. One of the things that is really cool about be an Escalation Engineer in GES/CPR is how far we go in the pursuit of solving complex problems. If we're debugging some Microsoft code in a kernel dump or user mode, and our quest takes us into a binary that we don't have code or symbols for, we don't stop, we forge on! Over the years there are members of our team that have had to port to or support Alpha, PowerPC, MIPs, IA64 and x64, myself included. As a result most of us have books for just about every mainstream processor under the sun. It's a good idea if you're going to be debugging on these platforms to have general working knowledge the CPUs you will encounter. The most common CPU's we deal with are x86, followed by x64 and IA64. Microsoft doesn't support PPC, MIPS or Alpha anymore unless you're dealing with Xbox consoles, and those are PPC. That said, this week's challenge is to tell us what the following assembly does. You can tell us in C, or break it down and comment on the various sections.
Some people like cross word puzzles, Most of us in GES/CPR love to esreveR reenignE assembler. Have FUN!
“I don’t even see the code anymore” Cypher... 0:000> uf myfun puzzler3!myfun [c:\source\puzzler\puzzler3\puzzler3\puzzler3.cpp @ 20]: 20 00cc1480 55 push ebp 20 00cc1481 8bec mov ebp,esp 20 00cc1483 81ecf0000000 sub esp,0F0h 20 00cc1489 53 push ebx 20 00cc148a 56 push esi 20 00cc148b 57 push edi 20 00cc148c 8dbd10ffffff lea edi,[ebp-0F0h] 20 00cc1492 b93c000000 mov ecx,3Ch 20 00cc1497 b8cccccccc mov eax,0CCCCCCCCh 20 00cc149c f3ab rep stos dword ptr es:[edi] 26 00cc149e 8b4508 mov eax,dword ptr [ebp+8] 26 00cc14a1 50 push eax 26 00cc14a2 e803fcffff call puzzler3!ILT+165(_strlen) (00cc10aa) 26 00cc14a7 83c404 add esp,4 26 00cc14aa 8945e0 mov dword ptr [ebp-20h],eax 28 00cc14ad 8b45e0 mov eax,dword ptr [ebp-20h] 28 00cc14b0 8945f8 mov dword ptr [ebp-8],eax 28 00cc14b3 eb09 jmp puzzler3!myfun+0x3e (00cc14be) 28 00cc14b5 8b45f8 mov eax,dword ptr [ebp-8] 28 00cc14b8 83e801 sub eax,1 28 00cc14bb 8945f8 mov dword ptr [ebp-8],eax 28 00cc14be 837df800 cmp dword ptr [ebp-8],0 28 00cc14c2 7e60 jle puzzler3!myfun+0xa4 (00cc1524) 30 00cc14c4 c745ec00000000 mov dword ptr [ebp-14h],0 30 00cc14cb eb09 jmp puzzler3!myfun+0x56 (00cc14d6) 30 00cc14cd 8b45ec mov eax,dword ptr [ebp-14h] 30 00cc14d0 83c001 add eax,1 30 00cc14d3 8945ec mov dword ptr [ebp-14h],eax 30 00cc14d6 8b45f8 mov eax,dword ptr [ebp-8] 30 00cc14d9 83e801 sub eax,1 30 00cc14dc 3945ec cmp dword ptr [ebp-14h],eax 30 00cc14df 7d41 jge puzzler3!myfun+0xa2 (00cc1522) 32 00cc14e1 8b4508 mov eax,dword ptr [ebp+8] 32 00cc14e4 0345ec add eax,dword ptr [ebp-14h] 32 00cc14e7 0fbe08 movsx ecx,byte ptr [eax] 32 00cc14ea 8b5508 mov edx,dword ptr [ebp+8] 32 00cc14ed 0355ec add edx,dword ptr [ebp-14h] 32 00cc14f0 0fbe4201 movsx eax,byte ptr [edx+1] 32 00cc14f4 3bc8 cmp ecx,eax 32 00cc14f6 7e28 jle puzzler3!myfun+0xa0 (00cc1520) 34 00cc14f8 8b4508 mov eax,dword ptr [ebp+8] 34 00cc14fb 0345ec add eax,dword ptr [ebp-14h] 34 00cc14fe 8a08 mov cl,byte ptr [eax] 34 00cc1500 884dd7 mov byte ptr [ebp-29h],cl 35 00cc1503 8b4508 mov eax,dword ptr [ebp+8] 35 00cc1506 0345ec add eax,dword ptr [ebp-14h] 35 00cc1509 8b4d08 mov ecx,dword ptr [ebp+8] 35 00cc150c 034dec add ecx,dword ptr [ebp-14h] 35 00cc150f 8a5101 mov dl,byte ptr [ecx+1] 35 00cc1512 8810 mov byte ptr [eax],dl 36 00cc1514 8b4508 mov eax,dword ptr [ebp+8] 36 00cc1517 0345ec add eax,dword ptr [ebp-14h] 36 00cc151a 8a4dd7 mov cl,byte ptr [ebp-29h] 36 00cc151d 884801 mov byte ptr [eax+1],cl 38 00cc1520 ebab jmp puzzler3!myfun+0x4d (00cc14cd) 40 00cc1522 eb91 jmp puzzler3!myfun+0x35 (00cc14b5) 41 00cc1524 5f pop edi 41 00cc1525 5e pop esi 41 00cc1526 5b pop ebx 41 00cc1527 81c4f0000000 add esp,0F0h 41 00cc152d 3bec cmp ebp,esp 41 00cc152f e820fcffff call puzzler3!ILT+335(__RTC_CheckEsp) (00cc1154) 41 00cc1534 8be5 mov esp,ebp 41 00cc1536 5d pop ebp 41 00cc1537 c3 ret
Good luck, and happy debugging.Jeff Dailey-
In response: Wow, you folks did it again. I was worried that not many of our readers would respond. Our entire team was very impressed with the number and quality of the responses we saw. Congratulations goes out to all those assembler gurus out there that figured out this was a simple bubble sort. We enjoyed seeing how various people went about solving this. Some people compiled their code as they worked on reversing the function to verify the assembler. This is a good approach. Others just seemed to work it out end to end. This is the approach I usually end up using because I’m typically in the middle of a debug and don’t actually need the source.
Great work!
Here is the answer….
void myfun(char *val)
{
00321480 push ebp
00321481 mov ebp,esp
00321483 sub esp,0F0h
00321489 push ebx
0032148A push esi
0032148B push edi
0032148C lea edi,[ebp-0F0h]
00321492 mov ecx,3Ch
00321497 mov eax,0CCCCCCCCh
0032149C rep stos dword ptr es:[edi]
int i;
int j;
int len;
char t;
len=strlen(val);
0032149E mov eax,dword ptr [val]
003214A1 push eax
003214A2 call @ILT+165(_strlen) (3210AAh)
003214A7 add esp,4
003214AA mov dword ptr [len],eax
for (i=len;i>0;i--)
003214AD mov eax,dword ptr [len]
003214B0 mov dword ptr [i],eax
003214B3 jmp myfun+3Eh (3214BEh)
003214B5 mov eax,dword ptr [i]
003214B8 sub eax,1
003214BB mov dword ptr [i],eax
003214BE cmp dword ptr [i],0
003214C2 jle myfun+0A4h (321524h)
for(j=0;j<i-1;j++)
003214C4 mov dword ptr [j],0
003214CB jmp myfun+56h (3214D6h)
003214CD mov eax,dword ptr [j]
003214D0 add eax,1
003214D3 mov dword ptr [j],eax
003214D6 mov eax,dword ptr [i]
003214D9 sub eax,1
003214DC cmp dword ptr [j],eax
003214DF jge myfun+0A2h (321522h)
if (val[j]>val[j+1])
003214E1 mov eax,dword ptr [val]
003214E4 add eax,dword ptr [j]
003214E7 movsx ecx,byte ptr [eax]
003214EA mov edx,dword ptr [val]
003214ED add edx,dword ptr [j]
003214F0 movsx eax,byte ptr [edx+1]
003214F4 cmp ecx,eax
003214F6 jle myfun+0A0h (321520h)
t=val[j];
003214F8 mov eax,dword ptr [val]
003214FB add eax,dword ptr [j]
003214FE mov cl,byte ptr [eax]
00321500 mov byte ptr [t],cl
val[j]=val[j+1];
00321503 mov eax,dword ptr [val]
00321506 add eax,dword ptr [j]
00321509 mov ecx,dword ptr [val]
0032150C add ecx,dword ptr [j]
0032150F mov dl,byte ptr [ecx+1]
00321512 mov byte ptr [eax],dl
val[j+1]=t;
00321514 mov eax,dword ptr [val]
00321517 add eax,dword ptr [j]
0032151A mov cl,byte ptr [t]
0032151D mov byte ptr [eax+1],cl
}
00321520 jmp myfun+4Dh (3214CDh)
00321522 jmp myfun+35h (3214B5h)
00321524 pop edi
00321525 pop esi
00321526 pop ebx
00321527 add esp,0F0h
0032152D cmp ebp,esp
0032152F call @ILT+335(__RTC_CheckEsp) (321154h)
00321534 mov esp,ebp
00321536 pop ebp
00321537 ret
Thank You
Jeff Dailey-
Hi NTDebuggers, I have another puzzler for you. We started crash2.exe under windbg and it crashed. Go figure! Sometimes we have a very limited amount of data available to figure out what went wrong. That being said, this week’s puzzler only gives you a few clues. Given this week’s debugger output, what do you suspect the problem is? What would you do to further isolate the issue or prove your theory?
If there is more data you need to solve it, post a comment / request and I will provide the data for you. We will post all comments during the week and provide our answer on Friday. We look forward to your comments.
CommandLine: crash2.exe
Symbol search path is: srv*C:\symbols*\\symbols\symbols
Executable search path is:
ModLoad: 00400000 00438000 crash2.exe
ModLoad: 779b0000 77b00000 ntdll.dll
ModLoad: 76180000 76290000 C:\Windows\syswow64\kernel32.dll
(15d0.1688): Break instruction exception - code 80000003 (first chance)
eax=00000000 ebx=00000000 ecx=cd7b0000 edx=00000000 esi=fffffffe edi=77a90094
eip=779c0004 esp=0017faf8 ebp=0017fb28 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
ntdll!DbgBreakPoint:
779c0004 cc int 3
0:000> g
(15d0.1688): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=0017feec ebx=7619140f ecx=0042ecc8 edx=00000000 esi=00000002 edi=00001770
eip=65732074 esp=0017ff00 ebp=6f207473 iopl=0 nv up ei pl nz ac pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010216
65732074 ?? ???
0:000> k 123
ChildEBP RetAddr
WARNING: Frame IP not in any known module. Following frames may be wrong.
0017fefc 66692065 0x65732074
0017ff3c 0041b9a3 0x66692065
*** WARNING: Unable to verify checksum for crash2.exe
0017ffa0 762019f1 crash2!_onexit+0x35
0017ffac 77a2d109 kernel32!BaseThreadInitThunk+0xe
0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23
0:000> lm
start end module name
00400000 00438000 crash2 C (private pdb symbols) D:\CPRRAMP\source\crash2\debug\crash2.pdb
76180000 76290000 kernel32 (private pdb symbols) C:\symbols\wkernel32.pdb\20F7BB5ED22344A2910B27CA7252AE792\wkernel32.pdb
779b0000 77b00000 ntdll (private pdb symbols) C:\symbols\wntdll.pdb\7099E4B6A6984FD08CBC90A4EDD40FD12\wntdll.pdb
0:000> db 66692065
66692065 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
66692075 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
66692085 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
66692095 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
666920a5 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
666920b5 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
666920c5 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
666920d5 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
0:000> db 0x65732074
65732074 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
65732084 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
65732094 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
657320a4 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
657320b4 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
657320c4 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
657320d4 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
657320e4 ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ?? ????????????????
0:000> db 0041b9a3
0041b9a3 c3 e8 42 57 ff ff c3 cc-cc cc cc cc cc cc cc cc ..BW............
0041b9b3 cc cc cc cc cc cc ff 74-24 04 e8 42 5c ff ff f7 .......t$..B\...
0041b9c3 d8 1b c0 f7 d8 59 48 c3-cc cc cc cc 8b 44 24 04 .....YH......D$.
0041b9d3 a3 78 3c 43 00 a3 7c 3c-43 00 a3 80 3c 43 00 a3 .x<C..|<C...<C..
0041b9e3 84 3c 43 00 c3 cc cc cc-cc cc cc 6a 10 68 58 16 .<C........j.hX.
0041b9f3 43 00 e8 cc 5d ff ff 33-ff 57 e8 0f 59 ff ff 59 C...]..3.W..Y..Y
0041ba03 89 7d fc 39 7d 08 75 1c-be 78 3c 43 00 ff 35 78 .}.9}.u..x<C..5x
0041ba13 3c 43 00 e8 19 5e ff ff-89 45 e4 c7 45 e0 02 00 <C...^...E..E...
0:000> db 0x66692065
0:000> dd esp
0017ff00 66692065 72756f20 61747320 69206b63
0017ff10 616c2073 72656772 6f6e6520 54686775
0017ff20 20736968 6d207369 2065726f 61746164
0017ff30 726f6620 00737520 6f76e0ae fffffffe
0017ff40 0041b9a3 0041b9c2 00411181 004146d9
0017ff50 0017ffa0 00412ac4 00000001 002620a0
0017ff60 002620d8 6f220936 00000000 00000000
0017ff70 7efde000 0017ff9c 00000000 00000006
0:000> dds esp
0017ff00 66692065
0017ff04 72756f20
0017ff08 61747320
0017ff0c 69206b63
0017ff10 616c2073
0017ff14 72656772
0017ff18 6f6e6520
0017ff1c 54686775
0017ff20 20736968
0017ff24 6d207369
0017ff28 2065726f
0017ff2c 61746164
0017ff30 726f6620
0017ff34 00737520
0017ff38 6f76e0ae
0017ff3c fffffffe
0017ff40 0041b9a3 crash2!_onexit+0x35
0017ff44 0041b9c2 crash2!atexit+0x9
0017ff48 00411181 crash2!ILT+380(__RTC_Terminate)
0017ff4c 004146d9 crash2!_cinit+0x49
0017ff50 0017ffa0
0017ff54 00412ac4 crash2!__tmainCRTStartup+0x15e
0017ff58 00000001
0017ff5c 002620a0
0017ff60 002620d8
0017ff64 6f220936
0017ff68 00000000
0017ff6c 00000000
0017ff70 7efde000
0017ff74 0017ff9c
0017ff78 00000000
0017ff7c 00000006
Good luck and happy debugging!
Jeff-
[Update: more debugger output, per reader request. Posted 4/17/2008]
Note this cashes without windbg also.
Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
ModLoad: 777b0000 77900000 ntdll.dll
ModLoad: 75a10000 75b20000 C:\Windows\syswow64\kernel32.dll
(1324.16d0): Break instruction exception - code 80000003 (first chance)
eax=00000000 ebx=00000000 ecx=73fe0000 edx=00000000 esi=fffffffe edi=77890094
eip=777c0004 esp=0017faf8 ebp=0017fb28 iopl=0 nv up ei pl zr na pe nc
777c0004 cc int 3
(1324.16d0): Access violation - code c0000005 (first chance)
eax=0017feec ebx=75a2140f ecx=0042ecc8 edx=00000000 esi=00000002 edi=00001770
0:000> db esp
0017ff00 65 20 69 66 20 6f 75 72-20 73 74 61 63 6b 20 69 e if our stack i
0017ff10 73 20 6c 61 72 67 65 72-20 65 6e 6f 75 67 68 54 s larger enoughT
0017ff20 68 69 73 20 69 73 20 6d-6f 72 65 20 64 61 74 61 his is more data
0017ff30 20 66 6f 72 20 75 73 00-9b 7d 6b d3 fe ff ff ff for us..}k.....
0017ff40 a3 b9 41 00 c2 b9 41 00-81 11 41 00 d9 46 41 00 ..A...A...A..FA.
0017ff50 a0 ff 17 00 c4 2a 41 00-01 00 00 00 98 21 03 00 .....*A......!..
0017ff60 d0 21 03 00 03 94 3f d3-00 00 00 00 00 00 00 00 .!....?.........
0017ff70 00 e0 fd 7e 9c ff 17 00-00 00 00 00 06 00 00 00 ...~............
0:000> ln 0042ecc8
(0042ecac) crash2!`string'+0x1c | (0042eccc) crash2!`string'
0:000> .formats 0017ff044
Evaluate expression:
Hex: 017ff044
Decimal: 25161796
Octal: 00137770104
Binary: 00000001 01111111 11110000 01000100
Chars: ..D
Time: Mon Oct 19 01:23:16 1970
Float: low 4.70085e-038 high 0
Double: 1.24316e-316
0:000> dd 0017ff044
017ff044 ???????? ???????? ???????? ????????
017ff054 ???????? ???????? ???????? ????????
017ff064 ???????? ???????? ???????? ????????
017ff074 ???????? ???????? ???????? ????????
017ff084 ???????? ???????? ???????? ????????
017ff094 ???????? ???????? ???????? ????????
017ff0a4 ???????? ???????? ???????? ????????
017ff0b4 ???????? ???????? ???????? ????????
0:000> dds crash2!__onexitbegin
00434354 59c96d56
00434358 00000001
0043435c 00000000
00434360 00000000
00434364 00000000
00434368 00000000
0043436c 00000000
00434370 00033760
00434374 00000000
00434378 00000000
0043437c 00000000
00434380 00000000
00434384 00000000
00434388 00000000
0043438c 00000000
00434390 00000000
00434394 00000000
00434398 00000000
0043439c 00000000
004343a0 00000000
004343a4 00000000
004343a8 00000000
004343ac 00000000
004343b0 00000000
004343b4 00000000
004343b8 00000000
004343bc 00000000
004343c0 00000000
004343c4 00000000
004343c8 00000000
004343cc 00000000
004343d0 00000000
0:000> dds crash2!__onexitend
00434350 59c94d56
[Update: our answer. Posted 4/18/2008]
This week we had a lot of people that realized this was a buffer overrun. You guys are so good I’m going to make next week’s puzzler a little harder!
Good work all! They are not all listed but some of the responses I liked the best were:
moltov
Matthieu
Doug
Tal Rosen
This week’s official response
Let’s start off with wmain. You can see here that we push a pointer to a string onto the stack and call fun1 004117a3
0:000> uf crash2!wmain
crash2!wmain [d:\cprramp\source\crash2\crash2\crash2.cpp @ 11]:
11 00412520 55 push ebp
11 00412521 8bec mov ebp,esp
11 00412523 83ec40 sub esp,40h
11 00412526 53 push ebx
11 00412527 56 push esi
11 00412528 57 push edi
12 00412529 686cec4200 push offset crash2!`string' (0042ec6c) << Pushing param onto the stack
12 0041252e e870f2ffff call crash2!ILT+1950(?fun1YAXPADZ) (004117a3) << Making call to fun1
12 00412533 83c404 add esp,4
13 00412536 33c0 xor eax,eax
14 00412538 5f pop edi
14 00412539 5e pop esi
14 0041253a 5b pop ebx
14 0041253b 8be5 mov esp,ebp
14 0041253d 5d pop ebp
14 0041253e c3 ret
Lets dump out the value we are passing.
0:000> da 0042ec6c
0042ec6c "This is a test ot see if our sta"
0042ec8c "ck is larger enough"
0:000> uf 004117a3
crash2!fun1 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 17]:
17 00412550 55 push ebp
17 00412551 8bec mov ebp,esp
17 00412553 83ec4c sub esp,4Ch
17 00412556 53 push ebx
17 00412557 56 push esi
17 00412558 57 push edi
19 00412559 8b4508 mov eax,dword ptr [ebp+8] << Here we are moving EBP+8 into eax. This is basically just loading the address of parameter one into eax
19 0041255c 50 push eax << We push it onto the stack to make our call to strcpy
19 0041255d 8d4df4 lea ecx,[ebp-0Ch] << Now we are loading the address of a local on the stack.
Note the local is ebp-c That means that we can only write 0xC bytes to
this location before we end up overwriting things on the stack like our base pointer and return address.
19 00412560 51 push ecx << Now we push our local variable address onto the stack for our call to strcpy
19 00412566 83c408 add esp,8
20 00412569 8d45f4 lea eax,[ebp-0Ch]
20 0041256c 50 push eax
19 00412561 e8c3eeffff call crash2!ILT+1060(_strcpy) (00411429) << This is where things go WRONG, within the call to strcpy we have overwritten our return address with string data.
Let’s look at the before and after.
0:000> bp 00412560
Breakpoint 2 hit
eax=0042ec6c ebx=75a2140f ecx=0017feec edx=00000000 esi=00000002 edi=00001770
eip=00412560 esp=0017fe9c ebp=0017fef8 iopl=0 nv up ei pl nz ac pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000216
crash2!fun1+0x10:
00412560 51 push ecx
0:000> k
0017fef8 00412533 crash2!fun1+0x10 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 19] << The return address is ok here!
0017ff50 00412ac4 crash2!wmain+0x13 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 12]
0017ffa0 75a919f1 crash2!__tmainCRTStartup+0x15e [f:\sp\vctools\crt_bld\self_x86\crt\src\crt0.c @ 327]
0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe
0:000> p << Lets step over the call to strcpy and look again.
eax=0017feec ebx=75a2140f ecx=0042eca0 edx=00686775 esi=00000002 edi=00001770
eip=00412569 esp=0017fea0 ebp=0017fef8 iopl=0 nv up ei pl nz ac pe nc
crash2!fun1+0x19:
00412569 8d45f4 lea eax,[ebp-0Ch]
0017fef8 65732074 crash2!fun1+0x19 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 20]
0017ff3c 0041b9a3 0x65732074 << This is not going to be pretty when we ret out of fun1. We will basically return to nowhere.
0017ffa0 75a919f1 crash2!_onexit+0x35 [f:\sp\vctools\crt_bld\self_x86\crt\src\onexit.c @ 98]
0:000> da 0017fef8 if we do a da on the location of the stack frame we can see what is at that location.
0017fef8 "st ot see if our stack is larger" << It’s our string!
0017ff18 " enough"
Finally we have the C code.
#include <windows.h>
void fun1(char * szData);
void fun2(char * szData);
int _tmain(int argc, _TCHAR* argv[])
fun1("This is a test ot see if our stack is larger enough");
return 0;
void fun1(char * szdata)
char szData1[10];
strcpy(szData1,szdata);
fun2(szData1);
void fun2(char *szData)
printf("Hello from fun2");
strcat(szData,"This is more data for us");
Have a great weekend, Good luck, and happy debugging!
Hello NTDebuggers, we have been very impressed with the responses we’ve gotten to our previous puzzlers so far. We invited the Debug Ninja to come up with a real challenge for this week.
This server blue screened with a Bug Check 0xD1: DRIVER_IRQL_NOT_LESS_OR_EQUAL. The challenge this week is to tell us, as specifically as possible, why it crashed. We’ll give you a hint, the complete answer is not that ebx+5441F815h points to an invalid address.
1: kd> .bugcheck
Bugcheck code 000000D1
Arguments e074281d 00000002 00000001 ba502493
1: kd> kv
ChildEBP RetAddr Args to Child
f78b6544 ba502493 badb0d00 00000001 00000000 nt!_KiTrap0E+0x2a7 (FPO: [0,0] TrapFrame @ f78b6544)
f78b65e0 ba50d9d8 00000020 8c32cab8 00000022 tcpip!GetAddrType+0x19f (FPO: [Non-Fpo]) (CONV: stdcall)
f78b6694 ba50dc56 8c32cab8 8ca71c2c 000005c8 tcpip!IPRcvPacket+0x66c (FPO: [Non-Fpo]) (CONV: stdcall)
f78b66d4 ba50dd58 00000000 8ca63440 8ca71c0a tcpip!ARPRcvIndicationNew+0x149 (FPO: [Non-Fpo]) (CONV: stdcall)
f78b6710 bada5550 8c4b53b8 00000000 f78b678c tcpip!ARPRcvPacket+0x68 (FPO: [Non-Fpo]) (CONV: stdcall)
f78b6764 ba9c614b 8cac2ad0 f78b6784 00000002 NDIS!ethFilterDprIndicateReceivePacket+0x1d2 (FPO: [Non-Fpo]) (CONV: stdcall)
1: kd> .trap f78b6544
ErrCode = 00000002
eax=8c32cab8 ebx=8c323008 ecx=00000001 edx=00000001 esi=8ca71c18 edi=f78b6618
eip=ba502493 esp=f78b65b8 ebp=f78b65e0 iopl=0 nv up ei pl zr na pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00250246
tcpip!GetAddrType+0x19f:
ba502493 008b15f84154 add byte ptr [ebx+5441F815h],cl ds:0023:e074281d=??
The remainder of this puzzler is quite long, click here to see the complete debug notes required to solve this puzzler. Post your responses to this page (and not the complete debug page), so we can keep all the comments together. Also, we won’t post any of your responses until Friday.
[Update, posted 4/30/2008]
Today is Wednesday and so far we haven’t received any complete responses; the partial answers we have received are close but no one has determined the root cause of the problem. When looking at problems like this it is often useful to ask “why?” when you think you have the answer and think about what other information may be available. In this situation it is possible to identify a specific problem and even determine appropriate action to fix the problem.
Perhaps we were a bit overzealous in posting 14 pages of assembly as a puzzler. Below is an abbreviated debug with less assembly, the link to the complete debug is still available for those of you who want to try the full challenge. Also, we are going to break our promise not to post responses until Friday and we have posted some responses that we have gotten so far, neither of these responses completely identifies the problem but they are getting close.
1: kd> dd @ebx+5441F815
e074281d ???????? ???????? ???????? ????????
e074282d ???????? ???????? ???????? ????????
e074283d ???????? ???????? ???????? ????????
e074284d ???????? ???????? ???????? ????????
e074285d ???????? ???????? ???????? ????????
e074286d ???????? ???????? ???????? ????????
e074287d ???????? ???????? ???????? ????????
e074288d ???????? ???????? ???????? ????????
1: kd> !pte e074281d
VA e074281d
PDE at 00000000C0603818 PTE at 00000000C0703A10
contains 000000021B980963 contains E154FC1000000400
pfn 21b980 -G-DA--KWEV not valid
Proto: 00000000E154FC10
1: kd> ub ba50d9d8
tcpip!IPRcvPacket+0x658:
ba50d9c4 51 push ecx
ba50d9c5 50 push eax
ba50d9c6 53 push ebx
ba50d9c7 ff75f4 push dword ptr [ebp-0Ch]
ba50d9ca ff75f8 push dword ptr [ebp-8]
ba50d9cd ff75ec push dword ptr [ebp-14h]
ba50d9d0 ff7508 push dword ptr [ebp+8]
ba50d9d3 e84a040000 call tcpip!DeliverToUser (ba50de22)
1: kd> uf tcpip!DeliverToUser
tcpip!DeliverToUser:
…
tcpip!DeliverToUser+0xc1:
ba50ded9 8b45f0 mov eax,dword ptr [ebp-10h]
ba50dedc 29451c sub dword ptr [ebp+1Ch],eax
ba50dedf 8b471c mov eax,dword ptr [edi+1Ch]
ba50dee2 ff7508 push dword ptr [ebp+8]
ba50dee5 2b4510 sub eax,dword ptr [ebp+10h]
ba50dee8 57 push edi
ba50dee9 8945fc mov dword ptr [ebp-4],eax
ba50deec e817ffffff call tcpip!UpdateIPSecRcvBuf (ba50de08)
tcpip!DeliverToUser+0xd9:
ba50def1 807e0906 cmp byte ptr [esi+9],6
ba50def5 753e jne tcpip!DeliverToUser+0x12f (ba50df35)
tcpip!DeliverToUser+0xdf:
ba50def7 8b430c mov eax,dword ptr [ebx+0Ch]
ba50defa 83b88c00000007 cmp dword ptr [eax+8Ch],7
ba50df01 0f857a510000 jne tcpip!DeliverToUser+0x124 (ba513081)
tcpip!DeliverToUser+0xeb:
ba50df07 837f1000 cmp dword ptr [edi+10h],0
ba50df0b 741a je tcpip!DeliverToUser+0x114 (ba50df27)
tcpip!DeliverToUser+0xf1:
ba50df0d 640fb61551000000 movzx edx,byte ptr fs:[51h]
ba50df15 8b8880010000 mov ecx,dword ptr [eax+180h]
ba50df1b 3bca cmp ecx,edx
ba50df1d 0f857145ffff jne tcpip!DeliverToUser+0x103 (ba502494)
tcpip!DeliverToUser+0x103:
ba502494 8b15f84154ba mov edx,dword ptr [tcpip!_imp__KeNumberProcessors (ba5441f8)]
ba50249a 0fbe12 movsx edx,byte ptr [edx]
ba50249d 3bca cmp ecx,edx
ba50249f 0f8582ba0000 jne tcpip!DeliverToUser+0x114 (ba50df27)
tcpip!DeliverToUser+0x110:
ba50df23 c645f402 mov byte ptr [ebp-0Ch],2
tcpip!DeliverToUser+0x114:
ba50df27 640fb60d51000000 movzx ecx,byte ptr fs:[51h]
ba50df2f 898880010000 mov dword ptr [eax+180h],ecx
ba5024a5 e979ba0000 jmp tcpip!DeliverToUser+0x110 (ba50df23)
[Update: our answer, posted 5/2/2008]
In response: I would like to congratulate the people who figured this out and say thank you to all of the people who gave this a try and sent in a response. I’m not sure if we made a challenge that was too easy, too difficult, or too long, but we did not get as many responses as we had expected. If you looked at this and chose not to post a comment, or you have looked at our previous puzzlers but chose to skip this one, let us know why so we can post better puzzlers in the future.
Here is the answer…
Experienced assembly readers will recognize that this instruction is a valid instruction, but it takes action that no assembly writer would ever take. An offset of 5441F815 is too big to be realistic; no structure is 5441F815 bytes long.
Why is this instruction telling us to go to a crazy offset? One possible cause is that the instruction is corrupt; a good debug method for this problem would be to unassemble the code around this instruction to see if the adjacent instructions look valid, and then to compare these instructions to ones in a known good version of tcpip.sys. Because this instruction was not corrupt neither of these steps helped, so I started looking at how we got to this instruction.
Looking at the prior function on the stack, IPRcvPacket, I see that we actually called DeliverToUser and not GetAddrType. Why is our current instruction in GetAddrType instead of DeliverToUser? It is possible that the call went to the wrong place, that DeliverToUser called something else which returned to the wrong place, or that somehow the flow of execution did not go as intended.
The next step I took was to analyze the DeliverToUser assembly to see how we might have ended up in GetAddrType instead of DeliverToUser. What I found, and what several of our readers found, was that DeliverToUser contained a jump that went to 1 byte away from where we encountered the bogus instruction.
Why would eip point to ba50d9d3 instead of ba50d9d4? That is the challenging question in this exercise, because the answer is not obvious. Some engineers will conclude that the hardware must not have done what we asked it to do, and say that the hardware must be at fault. However, we don’t really have any proof at this time that the hardware has actually done something wrong. Hardware problems usually manifest themselves as a bit flip, where one bit is changed from the correct value. We have gone from a 4 (0100) to a 3 (0011); a 3 bit change is hard to blame on hardware.
To answer the question “Why would eip point to ba50d9d3 instead of ba50d9d4?” we need to carefully analyze the instruction at ba50df1d. I broke out my trusty instruction set reference and looked up the inner workings of the jne instruction. On page 3-542 of the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M I found the layout for a jne to a 32 bit offset. The first two bytes are 0F 85, and the next 8 bytes are the offset we are going to jump too. This is our opcode:
0f 85 ff ff 45 71
Remember that by the time we make this jump the instruction pointer will have been moved to the next instruction, that will be at ba50df23 (ba50df1d plus the 6 bytes that are in this instruction). If we take this offset (ff ff 45 71) and add it to the next instruction we see that we’ll end up at ba502494 as expected.
1: kd> ?ba50df23+ffff4571
Evaluate expression: -1169152876 = ba502494
If we change just one bit in this opcode, if we change 71 (0111 0001) to be 70 (0111 0000), we end up at ba502493 instead. Now we can conclude that the hardware must have made a mistake, and now that we know what bit was flipped we can confidently say that we have identified the mistake that was made.
1: kd> ?ba50df23+ffff4570
Evaluate expression: -1169152877 = ba502493
We still have some unanswered questions. The most common question I get when I’ve found a bit flip is what piece of hardware is broken? Unfortunately I can’t answer this question with the data I have. In this case the jne instruction was good when we read it to put it in the dump but obviously it was bad when it was executed. Thinking about the parts needed to get the instruction from RAM and execute it I could speculate that a processor, riser board, or motherboard could be at fault but I don’t have any data that proves where the problem happened. It is unlikely that hardware diagnostics will find this type of failure because it is intermittent. Unless the problem manifests itself while the diagnostics are running, they will likely report that there is no problem. In the case of this server I wasn’t able to find out specifically what failed, uptime was more important than troubleshooting so a whole new server was put in place of this one and the problem has not been seen since.
Bit flips can be caused by any number of problems. Problems with a processor, a processor riser board, a motherboard, or RAM are common causes. As we have documented before in this blog improper voltage or overheating can also cause these types of problems. However the potential list of bit-flip causes is nearly endless; power supplies, power cables, incoming power fluctuations, radiation from nearby electronics, even sun spots, can all cause bit flips.
I should also explain why the stack showed that we were in GetAddrType instead of DeliverToUser. This is because tcpip.sys has been optimized with Basic Block Tools (BBT) optimization, blocks of code from the same function are located in different pages to reduce the cache hits in common code paths. This means that blocks of the GetAddrType code are located next to the DeliverToUser code. As demonstrated below there is a block of GetAddrType code that ends with an unconditional jmp at ba50248f and there is a block of DeliverToUser code that starts at ba502494.
1: kd> ub ba502494
tcpip!GetAddrType+0x9e:
ba502474 f6401401 test byte ptr [eax+14h],1
ba502478 0f84bd760000 je tcpip!GetAddrType+0xb4 (ba509b3b)
ba50247e 85db test ebx,ebx
ba502480 0f84bd6b0200 je tcpip!GetAddrType+0xa8 (ba529043)
ba502486 32c0 xor al,al
ba502488 e967770000 jmp tcpip!GetAddrType+0x15a (ba509bf4)
ba50248d 33c9 xor ecx,ecx
ba50248f e993770000 jmp tcpip!GetAddrType+0x19b (ba509c27)
1: kd> u
ba5024aa 90 nop
ba5024ab 90 nop
ba5024ac 90 nop
Congratulations to those who have read this far. Hopefully you now understand the cause of a problem that has stumped many engineers, and you have some new tools in your toolbox when you’re investigating a tricky problem. Remember to keep asking “why?” and not to assume anything; those are the approaches I used to find the source of this problem.
I recently came across a very interesting profiling tool that is available in Vista SP1 and Server 08 called the Windows Performance Analyzer. You can use this tool to profile and diagnose different kinds of symptoms that the machine is experiencing. This tool is built on top off the Event Tracing for Windows (ETW) infrastructure. It uses the ETW providers to record kernel events and then display them in a graphical format.
Performance Analyzer provides many different graphical views of trace data including:
Download the latest version of the Windows Performance Tools Kit, and install it on your machine. (http://www.microsoft.com/whdc/system/sysperf/perftools.mspx : Windows Performance Tools Kit, v.4.1.1 (QFE)) You will need to find the toolkit that corresponds to your processor architecture. Currently there are 3 versions available i.e. X86, IA64, X64.
After installation you should be able to see 2 new tools. The first one is Xperf, which is a command line tool that is used to capture the trace. The second is called XperfView, which graphically interprets the trace that has been collected by Xperf.
You will need to run the Xperf and XperfView from an elevated command prompt for all functionality.
For many tasks all you need for effective analysis is a kernel trace. For this example, we'll use the –on DiagEasy parameter to enable several kernel events including: image loading; disk I/O; process and thread events; hard faults; deferred procedure calls; interrupts; context switches; and, and performance counters. From an elevated command prompt launch xperf –on DiagEasy.
This starts the kernel logger in sequential mode to the default file "\kernel.etl"; uses a default buffer size of 64K, with a minimum of 64 and a maximum of 320 buffers.
To stop a trace, type xperf –d <filename>.etl at the command line. This will stop the trace and output the file.
There are 2 ways to view the trace. From an Elevated Command prompt, launch xperf <filename>.etl, or launch the XperfView tool and open the file manually. When you open the trace file, you should see something similar like this.
NOTE - While you need to run xperf from an elevated command prompt in order to record a trace you do not need an elevated command prompt in order to *analyze* a trace.
Using the Chart Selector tab, you can select all the graphs that you want to look at. To drill down in each chart, you can select the Summary table. For instance, in the CPU Sampling chart, the summary table gets you the summary of the processes that were running, with information like the amount of CPU time, CPU %, stacks (if the stacks were collected in the trace, see below). When looking at the Summary table for the Disk I/O chart, you can see which processes were writing files (the filename too!) to disk, as well as how much time it took.
You also have the ability to zoom in on a selected area. Another really cool feature is the ability to overlay multiple graphs on one frame. This way you can correlate different pieces of data together very easily.
Also, you select which counter instances you want to see in each specific chart. On the top right corner of each chart is a drop down box from where you can select the counter instances. For instance on the Disk I/O chart, you can select Disk 0, Disk 1, or a combination as well.
You can also view detailed information about the system that the trace was taken on. Click on the Trace menu item, and select System Configuration.
In the first sample Xperf command we ran, xperf –on DiagEasy. I am sure many of you were wondering what DiagEasy means. DiagEasy is a group of kernel events that are predefined by the Windows Performance Toolkit. This group includes Process, Threads, Kernel and User Image Load/Unload, Disk I/O, DPCs and Context Switch events.
When we used the xperf –on DiagEasy command, we did not specify an individual provider, so we enabled the kernel events for all the ETW providers on the system. If you want to enable events for a specific provider, you can the following format xperf -on: (GUID|KnownProviderName)[:Flags[:Level]]. For more information about ETW providers, Kernel Flags and Groups, you can run the xperf –help providers command.
One of the most powerful features in Performance Analyzer is the ability to visualize stacks. It's important to note that this requires no special instrumentation in the code – only that you have symbols for the binary components you are interested in analyzing.
When the trace is setup to collect the stacks, Performance Analyzer will display call stack summary information for the events that had stack walking enabled. Here is an example that takes a trace (with stack tracing enabled) of the entire system while running a "find string" utility.. We can use the Stack Tracing feature of Xperf to record a stack when certain events happen, or take sample at regular intervals over time. See xperf –help stackwalk output for more info.
Below, we will use the Stack Tracking feature of Xperf to take stack samples at regular intervals. With this output, we will be able to determine where the CPU is spending most of its time within a process.
xperf -on latency -stackwalk Profile
Latency is the kernel group to enable certain events, including the profile event which records the CPUs' activity every millisecond. The "-stackwalk Profile" flag tells Xperf to record stack walks on every profile event, which makes the profile information much more useful. In other words, in order to get profile information with stack walks you need to turn on the profile event, and turn on stack walking for that event.
Note that decoding of stacks requires that symbol decoding be configured. However stacks can be recorded without symbols, and can even be viewed without symbols, although they are much less useful without symbols. I only mention this in the event you're trying to record a trace of a problematic machine with little time to mess around with _NT_SYMBOL_PATH.
To get a trace with the stack information, do the following:
Click on the selector tab to bring up the column chooser list. Then select "Process name", "Process", "Stack", "Weight" and "%Weight". These are the most useful columns when looking at stacks from the sample profile event. You should get a view similar to this.
At this point I need to mention a few of the restrictions with stack walking coupled with when and how it works.
· Xperf stack walking is not available on XP
· On Vista stack walking is available for x86, and is available for x64 as of Vista SP1.
· On Windows 7 stack walking is available.
· Stack walking on x64 is complicated. You have to set DisablePagingExecutive in the registry, as documented here:
http://social.msdn.microsoft.com/Forums/en-US/wptk_v4/thread/282e5beb-0afd-411b-9a81-a0bbbf569dfe/
REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD –f
Sample Case
I recently came across a case where the customer was complaining that DPC processing was taking up too much CPU time. We ran Xperf on the machine and drilled down into the DPC activity on the machine.
From the Xperf graph, I was able to confirm that the customer was actually seeing high DPC usage. I selected the Summary for this chart, and got the list of drivers that were actually taking up CPU time.
Right off the bat, I could identify the driver that had a lot of DPC activity. I also noted that the average duration for each DPC from that driver was taking 788 microseconds. This is way too high. Each DPC should be taking a maximum of 100 microseconds.
- Omer
Performance.Analyzer.QuickStart.xps – This is shipped with the performance toolkit.
From an elevated command prompt, launch xperf -help
http://www.microsoft.com/whdc/system/sysperf/perftools.mspx
http://blogs.gotdotnet.com/pigscanfly/archive/2008/02/16/using-xperf-to-take-a-trace.aspx
http://www.microsoft.com/whdc/Driver/tips/DPC_ISR.mspx
Hello NTDebuggers, in the spirit of Click and Clack (The Tappet brothers), a favorite troubleshooting show of mine, we thought it would be fun to offer up some Debug puzzlers for our readers.
That said, this week’s Debug Puzzler is in regard to Dr. Watson. I’m sure most of you have seen Dr. Watson errors. This typically means your application has crashed due to an unhandled exception. Sometimes however the process just seems to disappear. The Just-in-Time (JIT) debugging options configured via the AEDebug key does not catch the crash… Does anyone know why this may happen?
We will post reader’s comments as they respond during the week, and next Monday will post our answer and recognize some of the best answers we received from our readers.
- Jeff Dailey
[Update: our answer, posted 4/11/2008]
Hello NTDebuggers. Let me start of by saying that we were very impressed by our reader’s answers. Our two favorite answers were submitted by Skywing and molotov.
When a thread starts, the ntdll Run Time Library (RTL) for the process inserts an exception hander before it calls the BaseThreadInit code to hand control over to the executable or DLL running in the process (notepad in the example below). If anything goes wrong with the chain of exception handlers, the process can’t make it back to the RTL exception handler and the process will simply terminate. See http://www.microsoft.com/msj/0197/Exception/Exception.aspx for details.
000ef7ac 75fbf837 ntdll!KiFastSystemCallRet
000ef7b0 75fbf86a USER32!NtUserGetMessage+0xc
000ef7cc 00b21418 USER32!GetMessageW+0x33
000ef80c 00b2195d notepad!WinMain+0xec
000ef89c 76e24911 notepad!_initterm_e+0x1a1
000ef8a8 7704e4b6 kernel32!BaseThreadInitThunk+0xe
000ef8e8 7704e489 ntdll!__RtlUserThreadStart+0x23 << Exception Handler is inserted here.
000ef900 00000000 ntdll!_RtlUserThreadStart+0x1b
Secondly, the process that crashes is actually responsible for starting the debugger via the RTL exception handler. The debugger is registered under the AeDebug registry key. Even if you are able to unwind to the RTL exception handler you may still run into trouble. If the computer is low on system resources such as desktop heap, you may not be able to create a new process and thus will not be able to launch the debugger. As SkyWing stated, it’s a relatively heavyweight operation. Applications may also call TerminateProcess from within their own code based on an error condition. If we have a customer that sees this symptom on a regular basis we typically recommend having them attach a debugger to monitor the process. Simply run via ADPLUS -crash -p (PROCESSID).
Good work folks! We’ll have another puzzler ready next Monday.
Good Luck and happy debugging!
- Jeff
We’ve been thinking about hosting a live group chat session to talk with the debugging community. If we had such a chat, the discussion would focus on debugging techniques and any questions you may have about anything we’ve previously blogged about. If you’d be interested in participating in a chat session, please answer the survey question below. Also, feel free to leave a comment on this page if there’s a particular debugging topic you would like for us to cover in the chat.
Thanks everyone for your feedback! This survey is now closed.
Hello all; my name is Scott Olson and I work as an Escalation Engineer for Microsoft Global Escalation Services team in Platforms support, and I wanted to share an interesting problem that came up recently. A co-worker was running Windows Vista Ultimate x64 on their home machine and ran into a problem where the system would get random bugchecks after upgrading the RAM from 2GB to 4GB. Any combination of the RAM with 2GB was fine; however with 4GB of RAM installed the system would bugcheck within 10 minutes of booting. Once I heard about this I wanted to look at the memory dump in kernel debugger.
Here's is what I found:
The system got the following bugcheck:
0: kd> .bugcheckBugcheck code 000000D1Arguments fffff800`03a192d0 00000000`00000002 00000000`00000000 fffff980`064aa8b6
Tip: The help file included with the Debugging Tools For Windows contains a Bug Check Code Reference that includes details on how to parse the Bug Check code and its arguments. See: Help > Debugging Techniques > Bug Checks (Blue Screens) > Bug Check Code Reference
!analyze -v provides the following information for this bugcheck:
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If kernel debugger is available get stack backtrace.Arguments:Arg1: fffff80003a192d0, memory referencedArg2: 0000000000000002, IRQLArg3: 0000000000000000, value 0 = read operation, 1 = write operationArg4: fffff980064aa8b6, address which referenced memoryDebugging Details:------------------READ_ADDRESS: fffff80003a192d0CURRENT_IRQL: 2
So with this data I can say that the system took a page fault on a read operation trying to reference the memory at fffff80003a192d0 at DISPATCH_LEVEL. OK, so let's get the trap frame so we can get into context of the system when the crashed happened:
0: kd> kv 3Child-SP RetAddr : Args to Child : Call Sitefffff800`03218f28 fffff800`0204da33 : 00000000`0000000a fffff800`03a192d0 00000000`00000002 00000000`00000000 : nt!KeBugCheckExfffff800`03218f30 fffff800`0204c90b : 00000000`00000000 fffffa80`0a3c6cf0 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x73fffff800`03219070 fffff980`064aa8b6 : 00000000`00000002 00000000`00000000 00000000`000005e0 fffff800`03219220 : nt!KiPageFault+0x20b (TrapFrame @ fffff800`03219070)
Here is the trap frame and it looks like system crashed while trying to reference memory at an offset from the stack pointer, rsp+0xD0 (highlighted above)
0: kd> .trap fffff800`03219070NOTE: The trap frame does not contain all registers.Some register values may be zeroed or incorrect.rax=0000000000000000 rbx=0000000000000010 rcx=0000000000000011rdx=0000000000000002 rsi=0000000000000000 rdi=0000000000000001rip=fffff980064aa8b6 rsp=fffff80003219200 rbp=00000000000071d6r8=fffff80003219280 r9=00000000000071d6 r10=0000000000000000r11=0000000000000000 r12=0000000000000000 r13=0000000000000000r14=0000000000000000 r15=0000000000000000iopl=0 nv up ei pl zr na po nctcpip!InetInspectReceiveDatagram+0xf6:fffff980`064aa8b6 440fb78c24d0000000 movzx r9d,word ptr [rsp+0D0h] ss:0018:fffff800`032192d0=8c13
As you can see above fffff800`032192d0 looks like valid memory and shouldn't normally cause a page fault on a read operation. At this point, I want to make sure the system did what it was told. I want to know what happened when the system trapped. To verify the faulting address I dumped the CR2 register to see what address was referenced when the page fault happened; this is also the first parameter in the bugcheck code for a stop 0xd1.
0: kd> r cr2cr2=fffff80003a192d0
Looking at this address it is clear that the trap frame does not exactly match, so let's look at how these addresses are different. Here is the stack pointer from the trap frame and the page fault converted into varying formats (focusing on the binary)
0: kd> .formats fffff800`032192d0Evaluate expression:Hex: fffff800`032192d0Decimal: -8796040490288Octal: 1777777600000310311320Binary: 11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000Chars: .....!..Time: ***** Invalid FILETIMEFloat: low 4.74822e-037 high -1.#QNANDouble: -1.#QNAN0: kd> .formats fffff800`03a192d0Evaluate expression:Hex: fffff800`03a192d0Decimal: -8796032101680Octal: 1777777600000350311320Binary: 11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000Chars: ........Time: ***** Invalid FILETIMEFloat: low 9.49644e-037 high -1.#QNANDouble: -1.#QNAN
Notice that there is a one bit difference between these 2 addresses
11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000
11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000
Since the software asked the system to do one thing and it did something different this is clearly some type of hardware problem (most likely with the processor). I reported this back to the co-worker and they contacted their hardware vendor. This must have been a common problem with this vendor because I found out later that they replied back within 10 minutes of contacting them with a recommendation to change the memory voltage in the BIOS. The memory voltage was set to Auto, which is a default. They recommended it be changed from 1.85 volts to 2.1 volts. After making the change the system was stable with 4GB of RAM.
We recently dealt with an interesting issue that I would like to share, hope you enjoy. - Jason
Issue Summary
Customer reports that within their Citrix server farm (running on Windows Server 2003), when a user logs into or out of a session (seems more pronounced on logoff), ALL users connected to the server experience a ~5-20 second hang. The hang is described as follows:
Customer has found that during the timeframe of the hang, Winlogon.exe (in session 0) is showing 25% CPU usage.
Troubleshooting & Analysis
Where to start… to begin, given the details above, we can assume we are not experiencing a hard hang where the server is completely unresponsive. The user can still move windows and the connection to the server is still active. Keyboard input is working, but seems to be buffered during the issue. The windows not redrawing indicates the message pump for each of the affected applications is stuck waiting for something.
In this scenario, I focused on the CPU usage for Winlogon under the premise that the other symptoms were side effects of whatever the CPU was working so hard on. On to the 25%; this is an interesting number to spike at. Generally speaking a thread has work to do, or it doesn't. If there was work to do it would normally take the CPU to 100%. So why 25%? In this scenario, each of the servers within the server farm had 4 processors. So we pegged 1 out of the 4 @ 100% resulting in task manager displaying 25% CPU utilization for the Winlogon in question.
So now we have a CPU spike in Winlogon. Why only the Winlogon in session 0? Going back to the issue summary, the customer reported that if ANY user logs off, ALL users experience a hang, and Winlogon in session 0 spikes. First, let's talk about what winlogon does for a living.
Sidebar: Winlogon on Windows Server 2003.
Any time you ask what a feature is or how it works, you should begin your research with the Windows Technical Reference (try a Live search like "winlogon site:http://technet2.microsoft.com"). In this case I focused on the "How Interactive Logon Works" article.
This article starts with:
The Windows Server 2003 interactive logon architecture includes the following components:
So Winlogon has something to do with Interactive logon.
What else can we find out about the responsibilities of Winlogon
Looking at this list of tasks performed by Winlogon, we need to try to determine which task aligns to our symptoms (hang during logon and logoff). I decided to focus on User Profiles due to the relationship between profiles and login and logoff.
What is a User Profile? Back to Live with "user profiles site:http://technet2.microsoft.com". Out of the results, I went with the "User Profile Structure" link. The page provides the following detail:
A user profile consists of:
With this we can now look at the Winlogon that is spiking the CPU. My step when looking at a CPU spike for a process is to determine if the time spent is in User-mode or Kernel mode. In task manager on the Performance tab you can monitor CPU usage. If you select the View menu and chose to "Show Kernel Times" you can get additional detail showing you if the time is associated with a User-mode module or a Kernel module. In this case the Winlogon spike showed the spike to be in Kernel time. This means the application (Winlogon) asked the OS to do something and it is trying to get it done.
My second step is to determine what user-mode request led to the spike in Kernel time. To answer this on a production system I chose to use Sysinternal's Process Explorer. I downloaded the tool and set it up so it would be able to obtain symbols from the public MS symbols server. During the repro I monitored the thread CPU time in Winlogon. Here is a screen shot of what Process Explorer looks when looking at the thread activity of an idle Winlogon.
Figure 1 – Process Properties in Process Explorer
When the CPU spike occurred we looked at the stack for the spiked thread (double click on the thread with the high CPU time or click the "Stack" button with the thread highlighted):
ntdll.dll!KiFastSystemCall+0x3ntdll.dll!KiFastSystemCallRetntdll.dll!ZwUnloadKey+0xcADVAPI32.dll!LocalBaseRegUnLoadKey+0x51ADVAPI32.dll!RegUnLoadKeyW+0x73USERENV.dll!MyRegUnLoadKey+0x6dUSERENV.dll!CUserProfile::UnloadUserProfileP+0x2a2USERENV.dll!UnloadUserProfileI+0x198RPCRT4.dll!Invoke+0x30RPCRT4.dll!NdrStubCall2+0x299RPCRT4.dll!NdrServerCall2+0x19RPCRT4.dll!DispatchToStubInCNoAvrf+0x38RPCRT4.dll!RPC_INTERFACE::DispatchToStubWorker+0x11fRPCRT4.dll!RPC_INTERFACE::DispatchToStub+0xa3RPCRT4.dll!LRPC_SCALL::DealWithRequestMessage+0x42cRPCRT4.dll!LRPC_ADDRESS::DealWithLRPCRequest+0x127RPCRT4.dll!LRPC_ADDRESS::ReceiveLotsaCalls+0x430RPCRT4.dll!RecvLotsaCallsWrapper+0xdRPCRT4.dll!BaseCachedThreadRoutine+0x9dRPCRT4.dll!ThreadStartRoutine+0x1bkernel32.dll!BaseThreadStart+0x34
Just to make sure, we repro'd again and looked at the call stack
We can see here that both call stacks for the thread in question are RPC threads waiting for incoming requests. The request that came in was from userenv module running in Winlogon associated with the session of the user logging off and came into the Winlogon in session 0. The request is to unload the User Profile (highlighted above). This led to a call to RegUnloadKey call which removes a hive from the registry but does not modify the file containing the registry information.
We now know the issue is related to unloading the user profile and specifically the registry portion of the user profile. Now we need to determine where we are spending the time in kernel mode.
Kernrate
Kernrate is a tool included with the Windows 2003 resource kit. This will let us dig into the kernel mode side of what is going on. I collected the following data from kernrate during the hang condition (Summarized):
Parsing the output, we see that 72.66% of the Kernel time was spent in the idle process. This is attributed to the 3 idle threads that are on the 3 processors not being used by the Winlogon thread. We see that the Winlogon process accounts for 20.39% of the kernel time (the CPU spike on the 4th processor). If we move into the module summary, you can see the correlated CPU times for the modules, INTELPPM (CPU driver) running on the idle processors and NTOSKRNL running on the 4th processor (this is where we will focus). In the final drill down, we see the function summary for NTOSKRNL and we can identify the function in use 82% of the time was CmpCleanUpKCBCacheTable. The Cm function prefix lets us know this is related to Configuration Manager (the Registry – see Table 2-7 from Microsoft Windows Internals 4th Edition for Commonly used Kernel Prefixes).
What do we know? We know the issue manifests most during user logoff. We know during the logoff we are trying to unload the profile. We know that in the process of doing this we are trying to unload a registry hive. We know that leads us to spend a lot of CPU time in the Kernel doing CmpCleanUpKCBCacheTable.
Why does this hang the machine? The registry is protected / synchronized with an ERESOURCE named CmpRegistryLock. While this is held exclusively during this cleanup function, all registry access is blocked (both read and write). This explains all of our symptoms. Applications freeze and do not redraw due to operations that need registry access being done on the GUI thread of a process. Perfmon cannot report due to its dependency on the registry.
Resolution
Now that we know the exact behavior, we could align it to a known issue that was introduced with the release of Windows Sever 2003 SP1 (applies to both SP1 and SP2)
KB927182
From the KB cause section:
This problem occurs because of the way that Windows Server 2003 Service Pack 1 (SP1) cleans up user registry information. When a user logs off, there is a five-second delay before Windows Server 2003 SP1 flushes the user registry hive. If the terminal server experiences heavy registry activity, the cached registry information on the terminal server may increase significantly. For example, this behavior may occur when a program scans the user profile during the logoff process. When the terminal server experiences heavy registry activity, it takes Windows longer to flush the data.
The fix applies to Windows Server 2003 with SP1 or SP2.
Written by Jeff Dailey:
Hello NTDebuggers, one of the most important things to understand in kernel debugging hung servers is the output of !locks. There can be a lot of data and it’s not always clear what is going on. One of the things I like to do in order to better understand the output is to use a visual representation of the resources involved and the threads that are blocking on those resources. Before we can do that we need to understand what to look for so we can document it in our diagram.
It’s a good idea to understand ERESOURCEs in general l before jumping into !locks. The following MSDN article goes into lots of great detail. http://msdn2.microsoft.com/en-us/library/aa490224.aspx
Simply put, you will typically see threads either with access to or trying to gain access to resources. If a thread has access to a resource it will be marked by <*>. Threads that have access to a resource can block other threads from gaining access to said resource.
You will see threads waiting for shared access. These threads do not have the <*> and listed above the threads that are Waiting on Exclusive Access.
You will also see threads that are Waiting on Exclusive Access. These threads are typically blocked waiting for the threads that have access or ownership of the resource to release it.
Let’s take a look at one section of !locks output and annotate each thread section...
Resource @ 0x896d2a68 Shared 1 owning threads << This info is the ERESOURCE in question. Contention Count = 15292 << The amount of contention for the object.
NumberOfSharedWaiters = 1 << This is self explanatory
NumberOfExclusiveWaiters = 39 << Number of exclusive waiters in the Ex Waiter List
Threads: 89bd1234-01<*> 896d2020-01 << We have two threads here. The owner, or shared owner <*>89bd1234 and the shared Waiter 896d2020
Threads Waiting On Exclusive Access:
888ed020 87c036f8 885dc7a0 8bc538b0 << All of these threads are waiting on exclusive access.
88e8cda0 88796988 8905fda0 8974dc10
Note the following output is completely fabricated, so alignment and variable names may not be valid.
The following is some sample output from !locks. In this scenario I document any ERESOURCE that has any threads waiting on exclusive access. I also document the ERESOURCES as nodes and show the relationship to the Threads. The key point is to show the threads involved, the resources they own, and the resources they are blocked on or trying to get exclusive access to. Ultimately you need to work your way toward the head of the blocking chain of events to figure out what is holding up the entire chain of execution from moving forward.
In this case you will see that a filter driver called MYFILTER has passed an invalid object to KeWaitForSingleObject. As a result the thread blocked and all the other threads and processes related to those threads froze and could not move forward. The machine was completely hung.
1: kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks......
Resource @ 0x8a50ee98 Shared 4 owning threads
Threads: 896856d0-01<*> 89686778-01<*> 896862d0-01<*> 89685da0-01<*>
KD: Scanning for held locks............................................................
Resource @ 0x896dabcd Exclusively owned
Threads: 886e5678-01<*>
KD: Scanning for held locks..
Resource @ 0x896d2a68 Shared 1 owning threads
Contention Count = 15292
NumberOfSharedWaiters = 1
NumberOfExclusiveWaiters = 39
Threads: 89bd1234-01<*> 896d2020-01
888ed020 87c036f8 885dc7a0 8bc538b0
88d78020 87a7dda0 88b85b20 87b78020
8936e8a0 87dd7ae8 886005a0 88557890
887b3680 87cc2790 87dd4050 87fad8a0
88179580 87b53d70 87cd2775 88ba0578
87b676f8 8886b560 87f68388 89681da0
88952720 888833c0
KD: Scanning for held locks................
Resource @ 0x8959c790 Exclusively owned
Contention Count = 4827
NumberOfExclusiveWaiters = 35
Threads: 89bd1234-01<*>
883e3aa0 88873020 88290020 87f5f588
888154f0 88bd4b28 88cbc448 884bd6c8
881e5da0 8935f518 87bcc978 8889e020
88cb3020 88c92178 87cf9020 88daaac0
89376020 88fe9020 887b29d0 87b6f7f0
87e12020 87b4f498 894ee730 88810020
881a8020 87dd55f0 888d3020 885f6da0
881f7da0 880742e8 87a31b50 879ffb50
88451da0 88646da0 8833a020
KD: Scanning for held locks.....................................................
Resource @ 0x88ce81ff Exclusively owned
Contention Count = 108
Threads: 87ad6f78-01<*>
KD: Scanning for held locks......................................
Resource @ 0x87da48fb Exclusively owned
Threads: 87bddda0-01<*>
KD: Scanning for held locks.
Resource @ 0x87df455c Exclusively owned
Contention Count = 2
NumberOfExclusiveWaiters = 2
89bd1234 87ad6c68
KD: Scanning for held locks............................................
Resource @ 0x87fcfe30 Shared 1 owning threads
Threads: 8a60f8a3-01<*> *** Actual Thread 8a60f8a0
KD: Scanning for held locks...........
Resource @ 0x880ef1cd Shared 1 owning threads
Threads: 8a60c3af-01<*> *** Actual Thread 8a60c3a0
27044 total locks, 9 locks currently held
Good luck and happy debugging.
Hello NTDebuggers, I’d like to announce something new for our community to share. We have decided to host an Open Source Debugger Extension project called ODbgExt on codeplex.com Right now it’s just the basic framework. This will be something we can work on together as a community. Think of it as a debugger extension by the people for the people. Jason Epperly and I (Jeff Dailey) will be the project managers for the codeplex environment. You can find the project on http://www.codeplex.com/ODbgExt
Project Description
ODbgExt is an open source debugger extension for WinDbg that is intended to be developed by the debugging community. It is based on the Windows Debugger SDK Sample ‘exts’ and uses the COM Debugger interface IDebugControl. The primary objective of the project is to provide the debugging community a central location to share their debugger extensions that help to isolate common problems in the community and make debugging both live systems and dumps easier. The initial release will be the basic framework that we expect other developers to contribute to. There will be very basic functionality in the initial version. We, GES (Global Escalation Services) intend on doing a series of blogs to talk about writing debugger extensions over the coming year. As we blog about writing extension we will include the code in ODbgExt (This project). This is the same group that runs the http://blogs.msdn.com/ntdebugging blog. We encourage you to sign up and contribute your debugger extension ideas and or code to the project. Simply create a codeplex account and request access.
Things we would like to include:
• A Graphical representation of Kernel, and User mode execution time by process and thread, while correlating to idle / non-idle time• A Graphical representation of idle time for kernel and user mode.• A Graphical representation of pool resources used on a per process bases along with handle table counts.• Hang detection, scanning for various conditions that could cause hangs in a system or user mode process.• Detailed system information such as oldest and newest binaries. • Binary info based on vendor name• A better dissembler that colorizes calls, jumps and indents to show code flow• Extensive use of DML (Debugger Mark-up Language) to enable more point and click debugging within windbg.• Support for a SQL Database backend to allow storing information about debugging sessions or binaries in a SQL Database.• Warnings when critical thresholds are exceeded such as handle counts over 10,000, Low PTE Conditions, etc.• Simplified searching for pool tags in binaries.• Support for VBA for Windbg• Dump annotation, via dump streaming; the ability to embed data into a dump via the debugger extension and later retrieve it. (Imagine embedded debug notes)• Embedding a snapshot of performance data in the dump at the time the dump is taken, ie. CPU, IO etc.We hope you are as excited about this project as we are! Please feel free to suggest more ideas for this project and by all means feel free to sign up and contribute some code!
Thank you,
Jeff Dailey
Platforms Global Escalation Services