There is an amazing white paper published on this topic which is available here: http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=21820.

If you really want to have a good mess about with Kerberos, it is the important white paper to read.

The purpose of this post is to record my notes taken from the white paper and tested in a lab. I find it really useful to have screenshots of the types of things to expect from different scenarios up on the blog because it saves time when there is a serious issue. I hope some of the notes are useful to others as well.

Before I even jot down the notes from the lab - there are some important Kerberos fundamentals to keep in mind.

Fundamentals

Kerberos is about authentication. We are making sure you are who you say you are. Authorisation is very different - that is where we grant access to resources. With Kerberos troubleshooting keep in mind that just because i can get a ticket to a file server,  it doesn't mean i can access the stuff on there. It sounds really simple, but its a trap that comes up again and again - authentication dosnt equal authorisation. Just because you are who you say you are doesn't mean I'm letting you touch my stuff.

Kerberos is all about tickets. Tickets are the drivers licence of the Windows world. When you show a policeman your drivers licence, they can be satisfied you are who you say you are. When a Windows client presents a kerberos ticket to a resource server, the resource server can be satisfied that you are who you say you are. (In Windows we also have the benefit of short term symmetric encryption in the ticket which makes it very difficult to forge... no McLovin here...).

The Kerberos tickets come in three flavours:

  • User, service, and system keys. Long-term symmetric keys generated from passwords.
  • Public keys. Long-term asymmetric keys used with smart cards.(im not going there in this lab - but i should later)
  • Session keys. Short-term symmetric keys created by domain controllers.

 The main focus when troubleshooting are the:

  • TGT. The KDC responds to a client’s authentication service request by returning a session ticket for itself. This special session ticket is called a ticket-granting ticket (TGT). A TGT enables the authentication service to safely transport the requestor’s credentials to the ticket-granting service.
  • Session ticket. A session ticket allows the ticket-granting service (TGS) to safely transport the requestor’s credentials to the target server or service. In the lab below you can see some images with session tickets for LDAP on the DC, and also CIFS on the DC

On the domain controller side, remember there are two tasks that are implemented in the single KDC (Key Distribution Centre) process. Two jobs that the Windows KDC does - first, it acts as an AS (Authentication Service). The AS is the part that gets you in the door as far as Kerberos is concerned. The AS gives you a ticket that can be redeemed over and over in the other part of the KDC which is called the TGS (Ticket Granting Service). You will come back to the KDC with your TGT in hand when you want to access a resource - for example, your file server. "Hey Mr KDC, i want to talk to the Ticket Granting Service please..., i already have a TGT", the KDC will check the TGT and then forward the request into the Ticket Granting Service code who will set up a ticket for the file server.

What's important for Kerberos to Work in Domains?
- TCP/IP connectivity. Especially ports: 53 (both tcp and udp - DNS), 88 (both - Kerberos) and 123 (UDP - Time). Also the 464 (TCP - Kpasswd) port if required by your clients).
- DNS
- Time
- Active Directory (we store our Service Principal Names in the Directory)

Where to Start...

I don't think i can better summarise this little list from the Kerberos troubleshooting guide so i am just pasting from page 11 so i have it here for reference:

1. Use Kerberos Tray or Kerberos List to confirm that you have a session ticket for the server you are attempting to connect to. If you have a session ticket for the server and you are still getting an error message, consider these two possibilities:

  • You might have an issue with SPNs. For more information about SPN issues, check "Need an SPN Set" and "0x8 KDC_ERR_PRINCIPAL_NOT_UNIQUE" in the white paper.
  • You might have an authorization issue instead of an authentication issue. If this is the case, most likely Kerberos authentication is not the problem.

2. If you do not have a session ticket, then use Kerberos Tray or Kerberos List to confirm that you have a TGT.

  • If you have a TGT but no session ticket, examine the system event log. Errors logged in the system log will help you determine why you cannot get a ticket to the server.

3. If you are auditing successful logons, then you can check the security event log on the client to see if the system is using NTLM instead of Kerberos authentication. Use of NTLM can occur because:

  • The application uses NTLM. See NTLM Fallback in the white paper for an example of this condition.
  • Kerberos authentication is failing and Negotiate is using NTLM.

4. If Kerberos authentication is failing, the system event log or captured data in a network trace should contain the Kerberos error code that was returned by the KDC or the Kerberos SSP. You can also debug to get more information.

Testing In The Lab

My lab is really simple: DC, FILE & CLIENT, all running Windows Server 2008 R2. All built straight from an MSDN ISO with no special configuration. The domain is called Contoso.com. My test user is "Tim Tester", logon name is tim, or tim@contoso.com.

One of the first things i like to know is what a "normal" system looks like , so when i compare the broken system to it i can look for some differences.

When tim logs on to a Windows 2008 client he gets the following tickets in a "normal"/"vanilla" setup :

The tickets let tim use the services that will provide file (cifs), active directory (LDAP) required to set up his group policy. He also gets the ticket granting ticket that will be used to request access to other services he requires.

The computer he is logged onto (called "CLIENT") also receives and initial set of tickets...

 

By the way, to access the list of tickets the computer has we need klist to target the "nt authority\system" account rather than the user. 

In my case i use psexec.exe (can download here: http://technet.microsoft.com/en-us/sysinternals/bb897553 ) with the following switches: "psexec -s -i -d cmd.exe" from an administrative command prompt. I usually do a "whoami" to confirm that the context has changed to "nt authority\system", and then run klist.exe to see what tickets the computer account has (shown in the image). Remember, the computer account is just like a normal account in most ways, it needs a ticket to access resources using kerberos just like a user does - that is why we see tickets for LDAP and CIFS from when the computer account applied group policy.

So moving on to messing about with some of the common problems...

Situation 1:

One issue that is quite common is that a user is in too many groups. Current versions of Windows support a maximum of 1015 groups and when you go over this number you are toast. Its documented here: http://support.microsoft.com/kb/328889 - but i think its much nicer to see what it looks like so you know it when it happens to you.

First thing - creating a heap of groups... I use a really simple powershell script that gives me a bunch of global groups, you can adjust the numbers to support your testing:1..1200 | ForEach { Net Group “TestGroup$_” /ADD /Domain} - this one liner gives me 1200 global groups with names TestGroup1 - TestGroup1200.

Next I'm going to create a user and add them to all the test groups.

Now, trying to log on with "Big Group Guy", my test user for this scenario:

Pretty obvious right? Nice descriptive error message. But in the lab i try to think of stuff that might happen where the results aren't so obvious. For example, What if the user has already logged on before all the groups are added to thier account? Usually, the answer is, no problem, in our testing we would log the user off as a troubleshooting step, test again and see the descriptive error message. Ok, but what if the account is a service account - and now you have a big corporate application that is failing because someone was messing around with powershell and accidently put the service account into a group that has 1015+ groups nested in it... that's where i think its more tricky and gives us more to think about. So i tested it to see what the results would look like..

I created a file share and gave the test users access on file, then i created a runas session for a good user (only a few groups) and a bad user (1200 groups) to compare the difference:

A good user can successfully obtain a directory listing for the share they have access to:

But what about the bad user? The guy in 1200 groups... He is already logged on, but now we want a ticket for a service (CIFS file share) on another server... Should that fail? Should i get a ticket for the file server (and be able to do the directory listing)?

Sure can!

So wait, an account has passed the magic 1015 groups mark and Kerberos is going to still hand out tickets for this guy? The answer is yes, and when you think about it, it makes sense (but for me it required a re-read of the kerberos guide to see when the PAC is populated and with what groups). I just like to test because often the customer says "so, we are making this change in production tonight.... what do you think?" and id much rather know what is going to happen, than think about stuff i read in a kerberos book years ago. The reason this works comes down to the timing of PAC construction (the thing that holds what groups you are in, priviliges you have), so because i didn't put all the groups onto the bad user account until after he had logged on, the PAC being passed around in tickets doesn't have the new groups in it yet.

You can inspect group membership of the account being used using whoami /groups as shown below:

What about if the account was hosting a service, and we restarted it after adding all the groups?

Thats when we are in trouble, and get a slightly more cryptic error... It doesn't stay cryptic though because all we need to do is look in the system log and we get the terminology we are familiar with for this problem:

Situation 2:

Duplicate SPN's. "Unique principal names are crucial for ensuring mutual authentication. Thus, duplicate principal names are strictly forbidden, even across multiple realms. Without unique principal names, the client has no way of ensuring that the server it is communicating with is the correct one.". Straight from the Kerberos troubleshooting guide. We see it all the time on Active Directory risk assessments where multiple accounts have been set up with the same SPN, its probably in the top ten most common issues. You cant do it. If you do it, you will probably notice KDC 11 events in the system log on your domain controllers (which are KDC's in the context of this blog entry).

So, some stuff for searching, messing with SPN's...

First step, messing it up. The easiest way is to add the SPN of the thing you want to access to another security principal. For example, i want to access the server called "FILE". So, all i do to mess up Kerberos for that server is to add an SPN called "HOST\FILE" & "HOST\FILE.contoso.com" to another security principal. In the example below, im going to add it to the "CLIENT" security principal which is a workstation on my domain. Note, important note actually, this doesn't stuff up "CLIENT". I can put "HOST\Daffy Duck" as a principal on "client" and it doesn't mess up client. It would mess up a principal on the network called "daffy duck" though. This is important because in Active Directory an admin may not have access to touch the computer object for "FILE", but they can still stuff up Kerberos by putting an SPN for "FILE" on an object they do have access to.

Now when my test user attempts to access the resource...

Well... It works. But it dosnt work as expected, and it wont work in all environments. It works because it has fallen back to NTLM. Kerberos has failed, NTLM has been used (which does not use Service Principal Names). We can confirm this in a couple of ways. The first is with klist.exe - notice there is no ticket for CIFS:

And the other thing, the thing that we should be monitoring for, is that a KDC 11 event will be logged on the KDC (Domain Controller) doing the work. This message will tell us the SPN that needs to be fixed up.

 So.. now we have to find the problem security principals and make sure only one of them has the SPN associated with it. There are three ways i see customers use (there may be many more). LDP, ldifde and the SPNQuery script from technet (http://technet.microsoft.com/en-us/library/ee176972.aspx ).

With LDP:

  1. Click Start, and then click Run.
  2. In the Open: text box, type LDP, and then click OK.
  3. On the Connection menu, click Connect.
  4. If you are on the domain controller, leave the default settings, and then click OK. If you are not on the domain controller, type the domain controller name in the Server text box and then click OK.
  5. On the Connection menu, click Bind.
  6. Type User, Password, and Domain in the corresponding text boxes, and then click OK.
  7. On the View menu, click Tree.
  8. In the Tree View dialog box, type the base distinguished name in the BaseDN text box or select it from the pull-down menu.
  9. On the Browse menu, click Search.
  10. In the Search dialog box, type the base distinguished name in the BaseDN text box or select from the pull-down menu.
  11. In the Search dialog box, type the following in the Filter text box:

             serviceprincipalname=HOST/FILE.contoso.com

12. For SPN, type the Service Principal Name that the error refers to — for example, HOST for computer accounts, HTTP for Web services.
13.Under Scope, click the Subtree option.
14.Click Run

In the example above we can see that two objects contain that SPN (bad) - we need to decide which one to keep, in this case FILE should have the HOST\FILE.contoso.com SPN and CLIENT should not.

Using LDIFDE (not the way i would go about it):

 Some customers choose to dump certain parts of AD into a text file, then use search to locate the duplicate SPN. You could achieve this with the command:  ldifde -f output.txt-d DC=contoso.com -r (objectclass=computer) -p subtree and then use notepad to search as shown below:

The 3rd option is the one i prefer, which is to use the purpose built script from technet:

You simply copy the text from this webpage: http://technet.microsoft.com/en-us/library/ee176972.aspx into a text file. Save it with a .vbs extension - then run "cscript queryspn.vbs HOST/FILE.contoso.com" to get the output needed to correct the problem.

 Remember, the fix here is simply removing the duplicate SPN.

Lets try to make it more drastic just to see what this would look like if NTLM hadnt kicked in and saved the day... I have disabled NTLM in the domain, and i try to use the file share on the server with the duplicate SPN (called "FILE").

Same command - simple "dir" to list the directory on the remote server:

Ouch. What about from the graphical interface?

That's a little bit more descriptive, but not completely obvious. We do pick up a KDC 11 event on the domain controller though. But be aware, if Kerberos fails due to a duplicate SPN, usually NTLM will kick in. If the environment has NTLM turned off, the problem gets more serious. Also in a double hop situation, NTLM wont work, Kerberos is essential so the authentication will fail there too.

Just quickly, another less obvious side effect of duplicate SPN:

In this case I was just trying to do a runas to do more troubleshooting on the machine with the duplicate.

And also, if i reboot that machine (and force the system to try to get a ticket for itself):

Big Ouch. Imagine this is an Enterprise wide application server... simple misconfiguration and boom. Lucky, its really easy to fix. Use any of the three search techniques from above, find the duplicate SPN in the domain, then remove it. In the example above, i didn't even need to reboot the server to log on - just remove the duplicate SPN, let AD replicate, then log on. (I would recommend a reboot anyway though - computer policy might be in a bad way otherwise).

 Situation 3:

Here's where i get lazy and don't reproduce the issue - but this one is simple... Time. Remember Kerberos encrypts the system time in with the authentication data. The KDC opens up the encrypted data, checks the time and if its outside the acceptable range you will have problems. The default range is 5 minutes, but remember this can be changed and many customers use different values here.

Situation 4:

Again, lazy - too hard to reproduce. UDP fragmentation. Remember that early Windows Kerberos Clients used UDP. This includes anything XP/2003 and older. Vista and above use TCP. The problem here is that UDP has no mechanism to retransmit and order packets so Kerberos can end up failing when packets get lost or arrive in the wrong order. UDP is build for speed and low overhead, where TCP is a trusty checks and balances type protocol. Kerberos has been moved to TCP on all modern Windows systems so if you think you are hitting this problem you can test changing your Kerberos client to always use TCP with the technique in this KB article: http://support.microsoft.com/kb/244474

 Situation 5:

Multi-Tiered Apps, Kerberos "Double Hop" and all that nasty business:

If you want to mess around with Kerberos delegation but dont want to install any heavy multi-tiered application this lab setup might be what you are looking for.

The basic overview diagram looks like this:

On WFE01 (web front end 01) i installed the basic IIS setup. Then on the default website i simply added a virtual directory that pointed to \\SQL01\share (literally this step is just a right click on the default web site inside IIS manager – then choose NEW – Virtual Directory). I assumed this would involve Kerberos authentication and require some messing around with SPN’s and delegation settings which was exactly what i was looking for.

While running through the New Virtual Directory wizard you get this prompt which gives me hope that this will be a super lightweight and successful way to test delegation:

At the end of the wizard, i now have a virtual directory called “test” that is running on a server called “WFE01″. The way clients should access this is http://wfe01/test. The “test” virtual directory should be nothing more than a listing of the contents from my \\SQL01\share directory.

If i attempt to view the site on the WFE01 server, i would expect it to work ok without doing anything else because we dont need delegation – we just need a ticket for the SQL01 machine:

This works. And we can see Kerberos authentication in the security log on the SQL01 server.

Using klist (“klist tickets”)on the WFE01 server i can also see that i have obtained a ticket for the SQL box.

Server: cifs/sql01@CONTOSO.COM
KerbTicket Encryption Type: RSADSI RC4-HMAC(NT)
End Time: 1/2/2011 21:12:23
Renew Time: 1/9/2011 11:12:23

Now the interesting part will be how well this works for clients. I have a Windows XP client machine in the lab – should it be able to access the same site? I guess not since delegation will be required, remembering that the XP client will need a ticket for the WFE01 server, then it will be needing the WFE01 server to obtain a ticket on its behalf for the SQL01 server to retrieve the directory listing.

As expected. No joy. Lets look at the logs on the WFE01 server and SQL01 server to see what type of authentication is happening. We could also use netmon, but in the first instance this is much quicker.

On the WFE01 server, the web front end, everything looks ok:

Successful Network Logon:
User Name: tom.tester
Domain: CONTOSO
Logon ID: (0×0,0x135C48)
Logon Type: 3
Logon Process: Kerberos
Authentication Package: Kerberos
Workstation Name:
Logon GUID: {abb6c392-9910-601b-5b95-f426201eaa9b}
Caller User Name: -
Caller Domain: -
Caller Logon ID: -
Caller Process ID: -
Transited Services: -
Source Network Address: 10.0.0.88
Source Port: 1310

On the SQL01 server, not so nice:

Successful Network Logon:
User Name:
Domain:
Logon ID: (0×0,0x7D820)
Logon Type: 3
Logon Process: NtLmSsp
Authentication Package: NTLM
Workstation Name: WFE01
Logon GUID: -
Caller User Name: -
Caller Domain: -
Caller Logon ID: -
Caller Process ID: -
Transited Services: -
Source Network Address: 10.0.0.5
Source Port: 0

We can see that the request to display our test user the contents of the directory was sent, but the authentication that was used was NTLM, and also that it came through as an ANONYMOUS logon from NT AUTHORITY\ANONYMOUS LOGON.

So what gives? This is the classic example of delegation not working, and its expected in this case because i havn’t allowed it yet.

The trick here is to allow the webserver to pass the credentials of the user to the next tier. Specifically in my case, for the WFE01 box to take the credentials i used for it (tom.tester credentials) and pass them to SQL01. To do this, i make a change on the computer object of the IIS server – WFE01.

With the settings above i have chosen to allow normal delegation, not the new “constrained delegation” in Windows 2003+. Before i change it to constrained i want to check that it works at all.

Logging back on to the XP client the web site displays as expected:

Looking at the event logs on the SQL01 server, things look MUCH better now. Rather than seeing NT AUTHORITY\ANONYMOUS LOGON, i see the name of the user who’s credentials are being passed. Also the authentication package is now Kerberos as expected, rather than NTLM. The other interesting point is that the authentication request is originating from 10.0.0.5 which is my web server WFE01 – so it is doing the delegation as expected.

Successful Network Logon:
User Name: tom.tester
Domain: CONTOSO
Logon ID: (0×0,0x800FF)
Logon Type: 3
Logon Process: Kerberos
Authentication Package: Kerberos
Workstation Name:
Logon GUID: {88478d58-81e2-2cc8-1529-439df5daab93}
Caller User Name: -
Caller Domain: -
Caller Logon ID: -
Caller Process ID: -
Transited Services: -
Source Network Address: 10.0.0.5
Source Port: 0

Excellent, so i have a functioning Kerberos delegation lab that i can use to inspect packets and get into the guts of what is going on with. But there is an extra step i really need to do to make it more relevant. That is adding the option for “constrained delegation”. Essentially saying to the IIS box “i trust you for delegation BUT only to the SERVERS and the SERVICES i say”. To make that change i go back to the computer object of the IIS box, and modify the delegation tab:

In this case i choose ONLY sql01.contoso.com and ONLY the CIFS service – which is the one i will need for the directory listing my web page is performing.

A test of the site and everything is still working as expected. Good stuff. Lab ready.

 Just out of interest - here is my (probably not) fail safe guide to setting up SPN's for complex multi-tiered situations. It works for me, just to remind me what SPN goes where:

So in the example above there is a front end server, and a back end server. There is an account that has been set up for SQL to run under, and an account that IIS runs under. This is common for a whole swag of Microsoft solutions, in this scenario it is a MOSS setup, but just slot in your front end back end names, draw it up on a white board, then configure it. The diagram is showing you where you need to "allow delegation" on the AD object, and where to add SPN's to make the solution work. (The FE and BE aren't real names in the setup, just showing which bits are usually front end, which are the back end).

 Situation 6:

You are either completely out of options, bored, or just inquisitive... Attach a debugger to see Kerberos errors. This is the last resort. Between event logs, klist.exe, netmon, application errors you will solve most of the Kerberos problems you are ever likely to see. But if you get to the point where something awful is happening and it cant be solved any other way... maybe the debugger can help you. DO NOT DO THIS ON A PRODUCTION SERVER. You have been warned, debuggers have their place but remember you are breaking in to a critical process and poking around - accidents happen. If you think you need to use the debugger on your production server - speak to Microsoft Support Services First.

  1. Click Start, click Run, and then type regedit.exe 
  2. Open the following registry key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa\Kerberos\Parameters\
  3. Create the following entry:

Value: KerbDebugLevel
Type: DWORD
Data: c0000043 (this alue will print the standard set of debug message. If you still want to see more output, set it to ffffffff).

Then:

  1. Determine the Process Identifier (PID) for the lsass.exe process from the Task Manager.
  2. Click the Processes tab.
  3. Select the View menu and choose Select Columns.
  4. Click the PID (Process Identifier) check box and then click OK.
    1. Click Start, Run, and then type ntsd -p PID of lsass.exe.

This will start the debugger and attach it to lsass. While the debugger loads, you might need to wait a few minutes before the system presents a prompt.

    1. At the prompt, type g. The debugger will now print out any errors that Kerberos authentication encounters.
    2. Try to authenticate using the Kerberos protocol, and then check the debug output for any error messages that might further elaborate upon the Kerberos errors seen in the event log.

Once you find the error information from the debugger, you can use the "Kerberos Troubleshooting" whitepaper from the start of the article to locate the steps for your message. Pages 30+ have steps for individual return codes.

I should also point out that when you disconnect the debugger, the box will probably bomb:

Remember, LSASS.exe is the heart of your domain controller - you just attached things to that heart and then ripped them off. The system wont be happy with you.

Thats about it, a day spent going over the Kerberos troubleshooting guide and taking notes & messing about in the lab. I hope something in here is useful to you.