David Notario's WebLog

  • Gotchas with Reverse Pinvoke (unmanaged to managed code callbacks)

    One of the first things I had to do when I started working in Outlook Web Access (will call it OWA from now on) was integrating an unmanaged component that another group at Microsoft wrote. After spending some time with it, I decided the best way to bring it in would be writing a very small unmanaged library that wrapped its behavior and exposed a very simple interface to OWA. This interface would consist of only 3 or 4 APIs, and we would use them via PInvoke.

    In order to avoid keeping unnecessary state around in the unmanaged wrapper dll, one of the APIs would do callbacks to OWA in order to report incremental results. This is something I hadn't done before in managed code. This is how it looks like:

    C++ side:

    typedef void (__stdcall *PFN_MYCALLBACK)();
    int __stdcall MyUnmanagedApi(PFN_ MYCALLBACK callback);

    C# side

    public delegate void MyCallback();
    [DllImport("MYDLL.DLL")] public static extern void MyUnmanagedApi(MyCallback callback);

    public static void Main()
    {

      MyUnmanagedApi(
        delegate()
        {
          Console.WriteLine("Called back by unmanaged side");
        }
       );
    }

    The CLR will do all the magic to marshal our anonymous delegate to an unmanaged pointer that can be passed out to the C++ side. However, when we look a bit closely at things we may find some interesting problems:

    What happens if my callback throws, what will my unmanaged code see? For example:

    delegate()
    {
      Console.WriteLine("Called back by unmanaged side");
      throw new ApplicationException("Let's see what happens");
    }

    If you try this out you will see that your C++ code will just see an SEH exception fly through... oops, my C++ code wasn't ready to deal with SEH stuff, it may leak. What can I do? In my case I controlled the signature of the callback, so I just changed the signature to return an HRESULT, so that I could morph exceptions into HRESULTS, which is what the CLR does for the COM interop case:

    delegate()
    {
      int hresult = Constants.S_OK;
      try
      {
        DoWork()
      }
      catch (Exception e)
      {
        hresult = Marshal.GetHRException(e);
      }

      return hresult;
    }

    Are we done? Nope. What are we missing? For one, a thread abort could be induced in the 'return hresult' statement, so we would be back to square 1, the unmanaged code seeing an unexpected SEH. Another interesting thing to look at is what does the unmanaged to managed transition look like. First, let's think about what an unmanaged to unmanaged callback would look like:

    Let's assume a callback with that adds the 2 arguments. This would be some code in the caller:

    mov ECX, [x]
    mov EDX, [y]
    call [EDX]

    This is what the callback code could look like

    mov eax, ecx
    add eax, edx
    ret

    What can go wrong here? Not much, we jump directly to the callee, the only thing you could hit could be a stack overflow, if you are running out of stack when the call instruction pushes the return address. Besides that, this code shouldn't fail.

    Now, let's go back to our unmanaged to managed callback. It's definitely more complicated than the unmanaged to managed one. What can go wrong? For example, marshalling a LPWSTR to a string object requires memory allocations, which may fail, also, you may not be allowed to actually go into managed code because the appdomain the thread was running on is unloading, etc... So definitely, things can fail before we have any chance to take any action, and again we are in hands of the CLR. Again, for COM interop, the CLR knows the signature has an HRESULT, so it will actually trap any of these problems and return an HRESULT. However, for our callback, what can it do if the callback returns void? In Whidbey, the answer is nothing, again, a rude SEH will be raised against us.

    So basically, in order to be robust here, the best we can do right now is have the C++ code wrap it's callback with a __try/__catch block, which will handle these situations and also the thread abort on return I describe above. Of course, in this case I owned the unmanaged code, so I was able to fix it, but there are a lot of these callbacks in Windows APIs, which won't do the __try/__catch, so you basically are always at risk of something bad happening. Don't you hate when you can't write reliable code?

    HRESULT __stdcall DoCallback(PFN_ MYCALLBACK callback)
    {
      HRESULT hr;
      __try
      {
        hr = callback();
      }
      __except(FilterCLRExceptions(GetExceptionCode(), GetExceptionInformation()))
      {
        hr = OOPS;
      }

      return hr;
    }

    I talked with my excoworkers about this and they acknowledged that this is not something they are very happy about, but there just wasn't time to address this in Whidbey , fixing this is in the TODO list for future versions of the CLR.

    Note: The one thing that drove me to write this down was that not one of the tutorials or documentation that explains 'reverse pinvoke' or unmanaged to managed callbacks mentions these problems, which is something I think we can do much better about (and we really need to do a better job, since 99.999% of people out there can't see our code to see what we really do).

  • New job!

    Another long time without posting anything. This time due to changing groups!!! Once we had wrapped Whidbey up, I spent some time thinking about what I would like to do next. One of the good things about Microsoft is that it is a huge company, you can work in a ton of software areas here, from experimental OSs or the Windows kernel and all the way to something like MSN web services, passing by games, productivity software, databases, search, etc, etc... It's just amazing.

    I had a great time in the CLR, I spent 5 years there, working with some of the most talented people I've ever met, had the opportunity to work on cool features of the product. Working in the CLR is close to building an OS, you are building this platform upon which a ton of groups and people build their own software. Most of the time you are working at a very low level, and we must be one of the few groups left in the company that is implementing crazy x86 tricks to get some extra cycles out of some important codepath. It's cool to be at the bottom of the stack, but it also has it's bad side, everything you do you have to look at from 1000 different angles, see how you will affect everybody, make sure you are meeting everybody's needs, and no matter what you do, somebody out there you will be breaking or annoying. The result of this is that everything moves slow, it's the price of doing business when you are the platform.

    So after thinking about it, I heard about an opportunity in Exchange, in Outlook Web Access. I had some friends there, plus its an app I use every day and I like, it's mostly managed code and there are a bunch of things to work on. After 4 months here, I can say I'm happy with my decision, I get to work on a lot of interesting stuff, I'm using the platform I helped to build, which is an incredible feeling and I get to use (dogfood) the app I'm building, every day, which is a really good feeling.

    So anyways, I'll probably keep posting about CLR, I'm seeing a lot of things from a different perspective now. I now I'm using managed code every day in a real product, not in my home projects, and boy, it's amazing, The combination of Visual Studio + the super rich set of libraries the .NET platform has is just a massive productivity boost, you really have fun writing code. I also get to see  some uglyness we didn't really get right, and that until you feel the pain, you don't realize how much a simple mistake in the platform or one of these 'postponed' features can cost everybody out there for n years.

     

  • Does the JIT take advantage of my CPU?

     

    Short answer is yes. One of the advantages of generating native code at runtime is that we know what processor you are running on and we can tune the code accordingly. Why would we do that for x86? Every generation of x86 processors has its own personality. Their personality comes usually in 2 ways:

     

    -          New instructions: For example, SSE and SSE2 instruction set

    -          New ‘moods’: For example, Pentium 1 wanted programmer to schedule instructions by hand (in order to fill its 2 execution pipes), changes in branch prediction logic, P4s trace cache, etc, and even in the form of ‘regressions’, such as P4 preferring ADD REG, 1 vs INC/DEC instructions (which were very frequent instructions in tight loop code).

     

    Also, AMD processors have their own personality, although in my experience, AMD’s are much more predictable and ‘well behaved’, and thus, need less work.

     

    Note that we just don’t jump and implement in the JIT functionality to take advantage of every processor difference. The process is usually identifying something that is hurting in one of the benchmarks or user scenarios we track, evaluate the cost of the fix, the risk of the fix (every time we make a processor specific optimization, we make the life of our test team a bit harder and it’s easier to have x-proc only bugs, which are a bit harder to track down) and then, once we have all that data, we make a decision.

     

    Examples of some of the processor specific optimizations in x86 (none of these will be a big surprise for developers that do machine level programming, all of these optimizations are called out in big fonts in the processor optimization manuals).

     

    -          Use of CMOV instruction when available (enables conditional moves, which is very useful in branches that are taken in a random (ie, non predictable by the processor) fashion

    -          Use of FCOMIx family of instructions (makes floating point comparisons much cheaper)

    -          Use of SSE2 for memory copies (memory copies are fun, you would expect something that simple to be always, the same, but I’ve witnessed 4 ‘recommended’ ways of doing it during the time I’ve worked with x86: Use string instructions (REP/MOVS/STOS), Use floating point registers for the move, Use scalar instructions (to get better pairing/parallelization) and now use SSE2). Note that we don’t use SSE2 for floating point code. The reason for this is that we don’t vectorize code (which is the real win with SSE2), SSE2 for scalar floating point is not always a win compared to the x87 (different latencies between instruction sets for adds and multiplications makes each one of them better than the other depending on the scenario) and some things like converting doubles to floats was really slow on SSE2, so we decided investing in making our x87 code better (which we were going to have to support anyways).

    -          Use of SSE2 for floating point to int conversion.

    -          Other minor instruction selection differences (such as avoiding INC and DEC instructions in hot code for P4 or to avoid store forwarding problems in P4 and Centrino processors)

     

    We don’t take advantage of other things, such as knowing code cache sizes, etc… One of the reasons for this is that we don’t want different code on every single machine out there. As usual, there is a trade off if we did this, we may get some extra speed in some situations, but on the other hand, in  a realistic world, it’s more likely for us to produce bugs that only repro in machines that meet n conditions, so introducing more processor specific optimizations has to be done carefully.

     

    What about NGEN?

     

    NGEN is an interesting case. In previous versions of the CLR (1.0 and 1.1), we treated NGEN compilation the same way as we treated JIT compilation, ie, given that we are compiling in the target machine, we should take advantage of it. However, in Whidbey, this is not the case. We assume a PPro instruction set and generate code as if it was targeting a P4.  Why is this? There was a number of reasons:

     

    -          Increase predictability of .NET redist assemblies (makes our support life easier).

    -          OEMs and other big customers wanted a single image per platform (for servicing, building and managing   reasons)

     

    We could have had a command line option to generate platform specific code in ngen, but given that ngen is mainly to provide better working set behavior and that this extra option would complicate some other scenarios, we decided not to go for it

  • CLR and floating point: Some answers to common questions

    Some very common questions I get from customers regarding floating point are:

     

    -          I get different results when compiling with optimizations vs without optimizations!

    -          My == checks are failing when the expressions are the same!

     

    The answer for this question is most of the time ‘This is by design’, but it does seem that we’re not doing a good job explaining why it is by design or why things work this way. I’ll try to cover the most common causes for this behavior in this post.

     

    X87 FPU

     

    X86’s old CPU is based on the following

     

    - 8 registers laid out in the form of a stack. The registers are 80 bit wide, although the precision at which operations happen (significant bits of the mantissa) can be modified via a control register. The range cannot be changed though (the exponent will always be 15 bit wide)

    - Some control registers that control the precision, exception modes, etc of the FPU.

    - Set of instructions to operate with stack registers and/or memory. Instructions that operate with memory can only use a register operand if the register is the top of the stack. Operations that involve 2 FP registers can access any element of the stack. This aspect of the FPU is what makes code generation for it more ‘interesting’

     

    Most of the non intuitive behavior people encounter comes from the fact that the registers are 80 bit wide. Precision is set by default in VC++ and CLR apps to ‘double precision’, which means that if you are operating with operands of type float, results of operations done with floats actually exist in the x87 stack as if there were of type double. In fact, it’s even weirder than that. They will have the mantissa of a double, but the range (exponent) of an extended double (80 bit).

     

    That means that a sequence of operations like the following

     

    -          Load float A

    -          Load float B

    -          Multiply loaded arguments

    -          Store float result  in memory

    -          load float A

    -          load float B

    -          Multiply loaded arguments

    -          Compare with stored result in memory

     

    May result in a result of ‘not equal’. Why? If the result of the first multiplication was not representable exactly as a float, it lost bits of precision when stored in memory, thus, when we compared what we stored in memory with what was in the FP stack, we see that  the numbers are similar, but not equal, resulting in a surprising ‘not equal’ result.

     

    Even more subtle

     

    -          load double MAX_DOUBLE

    -          load double MAX_DOUBLE

    -          add

    -          store double A

    -          load double A

    -          load double MAX_DOUBLE

    -          Substract

     

    Is different than

     

    -          load double MAX_DOUBLE

    -          load double MAX_DOUBLE

    -          add

    -          load double MAX_DOUBLE

    -          substract

     

     

     

     

    At the end of this sequence we have +Infinity in the floating point stack, because 2*MAX_DOUBLE is bigger than MAX_DOUBLE, which is the biggest number that can be represented in a double. Substracting MAX_DOUBLE from Inifinity still results in infinity. However, in the second example, everything happened in the floating point stack, which is operating in double extended range, and can represent 2*MAX_DOUBLE, so when we substract, we get MAX_DOUBLE, instead of Infinity.

     

    What CLR has to say

     

    How does the CLR play with this weird HW? Pulled out from the ECMA spec:

     

    Storage locations for floating-point numbers (statics, array elements, and fields of classes) are of fixed size. The supported storage sizes are float32 and float64. Everywhere else (on the evaluation stack, as arguments, as return types, and as local variables) floating-point numbers are represented using an internal floating-point type. In each such instance, the nominal type of the variable or expression is either R4 or R8, but its value can be represented internally with additional range and/or precision.  The size of the internal floating-point representation is implementation-dependent, can vary, and shall have precision at least as great as that of the variable or expression being represented. An implicit widening conversion to the internal representation from float32 or float64 is performed when those types are loaded from storage. The internal representation is typically the native size for the hardware, or as required for efficient implementation of an operation.  The internal representation shall have the following characteristics:

    ·              The internal representation shall have precision and range greater than or equal to the nominal type.

    ·              Conversions to and from the internal representation shall preserve value.

    [Note: This implies that an implicit widening conversion from float32 (or float64) to the internal representation, followed by an explicit conversion from the internal representation to float32 (or float64), will result in a value that is identical to the original float32 (or float64) value. end note]

    [Rationale: This design allows the CLI to choose a platform-specific high-performance representation for floating-point numbers until they are placed in storage locations.  For example, it might be able to leave floating-point variables in hardware registers that provide more precision than a user has requested.  At the same time, CIL generators can force operations to respect language-specific rules for representations through the use of conversion instructions. end rationale]

    When a floating-point value whose internal representation has greater range and/or precision than its nominal type is put in a storage location, it is automatically coerced to the type of the storage location.  This can involve a loss of precision or the creation of an out-of-range value (NaN, +infinity, or ‑infinity). However, the value might be retained in the internal representation for future use, if it is reloaded from the storage location without having been modified.  It is the responsibility of the compiler to ensure that the retained value is still valid at the time of a subsequent load, taking into account the effects of aliasing and other execution threads (see memory model section).  This freedom to carry extra precision is not permitted, however, following the execution of an explicit conversion (conv.r4 or conv.r8), at which time the internal representation must be exactly representable in the associated type.

    [Note: To detect values that cannot be converted to a particular storage type, a conversion instruction (conv.r4, or conv.r8) can be used, followed by a check for a non-finite value using ckfinite. Underflow can be detected  by converting to a particular storage type, comparing to zero before and after the conversion. end note]

    [Note: The use of an internal representation that is wider than float32 or float64 can cause differences in computational results when a developer makes seemingly unrelated modifications to their code, the result of which can be that a value is spilled from the internal representation (e.g., in a register) to a location on the stack. end note]

     

    This spec clearly had in mind the x87 FPU. The spec is basically saying that a CLR implementation is allowed to use an internal representation (in our case, the x87 80 bit representation) as long as there is no explicit storage to a coerced location (a class or valuet type field), that forces narrowing. Also, at any point, the IL stream may have conv.r4 and conv.r8 instructions, which will force the narrowing to happen.

     

    Why did the spec implementers decide to go down this route? Imagine they hadn’t, and said that the precision/range of FP results would always have to be of the type of their operands. In x87 it would mean having to spill to memory (in order to narrow down to the operand precision/range) after every operation. This can become a very high price to pay for that extra predictability, which is not really cross platform, and that not everybody needs.

     

    Typical problems: It’s a language choice

     

    Note that with current spec, it’s still a language choice to give ‘predictability’. The language may insert conv.r4 or conv.r8 instructions after every FP operation to get a ‘predictable’ behavior. Obviously, this is really expensive, and different languages have different compromises. C#, for example, does nothing, if you want narrowing, you will have to insert (float) and (double) casts by hand. On the other hand, VC++ Whidbey has a new floating point model, which by default will do narrowing on assignment boundaries (it’s more complex than that, see http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/floapoint.asp for more details).

     

    Let’s look what can happen with current C# semantics in a recent question we got:

     

     

    int Compare(Point x, Point y)

    {

    float thetaX = Worker(pivot, x);

          float thetaY = Worker (pivot, y);

     

          if (thetaX == thetaY)

          {

                return (0);

          }

          else

          {

                return (-1);

          }

    }

     

    Worker was a ‘pure’ function that returned a float value, as in it depended only of it’s input for it’s result, and produced no side effects. The surprise was coming from the fact that if you called Compare with an x == y, the function would still result in a -1. Let’s look at the disassembly to see what may have happened:

     

     

                …. Prolog

                lea     EAX, bword ptr [ESI+4]

                cmp     ECX, dword ptr [EAX]

                sub     ESP, 8

                movq    XMM0, qword ptr [EAX]

                movq    qword ptr [ESP], XMM0

                lea     EAX, bword ptr [ESP+14H+08H]

                sub     ESP, 8

                movq    XMM0, qword ptr [EAX]

                movq    qword ptr [ESP], XMM0

                call    [Worker(struct,struct):float]

                fstp    dword ptr [ESP]

                add     ESI, 4

                sub     ESP, 8

                movq    XMM0, qword ptr [ESI]

                movq    qword ptr [ESP], XMM0

                lea     EAX, bword ptr [ESP+0CH+08H]

                sub     ESP, 8

                movq    XMM0, qword ptr [EAX]

                movq    qword ptr [ESP], XMM0

                call    [Worker(struct,struct):float]

                fld     dword ptr [ESP]

                fcomip  ST(0), ST(1)

                fstp    ST(0)

                jpe     SHORT G_M004_IG03

                jne     SHORT G_M004_IG03

                xor     EAX, EAX

                …. Epilog

     

     

     

    What is happening is that Worker returned a value in the ST(0) register of the FP stack, which was narrowed down to float precision in the fstp (memory store) following the first call. But then, for doing the comparison, it’s comparing ST(0) returned by the second call with the narrowed down to float result of the first call. So if what Worker did was not exactly representable in a float, the comparison (fcomip) would result in false.

     

    How could this be solved? By narrowing by hand the result of the Worker() calls:

     

    int Compare(Point x, Point y)

    {

    float thetaX = (float) Worker(pivot, x);

          float thetaY = (float) Worker (pivot, y);

     

          if (thetaX == thetaY)

          {

                return (0);

          }

          else

          {

                return (-1);

          }

    }

     

    Which generates:

     

                …prolog

                lea     EAX, bword ptr [ESI+4]

                cmp     ECX, dword ptr [EAX]

                sub     ESP, 8

                movq    XMM0, qword ptr [EAX]

                movq    qword ptr [ESP], XMM0

                lea     EAX, bword ptr [ESP+18H+08H]

                sub     ESP, 8

                movq    XMM0, qword ptr [EAX]

                movq    qword ptr [ESP], XMM0

                call    [ComputationalGeometryUtils.theta(struct,struct):float]

                fstp    dword ptr [ESP]

                add     ESI, 4

                sub     ESP, 8

                movq    XMM0, qword ptr [ESI]

                movq    qword ptr [ESP], XMM0

                lea     EAX, bword ptr [ESP+10H+08H]

                sub     ESP, 8

                movq    XMM0, qword ptr [EAX]

                movq    qword ptr [ESP], XMM0

                call    [ComputationalGeometryUtils.theta(struct,struct):float]

                fstp    dword ptr [ESP+04H] ; Narrow down

                fld     dword ptr [ESP+04H] ; and reload result of call

                fld     dword ptr [ESP] ; result of first call

                fcomip  ST(0), ST(1) ; comparison is now done on narrowed down results

                fstp    ST(0)

                jpe     SHORT G_M004_IG03

                jne     SHORT G_M004_IG03

                xor     EAX, EAX

               .epilog

     

    This will now be more in line with what programmer expects. All this that is happening is in 100% compliance with CLR ECMA specification. I would agree that it is a behavior that is not 100% intuitive, but the problem is that being intuitive will result in a performance penalty, which all users may not be happy with. Maybe the solution is for C# to adopt FP models just like C++ has now, which, although don’t solve the problem completely, do help in the most common situations.

  • Lazy init singleton

    I’ve seen some confusion in some MS internal mailing lists about when singletons are instantiated for the pattern described in:

     

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnbda/html/singletondespatt.asp

     

     

    // .NET Singleton

    sealed class Singleton

    {

        private Singleton() {}

        public static readonly Singleton Instance = new Singleton();

    }

     

    The confusion comes from trying to understand when the singleton’s instance is created. The article states:

     

    ‘What we really care about is that we get the instance created either on or just before the first call to (in this case) the Instance property, and that we have a defined initialization order for static variables in a class’

     

    With the code above, the correct answer is ‘Whenever the CLR decides so, as long as it happens before the first use of Singleton.Instance’. This is not what you really want if your singleton is bringing in some expensive resource and the text in the article if not wrong, may be a bit misleading.

     

    The following counterexample initializes the singleton even if we will never call Instance.

     

    using System;

     

    sealed class Singleton

    {

        private Singleton()

        {      

                Console.WriteLine("Singleton initialized");

        }

       

        public void Method()

        {

                Console.WriteLine("Method");

        }

        public static readonly Singleton Instance = new Singleton();

    }

     

    class Test

    {

                static public bool bFalse = false;

                static public void Main()

                {

                            if (bFalse)

                            {

                                        Singleton.Instance.Method();

                            }

                }

    }

     

    Why?

     

    The compiler embeds the static initialization

     

    public static readonly Singleton Instance = new Singleton();

     

    inside the class constructor (.cctor) of the class Singleton. So the question now is ‘when does the .cctor get run’. I go over this in http://blogs.msdn.com/davidnotario/archive/2005/02/08/369593.aspx . If you really want a singleton that only gets created on first access, you want the Singleton type to have precise init. You would do it this way:

     

     

    sealed class Singleton

    {

        Static Singleton()

        {

        }

        private Singleton()

        {      

                Console.WriteLine("Singleton initialized");

        }

       

        public void Method()

        {

                Console.WriteLine("Method");

        }

        public static readonly Singleton Instance = new Singleton();

    }

     

    Now, note this behavior may come at the expense of runtime checks (that ask the 'did my .cctor run already?' question), whereas with beforefieldinit semantics, the .cctor gets run aggressively by the JIT (at compile time) and no checks  need to be inserted (so maybe you want 2 singleton classes, Singleton and ExpensiveResourceSingleton, depending on your needs).

  • Interested in working in the CLR?

    Would you like to join the CLR's development team?  We have 2 job openings for the Rotor project, the shared source CLR that brings managed code to BSD, Mac and Windows. You can find more about Rotor here:

    http://msdn.microsoft.com/msdnmag/issues/02/07/SharedSourceCLI/

    For additional information on the positions or submitting your resume, check out the following URLs

    http://members.microsoft.com/careers/search/details.aspx?JobID=be2065d1-0943-4ebf-9560-ada9673a6cd6

    http://members.microsoft.com/careers/search/details.aspx?JobID=b02cafac-c95a-46e1-b110-e3faef0c1db2

    We're looking forward to hearing from you!!!

     

    Other positions in the CLR team can be found in:

    http://members.microsoft.com/careers/search/results.aspx?FromCP=Y&JobCategoryCodeID=&JobLocationCodeID=&JobProductCodeID=10166&JobTitleCodeID=&Divisions=&TargetLevels=&Keywords=&JobCode=&ManagerAlias=&Interval=10

     

  • What is mscorsvw.exe and why is it eating up my CPU? What is this new CLR Optimization Service?

    Short version:

    mscorsvw.exe is precompiling .NET assemblies in the background. Once it's done, it will go away. Typically, after you install the .NET Redist, it will be done with the high priority assemblies in 5 to 10 minutes and then will wait until your computer is idle to process the low priority assemblies. Once it does that it will shutdown and you won't see mscorsvw.exe. One important thing is that while you may see 100% CPU usage, the compilation happens in a process with low priority, so it tries not to steal the CPU for other stuff you are doing. Once everything is compiled, assemblies will now be able to share pages across different processes and warm start up will be typically much faster, so we're not throwing away your cycles.

    If you are really want to get rid of mscorsvw.exe from your task manager, just do:

    ngen.exe executequeueditems

    which will drain all the queued up work.

    Long version:

    If you wonder why I haven't been posting much to my blog, mscorsvw.exe is the reason ;). mscorsvw.exe doubles up as both the CLR Optimization Service (or NGEN Service, as we know it internally) and as the NGEN worker. If you don't know what NGEN is you may start by reading Reid Wilkes' excellent article about NGEN in Whidbey: http://msdn.microsoft.com/msdnmag/issues/05/04/NGen/default.aspx . Reid is one of the members of the team that has been working in NGEN and the Optimization Service.

    What problem is the CLR Optimization Service trying to solve?

    One problem with precompiled assemblies (I will refer to them as NGEN images) is that they are very dependent on the VM and other asemblies. For example, they have Method Tables in a binary form, know what CLR helpers to call to do various tasks, may inline code from other assemblies, etc...

    The downside, of course, is that whenever the runtime (for example via a Service Pack) or one of the dependencies of your ngen images changes (version upgrade or patch), your ngen image becomes invalid. In v1.0 and v1.1 of the CLR we weren't really encouraging people to use NGEN, and this was one of the reasons, nobody recompiled the assemblies if they became invalid, so anybody that NGENed an assembly had to NGEN it again after servicing.

    Due to some changes in our long term schedule, we decided after shipping Beta 1 that we had to have a better story in Whidbey for this problem, hence the CLR Optimization Service was born. The CLR Optimization Service will keep track of dependencies of assemblies and will recompile them as necessary. For example, when the .NET Runtime gets serviced, the SP installer will tell the service that there have been changes in the runtime and that it should start recompiling assemblies. Of course, if you have 1000 managed applications installed, that would mean it would take some time to complete the installation of the SP, so instead of waiting to recompile the world, it will compile assemblies in the background.

    How does the CLR Optimization Service work?

    Reid's article already covers most of this, so I won't extend myself much on this.

    There's 2 ways of waking up the service:

    - Notifying the service there have been changes that may require recompilation (via ngen.exe update): The service will start up and figure out what needs to be recompiled and will queue up the work that is necessary.

    - Queueing up new work (via ngen.exe install /queue): will add new work to the queue of assemblies that need to get compiled.

    When does the service do work?

    Work is prioritized. There's 2 big buckets of work

    - High priority work: Service starts work immediately. Compilation happens in a low priority process to try to minimize impact on other apps that are running. It still can have impact on the machine performance, though, specially if the machine is doing a lot of I/O work or is very low on memory. Currently, priorities 1 and 2 are high priority, but this is subject to change.

    - Low priority work: Service starts but waits until the machine is idle. Our definition of idle is something that can change in the future, but I can tell you it looks at things like the last time the user had input, if the machine is running on batteries, screen savers, etc... Once it decides the machine is idle, it will go ahead and do the work that's left.

    Once all work is done, the service will shut itself down and won't come back until somebody wakes it up.

    What is the typical lifetime of the CLR Optimization Service?

    Let's look at the .NET Redist as an example. The .NET Redist does some compilation during setup and also queues up both high and low priority work items. Typically, when the installation completes, you will see 2 instances of mscorsvw.exe running. One is the service itself and the other is the compilation worker process, which is actually doing all the interesting work. If you are curious about which one is which, as a rule of thumb, the one using more memory and CPU is the worker. If you want to know for sure, you can do 'tasklist /svc', which will show you which mscorsvw.exe is the process.

    After a while, all the high priority work will get done and you will see only one mscorsvw.exe running. The service is just waiting for the machine to be idle. During this time, the service is not doing anything interesting, so it won't consume CPU and shouldn't be using much memory. One the computer is idle (typically when you take a break or go for lunch), the service will start up new workers to complete the remaining work.

    Once all work is done, the service will set itself to manual start (meaning it won't start when you restart your computer) and will shut down.

    Should I use the CLR Optimization Service in my installer?

    The guideline I would give is the following:

    Use /queue in your installer if you have assemblies that you would like NGENed, having the assembly compiled is not necessary for you to meet your performance goals (as with /queue there is no guarantee of when the compilation will actually happen), and install time is very important for you. Rationale: High priority work can impact user experience and with low priority work you  will at least have a service hanging around. Also Idle detection is not perfect (or each person will have one definition of idle), so the best way of having no negative impact on your customer's machine is not having any pending work.

    What about 64 bit?

    64 bit .NET Redist actually installs 2 versions of the runtime side by side, a 32 bit and a 64 bit one. You get 2 services, but they cooperate to make sure that both are not working at the same time.

    Why don't you take advantage of my multiproc machine?

    Because we don't want to degrade machine performance. The CLR Optimization service tries hard to do work in the background, having 2, 4 or 8 compilations happen at the same time multiplies by 2, 4 or 8 the memory/CPU/I/O resources we would need.

    How do I make mscorsvw.exe go away?

    I don't want to wait until my computer is idle, compile everything now!!!

    ngen.exe executequeueditems

    will process all pending work. Once this is done, the service will shutdown, as it has nothing else to do.

    How do I know what is compiled or if my deferred compilation failed?

    Look in the Application Event Log, we log there when we start and finish compiling assemblies, and any errors that can occur. We are working on improving the logging for RTM version.

    ngen.exe display

    Will also give you some status, but is not 100% guaranteed to be completely up to date regarding pending work.

    Feedback

    We've worked really hard to get this service done for Beta 2. One of the mean reasons to do this is getting feedback from our customers, so let us know your opinion, concerns, or if you see anything suspicious or that looks like a bug let us know about it: http://lab.msdn.microsoft.com/productfeedback/ (we keep better track of bugs/suggestions that come from there, so while you're free to use the blog to do that, I recommend you doing it via the feedback site)

    I'm posting this now because I have seen already questions in the newsgroups. Let me know if you have any other questions. I'll also be updating this post as things come up

    [Update Jun/17/2005]
    After investigating some machines that were having issues with mscorsvw.exe, we've found a problem that occurs when you have 2 builds of the CLR installed (typically a Whidbey Beta 2 and a CTP). This configuration was never supported (we don't support side by side builds of Whidbey) and other things are broken. The underlying cause of the problem is that we're not COM compatible between builds (only between major versions, for example: Everett and Whidbey's COM interfaces are compatible) and setup is not updating type libraries correctly in this scenario.  I asked our Setup team to address this issue (by telling the user that side by side builds is not a supported scenario or disallowing the installation) and it will be addressed in the future (due to schedule limitations, it's not possible for them to do this work in Whidbey timeframe). We also have done work to ensure that even if we run into the scenario, we terminate gracefully

    Typical symptoms of this problem include mscorsvw.exe permanently taking 100% CPU. If you try doing 'ngen.exe display' (using the Whidbey ngen.exe) and get a E_NOINTERFACE error you are hitting this issue. I recommend you cleaning your machine (as there may be other CLR stuff that is broken), but as a workaround you can go to the services panel (services.msc) and disable the service or do it via command line:

    Disabling NGEN Service for Beta 2 build (for other builds, substitute 50215 with the number of your build)

     

          sc.exe stop clr_optimization_v2.0.50215_32

          sc.exe config clr_optimization_v2.0.50215_32 start= disabled

          sc.exe stop clr_optimization_v2.0.50215_64

          sc.exe config clr_optimization_v2.0.50215_64 start= disabled

    I apologize for any trouble this may have caused.

    [Update Sep/19/2005]
    Some folks have also reported mscorsvw.exe problems in Vista Beta 1. There's 2 problems people are seeing here:

    - Same as problem described above, but caused by installing Betas, CTP or RC builds on top of Vista. Vista has already installed a Whidbey build, so if you install another CLR build on top of it, you may break the Whidbey build in LH (or viceversa). Once