Vance Morrison's Weblog

Vance Morrison is currently an Architect on the .NET Runtime Team, specializing in performance issues with the runtime or managed code in general.

  • What's Coming in .NET Runtime Performance in Version V3.5 SP1

    What's Coming in .NET Runtime Performance in Version V3.5 SP1

    It certainly has been a while since I last blogged.   Most of this is laziness on my part, but I can truthfully say that it is partly because I have been busy trying to get the next servicing release of the .NET framework (called Version 3.5 Service Pack 1) out the door.    Part of the framework (runtime + libraries) servicing is an updated version of the runtime DLL (which we call the CLR).    

    How do you know if you have an updated CLR?   If the file

    • c:\windows\microsoft.net\Framework\V2.0.50727\mscorwks.dll

    has a version number greater than 50727.3000 then you have the changes I will be talking about.   Because this is just a servicing of the runtime, we had to confine ourselves to changes that we felt had a very low chance of breaking existing applications.   Nevertheless we were able to add substantial performance value.   There are two changes I would like to talk about in particular.

    • Improved Cold Startup performance

    Cold startup time is the time it takes an application to start when all the data needed to run has to be fetched from disk (rather than from the operating system disk cache).  The second time an application starts (Warm startup) is substantially faster because very few (slow) disk reads are needed to get the application running.   It is not unusual for the warm startup time to be good (< 1 sec), but for the cold startup time to be quite bad (5-10 sec).  In the new runtime we go to a lot of effort to pack all the code and runtime structures (for now only in DLLS associated with the framework itself), to reduce the amount of Disk I/O needed.   While the improvements benefit all code that uses the framework, it helps the most on code that has been precompiled using NGEN.exe and the effect is more pronounced on larger applications. 

    • Inlining of value types (C# structs).  (X86) 

    C# structs are what the .NET runtime calls value types.  They are called this because when you have fields or arrays of such data structures, the value is embedded directly in the field (not referenced through a pointer).    All the primitive types (int, char, bool) are value types, as well as a few types defined in the base class library like DateTime, Decimal, Point and Rectangle.    The previous version of the code generator (for X86) did almost no optimization on value types.   This was unfortunate, because although value types are not common (most types are classes, not structs), when they are used, they can be used heavily, and so optimization is important.  In fact the biggest single piece of feedback we got on our feedback site related to performance was concerning more agressive inlining of value type methods.    The runtime’s 64 bit code generators could already optimize value types, but the code generator for X86 could not.   Since X86 is still the dominate platform there was an unmet need.    In this servicing we included the work we did to enable this for X86.    Your own code often may not benefit from this improvement (because it does not use value classes much), but if you do use them, you tend to use them heavily, and the value could be substantial. 

    64 Bit machines: X64

    Much of the cold startup work was rearranging the data in precompiled framework executables to pack heavily used items together, and as such is applies equally well for 64 bit platforms like X64.   However there are places where 64 bit machines are quite different (how exceptions are delivered by the operating system) and we did not invest heavily in tuning (packing) these places.  The result is that cold startup wins will not be as dramatic on 64 bit machines.   There were improvements to the 64 bit code generator, but since this code generator could already inline value types, we did not need to do work in that specific area.

    64 Bit machines: Intel Itanium

    Most users of the .NET runtime run it on the X86 processor, however beginning with Version 2.0 of the runtime also supports the Intel Itanium architecture.   In fact, internally the Itanium version of the runtime came on line well before the X64 version did.   The Itanium processor tends to be used in high end server applications, where scalability is a great concern.   In the next major release of the Framework, we are continuing to investigate and improve our performance on the Itanium architecture.   Microsoft is a charter member of the Itanium Solutions Alliance (ISA) and we are excited to work with the other members of ISA to further enhance the performance of the .NET Framework on Intel Itanium Architecture.  Check out the CLR Performance Team’s blog on investigating Itanium perf  for more on this work. 

     

  • Writing approachable code: Introducing the hyperaddin for Visual Studio!

    A few years ago now, several of us on the .NET Runtime team where lamenting how unapproachable the code base was for new developers to the team.    We agreed that more commenting would certainly help the situation, but that alone was not enough, because you also need to FIND the comments WHEN they are relevent for the comments to actually be useful.   Typically the most useful comments are not about a particular line of code, but information about the purpose of a whole method, class, or subsystem, and what the global invarients are.   Finding these very useful comments is often not a trivial exercise. 

    It occured to us at the time that this problem is not so different from finding relevent data on internet, and that having hyperlinks would help alot to 'channel' people to important relevant comments.  This was the genesis of the HyperAddin.   The idea is that Visual Studio already supported having hyperlinks in comments (any strings starting with http:// becomes a link that you can Ctrl-left click on).    The idea is to add a new string called code:identifier  which turns into a hyperlink that looks up the definition to identifier.  Now it is possible for one part of the code base to refer to another by symbolic name.   

    Taking our cue from HTML, we added the capability of sub-file anchors.  Any comment of the form

    • // #SomeIdentifier

    Defines an anchor in that file called SomeIdentifier.  You can then refer to it by the hyperlink code:identifier#tag.  Like HTML, this is a hyperlink to code:identifier (which gets you to the correct file) followed by a search on that 'page' (file) for 'tag'.  The result is that it is now trivial for one comment to refer to another.  

    With this simple change, it is now possible to write a very nice overview comment and refer to it in dozens of places.  Having used it for several months now, I find it very constraining on those occasions that I am not writing code in visual studio, and thus do not have my code hyperlinks.    In retrospect I am amazed that someone did not think of this earlier (which is a hallmark of a good idea).

    There is more to say about what the hyperaddin can do for you (stay tuned), but this blog entry is mostly about getting you to try it.   Just today I published the addin on codePlex as HyperAddin.  Please click on the link and follow the installation instructions (it is an XCOPY deployment).    If you use Visual Studio, give it a try right now.

    If you like the capabilities of HyperAddin and wish that Visual Studio had this feature, you can vote on this here.  To vote, click on the 'Sign in to Rate' (and register if necessary), and then click on the number of stars in the rating box that corresponds to the value of this feature.   The more people who vote, the more likely this feature is to get into Visual Studio.

  • The windows prefetcher

    In my last blog I talked about some of the conditions than need to hold for the cold startup formula

  • ColdStartupTimeMSec = WarmStartupTimeMsec + 4 * NumberOfReads + 20 * NumberMBytes
  • To be accurate.  I mentioned that if you have overlap between the CPU and the disk then it may not be accurate (although it would be an upper bound).   It turns out that there is a feature of the OS that will actually makes the cold startup formula work all the time, even in those cases I call out above where it might not work.  That feature is the OS prefetcher and is in windows XP and beyond.

    If you look at the I/O caused by running a program at cold start, you will find that the disk tends to be accessed randomly.   You need a page here, then a page in some other part of the file, as the CPU executes code from all over the place.  This causes literally thousands of individual Disk accesses, and as we have seen, each disk seek costs us 4msec, so seek time dominates, and adds up to seconds of startup time. 

    The observation is that this 'lazy loading' of data from the disk is actually very inefficient because it loads the data in small pages and each page costs something like 4msec to move the disk head.  It would be much better if you could read the data in big chunks, even if you end up reading in data that you did not end up using. 

    The prefetcher fixes uses this observation to speed up cold startup. On EVERY program load, the OS keeps track of what data that program needed to be paged in (even if they were not 'hard' page faults).   This data is persisted in binary form in a  *.PF file in c:\windows\prefetch (go look at that directory).    When a process is created, the first thing that is done is that this *.PF file is consulted.   It then issues ALL the file I/O specified in this file, at once and blocks the program until it competes.   These I/O requests are nicely sorted and combined.  If several I/O are adjacent (or even close to one another), a large combined I/O is issued instead (which avoids a seek), the I/O are also nicely ordered so that the disk head moves the smallest possible distance between reads.  This makes the disk I/O as efficient as it can be, and because disk I/O is the slow part of cold startup time, it leads to dramatic improvements in cold startup time (often the prefetcher cuts cold startup time in half). 

    Note that the prefecher does not allow the program to begin until all the I/O is complete.  I am speculating they did this because they did not want random I/Os from the executing program to interfere with the beautify optimized disk head movement.  However this has a unfortunate consequence.  What it means is that things like splash screens (which get put up early to alert the user that something is indeed happening), do not get up as quickly as you would like (because NO code in the process runs until ALL the I/O completes).  For this reason, the prefetcher has a limit on the amount of data it will prefetch (on XP this is 16MB).  This keeps the maximum pause caused by the prefetcher to a tolerable amount (By our formula above 16 MB * 20 msec /MB = 320msec + probably about 100 seeks or so which is still < 1 sec). 

    One side effect of the prefetcher is that for programs under its data size limit, the there can be NO overlap between CPU and Disk I/O during startup (the program is not allowed to run during the initial priming phase).   So this is yet another reason that tends to keep the formula accurate.

     Finally I just wanted to close by noting that the formula's usefulness is not so much that it is accurate in all cases, but rather that it is accurate enough and very simple, and thus can be used to reason about what can be done to improve cold startup.   The formula states that you can attack cold startup by

    • Attacking warm startup (and parallelizing CPU work is one way do doing this),
    • Attacking the amount of data read of the disk (minimizing the amount of data needed to startup is the way to do this)
    • Attacking the number of seeks (packing data so that it is not scattered about within a DLL does this)

     It attaches real values to each of these components so that for a given scenario you can know how each component is contributing and thus which is the most profitable component to attack.   This is its real value.

  • Assumptions used in the cold startup formula (when is it accurate).

    After my last blog entry on cold startup a reader (dimkaz) worried that the formula would not be accurate in many cases. 

    This topic is worth discussing in some detail because it pretty common to apply formulas outside the assumptions implicit in them, and simply get wrong conclusions.  The most fundamental part of the formula is  

    • Cold Startup = WarmStartup + DiskIOTime

    Technically this is only accurate if WarmStarup time and DiskIOTime don't overlap in time.  For example, if DiskIOTime and WarmStartup time could be done completely concurrently then ColdStartup time would be the maximum, not the sum of the two times, which could be quite different.    So the question is how much overlap is there between the disk and the CPU? 

    Well in one very common case the answer is simple.  If the application only has one thread of execution (eg. you did not use System.Thread, or System.Threadpool in your startup path), and you don't do any significant explicit file I/O in your application on startup, then the times will be additive (the formula above is correct).  In this very common scenario, the vast majority of disk time is time spent fetching the program itself from the disk.  The operating system does this by doing what is called 'demand paging'.  The idea is that in one operation (the API is called LoadLibrary), the entire file is read into memory, however all the data is not actually read in at that time, but is done 'lazily' on demand.  Basically whenever the CPU tries to access memory that was supposed to be read in, but is not actually been read in, the OS takes a 'hard page fault', stops what it is doing and fetches the needed page from disk (on x86 this happens pages are 4096 bytes). 

    The upshot of this is that during cold startup the CPU is constantly trying to access some new part of the program which needs to be paged in (it takes a hard page fault).  Each time it has to stop, and get data from the disk.  Since it really needs that data (typically the next instruction to executed) to proceed, so it simply has to wait until the I/O is done.  Thus there is no overlap between the CPU and disk I/O for that thread of execution.  Hard page faults typically take 4-8msec to service, which translate into 8 to 16 million instructions on a 2Ghz machine (so a hard page fault is quite significant), which h is why cold startup times can easily be many seconds long. 

    It is possible that if there were multiple threads of execution during startup, that while one thread is waiting for I/O to complete it can be doing something useful on another thread (thus CPU and I/O can overlap), so this may invalidate the formula above, however as we will see, the even in this case the windows prefecher keeps the formula working. 

    Even single threaded programs might be able to overlap I/O and CPU if they do a lot of explicit, sequential file reads during startup.  The reason is that when a Stream.Read is executed the operating system knows that the program is very likely to want to continue reading even after it has return the data that is actually requested.  Thus it issues the disk requests to prefech the next chunks of data from the disk and allows the disk to fetch them while the program operates on the last chunk that it was given.  If the program does a lot of computation and explicit sequential reads, then the CPU tends to completely overlap the I/O and the formula is wrong (it is too pessimistic).   This can happen for example in compilers, where allot of explicit reads as done as well as lot of computation (although even there, the bulk of the computation happens later, after all the data is read in, and thus does not overlap with I/O).

    Finally there is one more interesting case of overlap.  In the formula above all disk I/O is assumed to not overlap with itself.  This is correct if there is only one disk involved, but if multiple disks are involed obviously overlap is likely to happen.   Thus the formula is for the 1-disk case and is only an upper bound for the mulit-disk case.   In practice, few applications have control over the hardware configuration of the system, and in fact most client machines have only one disk, so the formula covers what is likely to be the most interesting case, but the assumption should be called out.

    The upshot is that the formula is valid for programs single threaded programs that don't do allot of sequential file I/O.  It is an upper bound for all programs, and frankly, tends to be pretty accurate even in those cases, because events tend to conspire to prevent CPU and I/O from overlapping effectively.

    In my next blog I will talk about the windows prefetcher, which is some very cool technology that has been in the OS since windows XP, and happens to make the cold startup formula accurate for pretty much any program. 

  • A Model for cold startup time of an application on Windows.

    Well it is has been a while.    I see now that it is been getting close to 1/2 a year since I last posted.   Sigh.  It is so easy for other things to get busy and not to blog.   I have resolved to try to be a bit more methodical about and insure that I write something every week or so. 

     Over the past year I have been doing quite a bit of work with the Common Language runtime trying to improve its cold startup time.   This has been a common complaint, and we knew when we shipped version 2.0 of the runtime that there was a lot we could do.   Unfortunately, it may be a while before the wider world see the benefit of this work (our ship cycle is quite long, and the next public release of the runtime has no major changes to the 'core' part of the runtime that I work on).   Nevertheless, I learn some useful things so far in this work and I thought I would write some of them down.

     When we first seriously looked at cold startup time we had two problems

    1. Our measurements of cold startup were VERY noisy.  This makes work quite difficult. 
    2. We did not have a clear idea how to prioritize work.  We knew Disk I/O was important, but how important?  Were disk seeks or data transfer time most important?  We needed to have a good model of what contributed to cold startup time before we could come up with an attack plan and prioritize the work properly. 

    The good news is that we came up with simple model that we are pretty happy with that helped solve both of these problem that I would like to share with you. 

    First: What is cold startup time?

     Cold startup time is defined to be the time it takes an application to start up (be responsive to user input), on a machine that has just been rebooted.   The important characteristic of cold startup time is that any data on the disk (mostly the program itself), needs to be read off the disk because it is not in the operating system 's disk cache.   Thus cold startup has a very strong component of Disk I/O time associated with this data fetch.  In contrast, warm startup time is defined as application startup time on second launch.   Under normal circumstances warm startup has almost NO disk I/O because literally everything the program needs is in cache, and thus the program never needs to go to disk to get anything. 

     This leads to a very nice formula

    • ColdStartup = WarmStartup + DiskIOTime

    This formula is accurate to a very good approximation.   Moreover DiskIOTime can be broken into two parts:  The time it takes to move the disk head and for the disk to rotate so that the correct data is under the head (Seek Time), and the time it takes to actually transfer the data.   Thus

    • DiskIOTime = DiskSeekTime + DiskDataTransferTime.

    These quantities can also be modeled well.  Disk Seek time does depend on how far the disk head needs to travel, however disks are now quite large (100s of Gig), and the Operating system does try to keep files that are used together close together on the disk (More on that on a later blog if there is interest), so the amount of head movement is typically very small as a % of the total possible disk travel.   Moreover, whenever there is ANY movement of the disk head, it takes a significant amount of time for the head to 'settle' so that an accurate read can happen.  Thus to a good approximation, the seek time is independent of how far the disk head moves, and thus can be accurate modeled by a constant.   Some empirical data gathering shows that for the machines that we typically use this time was about 4msec per movement.  Thus

    •  DiskSeekTimeMsec = NumberOfSeeks * 4

    The DiskTransferTime can also be accurately be modeled.  It is completely determined by the rotational speed of the disk, the number of read heads, and the density of the bits on the disk (all of which are constant for a given disk).   Typical disks that we use give about 50MBytes / sec transfer speeds.  This means that it takes 1/50 of a second (20 msec) to transfer 1MByte of data.   Thus

    • DiskTransferTimeMsec = NumberMBytes * 20

    Putting this all together gives the formula

    • ColdStartupTimeMSec = WarmStartupTimeMsec + 4 * NumberOfReads + 20 * NumberMBytes

    This formula shows that you can attach cold startup one of three ways:  Either reduce the warm startup time (computations done at startup by the app), the number of data read from the disk (typically dominated by the amount of UNIQUE code used at startup), or the number of disk reads (typically by packing the code used in the startup path so that it is all together and can be read from the disk in one read). 

    Now the formula does depend on disk technology, but typically will not vary in huge ways (by more than a factor of 2) from one disk to another (assuming a certain quality in your hardware).    In fact one of the nice benefits of the formula is that it is MACHINE INDEPENDENT (assuming the warm startup time is hardware independent (at least approximately)).  Thus measurements on one hardware can be compared to that on different hardware (you can't do that with time).   Effectively the formula calculates the cold startup time on a theoretical piece of hardware (that happens to have the specs of the machine I used to derive the 4 msec and 20 msec/MB numbers. 

    This solved both of our original problems (noise and prioritizing dev work).  The main reason cold startup is noisy is that the disk is being shared by ALL apps in the system, and thus their activity can perturb the time.  However the operating system already tracks the metrics above (number of read I/O and the number of I/O bytes), and breaks it down PER PROCESS, which means these I/O measurements have NO NOISE associated with other processes activity.  This is GREAT and reduces the noise SUBSTANTIALLY.    It solved the prioritizing problem because it tells us the first level of breakdown of what contributes to cold startup and thus we know which one more dominate for a given scenario (typically none of the terms can be ignored), and thus where we can get the best benefit (for a given cost).

    Next time I will talk about how these numbers play out in a real scenario, and where that led us to work prioritization. 

     Its good to be back!

    Vance

     

  • Intro to Programming Excercise: Bouncing Balls

    This article is the beginning of a new 'Introduction to Programming' topic for my blog.  The BouncingBalls example.

    Basically, I have decided to try to teach programming to a set of teenagers from my son's junior high school, and this topic will be a set of instruction materials for that 'course'. 

    The basic premise is simple:  I think that the .NET Runtime is a great platform for teaching kids about programming, and I personally want to see my son and his classmates get the chance to do some 'hands-on' programming.   

    The good news is that the .NET Runtime is free for any users of the Windows operating system.  The Windows VISTA has it preinstalled.   If the directory c:\windows\Microsoft.NET\Framework\v2.0.50727 exists you already have this software on your system.  If you don't you can get the .NET Runtime is part of the 'optional downloads' from windows update.  Click on 'custom' and look for the option to download '.NET Framework 2.0'.   You can also get it from the .NET downloads site.  

    One very nice characteristic of the runtime is that it comes with EVERYTHING you need to build and run a really LARGE set of applications.   We will take advantage of that here.   To help motivate my teen-age students, my programming examples will be very visual.  We will be writing various simple game programs (and leaning some programming along the way).   I will be teaching them the C# language.

    Note that actually writing on-line content that goes with the examples here is probably too much for me to do.  It takes a lot of time to write a really good explanation.   However the code samples I will be posting are fun and useful.   For those who know something about programming, there is a good chance you can pick up and read the code I post and learn something new about System.Windows.Forms and GUI programming. 

    The first example program is my 'Bouncing Balls' example.  This program is a physics simulation of some balls bouncing around a box.  Balls can bounce off the walls and each other.   This is just an example of what can be done.  Within 10 minutes of staring the project I had something interesting on the screen (a ball bouncing off a wall),   It took me another 3 evenings to get the code to add all the bells and whistles (collisions with other balls, the buttons, color, faces for the balls, etc).   We will be starting with some simpler 'ramp up' programs shortly, but I wanted to show something compelling that people could play with immediately

    The attachment is a ZIP archive that holds ALL the files needed to create the program from scratch.   There is NOTHING missing.  To unpack the archive, first double click on it, and it will open, showing a directory called 'BouncingBalls'.   Copy this directory to your local hard drive by opening dragging it to where you want to put it (e.g., you can just drag it to the desktop if you like, or put it in your 'My Documents' folder).  Once the BouncingBalls directory is on your machine, you can open it up, and you will find all the files.  

    The 'heart' of the program is in the file 'MainForm.cs', in the 'BouncingBalls' subdirectory.  Ideally, you should have an editor designed for looking at code to look at MainForm (e.g. Visual Studio), but if you don't have one, you can look at it using the 'notepad' editor.  There are two main routines to the program.  The first is the 'timer_Tick' method that updates all the balls on the screen every 20 msec, and the other is the 'MainForm_Paint' routine that is called any time the window needs to be redrawn.    Hopefully this is enough of a starting point of people who have seen C# before to get their bearings. 

    To actually compile and run the program, you need only double click on the 'BuildAndRun.bat' file in the BouncingBalls directory.   Note that after the program has been built into an EXE you don't need to do it again, you can launch the program by simply double clicking on the EXE that was built (it gets built into BouncingBalls\bin\Release\BouncingBalls.exe).   Only if you change the code do you need to rebuild.   You can also copy the EXE anywhere you like and run it from there (for example you can drag it to your desktop so you can launch it easily). 

    Give it a try.   Play with the GUI.  After you have played with it a bit see if you can do the following changes

    • In mainform.cs you will find the line 'timer.Interval = 20;' This controls how often updates happen.  Update this so that updates happen only half as often as before, and then recompile and run the program (double click on BuildAndRun.bat).   See what happens. 
    • In the same routine you will find code for the initial number of balls.  What is the current default?  Change it to 20.  Se what happens.
    • In the 'DrawFace' method you will find a variable called 'eyeRadius' which determines the size of the eyes.  Make the eyes twice as big as they were originally.  See what happens.   
    • The colors for the balls are determined by an array called 'colors'.  Find this array, and remove any colors you don't like. 

    Next time we will go through step-by-step building up to making this application.

  • Drilling into .NET Runtime microbenchmarks: 'typeof' optimizations.

    In my last blog entry showed how to use a simple simle class called MultiSampleCodeTimer to measure the performance (time), of pretty much any fragment of CLR code quickly and easily.  That entry had a solution packaged up as a zip file TypePerfMeasurement.zip that you could use to duplicate the experiment yourself.   However I did not have time to show you what is particularly interesting about those paricular performance measurements (I pciked them for a reason).  Here I wish to do complete the explaination. 

    When I unpack and run TypePerfMeasurement.zip (please do it yourself, detailed instructions at the end of my last post), I get the following output. 

    Data units of msec resolution = 0.279365 usec
    10 typeof(string)                       : count: 10000    7.677 +- 3%    msec
    10 typeof(string).TypeHandle            : count: 10000    0.000 +- >400% msec
    10 anObj.GetType() == type              : count: 10000    7.393 +- 41%   msec
    10 Type.GetTypeHandle(obj).Equals(tHnd) : count: 10000    4.554 +- 9%    msec
    10 anObj.GetType() == typeof(string)    : count: 10000    0.103 +- 7%    msec
    10 (anObj is string)                    : count: 10000    0.594 +- 12%   msec
    

    Please take the time to run it yourself.  You don't need Visual studio do so (see readme.txt file, it is as easy as running 'buildit, and running the resulting exe).  Note that if you are running under visual studio, and you get very differrent numbers based on whether you run it under VS or outside VS, you did not set your VS settings as I describe in this post.  Go and do that now.  Note that your numbers are going to be different than mine, but the relative sizes between the various rows should be reasonably close.

    What do these numbers mean?   We need some background first.   The .NET Runtime defineds a type called System.Type, which is a representation of a type in the system.  EVERY class that has been defined in the system has a cooresponding System.Type.  The System.Type object is the gateway to exploring at runtime all the data that the users typed when he defined the type.  You can get superclasses, methods, properties fields, etc by calling the appropriate operations on System.Type.   This ability to inspect and invoke characteristics of the source code is called Reflection, and System.Type is your gateway into this capability. 

    While the System.Type class is very powerful, that power comes at a cost.  Also most programs doe NOT need this power and should NOT use it (I need to blog about that too) becuase of the perf cost.   Thus the .NET runtime has tried to provide alternatives to using System.Type for some common operations that CAN be implemented efficiently.

    One such operation is testing if an object is of a EXACTLY a certain type.  While the 'is' operator in c# can often be used for this (and is the preferred method), however that mechanism will also match any subtype (eg ("foo" is object) returns true) and also only works for literal types (if the type was a variable of type System.Type), you could not do this.   For example

        System.Type myType = typeof(string); // in real life not a literal
        // and then in other methods (and maybe in a loop)
        if (anObj.GetType() == myType) { /* some op */ }
    

    In this example we use the 'GetType() method on System.Object to see if 'anObject' is a string.   Unfortunately both the operations used above (typeof, and GetType()), are relatively expensive (as shown in my data above 5-10ms instead of .7ms  for 'is' operator). 

    Type checking operations like the one above are common enough that the .NET runtime has added a special type called 'RuntimeTypeHandle' to make them fast.  RuntimeTypeHandle is a less user friendly but 'fast' alternative to System.Type.  It happens to be simply a managed wrapper that represents the internal pointer (to a structure called a  MethodTable), that the runtime uses internally to represent a type.  As mentioned it is not very friendly, however it can handle the scenario above, and is very fast.   The code using RuntimeTypeHandles looks like

        RuntimeTypeHandle myType = typeof(string).TypeHandle;
        // and then in other methods (and maybe in a loop)
        if (Type.GetTypeHandle(anObj).Equals(myType)) { /* some op */ }
    

    This code is significantly faster than the previous code becuse no System.Type object needed to be created, only RuntimeTypeHandles, wich is simply a value type wrapping an unmanaged poitner, and thus is very lean (The code above sadly is slower than it should be because some JIT optimizations where not done in time for V2.0, but is still significantly faster than using System.Type.   The first operation

        RuntimeTypeHandle myType = typeof(string).TypeHandle;
    

    Is blazenly fast, as shown by our measurements (it is slow small it is in the noise).  This is suprising because this line does a 'typeof(string)' and we know that that operation is relatively  slow.  Setting a breakpoint in this code shows why.  That line of code compiled down to

        00000000  mov         dword ptr [ecx+10h],790FA3E0h 
    

    Basically the JIT compiler is smart enough to realize that to fetch the RuntimeTypeHandle, you don't need to fetch the System.Type and fetch its TyepHandle, you can just lookup the value at compile time and emit code to generate the literal.  This is very fast. 

    Sadly the other part of the operation

        if (Type.GetTypeHandle(anObj).Equals(myType)) { /* some op */ }

    Should be just as fast (fetch the MethodTable poitner from the the object, and test for pointer equality), but is not due to some JIT optimizations that did not kick in appropriately (you will see that we make real calls for both GetTypeHanlde and Equals).   Even so, it is faster because we avoid creating a System.Type object.

    As data point of just how fast the code above coudl be (when the JIT is fixed), consider the final experiment in the code

        result = anObj.GetType() == typeof(string);
    

    Given what we know already, we would expect this to be slow, and yet, the measurements above show it to be FASTER than using the 'is' opertator.  How can that be?   The answer is that the JIT recognises this sequence and knows that while it seems like two System.Type object need to be created and compared, all you really want is a yes-no answer on a type question that can be answered using RuntimeTypeHandles.  It thus substitutes that code

        0000000f  cmp         dword ptr [edx],790FA3E0h 
    00000015  sete        al  
    00000018  mov         byte ptr [ecx+0Ch],al

    EDX holds 'anObj' and the method table for an object is the first field of the object, and 790fa3e0h is the method table pointer (RuntimeTypeHandle) for string.  (See previous post on using !DumpMT to determine this).   Thus in one instrution we have tested whether 'anObj' is string.  The 'sete' instruction converts the procressor condition flags set by 'cmd' into a boolena value in register AL, and the last insturction sets 'result' to this value.  This is pretty lean an mean!

    Summary:

    In this entry, we have taken a look at some code generation for doing 'type reflection'.  We measured its performance and got some anomolies (some operations were much faster than we expected.   We looked into the disassembly for those operations and determined that the JIT compiler was doing some non-trivial optimizations that made certain operations very fast (in one case faster than the 'built in' 'is' operation). 

    We have learned that a System.Type object is relatively expensive compared to RuntimeTypeHandle, and we have used techniques from the last few perf blog entries to help dig into exactly why.

    I again encourage you to experiment with the TypePerfMeasurement.zip example yourself and hone your skills in measuring and investigating .NET Runtime performance.  In my next blog entry I will be doing 'inventory' on the performance characteristics of the 'basic' operations of the runtime.

     

     

     

     

  • Measuring managed code quickly and easiliy: CodeTimers

    My performance blog entries to date have been 'foundational'.  In entries so far, I talk about how to use Visual Studio to look at the native code generated for the runtime.   With this foundation, we can now start exploring what the native code for managed code looks like and what optimzations the runtime does on your behalf.

    However before we do that, we need one more foundational piece.  You really can't do performance work, without measuring, and so we need tools to help us measure.   The simplest of these is the stopwatch.   Thankfully Version 2.0 of the rutnime provides a nice, high resolution class that does this called System.Diagnostics.Stopwatch, which is ideal for measuring short intervals accurately. 

    As nice as System.Diagnositcs.Stopwatch is, I found myself wanting more.  I was repeatingly typing code to start and stop the Stopwatch, subtracting the difference, subracting out the harness overhead, and doing statistics on the resulting numbers.    This led me to write two classes to automate this, call CodeTimer, and MultiSampleCodeTimer.   I have included the source for these in the Zip file below.  Simply open it, copy the directory to your local drive, and open the .SLN file using visual studio.   You can even compile the code without having Visual Studio.  Simply run the buildIt.bat batch file  The only prerequisite is that you need V2.0 of the .NET runtime installed (available from windows update, among other places). 

    The basic usage pattern for CodeTimer is 

        CodeTimer timer = new CodeTimer();
        timer.Measure("Measurement Name", delegate {
           // Code that you want to measure
        }); 
    

    Basically I am making use the 'anonymous delegate' feature of c# to allow the client to specify the code that he wants measured inline.  One of the nice feature of anonymous delegates is that they 'capture' any local variables that are in scope.  Thus you can write things like

       
        CodeTimer timer = new CodeTimer();
        string myString = "aString";
        string outString;
        timer.Measure("Measurement Name", delegate {
           outString = myString + myString;  // measuring concatination.  
        }); 
    

    To measure the time it takes to concatinate two strings.  What is actually going on 'under the hood' is that the C# compiler generates a class with two fields 'myString' and 'outString' (a field for every local variable referenced in the body of the delegate code). An instance of this class is created and initialized to the values of locals and this object is used to create a delegate to pass to 'Measure'.  Thus the locals are 'captured' and passed to the code that 'Measure' will ultimately call.   

    This detail is imortant if you are measuring very small operations, because field access generates different code than local variable access.   Thus in the example above, I am not just measuring string concatination, but also fetching an instance field (mystring) twice, and setting an instance field (outstring).   Generally these fetches are small, and so they don't purturb the data much, but you should be aware of them. 

    When you are measuring very short times (like the case above), you are likely to wan to run the test in a loop.  CodeTimer has built-in support for this.   I can run the example above 1000 times in a loop by doing

       
        CodeTimer timer = new CodeTimer(1000);
        string myString = "aString";
        string outString;
        timer.Measure("Measurement Name", delegate {
           outString = myString + myString;  // measuring concatination.  
        }); 
    

    Which tells 'timer' to run any measurment 1000 times.  This is great for measuring micro-operations.  

    While 'CodeTimer' is great, I want still more.  I don't want just one sample, but 10 or more, so I can take the average and standard deviation and understand how 'noisy' my data is.  That is exactly what MulitSampleCodeTimer does.   What is nice is that only the defintion of timer changes. 

       
        MultiSampleCodeTimer timer = new MultiSampleCodeTimer(10, 1000);
        string myString = "aString";
        string outString;
        timer.Measure("Measurement Name", delegate {
           outString = myString + myString;  // measuring concatination.  
        }); 

    Which asks that the code be run 1000 times per measurment and that 10 measurments be taken and used to compute min, max, mean, median and standard deviation.  The results are then printed (we are reallly starting to add value). 

    I find I use MultiSampleCodeTimer alot.   I use it in the'TypePerfMeasurement' project below to measure the speed of some reflection type constructs.   More on that in my next blog entry. 

    SUMMARY:

    In this blog entry, I have introduced to two classes 'CodeTimer' and 'MultiSampleCodeTimer' that can be used to meause the performance (time), of a wide variety of code.  You are encouraged to open, copy and build the attached sample below (you don't need VS to do it just the .NET runtime), and try it for yourself.   Measuring performance has never been easier.

    Next time we will talk about the numbers that the TypePerfMeasurement experiment actually produce and talk about why the performance is the way it is. 

    HOW TO USE THE DEMO:

    I want my blogs to be 'hands-on' which means that there should always be something you can do after reading to experiement with the concept I talked about (that is there is homework!).  To make this easy I have attached a ZIP file that is a archive of a Visual Studio Solution for code to experiement with.  Note that you do NOT Need Visual Studio to compile and run the demo, so don't let that stop you.  To unpack this simply

    1. Click on the ZIP file attachement at the bottom of the post.  This will bring up a dialog asking if you wish to open the file. 
    2. Click on the 'Open' button.   This will bring up an explore window displaying the contents of the ZIP file, which is a single directory.  If you just want to look at files, you can browse the files directly from this window. 
    3. Open a target folder on your local computer.  This can be anywhere, but if you have Visual Studio installed 'My Documents\Visual Studio 2005\Projects' is a good place
    4. Drag the directory from the ZIP file window into the target folder.  This copies the directory and all its contents to the local hard drive. 
    5. Once copied, you can close the ZIP window, and click into the target directory.  There you will find a .SLN file, which you can double click on to bring up the solution in visual studio. 
    6. If you don't have Visual Studio, you can still build the solution, there is a one line batch file called 'buildIt.bat' file that does this (it simply turns around and calls MSBuild to do its work).  You do need Version 2.0 of the .NET Framework installed.  If you don't have it you can get it from www.windowsupdate.com (custom downloads), or here
    7. All the samples build and run and do something interesting 'out of the box',  They are meant to be easy to understand, commented, samples so start looking around and experimenting (Hitting F10 to start stepping into the code is a good place to start). 
  • Digging deeper into managed code with Visual Studio: Using SOS

    I have let my blog laps for too long.    I am back to blogging.   I realized reciently that we have simply not written down many interesting facts about how the runtime actually works.  I want to fix this.   Coming up in future blogs I am going to be doing a bit of a 'architectural overview' which describe the differences between managed and unmanaged code, but before I do that I realized that I have not even finished a blog entry I started in March. 

    In my blog How to use Visual Studio to investigate code generation questions in managed code, I talk about the how to configure Visual Studio so that you can actually look at optimized code in the debugger (which sadly is not as trivial as you would like), and showed how to look at the disassembly of managed code.    Unfortunately manage code is hard to read without a guide, and so in this blog I will show you some very useful tips for reading managed assembly code. 

    In this blog entry I will show you the instructions ACTUALLY need to get executed to do something as simple as assigning a string to field of a class.   Note that I am assuming a familiarity with X86 assembly code.   If you are the type who never wants to read assembly code, you should stop reading now, because most of this blog is a step-by-step explanation of it. 

    I have attached the file InspectingManageCode.zip, which contains a (trivial), project that I used for this example.  You are STRONLY encouraged to open it (you can browse it the main file is Program.cs).  Copy the files (simply drag the 'InspectingManagedCode directory inside the ZIP to a directory of your choosing), launch the InspecingManagedCode.sln file and run the example.   While the project is already set to build and run optimized code, you will still need to turn off ‘just my code’ and turn on JIT optimization as described in my previous blog to follow along.    

    The code in the attached example is pretty trivial. 

    class Program

    {

        string myString;

     

        private Program()

        {

            myString = "foo";

        }

     

        static void Main(string[] args)

        {

            Program p = new Program();

        }

    }

    If you were to follow the instructions in the previous blog to see what code was generated for the body of ‘Main’ you would find the following code. 

    00000000  push        esi 

    00000001  mov         ecx,9181F4h

    00000006  call        FFCB1264

    0000000b  mov         esi,eax

    0000000d  mov         eax,dword ptr ds:[0227307Ch]

    00000013  lea         edx,[esi+4]

    00000016  call        79222B78

    0000001b  pop         esi 

    0000001c  ret            

    At first glance this code has little similarity to the source code: the original source has a call the constructor ‘Program’ and the assembly code has two calls to strange hex addresses.  There are also references to magical numbers like 9181F4H and 0227307CH.   In this case the disassembly has not proven to be very valuable.    What can we do?    Sadly if we try to peer into these CALL instructions we cannot, the debugger comes back with the very unhelpful message ‘There is no code at the specified location’.  Actually Visual Studio is LIEING to you.  There really is code there, but it simply will not show you.  I will show you techniques to get around this. 

    The key to unlocking mysteries of managed code, is a debug helper called SOS.DLL (it is a dll that is shipped with the runtime).  The DLL is what is called a ‘debugger extension’.  Basically it implements functionality that is useful in a debugger implementing functions that are useful for debugging code associated with it (in this case the runtime).   Other bloggers have also commented on the use of this DLL (do a web search of SOS.DLL for more).

    In Visual Studio, you load SOS.DLL by opening the immediate window (Ctrl-D I) and typing

    .load SOS.dll

    If you do this you may get the message

    SOS not available while Managed only debugging.  To load SOS, enable unmanaged debugging in your project properties.

    This message is actually reasonably helpful.  By stopping the debugger (Shift F5) going to Solution Explorer (Right hand pane), right clicking on the InspectingManagedCode project file, and selecting Properties, you will get the properties pane for the project.  If you select the ‘Debug’ tab on the left side you will find 3 check boxes at the bottom, one of which is labeled ‘Enable unmanaged code debugging’  If you check this, you put the debugger into a mode where it can debug both mananged and unmanaged code, (which means you can then use SOS.DLL).    I have already done this on the InspectingManagedCode project, but you will have to repeat this any time you need to use SOS.   (Sadly the instructions for setting the debugger mode are different for C++).    Note that running the debugger to debug both managed and unmanaged code will slow the debugger down a bit (it loads the symbols for all the unmanaged DLLS), so you probably only want do this on projects like this one where you want to use SOS.DLL. 

    Now you should be able to set a breakpoint in Main(), run the program (F5), and go to the immediate window (CTRL-D I) and type

    .load SOS.dll

    And get the message

    extension C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\sos.dll loaded.

    If you are curious the SOS.DLL has reasonably good help, if you type the command

    !Help

    It will give you a list of commands, and you can get help on individual commands by specifying the name eg.

    !Help u

    It will give you help on the ‘u’ (unassembled) command.    All SOS commands need to be prefixed by a ! character so that the Visual Studio Debugger knows that it is an SOS command and not an immediate value to be interpreted (the normal meaning of text typed in the immediate window).

    The unassemble SOS command is the command we are interested in.   It will disassemble a managed routine, but do a much better job than Visual Studio presently does.   Unfortunately, we need the address of the routine we want disassemble, and Visual Studio goes to some length to hide this information.  If you look at the disassembly for the code (CTRL-ALT-D), you will see that the address of the routine is never given, only the offset from the beginning of the method. 

    The way around this is to use the ‘Registers window’ (Ctrl-D R).  I happen to like to put this window just above the immediate window and shrink it so that only the two lines that actually show values are showing.   One of the registers is ‘EIP’ which stands for Extended Instruction Pointer’.  It is the address of the current instruction pointer.   In my particular invokaction EIP has the value of 00DE0071, so I can do the command

    !u 00DE0071

    Which will disassemble the ENTIRE routine that the address 00DE0071 lives in.  I like to right click in the immediate window and select ‘Clear All’ before I do this so the only thing in that window is the disassembly.   On my machine I get the result

    Normal JIT generated code

    Program.Main(System.String[])

    Begin 00de0070, size 1d

    00DE0070 56               push        esi

    >>> 00DE0071 B904309100       mov         ecx,913004h

    00DE0076 E8A11FB2FF       call        0090201C (JitHelp: CORINFO_HELP_NEWSFAST)

    00DE007B 8BF0             mov         esi,eax

    00DE007D 8B053C302B02     mov         eax,dword ptr ds:[022B303Ch]

    00DE0083 8D5604           lea         edx,[esi+4]

    00DE0086 E8A5380979       call        79E73930

    00DE008B 5E               pop         esi

    00DE008C C3               ret

    It is not unlike the version the Visual Studio produced, but there are differences

    1.      You will note that the ‘call instruction is annoted with ‘JitHelp: CORINFO_HELP_NEWFAST’, which makes it at least a bit clearer that this helper is used to create a New object (and is the fast version, we have many variations).

    2.      It printed the whole routine that 00DE0071 lives in and prints a >>>  on the instruction corresponding to the 00DE0071 address. 

    3.      While it did not print the name for the ‘call 79E73930’, notice that the HEX value is different than the value in the Visual Studio Disassembly (79222B78).  The value in the  VS disassembly is simply WRONG (it is bug no one bothered to fix). 

    So let’s take a look at the first two instructions.

    00DE0071 B904309100       mov         ecx,913004h

    00DE0076 E8A11FB2FF       call        0090201C (JitHelp: CORINFO_HELP_NEWSFAST)

    I mentioned that this helper call creates a new object from the GC heap. To do so it needs to know that type of the object to be created. This is what the magic number 913004 does.  Internally in the runtime types are described by a structure called a MethodTable, and 913004 is the address of the MethodTable to create.  We can find out what type 913004 corresponds to by using the !DumpMT (dump Method Table) SOS command. 

    !DumpMT 913004h

    Produces the output

    EClass: 00911254

    Module: 00912c14

    Name: Program

    mdToken: 02000002  (C:\Documents and Settings\vancem\My Documents\Visual Studio 2005\Projects\InspectingManagedCode\bin\Release\InspectingManagedCode.exe)

    BaseSize: 0xc

    ComponentSize: 0x0

    Number of IFaces in IFaceMap: 0

    Slots in VTable: 6

    The only output of this that is interesting at this point is the ‘Name’ field, which as you can see, indicates that 913004 cooresponds to the ‘Program’ type.   Thus these first two instructions create a program object.   This program object comes back from the helper with all its fields zeroed, so the next instructions in the program are the body of the constructor (the Program() constructor has been inlined into the body of Main(). 

    The next instructions

    00DE007B 8BF0             mov         esi,eax

    00DE007D 8B053C302B02     mov         eax,dword ptr ds:[022B303Ch]

    00DE0083 8D5604           lea         edx,[esi+4]

    00DE0086 E8A5380979       call        79E73930

    Basically implement the statement ‘myString = "foo"’ The helper returns a pointer into the uninitialized object in the EAX register.  The mov saves this into the ESI register.  EAX is then loaded with what is at the address 022B303Ch.  This happens to be the string “foo” (more on how it go there in a later blog).   You can confirm this by going to the disassembly code, setting a breakpoing right after the eax,dword ptr ds:[022B303Ch] instruction and looking at the value of the EAX register in the ‘registers’ window.   In my example it happens to be the value 012B1D44.   You can then use the command

    !DumpObj 012B1D44

    Which will dump the managed object at this address.  This will print .

    DumpObj 012B1D44

    Name: System.String

    MethodTable: 790fa3e0

    EEClass: 790fa340

    Size: 24(0x18) bytes

     (C:\WINDOWS\assembly\GAC_32\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)

    String: foo

    Fields:

          MT    Field   Offset                 Type VT     Attr    Value Name

    790fed1c  4000096        4         System.Int32  0 instance        4 m_arrayLength

    790fed1c  4000097        8         System.Int32  0 instance        3 m_stringLength

    790fbefc  4000098        c          System.Char  0 instance       66 m_firstChar

    790fa3e0  4000099       10        System.String  0   shared   static Empty

        >> Domain:Value  0014c550:790d6584 <<

    79124670  400009a       14        System.Char[]  0   shared   static WhitespaceChars

        >> Domain:Value  0014c550:012b186c << Basically

     Again, most of the output is uninteresting at this point, except the Name field (which says its a string), and the ‘String’ field (which shows the string value is ‘foo’).  So we have confirmed that this instruction loads up the address of the ‘foo’ string into the EAX register.   What is left is

    00DE0083 8D5604           lea         edx,[esi+4]

    00DE0086 E8A5380979       call        79E73930

    The first instruction ‘LEA’ may not be familiar to you.