Welcome to MSDN Blogs Sign in | Join | Help

How is good software like good science?

I'm not one who believes mainstream large-scale software development really deserves the title of "computer science" (or "software engineering" for that matter).  However, I have been thinking lately that there is an interesting analogy between good software development and good scientific theories.  Here are some examples:

 

Software program Scientific theory
What is it? A description of some desired computer behavior A postulated description of reality
Who creates it? Programmer Theorist
Who first validates it? Tester Experimentalist
Provability Usually can't ever be proven 100% correct.  The longer we go without finding any bugs, the more we suspect it may be bug-free. Can never be proven correct, just proven wrong.  The more it resists being proven wrong, the more we believe it is probably 100% correct.
Testability Good software is designed from the ground up to be tested effectively.  If you build a large system first, and then later start thinking about how you might test it, you're likely to end up with a lot of bugs that are hard to find. For something to qualify as a "scientific theory" it must be "falsifiable" - that is, there needs to be a way in theory you could prove it doesn't work.  In general, the more ways you might be able to show that a theory is wrong, the better the theory is.  For example, many people argue that most forms of string theory are not sufficiently testable to be seriously considered as science.
Predictive power If you write a program precisely to pass a certain set of tests, you don't really know if it's correct until you test it against some new test cases.  

Theories are generally designed to fit known experimental results.  Properly predicting the results of experiments that have already been done is important, but the real test comes when the theory predicts results which aren't yet known.
Reproducibility No matter how much testing you do, you ultimately need to get some real customers to try your software and see if they confirm your belief of correctness. 

Often the best tests come from users which are most critical of you.  For example, seeing positive comments on slashdot about Silverlight has been a confirmation to me that my team is making some good decisions. 
No matter how much one group's experiments agree with theory, little weight is attributed to the experiments unless other independent groups are able to reproduce the results.

Often the best tests of a theory come from people who believe it to be false.  The history of quantum mechanics is full of scientists who disbelieved it, but whose arguments and experiments ultimately strengthened it (such as Einstein and the EPR paradox). 
Simplicity Correct software tends to have a simpler, smaller, and/or more elegant implementation.  This doesn't mean it's behavior needs to be simple. All things being equal, simpler theories tend to be the correct ones.  This is known as Occam's razor.  There are many examples in science of theories which are conceptually and mathematically very simple, but whose implications are very complex and non-obvious.  A classic example is an explanation for the complex movements of the planets in the sky.  Ptolemy's model with the earth at the center was very complex, but Copernicus presented a much simpler model with the sun at the center.
Believability Good software tends to be well structured and documented so that a person can reason about it's correctness. Good theories tend to have a logical and believable explanation for why they should be correct. 
Specificity Good software tends to have a concrete specification for what is considered correct behavior.  We try to avoid the temptation to build something, write some tests for it, and call the results of those tests "correct". 

When the tests are first executed, there is ideally only one possible correct output.
Good theories tend to have fewer "free variables", which are parameters determined by experiment.  For example, each planet in Ptolemy's model of the solar system had a number of concentric rings with various sizes associated with it.  In the Newtonian model, the movement it determined entirely by each planet's mass, orbital radii (major and minor), and the universal gravitational constant.

When new experiments are performed, there is ideally just one result that would be consistent with the theory.
Generality The field of software (and often individual large systems) advances by replacing special-purpose components with general-purpose frameworks.  Operating systems and managed runtimes like the CLR are obvious examples here. Good theories often supercede many previous (apparently unrelated) results, encompassing them all under one larger umbrella.  A great example here is the realization that electricity, magnetism, radio waves, and light were all properties of the same electromagnetic force (and eventually even just one aspect of the electroweak force).
Reusability Good software tends to build on previous successes, re-using components that are known to be of high quality, but avoiding dependencies on high-risk pieces.  It's extremely difficult (and wasteful) to build a completely new large system from scratch.  Personally, I believe this is an area we could do better on the CLR. The progress of science has obviously only been possible by building on previous successes.   

 

So what can we learn from the study of good science (which has had a lot longer to mature) about how we should approach software?  Here are some ideas:

  • Be evidence-based - try to rely as much as possible on concrete data, it helps avoid the inevitable temptation to deceive yourself.
  • Have a culture of humility - accept that it's a lot easier to be wrong than it is to be right, and that your work should be assumed to be incorrect until there is enough independent evidence to suggest otherwise.  Recognize that certainty and black-and-white positions are usually overly simplistic and damaging.
  • Extraordinary claims require extraordinary evidence - there is no silver bullet, trying to chase it can leave you running in circles.
  • Be willing to accept a paradigm shift when necessary - it's sometimes necessary to abandon a long-held and cherished philosophy and accept well-justified radical new ideas in order to keep making progress.
  • Strive for simplicity - adding more total lines of code (like more special cases in your theory) should be considered more of a last resort, than as the normal process of growth.  You can only continue to tack on new ad-hoc solutions to problems for so long before the maintenance costs become stifling.  Removing code is more important for the quality of your software than writing new code.
  • Be self-critical - it's human nature for a group of like-minded intelligent people to be blinded to the truth by their arrogance.  Recognize this and seek out opportunities to prove yourself wrong.
  • Re-architect when necessary - it's sometimes advantageous to combine a group of previously independent things into a new component which replaces them all.
  • Study the past - learn from the patterns of past mistakes and successes, and recognize how to predict the most likely avenues for success.  Often negative results (understanding why a theory or piece of software failed to be successful) are more valuable than positive results.
  • Know when to start over - sometimes we have to be willing to let go and give up on an idea or piece of software and start from scratch.  Clinging to the past can be very destructive in the long-run.

I'd love to hear your comments about where this analogy works and where it doesn't.  I keep thinking about it whenever I read something about "good science", but I'm not yet sure whether it's just the natural tendency to make connections between things you know well, or whether there is some deeper underlying principle here connecting these two ideas.

Posted by rmbyers | 1 Comments

Customizing PDB lookup for source information in StackTrace

The System.Diagnostics.StackTrace class in .NET can be used to generate a textual representation of the current callstack.  This is used, for example, by Exception.ToString(). If requested by the caller, StackTrace can include source file locations (file names and line numbers, etc.) for each frame whose module has a PDB file available to the CLR.  PDB files are designed to be used primarily in development-time scenarios, and so the idea here is that when you're developing or testing your application and it spits out an exception (eg. to a log file, or an unhandled exception to the console), it will help you to debug the issue if you can see exactly where in the leaf method the exception was thrown, and where exactly each child function was called (actually it's not technically "exactly" - if JIT optimizations are enabled the results may be approximate, and frames may be missing completely due to inlining).  If you've done much .NET programming, you probably knew all this already. 

The more interesting (and less well documented) question I want to address is where exactly you must place your PDB files for this to work.  The CLR will look next to the corresponding module (DLL or EXE), and also check a few other standard locations (those local paths specified by the _NT_SYMBOL_PATH environment variable for example, and I believe the Windows system directory).  In fact, it's not really the CLR controlling any of this, but the ISymUnmanagedBinder::GetReaderForFile API from diasymreader.dll, which itself is implemented on top of the IDiaDataSource::loadDataForExe API.  Since this support for source locations is designed for development time, when you're generally running binaries you've just built - the PDBs are almost always next to the binaries and this works great.

Occasionally we get requests from people who like this feature but complain that the CLR isn't flexible enough to find their PDB files where they want to put them.  Sometimes this stems from wanting to use this feature for something it wasn't designed, such as shipping PDBs with your product and logging/reporting errors from the field.  For that scenario you're usually MUCH better off using Windows Error Reporting and minidumps.  You generally do not want to ship your PDBs to your customers (they're big, and can make it easier to reverse engineer your code - although this is a much bigger concern for unmanaged C++ code than .NET code).  In other cases, you may want to generate machine-readable stack traces (with module names, method tokens and IL offsets), and then post-process them using PDB files at your location to get source location information.

But, there are a few scenarios where it does really make sense to want more flexibility in how PDBs are located for the StackTraces generated at runtime.  For example, I recently got a request through product support from a customer with a large test environment where they were deploying their actual product.  They keep all their PDBs (1TB+ of them!) on a symbol server, and they would like to be able to use them to generate stack traces with source info without having to deploy PDBs to all their test machines (in the proper directories).  Although the CLR doesn't support this directly, there isn't any reason you can't implement this yourself.  The CLR StackTrace class exposes StackFrame objects which have all the information you need to map back to source addresses given an ISymbolReader instance.  ISymbolReader instances can be created directly (controlling PDB location policy manaually with ISymbolUnmanagedBinder2::GetReaderForFile2) by calling into diasymreader.dll through COM interop.  I've posted sample code for a StackTraceSymbolProvider class that does this here (using MDbg's COM interop wrappers for diasymreader.dll). 

Here's an example of how this code can be used to print out a StackTrace from an Exception while explicitly controlling the directories searched (searchPath is a semi-colon separated list of directories including SRV* entries for symbol servers), and whether things like a symbol server will be checked:

            catch (System.Exception e)
            {
                st = new StackTrace(e, true);
                StackTraceSymbolProvider stsp = new StackTraceSymbolProvider(searchPath,
                    SymSearchPolicies.AllowSymbolServerAccess |
                    SymSearchPolicies.AllowOriginalPathAccess |
                    SymSearchPolicies.AllowReferencePathAccess |
                    SymSearchPolicies.AllowRegistryAccess);

                Console.WriteLine("Custom stack trace:");
                Console.WriteLine(stsp.StackTraceToStringWithSourceInfo(st));
            }

To get this flexibility you have to re-implement some of the formatting done by StackTrace.ToString(), but you might want the flexibility to control this anyway (for example, it's easy to include column numbers in addition to line numbers).  It's non-trivial to wire this all up (especially if you're not familiar with COM-interop), but it's all plumbing really.  Hopefully this sample code will save some of you the hassle of figuring out this plumbing yourself.

Posted by rmbyers | 0 Comments
Filed under:

Using LINQ for Computational Genomics

I’ve been playing around a bit lately with computational genomics (I’m doing a project for my parallel computation class). I wanted to write some simple algorithms that operate on potentially large amounts of DNA data without using a ton of RAM. For example, the entire human genome is 3 billion base pairs – reading it all into memory is out of the question (at least on my home PC). It occurred to me that this was a perfect opportunity to spend some more time using LINQ. Computations like this are inherently stream based, and LINQ allows you to express and compose operations on streams very effectively. Here’s a simple example that I hope will help demonstrate the power of LINQ.

One of the simplest forms of statistical analysis done on a DNA sequence is producing a graph of it’s “GC density”. To a computer scientist, a DNA sequence is just a string of A, C, T, and G characters. The GC density with window size ‘w’ for a sequence is another (similarly sized) sequence of numbers between 0 and 1, each of which represents the ratio of Gs and Cs in the previous ‘w’ characters. For example, for the sequence “ATGCAG”, the GC density with window size 2 is “0, 0.5, 1, 0.5, 0.5”.

So, in order to calculate this, I first started with a simple function which, given a nucleotide sequence, would return an identical-length sequence of the number of Gs or Cs seen so far: 

    /// <summary>

    /// Generate a sequence that is the prefix sum of Gs and Cs in the input sequence.

    /// </summary>

    /// <param name="sequence">The input nucleotide sequence</param>

    /// <returns>A sequence of the same length for which each element indicates the

    /// count of Gs and Cs between the beginning and corresponding position in

    /// the input sequence.</returns>

    public static IEnumerable<int> GenerateGCCount(IEnumerable<Nucleotide> sequence)

    {

        int count = 0;

        foreach (Nucleotide n in sequence)

        {

            if (n == Nucleotide.G || n == Nucleotide.C)

                count++;

            yield return count;

        }

    }

Given this sequence (call it s), the density for window size w at some position i could (conceptually) be calculated with the formula d[i] = (s[i+w]-s[i])/w. However, since we might want a large window size, we don’t even necessarily want to have w nucleotides in memory at once, so we can’t use such a simple formula. However, the stream-based version is almost as simple:

    /// <summary>

    /// Compute the GC density of the specified nucleotide sequence.

    /// </summary>

    /// <param name="sequence">The sequence of nucleotides</param>

    /// <param name="windowSize">Size of the window for each density calculation</param>

    /// <returns>A sequence for which each element indicates the ratio of Gs and Cs to

    /// total nucleotides over the past windowSize elements</returns>

    public static IEnumerable<float> ComputeGCDensity(

        this IEnumerable<Nucleotide> sequence,

        int windowSize)

    {

        if (windowSize < 1)

            throw new ArgumentOutOfRangeException("windowSize");

 

        // First compute a sequence of the running count of G or C nucleotides

        var gcCounts = GenerateGCCount(sequence);

 

        // Combine the sequence with a copy of itself offset by the window size

        // The elements of the sequence are pairs of counts, with the first element

        // of the pair being the GC count at the beginning of the window, and the

        // second element of the pair being the GC count at the end of the window.

        var gcWindows = gcCounts.Zip(gcCounts.Skip(windowSize - 1));

       

        // Now for each window, compute the gc density and return it

        return gcWindows.Select(

            p => (float)(p.Second - p.First) / windowSize);

    }

This is pretty much it. If you remove the comments this is only a couple lines of code (or a single long query line if you prefer). Zip is just a helper function I wrote (based on list-processing functions in other functional languages like standard ML’s ListPair.zip or OCaml’s List.Combine) that returns a sequence of pairs from the two input sequences:

    /// <summary>

    /// A pair (2-tuple)

    /// </summary>

    /// <typeparam name="T1">Type of the first element</typeparam>

    /// <typeparam name="T2">Type of the second element</typeparam>

    public struct Tuple<T1,T2>

    {

        public Tuple(T1 first, T2 second)

        {

            First = first;

            Second = second;

        }

 

        public readonly T1 First;

        public readonly T2 Second;

    }

 

    /// <summary>

    /// Combine two sequences into a sequence of pairs

    /// </summary>

    /// <typeparam name="T1">Type of the elements of the first sequence</typeparam>

    /// <typeparam name="T2">Type of the elements of the second sequence</typeparam>

    /// <param name="source1">The first sequence</param>

    /// <param name="source2">The second sequence</param>

    /// <returns>A sequence that is as long as the shorter of source1 and source2.

    /// Each element is a pair of values - one from each of the input sequences

    /// at the same position.  If the input sequences are of unequal length, any extra

    /// data in the longer list is not used.</returns>

    public static IEnumerable<Tuple<T1, T2>> Zip<T1, T2>(

        this IEnumerable<T1> source1,

        IEnumerable<T2> source2)

    {

        // Conceptually we just want to foreach over both sequences, but we must

        // write the code manually since foreach works for only a single sequence

        var enum1 = source1.GetEnumerator();

        var enum2 = source2.GetEnumerator();

        while (enum1.MoveNext() && enum2.MoveNext())

        {

            yield return new Tuple<T1,T2>(enum1.Current, enum2.Current);

        }

    }

Together I think this is pretty simple and elegant. I create two instances of the same sequence, offset one by the window size, and then use their difference to compute the density. All of this is done on demand each time a density value is required. I’m not keeping any information about any position other than the current one in memory at any one time. One drawback is that for small window sizes, it may actually be pretty silly (from a performance perspective) to read each byte of the sequence from the file twice. But I consider this an optimization problem with the operating system file system cache should ideally solve for me. In practice execution still appears to be CPU bound, so I think the OS is doing a good job – the extra I/O isn’t a concern.

If you’re new to this style of programming, it can sometimes get a little confusing trying to keep track of the timing in which things execute. C#’s iterator method syntax and the IEnumerable pattern in general can make the use of delayed computation a little subtle. This is a big benefit of using a functional programming style like I have where no shared state is modified and the functions have no side-effects. The order in which things execute is unimportant.

To test that this actually worked on large data sources without using a lot of RAM, I ran it on the entire Human Chromosome 1 sequence (the largest one - almost 300 MB uncompressed) with a window size of 100000. For kicks, I also ran it on the Chimpanzee chromosome 1. Both runs took about 10 minutes on my relatively slow home PC, and used only a steady 10MB of RAM. Here are the resulting graphs if you’re interested.

Posted by rmbyers | 2 Comments
Filed under: ,

More on generic variance

In my entry on generic variance in the CLR, I said that you can’t convert a List<String> to a List<Object>, or even an IEnumerable<String> to IEnumerable<Object>.  I should point out however that the real-world scenarios where you’d want to do this usually involve passing an object of a more specific type to an API that (for abstraction reasons) takes a less specific type.  For example, say you have a class hierarchy like this:

 

    abstract class Shape

    {

        public abstract double ComputeArea();

    }

 

    class Square : Shape

    {

        public Square(double height)

        {

            m_height = height;

        }

 

        public override double ComputeArea()

        {

            return m_height * m_height;

        }

 

        private double m_height;

    }

 

    class Circle : Shape

    {

        public Circle(double radius)

        {

            m_radius = radius;

        }

 

        public override double ComputeArea()

        {

            return Math.PI * m_radius * m_radius;

        }

 

        private double m_radius;       

    }

 

You’d like to be able to use it uniformly like this:

 

    class Program

    {

        static void Main(string[] args)

        {

            List<Square> ls = new List<Square>();

            ls.Add(new Square(2));

            ls.Add(new Square(3));

            double totS = GetTotalArea(ls); // won’t compile

 

            List<Circle> lc = new List<Circle>();

            lc.Add(new Circle(2));

            lc.Add(new Circle(3));

            double totC = GetTotalArea(lc); // won’t compile

        }

 

        public static double GetTotalArea( IEnumerable<Shape> l)

        {

            double total = 0;

            foreach (Shape s in l)

            {

                total += s.ComputeArea();

            }

            return total;

        }

    }

 

This would only work if C# supported generic variance and IEnumerable was defined as IEnumerable<+T>.  However, there is another option in this case.  You could make GetTotalArea a generic method and rely on a base-type constraint:

 

        public static double GetTotalArea<T>(IEnumerable<T> l)

            where T : Shape

        {

            double total = 0;

            foreach (Shape s in l)

            {

                total += s.ComputeArea();

            }

            return total;

        }

 

Now the above Main method will work perfectly, you don’t even have to modify it to specify the type parameters (C#’s type inference can figure them out automatically).  So although you're not really converting an IEnumerable<Square> to an IEnumerable<Shape>, you are able to use it that way.

 

This is certainly great.  And in practice, most .NET developers will find the support in .NET 2.0 for generics to be more than powerful enough.  However, if you’re trying to do some serious “generic programming” [3], or if you’re excited by programming language theory like I am, then this does still leave something to be desired.  Most notably, we can specify an upper-bound using a constraint like this (rather than require our language to support covariant type parameters), but the CLR and C# don’t have support for lower bounds (“supertype constraints”) so you can’t use this technique in place of contravariant type parameters (eg. my IComparer<-T> example).  See section 4.4 (“Comparison with Parametric Methods”) of [1] for more details.

 

Java 5 addresses these scenarios with “Wildcard types” [2].  Eg. you would use a “List<? extends Shape>” in Java to implement my example above, and there is also a syntax “? super T” for lower-bounds.  It’s interesting to note that the authors of the paper on wildcards indicate that the main advantage of wildcard types over using normal generic methods (as I’ve done above) is that wildcard types don’t require exact type information (see section 4.4 of [2]).  The big difference between generics in .NET and Java is that in .NET, the CLR supports generics to the core and so we have exact type information everywhere (which is why you can get information about generic types using Reflection at run-time).  In Java, generics are “erased” by the compiler, and so at run-time, the information about generic instantiations are lost.  Using erasure has the benefit of better compatibility with old code and a much simpler implementation, but extending the VM (like we did for the CLR) has several important benefits including avoiding boxing for instantiations at value types and therefore better performance (eg. a List<int> can be very efficient), and the ability to use exact type information at run-time.

 

The big difference between wildcards and generic variance in the CLR is that wildcard types are an example of “usage-site variance” where MSIL uses “definition-site variance” (meaning it’s the type definition that specifies the variance annotation, not the user of the type).  I was reading about a cool academic programming language called Scala recently, and was pleased to see that after some experience with usage-site variance, they decided to switch to definition-site variance because they found it easier to use correctly (see “Comparison with wildcards” in [4]).  Scala is a very cool language (my programming languages professor recently mentioned it as a great example of a language on the forefront of modern academic language design), and can target .NET.  Unfortunately they haven’t built support for the V2.0 CLR yet so they aren’t actually making use of the definition-site variance support in the CLR (for the moment <grin>).

 

Anyway, I think that’s about all I have to say about generic variance.  I’ve got a programming languages exam on Tuesday (I’m working on my masters in computer science) so I think I better stop procrastinating and study for it <grin>.

 

References

[1] On variance-based subtyping for parametric types

[2] Adding Wildcards to the Java Programming Language

[3] A Comparative Study of Language Support for Generic Programming

[4] An Overview of the Scala Programming Language (2. Edition)

 

Posted by rmbyers | 18 Comments
Filed under: ,

Linq and the cost of additional language complexity

Uwe Keim posted a thought provoking comment in response to my entry about Linq.  Here is an excerpt:

I do see the benefits, but I also have a big déjà vû: The C#-language seems to go the C++-way, where I, even after 10 years of programming, don't know all of the features and sometimes still wonder "why does this thing behave this way?".  So the complexity is raised from version to version. Does everyone think, this is a good way? I think it would be better to NOT enhance the language/compiler from version to version, but to enhance functionality by enhancing the library instead.

I think the added complexity concern is a valid one which deserves careful analysis.  I'll just touch on my thoughts here.

One important difference between C++ and C#/VB is that C++ is designed by committee, where C# and VB each have a chief architect (Anders Hejlsberg and Paul Vick respectively) who doesn't have to try and satisfy everyone's favorite language request.  I know the complexity concern is something Anders and his team have taken very seriously (as have the VB folks).  I've heard Anders talk about the fact that they have to be extremely careful about this, and that he'd rather err on the side of caution.  The C# team has rejected several of my favorite language features (such as generic variance, C++ style const, etc.) because they felt the benefits for the average user didn't significantly outweigh the extra complexity (even through there were compelling benefits for SOME users).  This is where a lot of the "art" of language design comes into play.  Personally I have a lot of respect for Anders and Paul and trust their judgment and the customer research their teams have done which concluded that in this case the extra complexity was justified.  As much as I'd like to believe otherwise, there is probably a good reason here why Anders is the chief C# architect and you and I are not <grin>.

Of course they could be wrong, and it's important that Microsoft and our customers evaluate this risk seriously.  Uwe suggested a compiler plug-in model where you could enable different features on demand.  I'd love a public compiler plug-in model - there are so many cool features I'd like to add and use in my code.  I've even asked Anders about this (in the context of the fact that expression trees means that some errors that could normally be detected a compile time now become run-time errors), and he indicated that to maintain simplicity and comprehensibility they are avoiding any such dynamic / customizable compiler behavior.  I think Anders has a point here.  In this case, just using the v1.1 or v2.0 compiler to target the latest CLR has similar benefits except that the possible feature combinations are fixed to well known sets, which avoids the confusion and complexity of the plug-in or optional feature model.

The other thing to keep in mind is that a lot of effort went into reducing complexity by adding a few powerful general language concepts, instead of many feature-specific concepts.  I think Linq did even better than Cω in this respect.  For example, unlike the Cω compiler, the C# 3.0 compiler has no knowledge of databases, SQL or XML, it's all in the DLinq and XLinq libraries.  Personally, I find the new language features in the C# 3.0 spec to be quite simple and elegant, but then I'm a (functional) programming languages nerd so I'm not exactly the target audience.

Ultimately it will be up to our customers.  If, after seriously experimenting and evaluating Linq, you think the extra complexity isn't worth the benefit, then it's important that you give us your feedback (we're getting better and better at listening to this sort of thing).  On the other hand, the response from the PDC was overwhelmingly positive, so it would be likely to be an uphill battle <grin>.

Posted by rmbyers | 8 Comments

Comparison of a simple select statement in DLinq (C# 3.0) vs. ADO.Net

Six months ago I posted a comparison of a simple select statement in C-omega vs. ADO.Net which some people found very exciting.  Now that Linq has been officially unveiled, I figured I should update my comparison using C# 3.0 and DLinq.  Although Linq and C-omega have some significant differences, everything I said in that post about the benefits of C-omega applies equally well to Linq (in fact, I like Linq even better, primarily due to expression trees) [Update: Erik Meijer e-mailed me to say that C-omega also has indirect support for expression trees by exposing the parse tree to compiler plug-ins.]

Here is the example ADO.Net code using strongly-typed data sets:

      SqlDataAdapter da = new SqlDataAdapter(

            "SELECT * FROM Employees WHERE City= @city", nwindConn );

      SqlParameter cityParam = da.SelectCommand.Parameters.Add("@city", SqlDbType.VarChar, 80);

      cityParam.Value = city;

 

      NorthwindDataSet ds = new NorthwindDataSet();

      da.Fill(ds, ds.Employees.TableName );

 

      foreach (NorthwindDataSet.EmployeesRow dr in ds.Employees.Rows)

      {

            string name = dr.LastName;

            int id = dr.EmployeeID;

            Console.WriteLine( id + ": " + name);

      }

And here is the equivalent C# 3.0 code:

var rows = from e in db.Employees where e.City == city select e;
foreach( var e in rows )
{
   string name = e.LastName;
   int id = e.EmployeeID;
   Console.WriteLine( id + ": " + name );
}

Or even more concisely:

foreach( var e in db.Employees.Where( e => e.City == city ) )
   Console.WriteLine( e.EmployeeID + ": " + e.LastName );

There is already tons of information available about Linq (more so than there ever was for C-omega), and there are some great blogs by plenty of people more authoritative than myself (eg Matt, Dinesh, Paul, Dan, Scott, etc.), so I probably won't say much more about it myself.  But isn't it awesome that this is almost certainly going to become part of mainstream programming in a couple years?  Are we finally on the road to eliminating SQL injection attacks (not to mention all the other great benefits)?  It's advances like this that make me proud to work for Microsoft!

Posted by rmbyers | 10 Comments

Interested in C-omega? LINQ finally announced!

Ever since I started planning for my users-group talk (and wrote this blog entry) about data access with C-omega, I've been dying to tell everyone about the plans to add similar functionality to C#.  For those of you at my talk, you'll remember this video of Anders which alluded to the work that was being done here.  Well, today at the PDC, project LINQ (formerly "Clarity", etc.) was publicly announced.  In a nutshell, C# 3.0 (and VB 9.0) will add fundamental support for seamlessly working with data, which includes several core language enhancements that are generally useful (such as type inference).  Functional programming language fans (such as myself) should find the design especially appealing.  See the LINQ website for details.

This is definitely the one thing I'm most excited about in the Orcas (3.0) release.  Not only is it cool for programming-language theory fans, but it also has the potential to significantly improve the productivity of a typical enterprise developer.

 

Posted by rmbyers | 0 Comments

DebuggingModes.IgnoreSymbolStoreSequencePoints

In my last post I gave an overview of the DebuggableAttribute, what values the C# compiler gives it, and how the CLR uses those values.  I mentioned that with /debug+, the C# compiler sets the IgnoreSymbolStoreSequencePoints DebuggingModes bit, but I didn't describe what this bit does.  Understanding sequence points and ths IgnoreSymbolStoreSequencePoints bit is important for anyone writing a compiler for .NET (including one that uses Reflection.Emit).

A sequence point is used to mark a spot in the IL code that corresponds to a specific location in the original source.  For Reflection.Emit, they are emitted by the ILGenerator.MarkSequencePoint method.  This serves two purposes.  First to allow debuggers (and the StackTrace class) to map IL offsets back to source and line information.  Secondly to tell the JIT compiler that it must preserve this location uniquely in the generated native code so that a user could, for example, set a breakpoint on it (see Mike's blog entry for more details). 

Since they're only needed for debugging scenarios, the location of sequence points, and their mapping back to source locations, are stored in the PDB file.  Unfortunately this means that the JIT compiler has to open the PDB file and read this data any time it compiles a method with optimizations disabled (if JIT optimizations are enabled, then it doesn't care about sequence points in order to get the best codegen possible).  The PDB file format is quite complicated, and has a long legacy.  So as you might imagine, there is an unfortunate performance penalty here.

Now in addition to the sequence points specified in the PDB file ("explicit sequence points"), the JIT also infers additional sequence points based on the IL code ("implicit sequence points").  For example, it adds an implicit sequence at every point where the IL evaluation stack is empty, or on any "nop" instruction.  So in Whidbey we added the IgnoreSymbolStoreSequencePoints bit to allow compilers to indicate to the JIT that they are content with the JITs rules for implicit sequence points, and there is no need for the JIT to look in the PDB file for more. 

So our official recommendation is that all compilers (including those using Reflection.Emit) set the IgnoreSymbolStoreSequencePoints bit and rely on "nop" instructions to ensure the JIT places sequence points at the right place.  This can make a nice JIT performance improvement, and also has some other benefits related to not needing to read the PDB file at JIT time (eg. it may not be available).  The Whidbey C# and VB compilers do this, but unfortunately the managed C++ compiler does not.

Posted by rmbyers | 3 Comments
Filed under:

DebuggableAttribute and dynamic assemblies

Mike Stall has a great little sample showing how to make your dynamically generated code debuggable.  However, there is one more detail you should be aware of.  By default the JIT compiler will enable optimizations for the module, making debugging difficult or (in the case of JMC-mode in VS) impossible.  If you run your program from within Visual Studio and have the “Suppress JIT optimizations on module load” option enabled, then it’ll work fine. However, I believe this setting in VS is now (as of the June CTP) disabled by default in order to minimize any changes to your program’s behaviour when run under the debugger.  Additionally, optimizations would also be enabled it you attached to an already running process.

Compilers like csc.exe have command line options such as /debug and /optimize that affect how the JIT generates code for the assembly.  The compiler communicates this information to the CLR by attaching a System.Diagnostics.DebuggableAttribute to the assembly.  This attribute contains a DebuggingModes value which is a bit-mask of various flags.  Here are the flags that the Whidbey C# compiler emits for the various combinations of compiler options:

/debug- Retail build: No attribute emitted (same as specifying DebuggingModes.None). 
Setting /o+ or /o- doesn’t change anything when debug mode is disabled. 
/debug- is the default and so can be omitted.
/debug+ Debug build: Default, IgnoreSybmolStoreSequencePoints, EnableEditAndContinue DisableOptimizations. 
/o- is the default here.
/debug+ /o+ Debug optimized: Default, IgnoreSybmolStoreSequencePoints. 
In Whidbey this is essentially the same thing as /debug-.

Note that before Whidbey, the “Default” flag was used to “enable JIT tracking”.  JIT tracking meant we would track information necessary for debugging such as IL/native maps (for converting native address offsets to IL offsets and vice versa).  In Whidbey we have an efficient form of JIT tracking enabled all of the time, so this flag is now basically meaningless.  However, we need to ensure the Whidbey CLR runs Everett binaries in the same way, so for backwards compatibility we still require the Default flag to be present if we’re going to do things like disable JIT optimizations.

So, if you want to emit a dynamic assembly which is always debuggable (similar to using the C# compiler’s /debug+ option), then you need to use the following code after calling DefineDynamicAssembly:

// Add a debuggable attribute to the assembly saying to disable optimizations
Type daType = typeof(DebuggableAttribute);
ConstructorInfo daCtor = daType.GetConstructor(new Type[] { typeof(DebuggableAttribute.DebuggingModes) });
CustomAttributeBuilder daBuilder = new CustomAttributeBuilder(daCtor, new object[] {
            DebuggableAttribute.DebuggingModes.DisableOptimizations |
            DebuggableAttribute.DebuggingModes.Default });
assemblyBuilder.SetCustomAttribute(daBuilder);

Note that it’s best to do this before calling DefineDynamicModule.  Visual Studio, for example, looks at the optimization setting by calling ICorDebugModule2.GetJITCompilerFlags as soon as it gets a LoadModule event (which is triggered by DefineDynamicModule).  If you haven’t emitted the DebuggableAttribute yet, VS will think optimizations are enabled and will prevent debugging of the module if JMC is enabled.

It’s amazing how complicated the concept of “debug build” can be here, especially when you add in backwards compatibility, debugger overrides, various config file overrides, etc.  Thankfully, the story is simpler in Whidbey since we no longer have to worry about JIT tracking.

Posted by rmbyers | 6 Comments
Filed under:

Run-time exception checking

One of our partners asked us how a .NET program can tell what the currently active “try” blocks are on the stack.  This seemed like a dubious thing to want to do, but regardless a colleague of mine whipped up some sample code that uses the StackTrace class and reflection to do this.  We were talking about the possible uses of this, and most of them seemed pretty evil.  Changing program behaviour based on dynamic program inspection can lead to programs that are hard to reason about and brittle due to violating abstraction boundaries and implicit coupling.  Not to mention the fact that in optimized builds, some data may be missing from the StackTrace due to in-lining or other optimizations.

However, in-process inspection mechanisms like the StackTrace class can be incredibly useful for diagnostics purposes.  Perhaps in certain situations, it would be valuable to be able to write assertions like “If I throw a FooException, it should be handled by somebody”. As you probably know, there is a lot of debate about the value of checked exceptions, but I’m not going to discuss that here, see the Artima interviews with Anders Hejlsberg and James Gosling for a start.  We all know that validating your assumptions with assertions is a critical part of writing quality software, especially for large systems.  So if you’re making some assumptions about exception behaviour, it seems logical to want to check those assumptions at run-time with assertions.  Assertions can also be useful for making you aware of changes in behaviour that might have an impact on an area you didn’t anticipate.  You could imagine more complex checks like “The only catch handler for System.Exception is the one I know about in the top loop of my application” (nothing is more annoying then having some library code swallow your exceptions).  Of course checking these sorts of things statically (with tools like FxCop) is generally preferable to run-time checks, but our managed static analysis tools are still quite primitive, and it is very difficult to do complete static analysis in the presence of dynamic control flow mechanisms like reflection and delegates.

Anyway, here is some sample code that shows how you could write such assertions (I apologize for the lack of syntax highlighting – this new community server software doesn’t properly paste formatted text).  I’m still not convinced this is necessarily a good idea, but it seems to have some intriguing possibilities nonetheless. 

// Get a list of all the try clauses that are active on the stack that would
// catch exceptions of the specified type.
static public List<ExceptionHandlingClause> GetActiveTryClauses(Type exType)
{
  List<ExceptionHandlingClause> activeTryClauses =
    new List<ExceptionHandlingClause>();

  StackTrace stackTrace = new StackTrace();
  StackFrame[] frames = stackTrace.GetFrames();

  // Start at 1 to skip the GetActiveTryClauses frame.
  for (int i = 1; i < frames.Length; i++)
  {
    StackFrame frame = frames[i];
    MethodBase method = frame.GetMethod();
    MethodBody body = method.GetMethodBody();
 
    // Only consider methods that have an IL body.
    if (body != null)
    {
      IList<ExceptionHandlingClause> ehClauses = body.ExceptionHandlingClauses;
      foreach (ExceptionHandlingClause ehClause in ehClauses)
      {
        if (ehClause.Flags == ExceptionHandlingClauseOptions.Clause)
        {
          // Only consider clauses which are active on the current stack
          int offsetInFrame = frame.GetILOffset();
          int tryStartOffset = ehClause.TryOffset;
          int tryEndOffset = tryStartOffset + ehClause.TryLength;
 
          if ((offsetInFrame >= tryStartOffset) && (offsetInFrame < tryEndOffset))
          {
            // If desired, only collect clauses that would catch the specified type
            if (exType == null || ehClause.CatchType.IsAssignableFrom(exType))
            {
              activeTryClauses.Add(ehClause);
            }
          }
        }
      }
    }
  }
 
  return activeTryClauses;
}
 
public static bool IsCaught(System.Type type)
{
  List<ExceptionHandlingClause> handlers = GetActiveTryClauses(type);
  return (handlers.Count > 0);
}

This allows you to write simple checks like the following:

// Ensure someone will handle any IO failure here
Debug.Assert(TryClauseInfo.IsCaught(typeof(System.IO.IOException)));

// Ensure no-one is catching all exceptions
Debug.Assert(!TryClauseInfo.IsCaught(typeof(System.Exception)));

Or you could even write more complex checks based on the ExceptionHandlingClause information. 

What do you think?  Are there situations in your applications where you could get value out of using assertions like this? Can you think of any other uses for this code that wouldn't be totally evil?

Posted by rmbyers | 4 Comments
Filed under:

Comega talk

On Thursday I gave a .NET users group talk on Comega to somewhere around 100 .NET developers.  Overall I think it went pretty well.  I was nervous at first, but once I got into talking about the cool stuff I like I forgot about the pressure and had a good time.  If you were there, please let me know what you thought about both my presentation style and the technical content.

I asked the audience how many of them wrote a lot of code for interacting with databases or XML data, and almost every hand went up.  My premise was that there was a huge potential to improve the productivity, quality and security of a lot of mainstream software development through data-oriented language features like those in Comega, and that if customers agreed, it was their responsibility to demand these features from the industry.  In my opinion, we're far too content as programmers to go about doing things the way we've always done them rather than push for (and pay for) better ways of building systems (obviously finding the right balance here is hard).  Perhaps this reflects a typically short-term-oriented reward system of the software industry (shipping the next new feature sooner), or perhaps it just reflects a cynicism of a market full of half-baked "improve productivity quickly" products that lack empirical data and research (which development shops often aren't eager to share).  

One thing that surprised me was that when I showed the