Herman Venter, programming language enthusiast

  • A bit of help compiling your next .NET hosted programming language

    As the infrequent nature of my blog postings make pretty clear, I do have a day job that does not involve full time musing about programming languages.

    One of the things I do is to write compiler after compiler. Somewhat like old king Midas, everything I touch turns into a compiler. Along the way, I've harvested pieces of code that I use frequently and I've packaged them up in a mostly incomplete, but still useful set of libraries that help with writing compilers and compiler like tools.

    These libraries have now been released by Microsoft Research as an open source project, see http://ccimetadata.codeplex.com.

     If you are interested in building a compiler, or a program development tool, please have a look. Perhaps you can use some of what is there. And if not, perhaps you can help to complete it to the point where you can use it.

  • Numbers are up

    In my previous posts I’ve wondered about the future viability of static typing, arrays and loops. Can I get any more out of touch with reality? Well, what about numbers?

    Computers are number crunchers, so how can we possibly program a computer with a language that doesn’t know numbers. Surely I jest?

    Well yes, but only a little. Numbers are actually very problematic citizens of programming languages. C# is a mess with sbyte, short, int, long, byte, ushort, ulong, float, double, decimal, and even weird pseudo numbers like char, IntPtr, UIntPtr, and pointers. C is even worse, with things such as short, long and long long and rules that do not tell you how long a long is, but do tell you that is at least as long as a short.

    And then there are the literals for numbers. C has decimal integer literals, floating point literals, octal literals (which look uncannily like decimal literals, except that 012 != 12), hexadecimal literals and all sorts of weird suffixes. C#, thankfully, gets rid of octal literals, but add suffixes galore. JavaScript had octal literals, then deprecated them, but didn’t quite follow through with that.

    Of course, numbers by themselves are somewhat meaningless. Rather than just “5” we might mean “5 meters” or “5 kilograms”. Not surprisingly, there have been attempts at making programming languages aware of the units of numbers. During the early days of the ECMAScript Editon 4 work, a lot of effort went into such a proposal, which got weirder and weirder as time went on and eventually fizzled out and disappeared.

    And then there is the small matter of precision. In much of the scientific world, there is a big difference between 5.0 and 5.00. Ada had some interesting ideas here: you could specify the precision needed for a number and then the compiler would select the best implementation that meets the required precision. No other major language picked this up, so the experiment must be considered a failure.

    Coming back to C# and its myriad of numeric types, the C# rules for how numeric values are converted from one type to another  are complex and arcane. It can take some careful reasoning and consulting the rather daunting specification to figure out what happens in some cases. There is also a rather tricky interaction between numeric promotion and method overload resolution. At least the C# rules mostly do the right thing. You’ll never accidently turn 255 into -1 or 257 into 1, but you do need to insert explicit casts in many cases where they are not actually needed (which is something you become very aware of when you translate a C program to C#.) Not that C has it right: it happily lets you treat a numeric variable as a signed number in one place and an unsigned number in another, without so much as a “by your leave”.

    A big problem these days, when unsuspecting software components are at the mercy of “evil doers” intent on exploring every corner case to see what damage can be done is that most languages provide arithmetic operations that silently overflow. C# lets you specify that particular areas of code should use overflow checking, but it does not encourage you to make use of this checking. And when you do turn it on you are confronted with significant slowdowns as well as the need to handle very costly exceptions every time a number overflows. As a result, few programmers bother to check for overflows, and those who do check, have to pay a heavy price or need to resort to arcane coding practices. A while back I heard a claim that at least 50% of security bugs these days are related to unchecked for arithmetic overflows.

    In JavaScript, things are not quite so bad. There is only one numeric type: the IEEE 64-bit floating point number (doubles). Such numbers have the nice property that they do not overflow and that all arithmetic operations are complete (defined for every value).  Yet, JavaScript programmers are not happy campers and the language design committee keeps looking for ways to improve the language. The reason, in a nutshell, is that numeric literals are (mostly) expressed in decimal, while doubles are represented with fixed length binary numbers. This can lead to expressions resulting in numbers that print slightly differently from what high school arithmetic leads us to expect. It also leads to persistent bug reports from incredulous programmers that do not understand the intricacies of fixed length binary floating point numbers.

    A persistent proposal is the replace binary floating point numbers with decimal floating point numbers, which is probably an improvement (if one could just change the meaning of existing programs, which we cannot). But this still does not solve the problem of finite precision, which rounding mode to choose and how to deal with the fact that in much of Science, 1.0 is not the same as 1.00.

    All this makes we wonder if we should remove the special status that finite precision numbers currently enjoy in programming language. Perhaps the default interpretation of a numeric literal should be: a Rational number represented as a pair of arbitrary precision decimal numbers. All other kinds of numbers should be objects with same status as any other object in the language. Compilers can have special knowledge of some of these types of objects, but this should be an implementation detail, not a language feature.

    Of course, compilers should also do an outstanding job of not making you pay the full price of using arbitrary precision Rational numbers as your default numeric type. I suspect, however, that once you remove arrays and loops from a programming language, the performance of arithmetic operations will be much less of an issue.

  • End Loops

    If anything is more fundamental to programming languages than arrays, it would be loops. You can probably count the number of programming languages that dispense with loops on the fingers of one hand.

    Yet loops are problematic. Loop constructs are hard to design, no two languages quite agree on what loops should look like and how they should behave. Programmers, too, find loops hard to get right. Non terminating loops, loops that loop one iteration too many or one too few and loops that introduce performance problems abound in programs. Needless to say, loops are bad news for automated program verification systems.

    Of all the loop constructs out there, the one that seems the most enduring is the “while condition do stuff loop”. In the evolution of programming languages, this construct is strongly preserved so it must be extremely useful. Yet this construct is hardly intuitive. Many programming novices fundamentally misunderstand its sequential nature. It takes effort to understand that “stuff” does not just happen while “condition” is true, but that control has to reach the start of the loop, test the condition, conditionally enter the loop, perform one iteration, go back to the start of the loop, and so on.

    The imperative nature of loops is particularly troublesome when you are trying to compile programs to run efficiently on parallel machines. Loops are where all the action is, but to before you can parallelize the loop, you need to know rather a lot about what goes on inside the body of the loop.

    All this makes we wonder if the time has come to end loops in programming languages. Instead of providing a way to operate on the elements of a collection, one element at a time, the programming languages of the future should embrace constructs that manipulate entire collections in a way that completely hides the control flow.

    There are many ways to think about “post loop programming”. My personal favorite is spread sheets. You don’t see too many loops in spreadsheets, yet people create amazingly powerful applications with spreadsheets.

     I can hardly wait to see if the next champion programming language manages to put an end to loops.
  • Archaic Arrays

    Should arrays get axed from future languages?

    In some ways the question is shocking. Arrays have been built into programming languages since ancient times.  Arrays provide the foundation for many data structures and form the subject of an entire chapter in most introductions to programming. Why meddle with success?

    Well, arrays have always been problematic data structures. In FORTRAN they were the source of semantic ambiguity. In Pascal they were the source of type checking problems. In their C/C++ incarnation they have wreaked all sorts of havoc. In their Java/C# incarnation they are so unsatisfactory that no self respecting API can dare use them.

    To expand a bit on their C# incarnation: arrays are always mutable, which is a big problem when you need immutable data structures. Nevertheless they always have a fixed length, which is a big problem when you need mutable data structures. They are covariant, but only for arrays of reference types. Covariance is great for read-only arrays, but those don’t exist. Covariance is of little use for mutable arrays, but to use those you must pay the price of a runtime check for every element assignment.

    In short, C# programmers should hide arrays deeps inside of fundamental data structures such as the generic List class of the CLR. Unfortunately, the exalted status conferred on arrays by their special syntax and dedicated book chapters makes programmers far too likely to sprinkle arrays all over public APIs. That alone is probably enough justification for wielding the axe.

    But merely replacing “int[]” with “List<int>” is probably not going far enough. At the very least it should be “IList<int>” to provide some flexibility for the implementation. But that is not the whole story either: the very idea of random access needs rethinking.

    For small arrays, it does not matter very much if one accesses the array elements in sequence or randomly. For large arrays, it matters a great deal. Missing the L1 cache is bad, missing the L2 cache is very bad and taking a page fault is a disaster.

    Perhaps the time has come to rediscover and reinvigorate streams as the basic data structure for collections. C# v2 has made a step in this direction with iterators and C# v3 has turned it into a leap with LINQ.

    I believe that C# v3 points the way to the future. A new champion programming language will have to embrace streams and make them a central part of its programming model. Perhaps lazy functional languages still have lot more to contribute in this respect than the rather indirect influence they have so far had on C#.
  • Evolution versus Intelligent Design in Programming Languages

    I suspect that most people involved in the creation of programming languages like to think of themselves as (intelligent) designers. We have the Programming Language Design and Implementation conference, programming language designers, design documents, and so on.

    Yet, despite all this, programming languages look more like the products of evolution than of intelligent design. We can organize programming languages into families and trace the origins and gradual evolution of every programming language feature. Given any two programming languages, there are quite likely more similarities than differences.

    Of course, this should surprise no-one. For a new language to catch on, even in a modest way, it has to leverage the familiarity its potential users have with existing languages and make sure that any new features are few, powerful and easy to learn.

    On the other hand, new languages are not simply extended versions of Fortran. New programming languages also evolve by deleting features from their predecessors. Niklaus Wirth (Pascal, Modula, Oberon, etc.) famously remarked that the designer of a new programming language should spend more time deciding on what to leave out than on what to put in.

    Yet, every feature found in a programming language got there because it enabled a scenario important enough to warrant the significant expenditure of effort needed to put the feature in the language. Sometimes the scenario goes away as time goes by and the language feature can quietly be dropped without anyone shedding any tears. Or, more usually, the scenario becomes relatively unimportant and the people who care about it simply stick with the old languages.

    If things go really well, the new language introduces an abstraction that generalizes a number of (predecessor) features that can then be dropped from the language because they are special cases of the new feature.  Well, this seems like it ought to be true, but right now I’m hard pressed to come up with a good example of this actually happening in a successful programming language. (If a reader has such an example, I’d love to see it.)

    At any rate, it is interesting to take a hard look at the features we all know (and love?) in our current successful programming languages and speculate about which of these features might not (or should not) be retained in successful new programming languages.

    I coming weeks, I’ll pick some features, more or less at random and try to poke holes in them.
  • Is there a contract in your future?

    In my previous posting I speculated that compile time type checkers as we know them today may fall by the wayside in favor of something that looks and feels a lot more like the dynamically typed languages.

    One of the statements I made, “If you store an object in field foo and then call foo.bar(x, y, z), you really don’t care what the type of the object in foo is, only that it should have a function property called bar that can handle the arguments x, y and z.”, needs further qualification.

    Yes, you do not care what the type of the object in foo is, nor do you care about the particular function value that is stored in the bar property. However, you do care about what exactly is going to happen when you call foo.bar(x, y, z). It is pretty hard to program if you do not know what your program is going to do.

    In our current champion languages you get narrow things down by being specific about what types of values you may pass to bar and what type of value you can expect to get back. But that is a far cry from knowing what the function is going to do.

    Much more useful is a precondition that states “before you call this function, the following things must be true” and a postcondition that states “if you call this function while satisfying the precondition, you can be sure that the following things will be true”.

    Our current best practice is to state these conditions in documentation. Getting this documentation to be accurate and to stay accurate is quite a challenge. And much of the time we never quite get around to creating the documentation.

    I am persuaded that a programming language that makes it really easy to write down these pre and post conditions (contracts) and that quickly and accurately checks that function implementations satisfy their contracts, will make programming easier and programmers more productive.

    Just how this will be done is not obvious, but it is a worthy goal for anyone aspiring to come up with one or more of the innovations that will unlock the next champion programming language.
  • Will the next champion be a dynamically typed language?

    In recent years there has been an upsurge of interest in dynamically typed languages, with quite a lot of buzz about JavaScript, Python and Ruby. Could it be that one of these languages will rise to be the champion of the next decade?

    I would be a bit surprised. These languages are quite old and carry a lot of baggage. Although they are quite a bit more expressive than the current champions, they also give up a lot and are much more challenging to implement efficiently. Most importantly, I do not think they adequately address the pain points I mentioned in http://blogs.msdn.com/hermanventer/archive/2008/02/28/succeeding-as-a-programming-language.aspx.

    Nevertheless, there are good reasons for the current success of dynamic languages and we should try to learn from that. The language I know best is JavaScript, so let me reflect a bit on the differences between JavaScript and Java.

    The most obvious difference is the lack of type constraints. JavaScript objects allow you to assign any kind of value to their properties. This may seem like a crazy lack of discipline and an invitation for run-time disaster, but it does enable JavaScript programmers to create highly polymorphic data structures, which makes it easier to write less code. And writing fewer lines of code is a very good thing. Granted, there is now no compile time type checker to point out some of your mistakes to you. But, if you write unit tests and develop incrementally, you find these kinds of errors pretty quickly anyway. And you’ll find many others that a compile time type checker will miss.

    Now, in Java you can make your data structures polymorphic by annotating your fields with type “Object” and casting down whenever you need to operate on the value stored in the field. But this is not only verbose and ugly looking; it is also downright hard to do. In effect, the programmer has to implement dynamic typing, instead of the compiler doing it. And doing it efficiently is hard. So, instead, Java and C# have acquired ways to parameterize types. This gets you quite a bit of polymorphism, but often too little. Also, the added expressive power comes at the expense of much more complicated type rules. Not many weeks go by in between episodes where I find myself puzzling over how to make a type error go away, or helping someone else understand the C# type rules.

    Another important difference is that the JavaScript type system is structural rather than nominal. That is to say, if you store an object in field foo and then call foo.bar(x, y, z), you really don’t care what the type of the object in foo is, only that it should have a function property called bar that can handle the arguments x, y and z. To achieve this in Java or C#, you’d have to define a separate interface for every method you write, which is very tedious and best left to a compiler. Why is this important? In a nutshell: “version resilience”.

    The runtime support needed for dynamic typing can also leveraged for all sorts of metaprogramming, which makes dynamic languages very powerful. For example, in JavaScript one can dynamically update the value of a function property on a prototype object. This makes it easy to instrument code and do all sorts of Aspect Oriented programming like things.

    To sum up: I strongly suspect that a future champion programming language will provide a type system that is virtually or completely invisible to programmers, that allows programmers to write code that can deal with highly polymorphic data and that is structural rather than nominal. I also expect that there will be first class support for metaprogramming.
  • Will there be many domain specific champions rather than a few general purpose champions?

    In previous posts, I’ve speculated that one of the necessary ingredients for the ascent of a new “champion” programming language is that the overall computing market should expand significantly.

    This seems to have been the case in the past, but will the pattern necessarily repeat itself? One could plausibly argue that as computing becomes ever more pervasive, it also becomes ever more diverse and that “one size fits all” programming languages will be displaced by languages specifically tailored to the needs of particular domains.

    The argument has merit and does not lack proponents. After all, most communities of experts quickly develop domain specific jargons that help them communicate concisely and precisely with each other, even if this comes at the expense of the ability to communicate with outsiders.

    Apart from the LISP faithful, who inevitably  can (and do) point out that every programming language idea worthy of consideration has been present in LISP since shortly after I was born, the most vocal proponents of Domain Specific Languages tend to be from the Modeling community. A typical pitch goes something like this: “I have (or am building) a system that allows you to construct graphical models from an open ended set of shapes and lines. To create a domain specific language, you merely choose a vocabulary of shapes and line and add some syntactic and semantic constraints on how these shapes and lines combine. Then you attach a particular semantics to the language, typically by combining together elements from some convenient collection of semantic constructs that come with the system. Now you have a DSL from which my system is (or will be) able to automatically generate executable code, prove correctness properties, and so on.” For those of us who are “graphically challenged”, there is usually the option to create traditional concrete syntaxes as well.

    Intriguing as these systems/proposals are, some words of caution are in order. As the Lispers point out there is not really anything new here. If we look at the domain specific jargons developed by communities of experts, we hardly ever see major departures in alphabet, pronunciation and grammar. For the most part, domain specific jargons are constructed by adding new words to English, sometimes accompanied by new symbols to help with conciseness.

    It also bears noting that jargons are seldom designed as a conscious act. Instead, they slowly evolve along with the supporting community, typically in the context of an education system that teaches the jargon along with the expertise.

    To me, this seems more like the gradual evolution of routine libraries and frameworks tailored to specific communities, than it seems to me a like new Tower of Babel.

    To conclude: programming language design is really hard. A set of innovations that address our major pain points in a coherent way is more likely to come our way in the form a general purpose language created by an inspired team of language experts. Domain specific jargons will not go away, but they will always be based on top of general purpose languages rather than collectively displacing them.

  • Succeeding as a programming language

    In my last post I speculated that we can infer from history that a necessary precondition for toppling our existing programming language champions is that there should be a significant expansion in computing.

    A number of reasons come to mind why this may be true. One is that such an expansion represents a big new market with huge potential pay-offs for companies that succeed in serving the market. This creates an appetite for investment and a greater tolerance for risk taking. It also creates demand for programmers and spreads them out. A programmer with a personal interest in a new programming language might now find himself/herself in a position of authority, exposed to many people with diverse non computing backgrounds, and saddled with the responsibility of quickly and cheaply coming up with software that address new kinds of problems.

    In a more traditional setting, our protagonist will quickly be shouted down and told to stick with practices that are known to work (kind of). In the midst of a gold rush, however, anything goes and a programming language with something new to offer is likely to be tried and if it is worthy, word will spread like wild fire. In a few exceptional cases the language goes on to become the new champion. (It certainly also helps a lot if the language has a powerful sponsor.)

    Of course, even in a gold rush most people do not take risks that they can avoid. It should also be true that the old champion languages should be significantly less suited for the new environment than the aspiring champions. Since existing languages can and do evolve (sometimes quite quickly), it should be asked why, in the past, champion languages were so slow to adapt to new environments.

    I think the answer is that it the design space for language innovations is quite huge. Existing champions have to be very careful about how they evolve and they carry a lot of baggage that must be cared for by huge and expensive armies of porters. Aspiring champions are a dime a dozen and the vast majority of them will quickly die in utter obscurity. It is really hard to know which innovations are the right ones. By the time it is obvious what is to be done, an existing champion may already have become a former champion.

    If you find all this plausible and you're a programming language enthusiast like I am, it is really nice to know that we are not perpetually condemned to put ever more layers of lipstick on the language pigs that are our current champions. It is also quite daunting to consider the really long odds against coming up with one or more of the innovations that will get us to a better place.

    In searching for such innovations, it really helps to understand the problems of the new environment.

    So, just what is the new environment? Sadly, I am less than sure that I know. While computing is still expanding at quite a clip, there is little evidence that we are in the midst of a gold rush like during the dot com boom.

    Some trends stand out, though: Computing devices are becoming ever more ubiquitous. At some point, things may reach a critical mass, precipitating a gold rush. In such a world, computing will be massively distributed, network latencies will be significant, and security will be huge concern. On the one hand, these devices will generate huge amounts of data to process in high performance data centers. On the other hand, we'll be faced with computers that are tiny and slow, that need to resort to drastic measures to conserve battery life, that are not constantly connected and that may be impossible or at least very difficult to update.

    The kinds of applications that are likely to be very important will not just be traditional IT, serving up web pages and searching them. Bioinformatics will be huge. And if Robotics really takes off, a lot of AI computing will come along for the ride.

    All in all, computing will become much more diverse and much more challenging. Along the way, our current pain points will become unbearable burdens. The winners in this brave new world will be those companies that figure out how eliminate these pain points altogether.

    So what are the current pain points? I can only speak from personal experience, but here is my list:

    • Programmers have far too much control over the execution of their programs. This may seem like a good thing, but it is a nightmare when you have to worry simultaneously about network and memory latency, energy consumption, race conditions, deadlocks and failures. It is far better if programmers need to specify only what is absolutely essential, and have automated tools figure out the rest.
    • Security is a nightmare. There are far too many ways to do things and far too few iron clad guarantees provided by the underlying platforms. The only way forward is to automate this to the point where the vast majority of programmers never need to think about security.
    • Complexity is overwhelming and getting worse at an increasing rate.
    • Versioning is still an unsolved problem.
    • It is still far too hard to reason about the correctness of our programs. And when we can't be sure what our programs do, the above problems are greatly magnified.

    Perhaps my language enthusiasm is clouding my good sense, but I can see no way in which incremental changes to our current champions can eliminate these pain points.

    Which is perhaps just another way of saying: the innovations that we need are not yet obvious.

    By the time they are obvious, we'll hopefully have new champions.

  • Will there be a new champion programming language in the next decade?

    I hope so.

    If history is anything to go by it seems inevitable. I have been in the computing field since the mid seventies, mostly in an academic/research/advanced developement context. I've have used many programming languages over the years, but some of them have been much more equal than others at various times.

    In the seventies, Fortran and Cobol where champions. In the eighties, we had Pascal and C. In the nineties, Basic and C++ and in the noughts Java and C#. Why would the teens be any different?

    Well, in the long run only death and taxes are inevitable. There has always been strong, determined and well grounded opposition to new programming languages. I remember long debates (in South African academic circles) about whether to move from Fortan to Pascal as primary teaching language. C++ sort of snuck in on the back enthusiam for Unix and C, but by the time Java came along there was a large and passionate group opposing its adoption. I also well remember hearing Anders Hejlsberg protesting in 1999: "I am not inventing a new programming language. Nobody wants a new programming language. C# is just C with a few corrections to take account of what has been learnt over the years."

    Almost ten years later C# may be a "champion" but is still something of an emerging language. It continues to evolve and shows little sign of senility and fading away. Moreover, the former champions still have their loyal followers and show no signs of impending death either. A new champion will have to have quite a lot to offer before seizing the crown. Getting to the point where there is a lot on offer will require a huge investment. Huge investments are not made often or lightly, and take a lot time. It seems hard to imagine any of the current pretenders to the crown attracting this kind of investment, or a new pretender coming along soon enough to really impact the teens.

    Looking back at the successions that did occur, what can we learn? For me a few things stand out: When Pascal and C came along, Fortran and Cobol had far too little support for procedural abstraction and data structures. They evolved, but did so at a glacial pace and events just overwhelmed them. Still, the succession might not have occurred had it not been for the advent of the mini computer and the huge expansion of computing beyond the mainframe IT shops dominated by IBM and the BUNCH.

    (Visual) Basic offered event based programming and supported GUI programming in a way that Pascal was slow to match and C never managed. C++ rode the wave of object orientation and the general concern with modularity and programming in the large. Still, the succession may not have happened had it not been for the huge expansion of computing brought on by the advent of the micro computer. The old champions where not so much dethroned as surrounded and reduced to vasal status in a much larger empire.

    Java and C# offered garbage collection, strong typing and above all, extensive standardized libraries. But it took the world wide web and a further expansion of computing to make the succession happen. And the succession is by no means complete.

    If we are to have another succession, we should probably have another expansion. Is there room for one? I think so, but we'll have to see. My guess is that ubiquitous computing and the "cloud" will provide such an expansion.

    A new champion will also need to solve problems that matter to this expanded world and that the existing champions will be slow to provide (or find impossible to provide).

    I have some ideas on what this might be, which I'll expound on in my next post.

    Herman

     


© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker