Arrays considered somewhat harmful

Arrays considered somewhat harmful

Rate This
  • Comments 63

I got a moral question from an author of programming language textbooks the other day requesting my opinions on whether or not beginner programmers should be taught how to use arrays.

Rather than actually answer that question, I gave him a long list of my opinions about arrays, how I use arrays, how we expect arrays to be used in the future, and so on. This gets a bit long, but like Pascal, I didn't have time to make it shorter. 

Let me start by saying when you definitely should not use arrays, and then wax more philosophical about the future of modern programming and the role of the array in the coming world.

You probably should not return an array as the value of a public method or property, particularly when the information content of the array is logically immutable. Let me give you an example of where we got that horridly wrong in a very visible way in the framework.  If you take a look at the documentation for System.Type, you'll find that just looking at the method descriptions gives one a sense of existential dread. One sees a whole lot of sentences like "Returns an array of Type objects that represent the constraints on the current generic type parameter." Almost every method on System.Type returns an array it seems. 

Now think about how that must be implemented. When you call, say, GetConstructors() on typeof(string), the implementation cannot possibly do this, as sensible as it seems.

public class Type {
   private ConstructorInfo[] ctorInfos;
   public ConstructorInfo[] GetConstructors()
   {
     if (ctorInfos == null) ctorInfos = GoGetConstructorInfosFromMetadata();
     return ctorInfos;
   }

Why? Because now the caller can take that array and replace the contents of it with whatever they please. Returning an array means that you have to make a fresh copy of the array every time you return it. You get called a hundred times, you’d better make a hundred array instances, no matter how large they are. It’s a performance nightmare – particularly if, like me, you are considering using reflection to build a compiler. Do you have any idea how many times a second I try to get type information out of reflection?  Not nearly as many times as I could; every time I do it’s another freakin’ array allocation!

The frameworks designers were not foolish people; unfortunately, we did not have generic types in .NET 1.0. clearly the sensible thing now for GetConstructors() to return is IList<ConstructorInfo>. You can build yourself a nice read-only collection object once, and then just pass out references to it as much as you want.

What is the root cause of this malaise? It is simple to state: The caller is requesting values.  The callee fulfills the request by handing back variables.

An array is a collection of variables. The caller doesn’t want variables, but it’ll take them if that’s the only way to get the values. But in this case, as in most cases, neither the callee nor the caller wants those variables to ever vary. Why on earth is the callee passing back variables then? Variables vary. Therefore, a fresh, different variable must be passed back every time, so that if it does vary, nothing bad happens to anyone else who has requested the same values.

If you are writing such an API, wrap the array in a ReadOnlyCollection<T> and return an IEnumerable<T> or an IList<T> or something, but not an array.  (And of course, do not simply cast the array to IEnumerable<T> and think you’re done!  That is still passing out variables; the caller can simply cast back to array!  Only pass out an array if it is wrapped up by a read-only object.)

That’s the situation at present. What are the implications of array characteristics for the future of programming and programming languages?

Parallelism Problems

The physics aspects of Moore’s so-called “Law” are failing, as they eventually must. Clock speeds have stopped increasing, transistor density has stopped increasing. The laws of thermodynamics and the Uncertainty Principle are seeing to that. But manufacturing costs per chip are still falling, which means that our only hope of Moore’s "Law" continuing to hold over the coming decades is to cram more and more processors into each box. 

We’re going to need programming languages that allow mere mortals to write code that is parallelizable to multiple cores.

Side-effecting change is the enemy of parallelization. Parallelizing in a world with observable side effects means locks, and locks means choosing between implementing lock ordering and dealing with random crashes or deadlocks. Lock ordering requires global knowledge of the program. Programs are becoming increasingly complex, to the point where one person cannot reasonably and confidently have global knowledge. Indeed, we prefer programming languages to have the property that programs in them can be understood by understanding one part at a time, not having to swallow the whole thing in one gulp.

Therefore we tools providers need to create ways for people to program effectively without causing observable side effects.

Of all the sort of “basic” types, arrays most strongly work against this goal. An array’s whole purpose is to be a mass of mutable state. Mutable state is hard for both humans and compilers to reason about. It will be hard for us to write compilers in the future that generate performant multi-core programs if developers use a lot of arrays.

Now, one might reasonably point out that List<T> is a mass of mutable state too. But at least one could create a threadsafe list class, or an immutable list class, or a list class that has transactional integrity, or uses some form of isolation or whatever. We have an extensibility model for lists because lists are classes. We have no ability to make an “immutable array”. Arrays are what they are and they’re never going to change.

Conceptual Problems

We want C# to be a language in which one can draw a line between code that implements a mechanism and code that implements a policy.

The “C” programming language is all about mechanisms. It lays bare almost exactly what the processor is actually doing, providing only the thinnest abstraction over the memory model. And though we want you to be able to write programs like that in C#, most of the time people should be writing code in the “policy” realm. That is, code that emphasizes what the code is supposed to do, not how it does it.

Coding which is more declarative than imperative, coding which avoids side effects, coding which emphasizes algorithms and purposes over mechanisms, that kind of coding is the future in a world of parallelism. (And you’ll note that LINQ is designed to be declarative, strongly abstract away from mechanisms, and be free of side effects.)

Arrays work against all of these factors. Arrays demand imperative code, arrays are all about side effects, arrays make you write code which emphasizes how the code works, not what the code is doing or why it is doing it. Arrays make optimizing for things like “swapping two values” easy, but destroy the larger ability to optimize for parallelism.

Practical Problems

And finally, given that arrays are mutable by design, the way an array restricts that mutability is deeply weird. All the contents of the collection are mutable, but the size is fixed.  What is up with that? Does that solve a problem anyone actually has?

For this reason alone I do almost no programming with arrays anymore. Arrays simply do not model any problem that I have at all well – I rarely need a collection which has the rather contradictory properties of being completely mutable, and at the same time, fixed in size. If I want to mutate a collection it is almost always to add something to it or remove something from it, not to change what value an index maps to.

We have a class or interface for everything I need. If I need a sequence I’ll use IEnumerable<T>, if I need a mapping from contiguous numbers to data I’ll use a List<T>, if I need a mapping across arbitrary data I’ll use a Dictionary<K,V>, if I need a set I’ll use a HashSet<T>. I simply don’t need arrays for anything, so I almost never use them. They don’t solve a problem I have better than the other tools at my disposal.

Pedagogic Problems

It is important that beginning programmers understand arrays; it is an important and widely used concept. But it is also important to me that they understand the weaknesses and shortcomings of arrays. In almost every case, there is a better tool to use than an array.

The difficulty is, pedagogically, that it is hard to discuss the merits of those tools without already having down concepts like classes, interfaces, generics, asymptotic performance, query expressions, and so on. It’s a hard problem for the writer and for the teacher. Fortunately, for me, it's not a problem that I personally have to solve.

  • While here, the 'idioms' you mentioned for C#, they are more deficient that people are willing to admit. Here we are again fighting the runtime against something we losing control over. So wait for VM to catch up?

    I am in complete agreement with russelh and especially on that long and trivial readlines example, two blog entries back. That classic against an entire lecture in complication and IEnumerable not cutting it with all the candy:

    while ((line = reader.ReadLine()) != null)

    simply does it, and no idiom of C# can do it more elegant. If you take it into proper sequence world:

    while ( std::getline ...  or

    cont<T> sequence( (std::istream_iterator< T >( T ) ), std::istream_iterator< T >() );

    Than you can even reverse it in one liner: std::reverse, transform it into anything, and do all that with any type or anything at COMPILE-TIME.. Whoaa, C# looks like 1960s tech in comparison.

    There is some serious compile-time problems, that move into runtime, then there are patches to compilers, and finally we get LINQ. The problem is that LINQ code is around 50% slower than simply generic lists, and lists are slower than arrays and so it goes on an on.

    The lesson I learned 20 years back is that if something isn't optimal, it isn't the type, it is the tech or abstraction over it causing the problem, usually a performance and in this instance it is a clear, inferior semantic confusion of 'sequence' in CLR.

  • Aren't arrays actually quite good when it comes to parallelism? They're really easy to divide into chunks processed by different threads, unlike linear data structures such as lists. As you point out, they're not mutable (only the elements are mutable), which helps there (or rather, it doesn't hurt). Your comments about metadata are also shifting the blame for CLI's lack of an equivalent to const& onto arrays, which hardly seems fair. On the other hand, I think you've ignored some of the *actual* pedagogical problems such as jagged vs. rectangular arrays.

    In the spirit of "considered harmful", I consider your considerations harmful!

  • I think you forget to mention another, more important, point: having a readonly collection of references to objects won't save you from modifying objects.

    Thus IList<T> won't solve the problem you're mentioning!

    You either have to copy all objects (or use value types), or wrap all objects your array or collection contains, in a "readonlifier" wrapper.

    But the best thing is to design your objects (and it doesn't matter if you return them as items of an array or an collection) so that you have different interfaces for different uses.

    And C++'s const array modifier perfectly solves the problem you're mentioning.

    My 5c, I might be too sleepy to miss the whole point :)

    (I've recently replaced LINQ queries with direct array addressing in array to get 5x performance boost. At the same time the interface between the algorithm and the data store has remained the same. But I with all hands for your arguments regarding paralellization. Just arrays is not the key reason of why some algorithms are hard to parallel)

  • Developers in my team are encouraged (with threats of violence) to return interface types wherever possible, using the minimum interface required by the consumer, e.g. IEnumerable<T> if iteration is required, or IList<T> if indexed access is necessary.

    Ree: With the aforementioned move into many-core processors, the ability to parallelize is becoming more important than single-thread performance, and this is where LINQ will come into its own, especially with the PFX extensions.

    Roman: ConstructorInfo is an immutable class, in which instance the example is valid. Personally, I tend toward immutable classes for everything except business objects.

    And as for Moore's Law: what about memristors?

  • "It is important that beginning programmers understand arrays;"

    It seems to me that one of the main reasons for this was so that they could then quickly understand strings.  But that day is long gone.  In a world where strings are now first class citizens of their own, it should be possible to leave arrays as a concept to be introduced at the same time as other more appropriate collections.

  • It's much more important to understand the difference between O(1) and O(n), I agree :)

  • In this carnival there&#39;re a lot of software design/patterns and frameworks a bit of SOA, UML, DSL

  • @danyel:  That may be, but in that case you're still talking about something that should be wrapped in something like an Image class, and not exposed to public view.

    @ree:  It is generally more important to have code that is more maintainable and more amenable to reason than it is to have code that is more performant.  The fastest code serves no purpose if it's simply the fastest to produce an incorrect result, or the fastest to befuddle.  I'll take the CLR's abstractions any day.

  • @kfarmer,

    Part of Ree's point was that, if the CLR supported const and non-mutable arrays, you wouldn't need an abstraction to expose array-like data, and you could have the best of both worlds, maintainability and performance. Which is a perfectly valid point. Whether the gains from supporting const and immutability in the CLR would be worth the effort is another question entirely.

  • I say CLR devs hate arrays because CLR does not efficiently deal with arrays.   Also, CLR memory management makes using arrays inefficient.

    Ok, simple example.   A string.

    In a language like C++, a string is an array of chars.   Why is array of chars so handy?  Because I can modify the chars inside the array and avoid HUGE performance costs of creating new strings over and over and over when doing inline textual processing.

    I wrote a simple app to process newsgroup posts and push them into SQL Server in C#.  Due to string manipulations, I was burning huge amounts of CPU in GC collecting strings and StringBuilders.

    In C++, I could read the data from NNTP directly into a byte array, then rip through the byte array and do fixups and scan for bad things, and in some cases where there were no fixups pass the byte array directly to SQL Server for upload.  Zero allocs, zero memcpys in primary code path.

    Eventually I redesigned my C# code to do the same thing but I was painfully difficult to do.  You can't just cast a byte array to char * in C#.

    Overall the design compromises the CLR made are sensible.  These design choices have handicapped arrays in C#, let's be honest about this at least.  In C++ char arrays have probably caused 10,000 more huge horrible disasters than they have solved.

  • Most of the generic collections dealing in atomic list-type structures internally use arrays (with the exception of SortedDictionary, which uses a tree, and LinkedList, which deals with references to succeeding and preceding LinkedListNodes, and a few others). Now, unless you specify (if you are able to) an initial capacity, adding items dynamically, whether using Add() or the indexing property of the collection you are using, will trigger an allocation of a new array, and an Array.Copy(), when the item you are currently attempting to add would result in a new size that exceeds the current internal size of the collection. This can affect performance. However, what's even worse is that all of the adding/setting mechanisms in these collections trigger a check to determine whether the item is already in the internal array (usually using a binary search). Furthermore, this check forces invocations of methods in helper classes that aid in comparison and equality-calculations - this process results in pushing classes and methods onto the call stack. A lot of overhead, and ALL of these contribute to performance strain.

    If I have a collection and know certain things at runtime, such as how big it will be, that items I'm adding are new, etc., I choose an array to avoid all the performance stresses and overhead. I like to have a great granularity of control over how I deal in collections, so although I do agree with most of your points, I'd still be hard-pressed to pick a List over an array when I don't need to do more than basic array manipulation.

    On another topics about arrays, I rreally wish that the framework would allow us to be able to create an array with all elements initialized to a particular value of our choosing at creation time. For example, if I want an array of, say, 100 bools, where all elements should be set to true, I have to instantiate the array, then initialize it with a loop; it would be really helpful to just be able to write "bool[] arr = new bool[100](true);", or something like that.

  • I am an "engineer" engineer who has been writing water resources simulation software over a 30-year career.  I have moved through every operating system that MS has deployed and nearly every language (Fortran, Basic, Pascal, C, C++ and C#) striving to overcome every less efficient improvement that has been placed in its OSs and revisions to compilers.  MFC was a 5-year trip.  Now, with managed memory and all of the good things that have been done to protect me and my users from myself, I have concluded that MS has basically abandoned the scientific community in areas where speed is important.  Arguments made for distributed (team) development and tool development are well taken.  However, people who have high performance needs that arrays used to provide are basically out of luck.

  • I agree, Eric, that "completely mutable, and at the same time, fixed in size" is not often a useful combination.

    However, an immutable array (of, typically, immutable objects) is often precisely what I want, especially where performance is an issue.

    In such cases, the following works:

    struct ConstArray<T>

    {

    public ConstArray(T[] values) { this.values = values; }

    public T this[int index] { get { return values[index]; } }

    public int Length { get { return values.Length; } }

    T[] values;

    }

    (It would be even better if the jitter could optimise index access in loops as it does for system arrays.)

    BTW, here's yet another (forlorn, I know) plea for the C++ const modifier, which, as has been pointed out, would fix this issue.

  • Would an empty array be safe to return?

  • Very recently on a project, I was having significant issues with System.IO.Directory.GetFiles, in which

Page 3 of 5 (63 items) 12345