Arrays considered somewhat harmful

Arrays considered somewhat harmful

Rate This
  • Comments 63

I got a moral question from an author of programming language textbooks the other day requesting my opinions on whether or not beginner programmers should be taught how to use arrays.

Rather than actually answer that question, I gave him a long list of my opinions about arrays, how I use arrays, how we expect arrays to be used in the future, and so on. This gets a bit long, but like Pascal, I didn't have time to make it shorter. 

Let me start by saying when you definitely should not use arrays, and then wax more philosophical about the future of modern programming and the role of the array in the coming world.

You probably should not return an array as the value of a public method or property, particularly when the information content of the array is logically immutable. Let me give you an example of where we got that horridly wrong in a very visible way in the framework.  If you take a look at the documentation for System.Type, you'll find that just looking at the method descriptions gives one a sense of existential dread. One sees a whole lot of sentences like "Returns an array of Type objects that represent the constraints on the current generic type parameter." Almost every method on System.Type returns an array it seems. 

Now think about how that must be implemented. When you call, say, GetConstructors() on typeof(string), the implementation cannot possibly do this, as sensible as it seems.

public class Type {
   private ConstructorInfo[] ctorInfos;
   public ConstructorInfo[] GetConstructors()
     if (ctorInfos == null) ctorInfos = GoGetConstructorInfosFromMetadata();
     return ctorInfos;

Why? Because now the caller can take that array and replace the contents of it with whatever they please. Returning an array means that you have to make a fresh copy of the array every time you return it. You get called a hundred times, you’d better make a hundred array instances, no matter how large they are. It’s a performance nightmare – particularly if, like me, you are considering using reflection to build a compiler. Do you have any idea how many times a second I try to get type information out of reflection?  Not nearly as many times as I could; every time I do it’s another freakin’ array allocation!

The frameworks designers were not foolish people; unfortunately, we did not have generic types in .NET 1.0. clearly the sensible thing now for GetConstructors() to return is IList<ConstructorInfo>. You can build yourself a nice read-only collection object once, and then just pass out references to it as much as you want.

What is the root cause of this malaise? It is simple to state: The caller is requesting values.  The callee fulfills the request by handing back variables.

An array is a collection of variables. The caller doesn’t want variables, but it’ll take them if that’s the only way to get the values. But in this case, as in most cases, neither the callee nor the caller wants those variables to ever vary. Why on earth is the callee passing back variables then? Variables vary. Therefore, a fresh, different variable must be passed back every time, so that if it does vary, nothing bad happens to anyone else who has requested the same values.

If you are writing such an API, wrap the array in a ReadOnlyCollection<T> and return an IEnumerable<T> or an IList<T> or something, but not an array.  (And of course, do not simply cast the array to IEnumerable<T> and think you’re done!  That is still passing out variables; the caller can simply cast back to array!  Only pass out an array if it is wrapped up by a read-only object.)

That’s the situation at present. What are the implications of array characteristics for the future of programming and programming languages?

Parallelism Problems

The physics aspects of Moore’s so-called “Law” are failing, as they eventually must. Clock speeds have stopped increasing, transistor density has stopped increasing. The laws of thermodynamics and the Uncertainty Principle are seeing to that. But manufacturing costs per chip are still falling, which means that our only hope of Moore’s "Law" continuing to hold over the coming decades is to cram more and more processors into each box. 

We’re going to need programming languages that allow mere mortals to write code that is parallelizable to multiple cores.

Side-effecting change is the enemy of parallelization. Parallelizing in a world with observable side effects means locks, and locks means choosing between implementing lock ordering and dealing with random crashes or deadlocks. Lock ordering requires global knowledge of the program. Programs are becoming increasingly complex, to the point where one person cannot reasonably and confidently have global knowledge. Indeed, we prefer programming languages to have the property that programs in them can be understood by understanding one part at a time, not having to swallow the whole thing in one gulp.

Therefore we tools providers need to create ways for people to program effectively without causing observable side effects.

Of all the sort of “basic” types, arrays most strongly work against this goal. An array’s whole purpose is to be a mass of mutable state. Mutable state is hard for both humans and compilers to reason about. It will be hard for us to write compilers in the future that generate performant multi-core programs if developers use a lot of arrays.

Now, one might reasonably point out that List<T> is a mass of mutable state too. But at least one could create a threadsafe list class, or an immutable list class, or a list class that has transactional integrity, or uses some form of isolation or whatever. We have an extensibility model for lists because lists are classes. We have no ability to make an “immutable array”. Arrays are what they are and they’re never going to change.

Conceptual Problems

We want C# to be a language in which one can draw a line between code that implements a mechanism and code that implements a policy.

The “C” programming language is all about mechanisms. It lays bare almost exactly what the processor is actually doing, providing only the thinnest abstraction over the memory model. And though we want you to be able to write programs like that in C#, most of the time people should be writing code in the “policy” realm. That is, code that emphasizes what the code is supposed to do, not how it does it.

Coding which is more declarative than imperative, coding which avoids side effects, coding which emphasizes algorithms and purposes over mechanisms, that kind of coding is the future in a world of parallelism. (And you’ll note that LINQ is designed to be declarative, strongly abstract away from mechanisms, and be free of side effects.)

Arrays work against all of these factors. Arrays demand imperative code, arrays are all about side effects, arrays make you write code which emphasizes how the code works, not what the code is doing or why it is doing it. Arrays make optimizing for things like “swapping two values” easy, but destroy the larger ability to optimize for parallelism.

Practical Problems

And finally, given that arrays are mutable by design, the way an array restricts that mutability is deeply weird. All the contents of the collection are mutable, but the size is fixed.  What is up with that? Does that solve a problem anyone actually has?

For this reason alone I do almost no programming with arrays anymore. Arrays simply do not model any problem that I have at all well – I rarely need a collection which has the rather contradictory properties of being completely mutable, and at the same time, fixed in size. If I want to mutate a collection it is almost always to add something to it or remove something from it, not to change what value an index maps to.

We have a class or interface for everything I need. If I need a sequence I’ll use IEnumerable<T>, if I need a mapping from contiguous numbers to data I’ll use a List<T>, if I need a mapping across arbitrary data I’ll use a Dictionary<K,V>, if I need a set I’ll use a HashSet<T>. I simply don’t need arrays for anything, so I almost never use them. They don’t solve a problem I have better than the other tools at my disposal.

Pedagogic Problems

It is important that beginning programmers understand arrays; it is an important and widely used concept. But it is also important to me that they understand the weaknesses and shortcomings of arrays. In almost every case, there is a better tool to use than an array.

The difficulty is, pedagogically, that it is hard to discuss the merits of those tools without already having down concepts like classes, interfaces, generics, asymptotic performance, query expressions, and so on. It’s a hard problem for the writer and for the teacher. Fortunately, for me, it's not a problem that I personally have to solve.

  • Care to share the details of "...considering using reflection to build a compiler..."?

  • ReadOnlyCollection<T> return references to object.  The object's members can be changed.  We just can't add, remove or change a reference to the ReadOnlyCollection.  So returning this collection still does not guarantee 100% readonly collection

  • I mean Add, Remove or change the reference of objects already inside the collection.

  • @Thomas,

    As I'm sure Eric would say (because he has so many times before), you cannot stop someone accessing private data if they have full security access; in that case you have to assume that they have access to all of the memory in the process. What matters is protecting data that is being passed to code that does NOT have full security access, in particular code that cannot use reflection. In that case, accessing private data inside your class would not be possible, but casting your return value from an IEnumerable to an array would be perfectly legal, and would give that untrusted code mutable access to your private data.

  • @chris

    The CLR doesn't have a comprehensive 'const' semantic like C++, and in order to maintain interoperability with other programming languages, including C#, it probably never will.  

    That said, you can take and appy attributes to just about everything, including return values.  You can invent 'const', or use an existing attribute for that purpose.  You could then extend each programming language to require it, and add runtime checks on the IL of callers to ensure they don't perform non-const modifications to const return values, references, and argumenst.  And you could extend existing languaged -- writing a C# compiler that issues errors if you violate const correctness.  

    But that's a lot of work.


    I appreciate that you'd like to return read-only-proxies in many cases instead of arrays, especially in 'getters'.  But Joe Duffy pointed out at (link) that proxies, such as enumerators, are often more expensive than they seem on the current JITter, more expensive than such proxies are in compiled languages (i'm thinking of vector::const_iterator, for instance).  

    "In almost every case, there is a better tool to use than an array" -- i'd agree.  It's unfortunate that there's a perf penalty to those better tools, but often it can be justified by the benifits (in product reliability, reduced development costs, bugs found earlier) of proxies.


  • Indeed. Attributes are the best you can do in C#. If you were writing your own language what you'd probably want to do is use an optional type modifier rather than an attribute; in C++, adding "constness" changes the _signature_ of a method.  In the CLR, adding an attribute does not change a signature but adding an optional type modifier does. (This is how managed C++ does constness.)

  • > Care to share the details of "...considering using reflection to build a compiler..."?


    csc foo.cs /r:bar.dll

    Somehow the compiler has to get type information out of bar.dll in order to compile foo.cs.  Reflection seems like a good candidate.  Today the compiler uses the unmanaged metadata reader interfaces, but I would like to write any new tools we come up with in C#, which means that suddenly System.Reflection becomes a reasonable choice. Not the only choice, but a reasonable one.

  • "transistor density has stopped increasing"

    Do you mean will stop increasing in the future?  Aren't Intel and AMD are both talking about new process generations with higher transistor density.  For instance: (link) and (link)

    45 nm is now shipping and 32nm and 22nm are both planned.

    On the issue of arrays, their use can dramatically increase performance in multi-processor, shared memory architectures.  They give you locality.  You allocate them all at once, so the memory is contiguous.  This means cache hits are far more likely with array access.

    If you are designing producer/consumer queues in C++, you can align the queues to start on a processor cache line, and control the enqueue and dequeue processes such that the cache lines do not "ping pong" between processors.  So yes, that solves a problem that I had.

    Arrays are generally preferred over linked structures when performance is a concern.

    Here is a typical reference from processor optimization manual:

    "As you can see, traversing an array gives the processor ample opportunity for concurrency in generating memory requests. Overlapping memory operations results in lower effective latency per item. In contrast, traversing a linked list is fundamentally sequential. The penalty is especially bad when the array elements are randomly scattered in memory: the full memory latency cost is incurred for accessing each individual item.

    The simple conclusion is that arrays should be preferred over linked lists, when they are capable of doing what's needed. This is especially true when the data will typically be accessed in a sequential manner."

    What I'm saying is that I've found arrays to be a critical tool to solve performance issues and of course you have to be careful to use them correctly.

  • OK, so lets suppose that transistor density in the very highest end machines quadruples in the next generation. Does anyone seriously believe that it will ever quadruple again? Maybe density is still increasing, but the end is nigh.

    And as for performance -- sure, there are plenty of applications where performance is gated on minimizing cache misses. If careful testing of your program against user-focused metrics indicates that cache misses are your hot spot AND your performance is unacceptable then sure, consider using an array.  But understand that doing so comes with costs of its own, namely, having a whole pile of mutable state which must be carefully protected from misuse.

  • @russelh:

    If you need minimize cache misses, usually it's enough back your IList by an array - this will bring in the needed memory management optimization without sacrificing the interface elegance.


  • I know this wasn't the focus of your post, but doesn't Moore's Law often taper off from time to time, only to kick back into action via some massive technological leap?  It happened with transistors coming from vacuum tubes, and it'll probably happen again with one of the newer areas of research, like quantum chips or DNA or whatever.  Maybe in the literal sense, transistor density won't continue to increase forever, but computing power in general probably will, and not just through the addition of more cores.

    On topic: as others have commented, it seems that 9 times out of 10, what I really need is a read-only, indexed sequence.  I always think, hey, let's make things as general as possible by using an IEnumerable<T>, and then realize later on that I need indexed access, and have to change it to an IList<T>, which never feels quite right because its "immutability" is just in the implementation.  And using an actual ReadOnlyCollection<T> instance never feels quite right either, I don't know why.

    'Course I realized while writing this that it would be trivially easy to write an IReadOnlyList<T>, a wrapper class, and a couple of extension methods for conversion; don't know why this never occurred to me before.  Maybe because it's a stupid idea and I haven't yet realized *that*. :-)

  • Moore's Law is badly named. It really ought to be "Moore's Observation"; just because a particular economic trend has held for a few years does not make it a law of nature.

    It is the case that every time a particular technology has been tapped out, a new technology has come along to replace it.

    Similarly, every time we've gotten close to running out of oil, either new reserves have been discovered, or new technologies for extending existing reserves have been found.  

    It might well be the case that future technologies open up massive new vistas in cheap computational power. And it might be the case that all the oil we'll ever need is available for cheap in some place that we've not looked for it yet.

    But there is no law of physics that is going ensure that either case comes to pass, and it would therefore be unwise to assume that this will happen. If new magical technologies are invented, great. I'm pro that. But I'm going to make the conservative assumption that refinements of old technology are what we've got to work with now and going forward.  

    Or, look at it another way. I cannot possibly design tools for an unknown future that is radically different from today. I could guess, and likely be wrong, or I could design tools for the most likely future.

  • > Will we be seeing a "threads considered somewhat harmful" post soon?

    No, because I already wrote it in 2004:

  • I was probably overreacting, or focusing on "using" an array vs. "handing out" an array.  I guess StringBuilder is a decent example of encapsulating an array in a useful way.

    @Aaron G:

    When I was in graduate school in the 1991-1992 time frame, optical computing was supposed the next big thing.  No one (at least not my professors) thought CMOS technology would get as far as it has.  I'm not disagreeing with Eric.  Clearly, the trend now is in the direction of more and more cores (threads) per processor with constant or decreasing throughput per thread.

  • Who says arrays must be mutable? The point is mute if they are not.

    Apart from C# being so deficient with constness (which no matter what you do at runtime will never bring the point back home: compile-time checks and optimisations that are slowly catching up in compilers), it is very deficient with poor-man sequence abstraction.

    There is no way to do reverse iteration, there is ugliness of ReadOnlyCollection, there are huge, huge perf hits across the board because 'managed' brings 'overhead'. There are hacks on foreach compilation time and time again, there are hacks on generic lists in VM implementation and IL.

    Seems to me the entire posts has gone off on the tangent of critising something CLR adopted in memory management only to lose itself in another area of non-coherency.

    Proper sequences (compile-time, meta-programming and more), and vectors are brilliant, mutable or not. Arrays are just a representation you can swap however you wish.

    That is the real problem, the CLR, not the array type.

Page 2 of 5 (63 items) 12345