References are not addresses

References are not addresses

Rate This
  • Comments 76

[NOTE: Based on some insightful comments I have updated this article to describe more clearly the relationships between references, pointers and addresses. Thanks to those who commented.]

I review a fair number of C# books; in all of them of course the author attempts to explain the difference between reference types and value types. Unfortunately, most of them do so by saying something like "a variable of reference type stores the address of the object". I always object to this. The last time this happened the author asked me for a more detailed explanation of why I always object, which I shall share with you now:

We have the abstract concept of "a reference". If I were to write about "Beethoven's Ninth Symphony", those two-dozen characters are not a 90-minute long symphonic masterwork with a large choral section. They're a reference to that thing, not the thing itself. And this reference itself contains references -- the word "Beethoven" is not a long-dead famously deaf Romantic Period composer, but it is a reference to one.

Similarly in programming languages we have the concept of "a reference" distinct from "the referent".

The inventor of the C programming language, oddly enough, chose to not have the concept of references at all. Rather, Ritchie chose to have "pointers" be first-class entities in the language. A pointer in C is like a reference in that it refers to some data by tracking its location, but there are more smarts in a pointer; you can perform arithmetic on a pointer as if it were a number, you can take the difference between two pointers that are both in the interior of the same array and get a sensible result, and so on.

Pointers are strictly "more powerful" than references; anything you can do with references you can do with pointers, but not vice versa. I imagine that's why there are no references in C -- it's a deliberately austere and powerful language.

The down side of pointers-instead-of-references is that pointers are hard for many novices to understand, and make it very very very easy to shoot yourself in the foot.

Pointers are typically implemented as addresses. An address is a number which is an offset into the "array of bytes" that is the entire virtual address space of the process (or, sometimes, an offset into some well-known portion of that address space -- I'm thinking of "near" vs. "far" pointers in win16 programming. But for the purposes of this article let's assume that an address is a byte offset into the whole address space.) Since addresses are just numbers you can easily perform pointer arithmetic with them.

Now consider C#, a language which has both references and pointers. There are some things you can only do with pointers, and we want to have a language that allows you to do those things (under carefully controlled conditions that call out that you are doing something that possibly breaks type safety, hence "unsafe".)  But we also do not want to force anyone to have to understand pointers in order to do programming with references.

We also want to avoid some of the optimization nightmares that languages with pointers have. Languages with heavy use of pointers have a hard time doing garbage collection, optimizations, and so on, because it is infeasible to guarantee that no one has an interior pointer to an object, and therefore the object must remain alive and immobile.

For all these reasons we do not describe references as addresses in the specification. The spec just says that a variable of reference type "stores a reference" to an object, and leaves it completely vague as to how that might be implemented. Similarly, a pointer variable stores "the address" of an object, which again, is left pretty vague. Nowhere do we say that references are the same as addresses.

So, in C# a reference is some vague thing that lets you reference an object. You cannot do anything with a reference except dereference it, and compare it with another reference for equality. And in C# a pointer is identified as an address.

By contrast with a reference, you can do much more with a pointer that contains an address. Addresses can be manipulated mathematically; you can subtract one from another, you can add integers to them, and so on. Their legal operations indicate that they are "fancy numbers" that index into the "array" that is the virtual address space of the process.

Now, behind the scenes, the CLR actually does implement managed object references as addresses to objects owned by the garbage collector, but that is an implementation detail. There's no reason why it has to do that other than efficiency and flexibility. C# references could be implemented by opaque handles that are meaningful only to the garbage collector, which, frankly, is how I prefer to think of them. That the "handle" happens to actually be an address at runtime is an implementation detail which I should neither know about nor rely upon. (Which is the whole point of encapsulation; the client doesn't have to know.)

I therefore have three reasons why authors should not explain that "references are addresses".

1) It's close to a lie. References cannot be treated as addresses by the user, and in fact, they do not necessarily contain an address in the implementation. (Though our implementation happens to do so.)

2) It's an explanation that explains nothing to novice programmers. Novice programmers probably do not know that an "address" is an offset into the array of bytes that is all process memory. To understand what an "address" is with any kind of depth, the novice programmer already has to understand pointer types and addresses -- basically, they have to understand the memory model of many implementations of C. This is one of those "it's clear only if it's already known" situations that are so common in books for beginners.

3) If these novices eventually learn about pointer types in C#, their confused understanding of references will probably make it harder, not easier, to understand how pointers work in C#. The novice could sensibly reason "If a reference is an address and a pointer is an address, then I should be able to cast any reference to a pointer in unsafe code, right?"  But you cannot.

If you think of a reference is actually being an opaque GC handle then it becomes clear that to find the address associated with the handle you have to somehow "fix" the object. You have to tell the GC "until further notice, the object with this handle must not be moved in memory, because someone might have an interior pointer to it". (There are various ways to do that which are beyond the scope of this screed.)

Basically what I'm getting at here is that an understanding of the meaning of "addresses" in any language requires a moderately deep understanding of the memory model of that language. If an author does not provide an explanation of the memory model of either C or C#, then explaining references in terms of addresses becomes an exercise in question begging. It raises more questions than it answers.

This is one of those situations where the author has the hard call of deciding whether an inaccurate oversimplification serves the larger pedagogic goal better than an accurate digression or a vague hand-wave.

In the counterfactual world where I am writing a beginner C# book, I would personally opt for the vague hand-wave.  If I said anything at all I would say something like "a reference is actually implemented as a small chunk of data which contains information used by the CLR to determine precisely which object is being referred to by the reference". That's both vague and accurate without implying more than is wise.

  • Nice explanation.  The phrase "objects owned by the garbage collector" caught my eye.  Can you recommend/suggest any resources that dig into this concept of managed objects being under the ownership of the GC?  I had always thought of the GC knowing about or managing objects rather than making the mental shift to it owning the objects as such.  This probably shows my lack of understanding about the GC.

  • @Adrian:

    MSDN magazine had a series on the GC which was pretty good:

    http://msdn.microsoft.com/en-us/magazine/bb985010.aspx

  • I actually disagree. "Reference is an address" is a simple but powerful mental model.

    It is sufficient to explain semantics of assignment, field modification, method argument passing, etc. Without it, the novice has to memorize a bunch of weird rules.

    Sure, the address is not static, and it may not even be implemented as such. By the time the user cares, they will be able to grok it.

  • I think the use of the term address is reasonable.  It's not a perfect analogy, but I'd compare it more to an street address or an IP address rather than a pointer.  Really though, there's hardly any semantic distinction: both are intrinsically meaningless short bits of identifying information whose sole purpose is to find other information.

    It's not a coincidence that references are implemented as pointers; since they both serve almost the same purpose on the same architecture, it's natural they'll be implemented almost identically.  The implementation of references as pointers isn't an implementation detail, it's inherent in what references are.

    The languages tries to protect you from inadvertent mistakes via the fixed statement, but it won't prevent a fixed pointer from being (unwisely) used outside of the fixed scope; and direct usage of the GCHandle type makes this kind of conversion even less obvious.

    In terms of pedagogy, I think the value of the inaccurate oversimplification is that it is indeed a simplification.  It's hard enough as is to learn new things, but the larger the number of new concepts, the harder it becomes.  If you need to explain both references and pointers, then treating the commonalities first might be easier to grasp than focusing on the distinctions.  Then again, you probably don't need to explain _both_ to a real novice anyhow.

    I suspect most readers of (beginner) programming books aren't first time programmers, but people that have seen similar constructs in other languages.  If that's you're readership, the distinction might be a useful detail.

    To a complete beginner, an address, a pointer, a reference, an identifying description are all going to appear to be very similar.  A vague hand wave might be confusing, an oversimplification might be misleading; there's no free lunch.  But when it comes to C#, which doesn't allow pointer arithmetic on references (at least not with the unsafe and fixed keywords), wouldn't those pointer-oriented keywords be a better place to clarify the distinction?  Confusing references and pointers seems harmless enough...

  • Well, would it help to point out to these authors out section 25 of the C# spec (unsafe code) is a conditional part of the standard, that conforming implementations are not required to implement unsafe code, pointers, or any of the other paraphernalia surrounding it? From this it naturally follows that MS's particular implementation and CLR are certainly not the only targets to consider.

    Further, it should be possible to compile/run C# for target environments where pointers simply do not exist and references *are* opaque first-class objects. Example environments include the Java VM, lisp machines, Parrot, or even a javascript engine with a C# compiler equivalent to Google's GWT Java-to-Javascript compiler.[0]

    I find it somewhat scary that there are people writing books on C# who do not understand this.

    [0] http://en.wikipedia.org/wiki/Google_Web_Toolkit

  • It's all a bit of a mine-field.

    Depending on your background, you might assume "pointer" to mean the C concept of a numerical offset from the beginning of an address-space, or you might assume it to be an opaque token like in Pascal. (Although I think most popular Pascals gave in and allowed pointer arithmetic.) You might understand "reference" to be an opaque token as in .net, or "something that's maybe sort of like a dereferenced pointer" as in C++. "Address" might have a precise technical meaning or simply mean "a piece of information that can be used to unambiguously find a thing".

    You're definitely right that you can't explain references in terms of addresses without first explaining quite what you mean by addresses!

    I often wonder how I'd do at learning programming from scratch in a modern language like C#. Although I now have a good understanding of reference and value types, I feel that I got there by first understanding value types and pointers (without pointer arithmetic) in languages like Pascal, and then understanding reference types by analogy to pointers. To me reference semantics seem quite advanced to learn straight away, and I wonder how they are best explained to those who have never learned another programming language.

  • "Pointers are strictly "more powerful" than references; anything you can do with references you can do with pointers, but not vice versa. I imagine that's why there are no references in C -- it's a deliberately austere and powerful language."

    Can you give an example for something that can be done with pointer and not with references, except indexing into arrays?

    Sure. The obvious example is "pointers to pointers". In C# you can have a variable which contains a reference to an object -- one level of indirection. And you can pass the variable as an argument of a method that has a ref parameter, so that's two levels of indirection. But with pointers you can get arbitrarily deep levels of indirection; you can have a pointer to a pointer to a pointer to a pointer to an int if you want.

    Another example is being able to compare two references for reference equality. In C# you can take two references and use System.Object.ReferenceEquals to test to see if they refer to the same object. But consider the following:

    void M1(ref int x, ref int y) { ... }
    void M2(int* x, int* y) { ... }

    int abc = 123;
    M1(ref abc, ref abc);
    M2(&abc, &abc);

    In M1 there is no code that you can write in C# that can tell you whether you are in this case or not -- you have no way of knowing if x and y are refs to the same variable. In M2 you can just compare the pointers for equality and you'll know.

    -- Eric

  • > A pointer in C is little more than a number which refers to a specific index into an array the size of all memory available.

    Since we're nitpicking here already, I have to point out that this is incorrect - in ISO C & C++, a pointer is not an index into "an array the size of all memory available" - in fact, intimately tied to the object, or an array of objects, from which it was produced, and any use of it outside that scope usually leads to U.B., with a few exceptions. For example, you can't take addresses of two locals and compare them with < or > operators, nor can you subtract one from another - both are U.B. according to the spec. Technically, it is entirely valid for a compliant C implementation to implement pointers as "fat" objects that store some handle to the object (or array) from which the pointer was created, and the index within; and complain loudly about any operations that are U.B. The fact that pointers are plain memory addresses in most C implementations out there is also strictly an implementation detail.

    Excellent point. (Though an interesting thing, while we're nitpicking, is that saying "you cannot do this because it is undefined behaviour" is logically inconsistent. If you cannot do it then that's because the compiler is stopping you from doing it. Undefined behaviours only happen when you can do something that leads to the undefined behaviour. Really I think you meant to make the moral statement "you should not because it is undefined behaviour".) But I fully agree with your larger point. There is a difference between the memory model handed to you by most implementations of C and the memory model defined by the language spec.

  • I entirely agree with the main premise of the post. My problem is that the example I find easiest to explain the difference between objects and references involves the word "address". I like to use the example of a house - you can write its address on a piece of paper, then copy it and give it to someone else, and the house itself isn't copied. If you give someone the address of your house and they go and move the furniture, when you get home you can see the furniture has been moved too. It works in various ways.

    It's not surprising that this is the case, because "address" means more than "location in memory" - the computing term was chosen to mirror the real-world term, rather than vice versa. We use "address" for many things: email address, web address etc.

    I'm still looking for an equally good example which doesn't use the word "address".

  • There are so many explanations and definitions to references, pointers and handles that have come out over the years, but there seems to be no general consensus of  what are the correct definitions and comprehensive definitions.

    References have been defined as being  "addresses". Pointers have been defined as being "volatile references to memory locations", references have been defined as being constrained pointers, handles as references to resources or opaque pointers, now references are defined as being opaque handles. Pointers have been defined as being indexes into the memory array, arrays have been defined as higher abstractions over pointers. There are clearly some circles here ... after all, all these at the base just are numbers of well defined lengths in binary format. It matters how you treat those numbers as values or as indirections to values.

    To a newcomer "a reference is an address to an object" is more clear than "an opaque handle", what is a handle to a newcomer? Does he know what that is? does he know what opaque really means in the context of memory management or object management ?

    To a experienced programmer a "opaque handle" is a better definition, because it says allot more about the behaviors involved, but to a newcomer explaining references as handles is very confusing, you can't reference an abstract concept (handle) to explain a reference :).

    ... my 2 cents.

  • Rereading the post, the first paragraph struck me - I think I skimmed over it on my first reading. I wondered how I'd escaped this chastisement when Eric reviewed C# in Depth. The obvious chapter which would have contained the problem is chapter 2, where I go over a few fundamentals. I've looked over that chapter again just now, and indeed I don't claim that a reference is an address. Unfortunately, I don't actually *define* a reference at all - I just give examples and analogies.

    Ho hum. Sins of omission instead of commission? I guess I could get away with it for C# in Depth. If I ever write a beginners' book, it'll be a different matter...

  • Perhaps, at least once in a lifetime, it is useful not to shoe-horn the entire managed mentality into what something is or is not. Same applies to pushing managed idioms as the only correct solution to world hunger (while it leaks so much memory for few TextBoxes, Buttons and Tabs especially... )

    It is a well known fact for few decades that references are more optimisable btw.

  • > I'm still looking for an equally good example which doesn't use the word "address".

    Dogs and dog leads. Many people can have a lead to the same dog. If one of those people yanks the lead, the same dog moves for everyone.

    Dogs can hold (in their mouths?) the leads of other dogs.

    Garbage collecting is the stray dog van!

    This needs work I know :)

  • Very simple and clear explanation of differences between references and pointers from the classical OO point of view.

    Yet, the problem is that OOP completely hides the mechanism of references and object management while the notion of address/pointer does not exist at all. In this sense, it does not matter if we refer to an object representative as a reference, a pointer, an address, a surrogate or a handle - the only functions they guarantee is providing access to the represented object.

    In wider scope, all the above terms are used in different context to emphasize some special feature, function of use pattern. But theoretically, there are only two types of identifiers: 1) an object representative like reference, surrogate or handle, and 2) an field reprsentative like offset. The main source of confusion is that frequently one type can be used for both purposes. Say, a name can represent either an object or a field. In the context of this post, a pointer can be used to represent an object or an offset to a field.

    Unfortunately, this difference is not emphasized in OOP just because it is not needed -- we have only primitive references and primitive fields. (Primitive means that they are generated and processed by the compiler, interpreter or run-time environment only.) Concept-oriented programming (CoP) is an emerging technology [1,2,3] which tries to fix this (and other) restrictions of OOP by making references first-class citizens of the object world. In particular, a new programming construct, called <a href="http://conceptoriented.org/wiki/Concept_%28concept-oriented_programming%29">concept</a>, is used instead of classes and it is defined as a couple of two classes: one reference class (for describing the structure and functions of references) and one object class.

    [1] <a href="http://conceptoriented.org/papers/CopInformalIntroduction.html">Informal Introduction into the Concept-Oriented Programming</a>

    [2] <a href="http://conceptoriented.org/wiki/Concept-oriented_programming">Concept-Oriented Programming wiki article</a>

    [3] <a href="http://conceptoriented.org/blogs/cob/">Concept-Oriented Programming Blog</a>

  • @Jon Skeet: the House sample is perfect, because the address is not the only way to get a reference to it. You could have standard coordinates (latitude, longitude) or custom coordinates (common start point, direction, distance).

    Address, standard and custom coordinates are three valid implementations for referencing a house.

Page 1 of 6 (76 items) 12345»