References are not addresses

References are not addresses

Rate This
  • Comments 76

[NOTE: Based on some insightful comments I have updated this article to describe more clearly the relationships between references, pointers and addresses. Thanks to those who commented.]

I review a fair number of C# books; in all of them of course the author attempts to explain the difference between reference types and value types. Unfortunately, most of them do so by saying something like "a variable of reference type stores the address of the object". I always object to this. The last time this happened the author asked me for a more detailed explanation of why I always object, which I shall share with you now:

We have the abstract concept of "a reference". If I were to write about "Beethoven's Ninth Symphony", those two-dozen characters are not a 90-minute long symphonic masterwork with a large choral section. They're a reference to that thing, not the thing itself. And this reference itself contains references -- the word "Beethoven" is not a long-dead famously deaf Romantic Period composer, but it is a reference to one.

Similarly in programming languages we have the concept of "a reference" distinct from "the referent".

The inventor of the C programming language, oddly enough, chose to not have the concept of references at all. Rather, Ritchie chose to have "pointers" be first-class entities in the language. A pointer in C is like a reference in that it refers to some data by tracking its location, but there are more smarts in a pointer; you can perform arithmetic on a pointer as if it were a number, you can take the difference between two pointers that are both in the interior of the same array and get a sensible result, and so on.

Pointers are strictly "more powerful" than references; anything you can do with references you can do with pointers, but not vice versa. I imagine that's why there are no references in C -- it's a deliberately austere and powerful language.

The down side of pointers-instead-of-references is that pointers are hard for many novices to understand, and make it very very very easy to shoot yourself in the foot.

Pointers are typically implemented as addresses. An address is a number which is an offset into the "array of bytes" that is the entire virtual address space of the process (or, sometimes, an offset into some well-known portion of that address space -- I'm thinking of "near" vs. "far" pointers in win16 programming. But for the purposes of this article let's assume that an address is a byte offset into the whole address space.) Since addresses are just numbers you can easily perform pointer arithmetic with them.

Now consider C#, a language which has both references and pointers. There are some things you can only do with pointers, and we want to have a language that allows you to do those things (under carefully controlled conditions that call out that you are doing something that possibly breaks type safety, hence "unsafe".)  But we also do not want to force anyone to have to understand pointers in order to do programming with references.

We also want to avoid some of the optimization nightmares that languages with pointers have. Languages with heavy use of pointers have a hard time doing garbage collection, optimizations, and so on, because it is infeasible to guarantee that no one has an interior pointer to an object, and therefore the object must remain alive and immobile.

For all these reasons we do not describe references as addresses in the specification. The spec just says that a variable of reference type "stores a reference" to an object, and leaves it completely vague as to how that might be implemented. Similarly, a pointer variable stores "the address" of an object, which again, is left pretty vague. Nowhere do we say that references are the same as addresses.

So, in C# a reference is some vague thing that lets you reference an object. You cannot do anything with a reference except dereference it, and compare it with another reference for equality. And in C# a pointer is identified as an address.

By contrast with a reference, you can do much more with a pointer that contains an address. Addresses can be manipulated mathematically; you can subtract one from another, you can add integers to them, and so on. Their legal operations indicate that they are "fancy numbers" that index into the "array" that is the virtual address space of the process.

Now, behind the scenes, the CLR actually does implement managed object references as addresses to objects owned by the garbage collector, but that is an implementation detail. There's no reason why it has to do that other than efficiency and flexibility. C# references could be implemented by opaque handles that are meaningful only to the garbage collector, which, frankly, is how I prefer to think of them. That the "handle" happens to actually be an address at runtime is an implementation detail which I should neither know about nor rely upon. (Which is the whole point of encapsulation; the client doesn't have to know.)

I therefore have three reasons why authors should not explain that "references are addresses".

1) It's close to a lie. References cannot be treated as addresses by the user, and in fact, they do not necessarily contain an address in the implementation. (Though our implementation happens to do so.)

2) It's an explanation that explains nothing to novice programmers. Novice programmers probably do not know that an "address" is an offset into the array of bytes that is all process memory. To understand what an "address" is with any kind of depth, the novice programmer already has to understand pointer types and addresses -- basically, they have to understand the memory model of many implementations of C. This is one of those "it's clear only if it's already known" situations that are so common in books for beginners.

3) If these novices eventually learn about pointer types in C#, their confused understanding of references will probably make it harder, not easier, to understand how pointers work in C#. The novice could sensibly reason "If a reference is an address and a pointer is an address, then I should be able to cast any reference to a pointer in unsafe code, right?"  But you cannot.

If you think of a reference is actually being an opaque GC handle then it becomes clear that to find the address associated with the handle you have to somehow "fix" the object. You have to tell the GC "until further notice, the object with this handle must not be moved in memory, because someone might have an interior pointer to it". (There are various ways to do that which are beyond the scope of this screed.)

Basically what I'm getting at here is that an understanding of the meaning of "addresses" in any language requires a moderately deep understanding of the memory model of that language. If an author does not provide an explanation of the memory model of either C or C#, then explaining references in terms of addresses becomes an exercise in question begging. It raises more questions than it answers.

This is one of those situations where the author has the hard call of deciding whether an inaccurate oversimplification serves the larger pedagogic goal better than an accurate digression or a vague hand-wave.

In the counterfactual world where I am writing a beginner C# book, I would personally opt for the vague hand-wave.  If I said anything at all I would say something like "a reference is actually implemented as a small chunk of data which contains information used by the CLR to determine precisely which object is being referred to by the reference". That's both vague and accurate without implying more than is wise.

  • I am suprised to hear that many C# book authors say this. Really, they shouldn't be writing such a book if they don't know that references are not something as low level as memory address pointers. I never even had to be told or read there was a difference. There is no way any high level language would allow such a low level aspect to be used as often as references are used. Even if they never had to use Intptr it still goes without saying a C# reference is of a higher level than a memory address pointer. I can understand defining such a thing is a little tricky in order to make a novice understand but that is what they signed up for when they decided to write such a book.

  • Growing up I studied both VB and C and always maintained a mental analog between ByRef in VB and pointer parameters in C. From a level of indirection standpoint these two mechanisms are equivalent (nevermind all the more powerful things I could do with pointers that I never wanted to do). Maybe this is a flaw in my own mental model (having never really used C/C++ only studied them). And surely there is a lot of nuance that is either left out or worse implied by mixing the semantics of pointers, smart pointers, const pointers, C++ references, TypedReferences, Handles, etc, etc. And URLs and URIs are really different things.

    When I moved to VB.NET the analogy of reference types to Pointers made it pretty easy for me to understand them on some level versus value types. In this instance I think that inaccurate simplification is more useful than harmful though I can empathize with your desire to not muddle the facts. Perhaps it is better to say it backwards - that Reference is the highest level abstraction and that pointers are merely an address-based implementation of the concept and that .NET references are likewise an implementation of the pattern of referencing - that way you're defining a .NET concretization in terms of a common abstraction, rather than by another sibling concretization.

    As an aside, at this stage of the game I wonder if the .NET team made the right design choice in coupling val/ref stack/heap semantics to classes and structures and whether C++/CLI is more spot on by decoupling the kind of object from its storage semantics.

  • glad you got that off your chest? :)

  • I think what it all boils down to is this:

    A pointer, in every C and C++ implementation I've ever seen, is a memory address in the process memory space.  I suspect that pointers (unsafe as they may be) are implemented in exactly the same way in C#.

    Well, I certainly have seen such implementations. What about implementations of C that target 16 bit Windows running on x86? Do they have pointers which are "memory address in process memory space"? What does "process memory space" even mean in an operating system that doesn't have processes? And what does it mean when pointers come in different sizes?

    Remember "near" vs "far" pointers, segmented architectures, selectors and offsets? Pointers used to be weird, man. -- Eric

    A reference is a data structure (in many implementations, a simple pointer) that contains all the information necessary for the code to find the object it refers to.  in the case of C# and other CLR languages, the garbage collector may move the object, but it updates the reference to continue to describe the location of the same object, regardless of its actual location in memory.  It's essentially like a shortcut on your desktop.  file that the shortcut refers to can be moved, but as long as the shortcut is updated with the new location, you don't need to know where the actual file is, so long as you can click the shortcut on the desktop.

  • Great article, with one important objection.

    Since this is a discussion on semantics, I would argue about the semantic of "address". The wod "address" would drive only some programmers to think of a memory address. Others might think of a URI, for example. Most novices will probably just think of a street address.

    Take this definition in support:

    http://wordnetweb.princeton.edu/perl/webwn?s=address

    I do agree, the statement "references are memory addresses" is wrong. But, in general, "references are addresses" is not that inaccurate.

  • I think your questioning of C is incorrect.  There is no question to why C doesn't have references.  C is a low-level language that doesn't need the overhead of references.  You manage your own memory.  It's difficult but not horrible.  If you don't understand it, you shouldn't be using it!  There are plenty of other languages that will solve your problem most likely.

  • If we use the house analogy, a reference would be like saying "Tom's house". It's not an address, it's only known amongst the circle of friends, the "program" in which it is used. The actual address - 123 Elm Street - would be more like a pointer. So if you know Tom, you understand the reference but it doesn't give you the exact address because it's not needed. And if Tom moves, no problem for the reference.

    No analogy is perfect but for a beginner, this helps to keep things in context to know that a reference is not a pointer but it still, nonetheless points to an address.

    Thanks for the article.

  • Maybe a bit late, but I still want to share my ideas and questions on this subject.

    I wonder why everybody tries to explain the differnce between a value type and a referende to by detailed discusions about what they are. I prefere to explain them by what they do.

    And then it is not as complex as it may seem.

    My explanation goes alomg he following lines:

    In modern programming languages (like C# and VB) we use complex data structures, like database records or classes (e.g. the employee in a personel sytem). These data structures are not new, the only thing new about them is that since the second world war we have been spending lots of time, blood, sweat and tears to put them in machines called computers. First as a carbon-copy of the paper implementation, i.e. everybody who needed to know about an employee still had his/her own copy of the personel data sheet for that person.

    Enter a GREAT NEW IDEA:  lets have a single copy of the electronic sheet that we all share. At first this was only implemented in the long term storage model (subject databases), but the way programming worked in those years (seventies - mid nineties) if a I would claim the personel datasheet for Mr. J.D. nobody else could access it.  Even worse: if  ihad it claimed to update his address, I could not at the same time make an update to reflect his/her gender change operation.

    Enter YET ANOTHER GREAT IDEA. We should not think in terms of private access to a personel data sheet but to shared access to that sheet with automatic and transparent propagation of changes to the data by one user (or using program function) to all other users. As happens this is not yet even close to implementation in  the real world (although CIA, FBI, MI5, James Bond etc sometimes radiate something different), but at the level of a single computer program with many functions being executed by that program at the same time we are getting the tools:

    - virtual storage, that allows us to make/write all those pieces of software you need to implement the following features. But that also means that you never can tell where you data is: on a hard drive, in the cache of the harddirve, on fast memory expansion unit (aka memory stick), .... Well its hidden for the beginners in programming, but even those beginners will understand the need for a Data Entity Manager.

    - shared memory, that offers one program the oportunity to make another crash. So in te modern programming languages its hidden from and inaccessable by the user. But is an absolute must have for a fast responding Data Entity Manager. (By the it was invented to with data and code sharing in mind. That you can crosscreate abends is the unintended down-side)

    - Data Entity Managers, that keep a single copy of the data given to them, a give shared access to every program/function with the correct credentials

    - indexers into the data managed by the data entity managers: the users.

    And the mechanism to index into the Data Entity Manager in C# en VB is called REFERENCE TYPE. And that is all there is to the Reference Type: it offers an index into the Data Entity Manager. The only thing that is a bit strange here is the name Microsoft has given their Date Entity Manager: The Garbage Collector (Lets hope Sheakespear (What's in a nama) was right and not the old Romans (Nomen est Omen)).

    If it is that simple, then why is not all data reference type? Well, in good pogramming all meaningfull data should be implmented as a reference type (even if that means creating your own data-only classes that contain datatypes that are (wrongly?) only implemented as valuue types by the compiler maker). Value types should only be used for program internals (like loop counters), that are meaningless outside the scope of the program.

    Wim Rozendaal

  • This is all well and good - and everyone sems happy with the numerous descriptions of pointers and addresses and dogs on leads etc... (good one that!) - but this confuses my address based understanding of pInvoke calls where you import an implementation using pointers and marshal it using ref... i'd always thought of references as just that, a reference, which is nice and easy and fits in with your original post of not cosidering them as anyhting other than a reference (don't need to know? why worry about it then?). However when dealing with C++ implementations you're trying to pInvoke to you often come across this marshalling grief - and that's where all the stuff above leaves me confused. what exactly is going on when you use ref in your c# definition of something which in the original c++ implementation was a pointer? is the type marshaller doing clever stuff on your behalf, getting the real address (what ever that means!! not wanting to open a can of wroms there) from the GC and substituting it at runtime? Or is this functionality dependant on the implementation of the GC using addresses as the references? i'm normally quite happy not knowing this stuff, but with pInvoke situations i've often found myself needing to know more than i want to!

  • Eric, you said

    <i>In M1 there is no code that you can write in C# that can tell you whether you are in this case or not -- you have no way of knowing if x and y are refs to the same variable. In M2 you can just compare the pointers for equality and you'll know.</i>

    You are mistaken. Observe:

    void M1(ref int x, ref int y) {
     int z = x;
     x++;
     bool same = x == y;
     x = z; //side-effects are bad
     ...
    }

    I get the idea of your code, but I think its not quite right. That would set "same" to true if x is one and y is two, for example. I think what you meant to say was something like "same = (x == y) && (++x == y)"

    That said, you still do not _know_ that x and y are the same variable. You have a good _guess_ that they are the same variable, but you do not know for sure. There could be another thread constantly watching x for changes and updating y as soon as they happen. That would give you a race condition in which sometimes "same" would incorrectly report true and sometimes would report false.

    (Of course it is a bad programming practice to pass such volatile variables by reference, and the compiler will warn you about it if you do.)

     -- Eric

     

  • @Mike Birt

    Marshalling does indeed do loads of clever stuff on your behalf. If the call signature of the function you invoke includes pointer parameters then the referenced values might potentially be changed. If you fail to specify ref then your parameter value will probably make it to the invoked function but any changes won't make it back to the caller. This would cause problems with API functions that expect you to pass them a struct to populate.

  • Managed vs. Unmanaged Code and References vs. Pointers

    first, sorry my English :)

    secondly, I am new to this subject...

    here is my Question..

    I have a DLL import

    [DllImport("ClientDll.GC.4x.dll", EntryPoint = "RPC_getImage")]
    public unsafe static extern int RPC_getImage(String imageName, byte[] content, ref UInt32 contentSize);

    ..........................................

    I have c# Code, at the c# side I use this,

    byte[] xx;
    RPC_getImage("my Image", xx, 1234);

    ..........................................

    and at the C++ side, the Code is (compacted)like this:

    RPC_getImage(const char * name,
                void *       content,
                unsigned *   contentSize)
    {
      static void * _content;
      //next line is a intern c++ function with typical pointers that fill _content)
      _getImage(g_RPCHandle, (const char*) name, (unsigned char**) &_content, contentSize);
      memcpy(content, _content, *contentSize);
      makefree_in_a_good_form(_content); 
      _content = NULL;          //ok, a c++ function make _content free
    }

    What "is" my variable xx at the c# side?

    is xx a reference to byte[] or not?

    what does the Garbage Collector do with xx?

    Thank you ! :)

     

    Obviously this is going to die horribly in multiple ways. A managed string is UTF16 -- two bytes per char, but your code expects one byte per char. You are supposed to be passing a ref to the size, instead you are passing the size. And it is completely unclear how the runtime is supposed to marshal the byte array. To make this work you need to decorate each formal parameter in the extern declaration with the appropriate marshalling attribute so that the runtime knows what to do. -- Eric

  • This is an excellent description! This is the kind of reading us techies live for. Wise choice of words! If people like this author still work for MS, then I will change my rating on MS stock to "buy" :)

  • Its good that such things gets under questions, gives better understanding to underlying logic of programming languages.

    I think in this post defenition of reference is also a very vague. Pointer adds even more confusion.

    From technical point of view there can be tons of things hapening beyond the scene, but from logical point of view its  pretty simple. Reference itself is an intension and extension of it is the object it referes to. Relation or function or all handwaving describing the way binding between intension and extension happenens is a function(relation or whatever you like it to call) translating(mapping) intension of reference for the curent state into its extension (in current state).

    Thanks for article.

  • An excellent post (and followup), but one inconsistancy (quotations reversed in order to make post easier...)

    "Pointers are strictly "more powerful" than references; anything you can do with references you can do with pointers, but not vice versa. I imagine that's why there are no references in C -- it's a deliberately austere and powerful language. "

    "The inventor of the C programming language, oddly enough, chose to not have the concept of references at all. Rather, Ritchie chose to have "pointers" be first-class entities in the language"

    You explain very well WHY there are no references with the first quote (which occurs later in your original post). Provided there was "one way to skin a cat" many other "features" were eliminated.

    Remember the C was originally designed to run on a DEC PDP-11 mini-computer. The maximum addressable space (directly at one time by aq single process) was [actually still is - I have a few of them as well as a PDP-8 and Vaxen] 32KW....sooo many aspects of C derive from this processors architecture....

Page 5 of 6 (76 items) «23456