Fabulous Adventures In Coding
Eric Lippert is a principal developer on the C# compiler team. Learn more about Eric.
(Eric is out camping; this posting is prerecorded. I'll be back in the office after Labour Day.)
The word "type" appears almost five thousand times in the C# 4 specification, and there is an entire chapter, chapter 4, dedicated to nothing but describing types. We start the specification by noting that C# is "type safe" and has "a unified type system" (*). We say that programs "declare" types, and that declared types can be organized by namespace. Clearly types are incredibly important to the design of C#, and incredibly important to C# programmers, so it is a more than a little bit surprising that nowhere in the eight hundred pages of the specification do we ever actually define the word "type".
We sort of assume that the developer reading the specification already has a working understanding of what a "type" is; the spec does not aim to be either a beginner programming tutorial or a mathematically precise formal language description. But if you ask ten working line-of-business developers for a formal definition of "type", you might get ten different answers. So let's consider today the question: what exactly is this thing you call a "type"?
A common answer to that question is that a type consists of:
* A name* A set (possibly infinite) of values
And possibly also:
* A finite list of rules for associating values not in the type with values in the type (that is, coercions; 123.45 is not a member of the integer type, but it can be coerced to 123.)
Though that is not a terrible definition as a first attempt, it runs into some pretty nasty problems when you look at it more deeply.
The first problem is that of course not every type needs to have a name; C# 3 has anonymous types which have no names by definition. Is "string" the name of a type? What about "List<string>" -- does that name a type? Is the name of the string type "string" or "String", or "System.String", or "global::System.String", or all four? Does a type's name change depending on where in the source code you are looking?
This gets to be a bit of a mess. I prefer to think of types as logically not having names at all. Program fragments "12" and "10 + 2" and "3 * 4" and "0x0C" are not names for the number 12, they are expressions which happen to all evaluate to the number 12. That number is just a number; how you choose to notate it is a fact about your notational system, not a fact about the number itself. Similarly for types; the program fragment "List<string>" might, in its context, evaluate to refer to a particular type, but that type need not have that fragment as its name. It has no name.
The concept of a "set of values" is also problematic, particularly if those sets are potentially infinite "mathematical" sets.
To start with, suppose you have a value: how do you determine what its type is? There are infinitely many sets that can contain that value, and therefore infinitely many types that the value can be! And indeed, if the string "hello" is a member of the string type, clearly it is also a member of the object type. How are we to determine what "the" type of a value is?
Things get even more brain-destroying when you think, oh, I know, I'll just say that every type's set of values is defined by a "predicate" that tells me whether a given value is in the type or not. That seems very plausible, but then you have to think about questions like "are types themselves values?" If so, then "the type whose values are all the types that are not members of themself" is a valid predicate, and hey, we've got Russell's Paradox all over again. (**)
Moreover, the idea of a type being defined by a predicate that identifies whether a given value is of that type or not is quite dissimilar to how we typically conceive of types in a C# program. If it were, then we could have "predicate types" like "x is an even integer" or "x is not a prime number" or "x is a string that is a legal VBScript program" or whatever. We don't have types like that in the C# type system.
So if a type is not a name, and a type is not a set of values, and a type is not a predicate that determines membership, what is a type? We seem to be getting no closer to a definition.
Next time: I'll try to come up with a more workable definition of what "type" means to both compiler writers and developers.
(*) An unfortunate half-truth, as pointer types are not unified into the type system in the version of the C# language which includes the "unsafe" subset of functionality.
(**) When Russell discovered that his paradox undermined the entire arithmetic theory of Frege, he and Whitehead set about inventing what is now known as the "ramified theory of types", an almost impossible-to-understand theory that is immune to these sorts of paradoxes. Though the mathematical underpinnings of type theory/set theory/category theory/etc are fascinating, I do not understand them nearly well enough to go into detail here. Remember, what we're looking for is a definition of "type" that is amenable to doing practical work in a modern programming language.
"I would have thought it is most unusual to have a value without knowing a type, because in order to have a value you must have had some sort of *symbol*, be it scribbles on paper or bytes in RAM, and some system for interpreting such symbols into some possible set of conceptual values, and so we already have at least one type that we know the value belongs to."
Eric's point is not that we couldn't know *a* type if pressed but that it's hard to know *the* type -- the one we care about in our language. "42" could be an integer or a string or a pixel set or a bunch of curves, depending on my symbolism, but that doesn't really help a compiler if I don't restrict my choices. If the type "meaningless scribbles" is allowed (or "top" or whatever you want to call it) then all values are certainly of at least that type, but that's of theoretical interest only. On the other end, "32 bit machine word" is only slightly more useful if you want to have a high-level language.
If you will, Eric is using "value" in a common-sense way of "something that's undoubtedly a value in some type system, but not obviously of one particular type in the type system we're interested in (and it's our job to find out what it is)".
"Furthermore, why do we want to determine a single type for the value "hello"? Surely the only reason a type is useful is when we need to work with a value that we *don't* entirely know in advance, but about which we know *something*. The type is the information that we have that constricts our unknown value from the set of all conceivable ideas to something useful which can be meaningfully processed. It seems more important to worry about the relationships between the types we care about than to fret over the difficulty of saying anything about the type of an arbitrary value from the set of all possible values."
Good luck writing a compiler that doesn't fret about what type a value is, but does "worry about the relationships between the types we care about". At the lowest level of your translation cycle is a bunch of meaningless scribbles. You'll have to type them in some way eventually. The more flexible you make typing, the more complex type interactions get. Even languages with type inference generally "prefer" to make decisions quickly rather than leave the question open until the entire set of all operations that could possibly happen is known, not just for implementation reasons but also because extremely open-ended typing tends to lead to surprises. Anyone who's worked with a script language that has some unexpected type conversion rules (most of them have at least one) knows what I'm talking about.
That's not to say it isn't possible to take this view and stick with it, but the difficulty is a very real one worth fretting about for most compiler writers.
This has changed so much from the NT daze. Eric, what are your top five fav blogs that you follow? Just curious. :O)
Scribbles on paper and bytes in RAM were not given as examples of values, but of symbols. What I was saying is that in order to have a value at all you must *first* have some context that tells you that those four bytes are an unsigned integer or a single precision float or four ASCII characters or whatever else. If you don't have the context then you don't really have the value.
"Good luck writing a compiler that doesn't fret about what type a value is"
Oh dear, that is not what I meant at all. I said that you can't have a value without some kind of context. What I was trying to say is that it that it doesn't seem particularly important to try to define (say) some function that takes an utterly arbitrary value and tells you its type because it's the context that tells you what other values it could have had, not the value itself. When talking about the value "hello", perhaps the context tells you that it could have been any sequence of characters (i.e., any string). The useful questions are then whether this type - the set of all strings - is a subset of some other type we are interested in, say the set of all objects, or whether a string can be coerced to another type, and so on. That's what I mean by "the relationships between the types we care about".
I don't think anything I said precludes any particular compiler/interpreter implementations. I'm just talking about the question at hand - what is a type; and Eric's reasons for why a "set of possible values" is not a suitable definition of a type. I'd have considered this to be a relatively abstract question about the principles behind programming languages, not a specific mechanism for implementing a compiler or interpreter.
How about this - a type T is:
1) A set of operations that can *create* distinct values ("instances of T").
2) A set of operations that can *be performed on instances* of T.
There's some subtlety in the meaning of the word "distinct" in that first paragraph that I am not quite sure how to get across - something to the effect that the values thus created must be tagged with the identity of T so that other types cannot ever produce the same values. But I think overall this kind of definition applies as a working definition of how the concept of a Type is used in the CLR.
A type is union of:
- data storage specification (at least size);
- agreement about meaning of that data (e.g. "this is character").
... and optional set of operations defined for this data.
This covers c# (and it's .Net underpinnings) only, java would be similar in some ways but the lack of user defined value types means a bunch of things are not required.
A Type is a set of Capabilities and Restrictions relevant (but in subtly different ways) to any storage location and instances of objects.
These are defined in several places, in some cases the c# spec, in others your code some in the CLR specification itself. They are likewise implemented in multiple places, the compiler, the runtime, the resulting code from the compiler.
Types can be reified in two ways,
At runtime through reflection (either by asking the compiler to get it via typeof() or by asking the runtime to get it from an object instance with GetType() or moral equivalent).
The other was added in 2.0 with the addition of generics. A generic type 'T' exists as something which can be interacted with at compile time, it is in essence a placeholder in which you can substitute a type, but once picked for a specific instance/storage location cannot be changed.
These reifications, precisely because they make some abstractions concrete do not encompass all aspects of Types, they can be very loosely analogous to pictures of types (in the sense of reflection) or silhouettes of types (generics).
Thus all further definitions here will indicate precisely when it is talking about either of these reifications.
Types exist without objects, if you have a variable foo which is defined to contain an int then the only things that knows about that is the compiler (both csc and the jit) and the runtime (so it can know there is/isn't a need to trace through that location when determining the live set).
Types also exist outside of storage locations (you may have a storage location defined as type "object" but the value referenced by it has type Flibble. Note that There is no requirement that the runtime actually creates all the cruft associated with the type of string is defined anywhere (see escape analysis and enregistration).
The capabilities come in to play in a few ways:
The operations which may be performed on something contained within a storage location (the type in question being the type of the storage location) verses the code which will be used to determine the actual behaviour which may be defined by the type of the instance which is present in the storage location.
What fields (themselves storage locations defined by their type and name) are present on the type.
Can instances of the type be blitted.
What conversions between types are possible (and again how this happens)
Some of these capabilities imply virtual dispatch
The Restrictions, far more of these, for example
How may a specific type be constructed.
How visible is the type (for this the concept of module, a collection of types must also be defined).
Can the type be extended for implementation inheritance.
How much space is required to hold an instance of the type in a storage location. For reference types this is defined to always be done through managed references, which abstract the problem away by always being IntPtr.Size. For value types the sizes must be known at compile time.
Does assignment to another storage location require a copy (or at least require the *semantics* of a blit based copy) (ref and out parameters are aliases to the _same_ storage location).
Many of these became much more complex and subtle when generics came in.
Some types are 'special' in the sense that they only appear at certain points (say only at compile time).
dynamic is one such type. It ceases to exist at runtime (if you attempt to reflect on it this will be obvious), instead be replaced by specific instances of other types which are responsible for ensuring the behaviours expected of it are adhered to. Generics in java would be another example as they apply at compile time but cease to exist at runtime (again reflection or simply casting allow you to violate the specifications the compiler enforced) .
Another is pointers, which have many aspects akin to types (capabilities and restrictions) but they are markedly different from other types and as such cannot be reified in the same fashion (no reflection, not no boxing, no use in generics)
Much of the subtlety in the specification (and compiler design) comes from which capabilities and restrictions exist at a point in time, and which order they are applied in (if at all).
(this is a retry, my first attempt seems to be munched, sorry if it shows up doubly after all!)
@JM: Indeed there are more sophisticated type systems where you could define types like 'a string representing a valid VB program' - but either you'll have to pack more info with the string to have it be a value of that type (namely data proving that it is, in fact, a representation of a valid VB program) or you'll have to be content with your type judgement being undecidable, i.e. not being able always to tell whether value x belongs to type y or not. An undesirable trait, IMO, especially for compilers wanting to give guarantees of correctness.
I think the type is a combination of rules we put together. A type can be some sort of imaginary we make in mind when we see a value. I agree that the value is should have a type but this "type" is some how semi hidden
My try on the meaning of type without reading through the article:
A type is implied set of rules for natrual interpretation and manipulation of data.
@Amin: Value may not necessarily be associated with types if our defination is not bounded to .NET realm. See C++ macro defined values for example, also a value referenced by untyped pointer for another example.
An interesting question, 'are type values themselves?'. Certainly there is the type Type, and the operator typeof() and method object.GetType() yield values in that type. But one cannot say e.g. Type mytype = condition() ? String : Int64; -or- String.GetType(). So there is a difference between the (compile-time) type indications we use to type variables with or use the new operator on, and the (run-time) Type-valued objects representing these types. That said, by reflection we can use the latter to create new instances with, so maybe types are values themselves after all.
Russell's paradox is avoided since values seem (implicitly) to have a reference to their types. Defining a new subtype B of A that doesn't extend A in any way doesn't suddenly yield all A-typed objects to be B-objects as well - this in contrast to a set-theoretic view when types are just sets of values.
(Hope this post doesn't get lost on the way - chances are 50-50 up to now :-S Nimrand, I tried to reply to you twice, but no luck)
@cheong00 : What I mean is not bounded to a special programming language. Think about my word in real life examples.
I think what you say is something we call 'feature' of a programming language or 'specification' or 'model of creation' of a language , but if we don't bound ourselves to some programming platform and focus on the deep meaning of the 'Type' we come to the common rules for natrual interpretation and manipulation of data as you said yourself.
And another interesting thing I think of : sometimes values can make new types... what do you think ?
>> To start with, suppose you have a value: how do you determine what its type is?
I think this counterpoint is subtly wrong, for the simple reason that it makes certain assumptions about what a "value" is - and surely, if we're defining "types" in terms of "values", we should first define what a "value" is with similar rigor?
On the other hand, if "value" itself is defined as data tagged with a type - which resembles more what we actually see in CLR - then there's no contradiction. A type then is, indeed, just a predicate, and every value is tagged with one (and only one) type, which is how we can determine the "actual" type of any value.
@Amin: For values that makes new type, that doesn't make many difference if you're using my defination. You still have to obey the function contract (memory size, properties, methods) provided by the new type, and provide a way to interprete the value it holds.
Actually I'll agree my defination is a bit dull and not stimulating imagination, just that it helps to fit all essential parts to think about while talking about the term "type" when we talk about programming.