Eric Fleegal's WebLog

. . . .

  • Simplifying C++ NULL terminated string handling

    I wrote the following short article several years ago.  I've reproduced it here by request 

     

    Once of the curious features of the C language is its lack of an integrated string type.  Most programming languages developed in the 1960/70s included a basic string type. Strings in C, however, are just a special case of array data and the only direct language support involves initialization of pointers and array values with string literals. 

     

    Fortunately, the C++ standard library introduces a standardized string type, std::string.  Although its implementation is very well thought out, naïve use of std::string can fragment the heap, reduce performance and create unexpected bottlenecks.  However, the same can be said for naïve use of strings in languages like Java and C#, where the String type is fundamental and most of the implementation details are hidden from the programmer.  In C++ there are ways to mitigate potential performance problems. When I employ std::string, I prefer to use a memory model very similar to boost’s segregated storage; this reduces memory fragmentation and keeps all the string data within a sandbox.  It’s fairly easy to do this once you know how to write a standard allocator.  The strstream class in C++ is an efficient and elegant solution to complex string construction, and I prefer it over Java’s somewhat clumsy StringBuilder class.  If you prefer the printf way of formatting strings, the boost library offers an excellent type-safe alternative that’s built to work efficiently with standard stream classes.

     

    Systems programmers and developers of high performance applications typically use C style strings.  There are a number of reasons for this, but chief amongst them are efficiency and interoperability.  C strings are efficient precisely because they are simple -- they can be allocated on the stack or as part of a larger structure or in the free-store, and operations on them can be specifically tailored for maximum efficiency.  Moreover, C strings are usually not optional when interacting with operating system APIs, drivers, low level libraries and legacy code. 

     

    Being able to program with C strings is a fairly fundamental skill for most Microsoft developers; indeed, a good number of the coding questions we ask during technical interviews involve some sort of C string manipulation.  Being conversant with C string manipulation and the concomitant standard library functions is a point pride for many Microsoft programmers, especially with those who cut their teeth in C instead of C++.  Indeed, these C string "fanboys" are sometimes critical of those developers who prefer std::string over character arrays and raw character buffers (I am not one of them).

     

    I find that the worst part of programming with C strings is the string library itself.  Take a very basic function like strcpy, for instance.   In the early days the length of a symbol name was limited to just a few characters, so we can forgive the designers for selecting a somewhat less than human readable name.  When there was only one way to copy a string, the name strcpy wasn’t a bad choice.  However today there are dozens of different variations on the string copy function name available to Visual C++ programmers.  Here are 24 of them from <string.h> and <mbstring.h>:

    strcpy, wcscpy, _mbscpy, _tcscpy,

    strcpy_s, wcscpy_s, _mbscpy_s, _tcscpy_s,

    strncpy, wcsncpy, _mbsncpy, _tcsncpy,

    _strncpy_l, _wcsncpy_l, _mbsncpy_l, _tcsncpy_l,

    strncpy_s, wcsncpy_s, _mbsncpy_s, _tcsncpy_s,

    _strncpy_s_l, _wcsncpy_s_l, _mbsncpy_s_l, _tcsncpy_s_l

    There are versions for four different character sets: ASCII, Mbcs, Unicode and TCHAR.  There are safe and unsafe versions, locale specific versions, and versions with additional semantics.  Now multiply those semantic variations against the dozen or so basic string operations and you have hundreds of different names to try to remember!

     

    The function names in <strsafe.h> are a little more regular, but there are still dozens of names to remember.  There are 24 of them for copying a string:

    StringCbCopy, StringCbCopyA, StringCbCopyW,

    StringCbCopyEx, StringCbCopyExA, StringCbCopyExW,

    StringCbCopyN, StringCbCopyNA, StringCbCopyNW,

    StringCbCopyNEx, StringCbCopyNExA, StringCbCopyNExW,

    StringCchCopy, StringCchCopyA, StringCchCopyW,

    StringCchCopyEx, StringCchCopyExA, StringCchCopyExW,

    StringCchCopyN, StringCchCopyNA, StringCchCopyNW,

    StringCchCopyNEx, StringCchCopyNAEx, StringCchCopyNWEx

    To complicate matters, Strsafe.h is incomplete; it has neither multi-byte character support nor support for locale specific functions [locale specific functions have been added since the time this article was authored].  Moreover, the strsafe naming style is only available for those functions needing to prevent buffer overruns.  String operations such as comparison and collation, which are already safe, have no implementation in this library.

     

    I think there’s really only one name I should have to remember for each basic string operation.  In this case, that name would be “Copy”—overloaded for each different semantic variation of the operation, but with a uniform scheme for parameterization and return values.  The compiler should do all the work of figuring out which variation to use.  Since the name “Copy” is applicable to more than just strings, we should declare the name within a “Strings” namespace.  Since some variations are unsafe, we should introduce a counterpart namespace “UnsafeStrings” to make it very explicit when choosing to use an unsafe version of a string function.  We declare the functions in a namespace instead of a class so that the library can be extensible.  This also makes factoring the implementation code into different files a little easier.

     

    For an initial example, the basic safe Copy operations for each of three string types would be declared as follows:

     

          namespace LibraryName

          {

    namespace Strings

          {

                errno_t Copy(char* destination, size_t destinationSize, const char* source);

                errno_t Copy(unsigned char* destination, size_t destinationSize, const unsigned char* source);

                errno_t Copy(wchar_t* destination, size_t destinationSize, const wchar_t* source);

          }

          }

     

    These are the basic copy functions for ASCII, Mbcs and Unicode respectively.  Each of these overloaded versions of Copy simply dispatches to its counterpart in the standard library.  For example, the ASCII version is:

     

          inline errno_t Strings::Copy(char* destination, size_t destinationSize, const char* source)

          {

                return ::strcpy_s(destination, destinationSize, source);

          }

     

    The counterpart unsafe Copy functions would be declared as follows:

     

          namespace LibraryName

          {

          namespace UnsafeStrings

          {

                using Strings::Copy;

                errno_t Copy(char* destination, const char* source);

                errno_t Copy(unsigned char* destination, const unsigned char* source);

                errno_t Copy(wchar_t* destination, const Unicode::Char* source);

          }

          }

     

    Notice that the safe versions of Copy are composited into the UnsafeStrings namespace with a using declaration.  This is done for convenience and makes both Strings or UnsafeStrings a complete name-set.   

     

    The Unsafe definitions of Copy will need additional parameter checking to ensure that the function semantics are uniform with the safe versions.  In practice this doesn’t usually introduce much of a performance barrier since, depending on context, the compiler can often optimize away these additional parameter checks when the function gets expanded inline.  The cost of the parameter checking is trivial compared to the cost of the copy.

     

          inline errno_t UnsafeStrings::Copy(char* destination, const char* source)

    {

          if (!destination || !source)

                return EINVAL;

          ::strcpy(destination, source);

    return 0;

    }

     

    The different semantic variations take on a very regular form.  For instance, to copy a limited number of characters we simply declare additional overloads as follows

           

           namespace LibraryName

           {

    namespace Strings

           {

                  errno_t Copy(char* destination, size_t destinationSize, const char* source, size_t maxCount);

                  errno_t Copy(unsigned char* destination, size_t destinationSize, const unsigned char* source, size_t maxCount);

                  errno_t Copy(wchar_t* destination, size_t destinationSize, const wchar_t* source, size_t maxCount);

           }

     

    namespace UnsafeStrings

           {

                  using Strings::Copy;

                  errno_t Copy(char* destination, const char* source, size_t maxCount);

                  errno_t Copy(unsigned char* destination, const unsigned char* source, size_t maxCount);

                  errno_t Copy(wchar_t* destination, const wchar_t* source, size_t maxCount);

           }

           }

     

    As with the earlier variations, the implementation of these simply dispatch to the correct counterpart function in the standard library.  Again, the unsafe versions will need a little additional code to perform some parameter checking. 

     

    The locale specific variations of Copy are similarly implemented. 

     

    It’s convenient to add safe versions of Copy specifically for arrays.

     

    namespace Strings

    {

    template <size_t TSize>

    inline errno_t Copy(char (&destination)[destinationSize], const char *source)

    {

    return Copy(destination, destinationSize, source);

    }

    }

     

    Other string operations are similarly easy to define.  Consider, for instance, the 28 different name variations for functions comparing two strings:

    strcmp, wcscmp, _mbscmp, _tcscmp,

    _stricmp, _wcsicmp, _mbsicmp, _tcsicmp,

    _stricmp_l, _wcsicmp_l, _mbsicmp_l, _tcsicmp_l,

    strncmp, wcsncmp, _mbsncmp, _mbsncmp_l ,

    _tcsnccmp, _tcsncmp, _tccmp,

    _strnicmp, _wcsnicmp, _mbsnicmp,

    _strnicmp_l, _wcsnicmp_l, _mbsnicmp_l,

    _tcsncicmp, _tcsnicmp, _tcsncicmp_l

    As with the copy functions, the compare functions have versions for four different character sets, ASCII, Mbcs, Unicode and TCHAR.  There are locale specific versions, versions with case insensitive comparison semantics and some versions with different names but identical semantics. 

     

    As with Strings::Copy, the string comparison operation should have only one name, Compare.  The declaration for the basic Compare functions are

     

          namespace LibraryName

          {

    namespace Strings

          {

                int Compare(const char* string1, const char* string2);

                int Compare(const unsigned char* string1, const unsigned char* string2);

                int Compare(const wchar_t* string1, const wchar_t* string2);

          }

     

          namespace UnsafeStrings

          {

                using Strings::Compare;

    // there are no unsafe specific versions of Compare

          }

          }

     

    Each of these overloaded versions of Compare simply dispatches to its counterpart in the standard library.  For example:

     

          inline int Strings::Compare(const char* string1, const char* string2)

          {

                return ::strcmp(string1, string2);

          }

     

    The function strcmp and other standard string comparison functions have undefined behavior when passed bad parameters.  This makes them unsuitable as predicate operations for sorting algorithms and ordered containers.  Historically this was done for performance reasons since parameter checking was considered “expensive” due to the extra branch operations – this defense is somewhat dubious since comparing two strings is relatively much more expensive than the parameter checking.  Accordingly, we redefine the Compare functions with an alternate semantic—one that is well-ordered for any two string arguments, NULL or not.

     

    // function Compare(a,b)

    // Compares two strings by lexicographically

    // returns 

    //     <0 when a < b

    //      0 when a == b

    //     >0 when a > b

    // except when either a or b are NULL, then

    //     <0 when a==NULL && b!=NULL

    //      0 when a==NULL && b==NULL

    //     >0 when a!=NULL && b==NULL

    inline int Strings::Compare(const char* a, const char* b)

    {

        if (!a)

            return b ? -1 : 0; 

        if (!b)

            return +1;           

        return ::strcmp(a, b);

    }

     

    The new added semantics are: two NULL strings are equal, and a NULL string is considered “less than” one that isn’t NULL.  This means that if the Strings::Compare function is used in a predicate operator, the NULL strings will be sorted forward.  The compiler can sometimes optimize away the parameter checking when the function is inlined.

     

    Case insensitive comparison requires an additional tag type.  Its introduced in the following code as enumCaseInsensitive, and the tag names CaseInsensitive and CASE_INSENSITIVE.

     

          namespace LibraryName

          {

    namespace Strings

          {

                enum enumCaseInsensitive { CaseInsensitive, CASE_INSENSITIVE };

     

                int Compare(const char* string1, const char* string2, enumCaseInsensitive);

                int Compare(const unsigned char* string1, const unsigned char* string2, enumCaseInsensitive);

                int Compare(const wchar_t* string1, const wchar_t* string2, enumCaseInsensitive);

          }

     

          namespace UnsafeStrings

          {

                using Strings::Compare;

          }

          }

     

    As with the case sensitive version, this version of Compare simply dispatches to the appropriate counterpart in the standard library, adding the same NULL semantics as before.  For example:

     

    inline int Strings::Compare(const char a, const char b, enumIgnoreCase)

    {

        if (!a)

            return b ? -1 : 0; 

        if (!b)

            return +1;           

        return ::stricmp(a,b);

    }

     

    To use the case insensitive version of Compare, the calling code simply passes in the Strings::CaseInsensitive tag.  I usually bring the identifiers “CaseInsensitive” or “CASE_INSENSITIVE” into the current namespace with a using directive. 

     

    using Strings::CaseInsensitive;

    . . .

    if ( Strings::Compare(name1, name2, CaseInsensitive) < 0 )

    {

          . . .

    }

     

    Conclusion

    The completed Strings library contains the following functions: Append, Collate, Compare, CompareOrdinal, Convert, Copy, Find, IsEqual, IsLessThan, IsGreaterThan, Length, PrintF/VPrintF, PrintFLength, Replace, ScanF/VScanF, and Tokenize.  In all, this Strings library has hundreds of functions but only sixteen function names to remember. 

     

    Incidentally, the similarity to the function naming convention in the C# String class is no coincidence. 

  • How to do Object Properties in C++

    One of the many useful features of modern languages like C# are object properties, as they provide a higher level of encapsulation than public fields.  The field-like syntax is far easier to read and write than traditional C++ GetXXX and SetXXX functions. 

    It’s surprising how many people don’t know that Visual C++ has properties too.  Microsoft added property fields into C++ as a language extension back in the days when COM programming was all the rage.  As with most C++ language extensions, the syntax is a bit clumsy; this one uses a Microsoft specific __declspec compiler directive.  The syntax is:

    __declspec ( property ( get=nameOfGetFunction, put=nameOfSetFunction ) ) typeExpressing propertyName

    When the compiler sees a data member declared with this attribute on the right of a member-selection operator ("." or "->"), it converts the operation to a get or put function, depending on whether such an expression is an l-value or an r-value. In more complicated contexts, such as "+=", a rewrite is performed by doing both get and put.  A property can also be declared read-only or write-only by specifying only the get or put function respectively. 

    To make life a little easier, I introduce a header file “C++ Properties.h” with the following macros:

    #define PROPERTY(TYPE, NAME) __declspec(property(get=Get##NAME,put=Set##NAME)) TYPE NAME

    #define READONLY_PROPERTY(TYPE, NAME) __declspec(property(get=Get##NAME)) TYPE NAME

    #define WRITEONLY_PROPERTY(TYPE, NAME) __declspec(property(put=Set##NAME)) TYPE NAME

    Notice that these macros use preprocessor token pasting so that the get and put functions always map to GetXXX and SetXXX, where XXX is the name of the property.  This allows us to declare classes with properties in a very readable form; for example:

    class GamePad

    {

    public:

            . . .

            READONLY_PROPERTY(bool, IsConnected);

            bool GetIsConnected() const;

            . . .

    };

    While not quite as elegant as the built in property syntax in C#, it’s not a bad substitute.

    You can declare virtual C++ properties simply by making the getter and/or setter methods virtual.  Similarly, abstract properties can be defined by using pure virtual getter and/or setter methods.  For example:

    class GamePad

    {

    public:

            . . .

            READONLY_PROPERTY(bool, IsConnected);  // virtual property

            virtual bool GetIsConnected() const;

            READONLY_PROPERTY(float, PollingRate);  // abstract property

            virtual float GetPollingRate() const = 0;

            . . .

    };

    Note that the const semantics for the property are determined by the getter or setter method.

    Array semantics are also supported. The syntax is basically the same, but with an added “[]”, as follows:

    __declspec ( property ( get=nameOfGetFunction, put=nameOfSetFunction ) ) typeExpressing propertyName[]

    The accessor function simply needs to take an index argument.  Although we can use our existing macros for arrays, as follows

    class GamePad

    {

    public:

            . . .

            READONLY_PROPERTY(ButtonState, Buttons)[];

            ButtonState GetButtons(size_t buttonIndex);

            . . .

    };

    // And used like:

    GamePad gamePad;

    . . .

    ButtonState buttonState = gamepad.Buttons[3];

    I find it somewhat less confusing to have additional macros in “C++ Properties.h”

    #define ARRAY_PROPERTY(TYPE, NAME) __declspec(property(get=Get##NAME,put=Set##NAME)) TYPE NAME[]

    #define READONLY_ARRAY_PROPERTY(TYPE, NAME) __declspec(property(get=Get##NAME)) TYPE NAME[]

    #define WRITEONLY_ARRAY_PROPERTY(TYPE, NAME) __declspec(property(put=Set##NAME)) TYPE NAME[]

    Changing the above class into:

    class GamePad

    {

    public:

            . . .

            READONLY_ARRAY_PROPERTY(ButtonState, Buttons);

            ButtonState GetButtons(size_t buttonIndex);

            . . .

    };

    The array access functions can also be multidimensional:

    class Picture

    {

    public:

            . . .

            READONLY_ARRAY_PROPERTY(Color, Pixels);

            Color GetPixels(unsigned int x, unsigned int y);

            . . .

    };

    // And used like:

    Picture picture;

    . . .

    Color colorAt = picture.Pixels[x][y];

    Because Properties provide a higher level of encapsulation than public fields, I often find myself exposing private fields through const properties.

    class GamePad

    {

    public:

            . . .

            READONLY_PROPERTY(bool, IsConnected);

            bool GetIsConnected() const { return isConnected_; }

            . . .

    private:

            bool isConnected_;

    };

    Just like traditional accessor functions, this enables internal members to change the isConnected_ state while exposing the state to the public scope as a const property.  This pattern is so very common that I introduce an explicit property macro for it:

    #define READONLY_PROPERTY_RVALUE(TYPE, NAME, RVALUE_EXPR) \

    __declspec(property(get=Get##NAME)) TYPE NAME; \

    TYPE Get##NAME() const { return RVALUE_EXPR; }

    For symmetry I also add the following two macros, though admittedly they’re rarely used (and many of my colleagues hate them).

    #define PROPERTY_VALUE(TYPE, NAME, RVALUE_EXPR, LVALUE_EXPR) \

    __declspec(property(get=Get##NAME,put=Set##NAME)) TYPE NAME; \

    TYPE Get##NAME() const { return RVALUE_EXPR; } \

    void Set##NAME(TYPE newValue) { LVALUE_EXPR = newValue; }

    #define WRITEONLY_PROPERTY_LVALUE(TYPE, NAME, LVALUE_EXPR) \

    __declspec(property(put=Set##NAME)) TYPE NAME; \

    void Set##NAME(TYPE newValue) { LVALUE_EXPR = newValue; }

    Although it would preferable for C++ properties to have a cleaner built-in syntax, these macros provide enough of an abstraction to enable use of properties without sacrificing readability.

     

  • #pragma once

    Most C++ compilers now support the non-standard #pragma once compiler directive.  This directive instructs the compiler to #include the file only once in a single compiland, and replaces the old C-style header sentinels (often called #include guards).

    The central problem with preprocessor based header sentinels is that they require the user to create a unique symbol to identify each and every header file that might be #included by a single compiland.   On very large projects, this burden becomes somewhat painful.  Consider also the distinct possibility that two libraries might contain public header files with the exact same name; using the typical __FILENAME_H__ convention, its very possible to run into name collision between the two libraries.   Some project teams attempt to avoid this problem by imposing a strict sentinel naming standard, usually including a file's path as part its sentinel name.  I'm not fond of this solution because if a header file needs to be moved, as occurs when refactoring a library, it requires that the header file be edited to conform to its new location.  The #pragma once directive avoids all this nonsense entirely.

    A secondary problem is one of efficiency.  The #pragma once directive allows the compiler to avoid opening and preprocessing a header file after its been seen once.  Although it is technically possible for a compiler to implement a similar mechanism when it encounters the header sentinel pattern (GCC can do this for instance), the mechanism is a bit fragile because it depends on a user following an exact coding pattern for the optimization to work correctly.  I’ve encountered code patterns in library header files that appear to be the header-sentinel pattern, but are in fact not.  Moreover, the compiler must account for the fact that preprocessor symbols can be explicitly #undef’d.  I much prefer the explicit directive because it states exactly the intention of the programmer -- "include this header only once".

    Unfortunately for users of GCC, this compiler directive has been deprecated (although it’s still supported last time I checked).  It’s my personal opinion that this is yet another case of Gnu’s pervasive NIH syndrome (they got it bad), however the official reason is that the construct is not portable (which I admit is technically true).  I do not understand why the standards committee hasn’t adopted it into the ISO standard.  It’s a trivial compiler feature to implement.  Although its not an official language feature, most C/C++ compilers support it.

  • Eschew Obfuscation

    While some may think that naming conventions are much ado about nothing, no other subject of coding standards evokes as much fervent discourse.

     

    When I first started programming for Windows back in college (ca 1990), I was baffled that all these really bright programmers at Microsoft would use cryptic symbol names like LPCWSTR and crgpcsz (called Hungarian notation).  In short, it’s a naming convention that incorporates a symbol’s type information into its name using a series of short prefix identifiers.  At the time, this seemed anathema to everything I was being taught about producing readable, maintainable code.  Indeed, this notation presented a particularly difficult barrier for me when entering the world of Windows programming, and played no small part in my choosing to be a Unix developer for many years.

     

    In 2000 I had the privilege of working in Microsoft Research for Dr. Charles Simonyi, the inventor of Microsoft Word and yes, the infamous Hungarian notation.  When I asked Charles about his popular notation and why he proposed such a confusing convention, he got this amused look on his face and told me that most people had actually missed the point entirely.  His intention, he explained, was not simply to conflate type information into the name of a symbol; rather, he wanted to free the developer from the burden of name selection, a “frustrating and time consuming task”.  His premise was that if two programmers, using the same convention, would independently choose the same name for the same program text, then both goals of readability and write-ability have been served.  Readability, he argued, becomes a natural artifact of write-ability, and thus emphasis on the latter is rightly placed.  Although I remained skeptical, I had to admit that given the historically weak type safety of C and its lack of name encapsulation, it wasn’t difficult to understand the broad appeal of Charles’ proposal amongst early Microsoft programmers.  Indeed, to this day some developers at Microsoft adhere to Hungarian notation with near religious fervor.

     

    My personal experience is that Hungarian notation tends to obfuscate rather than to illuminate; that different programmers using Hungarian do not independently choose the same names for the same program text for the same reason that different programmers don’t usually choose to write algorithms in identical ways.  There are often many ways to implement the same algorithm using different structures and types.  To make matters worse, independent teams inevitably choose subtle style variations, further increasing confusion and inhibiting long term maintainability.  Over time even a single team’s notation will evolve such that each successive generation of code will look progressively different from legacy code.  I recently joined a team at Microsoft with a continuous product line that’s more than twenty years old.  This team’s codebase includes some legacy C components so old that they look entirely different than code within recent years.  If they had instead chosen plain English words and phrases for symbol names, their old code would be just as readable as the new code (though still in C instead of C++).  And while I agree with Dr. Simonyi that naming a symbol with just the right words can be difficult, even frustrating at times, I think the effort pays off in more readable, maintainable code.

     

    In recent years I’ve had the joy of working with increasingly more programmers from Generation Y.  These brilliant “kids” cut their teeth on object oriented programming, have never had reason to use ancient editors like vi or Emacs, nor have they ever programmed without the aid of basic semantic tools like Intellisense.  To them, Hungarian notation is not just an anachronism; it’s a pedantic scheme that gets in the way of their efficiency and creativity.

     

    In twenty years of programming, I’ve found one thing to be universally true:  consistency, above all else, is crucial to writing readable, maintainable code.  Consistency between programmers on the same team as well as consistency for the same programmer from year to year.

     

    Last year I proposed as simple naming convention for my team’s C++ development.  It can be summarized quite succinctly.  “Use consistent, meaningful English names that reflect the object described or action being taken.  Name types, functions, properties and namespaces LikeThis, variables and parameters likeThis, private fields likeThis_ and C++ macros LIKE_THIS.”  Notice the intentional similarity to the very practical naming convention for CLR development.  However, C++ is different enough from C# to necessitate a few changes.

     

    The details of my C++ naming proposal are as follows:

     

    Casing Styles Defined

     

    UpperCamelCase : the first letter in the identifier and the first letter of each subsequent, concatenated word are capitalized.  You can use UpperCamelCase for identifiers of three or more characters.  No underscores are used.  For example: DeviceLock, Scene, TabScene

     

    camelCase : The first letter of an identifier is lowercase and the first letter of each subsequent concatenated work is capitalized.  No underscores are used. For example: deviceLock, scene, tabScene

     

    UPPER_CASE : All letters in the identifier are capitalized.  Concatenated words are separated by an underscore.

     

    Type Names

    Type names in C++ include class, struct and interface identifiers, enum typenames, and typedefs.  In general, type names should be noun phrases, where the noun is the entity represented by the type.  For example, Button, Stack and File each have names that identify the entity represented by the type.  Choose names that identify the entity from the developer’s perspective; names should reflect usage scenarios.  Use these guidelines:

     

      • Use UpperCamelCase
      • Use nouns, noun phrases or occasionally adjective phrases.  Do not use verbs.
      • Consider ending the name of a derived class with the name of the base class.
      • Prefix interfaces with the letter I.  Do not prefix class names with the letter C, nor structs with the letter S, nor template class names with the letter T.
      • For a class that simply implements an interface, consider ending the class with the interface name, sans the I prefix.
      • Do not use abbreviations, except those that are commonly recognized (Io, Ctrl etc).

    Template parameter names

    Template parameter names are a bit of a special case.  Choose descriptive names for template parameters, unless a single-letter name is completely self explanatory and a descriptive name would not add value (consider using the letter T in such cases).

     

      • Use UpperCamelCase
      • Prefix the parameter name with the letter T.  Although template parameter names are usually types, it’s usually important to differentiate them from non parameterized names.  Someday our integrated development environments may provide a nice method of doing this without mangling the name; say, by displaying templated name in italics for instance.
      • Consider indicating semantic constraints placed on a type parameter in the naming of the parameter.  For instance, a parameter constrained to the type ISignInMessageReceiver may be called TSignInMessageReceiver. 
      • Use nouns or noun phrases for object types and object instances and verbs for functor or function object parameters.
      • Do not use abbreviations except those that are commonly recognized.

    Enumeration value names

    In C#, references to enumeration value names must be prefixed by the enumeration type name.  Unfortunately, C++ has no such requirement; only the value name is referenced.  To accommodate this in our naming convention, follow these rules:

     

      • Use camelCase
      • Declare enum types within a scope appropriate to its usage
      • Use nouns, noun phrases or occasionally adjective phrases.  Do not use verbs.
      • Do not use abbreviations except those that are commonly recognized.
      • Optional: many C++ developers prefer to prefix enumeration value names with the name of the enumeration type. 
        For instance, enum State { stateIdle, stateReading, stateWriting }. 
        The prefix should not be an abbreviation of the enum typename.

    Preprocessor symbol names.

    Preprocessor symbols (macros) do not exist in C#, so the CLR naming convention offers little guidance.  The C++ industry standard is to declare macros LIKE_THIS.

     

      • Use UPPER_CASE
      • #undef temporary macro names
      • Do not use abbreviations except those that are commonly recognized.
      • Macro names should be at least three characters

     

    Method and function names

    Methods are actions upon an object and their names should employ verbs and verb phrases.  Do not select a name that describes how the method operates; in other words, do not use implementation details for your method names.

     

      • Use UpperCamelCase
      • Use verbs or verb phrases
      • Do not use implementation details in a method’s name
      • Consider prefixing event handling methods with “On”, such as “OnInitialize”

     

    Field names

    Although the C# naming convention proscribes exposing any fields with public or protected protection, and recommends UpperCamelCase for private fields, I have found this to be a bit clumsy to employ in C++.  Moreover, its inconsistent with C#'s variable naming rules.  Instead, I propose the following for field names.

     

      • Use camelCase
      • Use descriptive names, typically nouns (though function object fields may be verbs)
      • Use plural names for collection fields rather than suffixing with the container type.  Ex: “names” instead of “nameList”
      • Some developer prefix private class fieldnames with “m_” (as with MFC classes). I personally prefer to suffix private class fieldnames with an underscore likeThis_, as does Alexandrescu and other prominent C++ developer/authors.  Perhaps someday most IDEs allow us to visually identify member field names (say, with italics for instance) to differentiate them from other names; until then this kind of name decoration has shown to be useful.
      • Do not decorate field names with type information (as in Hungarian notation). 

    C++ Property Names

    Did you know that Microsoft’s C++ has properties?  Well it does, though the syntax is a bit clumsy.  It uses a special __declspec directive. I usually define a set of MACROs that ameliorate the clunky syntax (I will write on that more in another blog posting).

     

      • Use UpperCamelCase
      • Use nouns, noun phrases or adjectives.
      • Prefix Boolean property names with Is, Can, Has etc. where it contributes to readability.
      • Prefer Boolean property names with affirmative phrases (CanSeek instead of CantSeek).
      • Properties imply simple value lookups or trivial computations so do not use a property when non-trivial computations are involved.  Instead use a method.
      • Prefer to name the "getter" and "setter" functions with GetPropertyName and SetPropertyName respectively (though this is not strictly necessary). 

    Parameters and auto/local variable names

     

    ·         Use camelCase

    ·         Use descriptive names which reflect how the variable will be used

    ·         Do not decorate names with type information (as in Hungarian notation)

    ·         Use nouns, noun phrases or an adjective, except for function objects

    ·         Use plural names for collection/container fields rather than suffixing with the container type.  Ex: “names” instead of “nameList”

    ·         Avoid declaring variables in the global scope; instead declare them as variables within an appropriate namespace and follow the naming convention for C++ Properties. 

    ·         Do not declare a variable with the static keyword at global scope; this is a deprecated language feature.  Instead, use an anonymous namespace.

     

    Namespaces

     

      • Use UpperCamelCase
      • Use nouns or noun phrases
      • Do not use generic type names that might conflict with class names (eg. Element, Node, Log, Message)
      • Consider using plural names where appropriate: eg. Strings instead of String
      • Do not use the same name for a namespace as a type within the namespace.
      • Do not place application specific namespaces within the namespace of a shared library namespace.

    Use of Acronyms

    Acronyms are generally proscribed, as they reduce readability especially to those programmers for whom English is a second language.  However, an acronym may be used if it is generally recognized by your programming community and if it doesn’t reduce readability.  Examples include DB, IO, Xml, Cpu, Gpu, Html, etc.

     

      • Capitalize both characters of two-character acronyms, except the first word of a camelCase identifier.  Example: DB, IO for UpperCamelCase and ioChannel for camelCase.
      • Capitalize only the first character of acronyms with three or more characters, except the first word in a camelCase identifier.

     

  • The trouble with long double

    Back in ‘04 I made a prediction that 80 bit floating point values would likely be supported in some future version of the VC++ compiler (just like we did in the 16bit version of the compiler!).  Alas, it’s now ’07 and much to my disappointment this will not come to pass; so much for my foresight.  There are a number of technical reasons for this, not the least of which is that implementing the feature requires more than just a change to the compiler.  It is my strong impression, however, that Microsoft would have solved these issues had customers more aggressively clamored for this particular feature. 

    The point is now rapidly becoming moot.  80-bit doubles seem so “last century”.  We are approaching the day when most new Windows systems will have FPUs that easily support the 128bit long double format.  It seems natural to expect, or rather for our customers to demand, that 128bit long double semantics be fully supported in some future version of C++.

    So lobby away people! http://msdn2.microsoft.com/en-us/visualc/aa336397.aspx  

    128bit long doubles would be a nice complement to the .NET 128 bit decimal floating point type.

  • Typesafe method of interfacing to DirectX shaders

    I thought this idea might be of interest to DirectX C++ programmers.

    Typesafety is perhaps the most critical feature of higher programming languages, and yet so often application programming interfaces introduce non-typesafe constructs. This is frequently the case with low-level APIs. A great example of this is DirectX’s APIs for setting GPU registers for vertex and pixel shaders. They include methods like these:

    D3DVOID SetVertexShaderConstantF(
      UINT StartRegister,
      CONST float *pConstantData,
      DWORD Vector4fCount
    );

    D3DVOID SetPixelShaderConstantF(
      UINT StartRegister,
      CONST float *pConstantData,
      DWORD Vector4fCount
    );

    StartRegister specifies the base register number, pConstantData should point to the value(s) to be loaded into the registers where each register is four floats and Vector4fCount specifies the number of registers to which the API should write.

    This API is intentionally generic.  But it lacks most of the type safety C++ offers. Consider the following simple cases representing aberrant uses of the API:

    const float fValue = 1.5f;
    const D3DVECTOR4 v4Value(1.0f, 2.0f, 3.0f, 1.0f);

    // case 1
    g_piDevice->SetVertexShaderConstantF(0, &fValue, 1);
    // case 2
    g_piDevice->SetVertexShaderConstantF(2, (float*)fValue, 4);

    In case 1, the y, z and w components of vertex constant register 0 will be loaded with unintended values. In case two, the values in register 3, 4 and 5 will be overwritten with garbage values since the Vector4Count parameter is wrong. In neither of these cases will the compiler provide an error or warning that something’s wrong.

    The easiest way to add some type safety is to provide a function that is overloaded on the value to be written. For the simple cases above, we could introduce the following:

    inline void SetVertexShaderConstantF(UINT RegisterID, float value)
    {
        D3DXVECTOR4 vTemp = { value, 0, 0, 0 };
            // note: on xbox, use XMVECTOR
        g_piDevice->SetVertexShaderConstantF(RegisterID, &vTemp, 1);
    }
    inline void SetVertexShaderConstantF(UINT RegisterID, const D3DVECTOR4& value)
    {
        g_piDevice->SetVertexShaderConstantF(RegisterID, (float*)value, 1);
    }

    Some compilers, like the one on Xbox 360, are able to provide additional optimizations when parameter values to inline functions are literals. To ensure that this optimization is available when using our typesafe versions, we could add a template that passes the RegisterID as a literal instead of as a variable:

    template <UINT TRegisterID>
    inline void SetVertexShaderConstantF(float value)
    {
        D3DXVECTOR4 vTemp = { value, 0, 0, 0 };
        g_piDevice->SetVertexShaderConstantF(TRegisterID, &vTemp, 1);
    }

    . . .

    // used this way
    SetVertexShaderConstantF<0>(value);

    If the register traits of the target GPU are known a priori, we can further refine this idea by introducing a compile time constraint on the template parameter TRegisterID. On Xbox 360, this value must be in the range 0…255. To constrain this at compile time on the xbox 360 we can use the _STATIC_ASSERT macro:

    template <UINT TRegisterID>
    inline void SetVertexShaderConstantF(float value)
    {
        _STATIC_ASSERT(0<=TRegisterID && TRegisterID<=255);
        D3DXVECTOR4 vTemp = { value, 0, 0, 0 };
        g_piDevice->SetVertexShaderConstantF(TRegisterID, &vTemp, 1);
    }

    An error will now be generated at compile time if the programmer uses this function with a register id that is out of range. (NOTE: If static assertions are not available in your environment, you can use or build something like the boost library’s BOOST_STATIC_ASSERT. )

    Aside from type safety, it would be nice if our interface provided a simple, typesafe way to declare all the registers needed for a particular shader. Let me show you the basic way I do this for vertex shaders:

    class CVertexShader
    {
    public:
        template <class TDataType, UINT TRegisterID>
        class CConstant
        {
        public:
            inline void operator=(const TDataType& value)
            {
                SetVertexShaderConstantF<TRegisterID>(value);
            }
        };
    };

    Notice that sizeof(CVertexShader::CConstant) is zero. This is important because we don't want our strategy to impose any additional memory requirements.

    Derived classes can then easily describe a typesafe program interface to a vertex shader. For example:

    class CSimpleVertexShader : public CVertexShader
    {
    public:
    	CConstant< XMMATRIX,0 > mWorld;
    	CConstant< XMMATRIX,4 > mView;
    	CConstant< XMMATRIX,8 > mProjection;
    	CConstant< XMVECTOR, 12 > vEyePositionW;
    };
    
    . . .
    // used this way
    CSimpleVertexShader Simple;
    . . .
    Simple.mWorld = mWorld;
    			

    Essentially, this provides a compile-time name binding of a particular vertex shader’s registers that is both type-safe and convenient to use.

    I also enhance class CVertexShader to add run-time binding to a particular instance of a loaded or compiled vertex shader. This looks something like the following (I’ve omitted the runtime assertions and state checking for brevity):

    class CVertexShader
    {
    protected:
        CInterfacePtr<IDirect3DVertexShader9> m_piVertexShader;
    public:
        IDirect3DVertexShader9* operator -> () { return m_piVertexShader; }
        operator IDirect3DVertexShader9* () { return m_piVertexShader; }

        HRESULT Set() { return g_piDevice->SetVertexShader( m_piVertexShader ); }

        HRESULT Load(const char* pszFilename);   
        HRESULT Load(const wchar_t* pszFilename);   
        HRESULT Compile(const char* pszCode);   
        HRESULT Compile(const wchar_t* pszCode);   
    };

    The implementation for the concomitant CPixelShader is nearly identical. For brevity's sake I've left out the details, but I’ll be happy to post the complete code if anyone requests it.

  • Stanley110 asks...

    Sorry I haven't had time to post in quite a while.  I’ve been developing the new Xbox Live Arcade for the upcoming Xbox 360, a project that has taken considerable time and effort.

     

    Stanley110 asks:

    A) Excel uses double precision. What benefit with respect to accuracy or precision of the calculated result is there with respect to single precision?

    B) Electronic calculators use single precision? Is this true?

    C) Aside from computer speed and things like that, is there any difference in the accuracy and precision of an aritmetic calculation when it is by double precision than by single precision.

     

    Short Answer to Question A: The precision benefits of double-precision over single precision are exactly as the name suggests: at least double the precision of single precision.  Accuracy is a different question altogether.  With carefully coded algorithms, single precision can yield very accurate results; however, most users (even most computer scientists) are not trained to devise such algorithms for all but the simplest cases.  Most of the time (though not always), if you perform the same computation in double and single precision, the double precision result will usually be more accurate.  I hate to make a blanket statement like that, so PLEASE note the qualifiers before flaming me with email :-)

     

    More info on floating point
    http://en.wikipedia.org/wiki/IEEE_floating-point_standard

    http://en.wikipedia.org/wiki/Talk:Computer_numbering_formats

     

    Accuracy and precision are not the same thing

    http://en.wikipedia.org/wiki/Accuracy

    http://en.wikipedia.org/wiki/Talk:Accuracy_and_precision

     

    (yes, I am a fan of Wikipedia; its freaking brilliant!)

     

    Excel uses double precision mainly because it’s what’s available on most architectures.  Moreover, if care is taken to properly account for error, doubles are precise enough for many financial computations, especially the kind used by most Excel users.  However, even simple tax or interest computations can be perturbed by the use of double precision (so be careful and check your results!)

     

    Short Answer to Question B: No; well, maybe some cheap calculators given away in a box of Cap’n Crunch, but useful calculators will have at least 12 decimal digits of precision ( single prec has only about 7; log10(2^24) ).  Most inexpensive calculators use doubles or extended precision since the chips for it are fairly inexpensive.  Really nice calculators use extended double, quad-precision or even provide decimal-floating point precision with 28 decimal digits or more of precision.  Some even use rational number systems under certain circumstances to represent numbers like 1/10, 1/3 etc.  The built in Windows calculator, for instance, provides 32 decimal digits of precision and uses rationals for certain computations.

     

    Gossip: I heard a rumor that Excel may (soon?) provide computations using .NET’s decimal type.  But I haven’t been able to confirm this.  So naturally I must spread the rumor.

     

    Short Answer to C:  I recommend the articles above.  Keep in mind that on many systems, double precision computations are just as fast or even faster than single precision computations. 

  • Visual Studio 2005 Beta

    Several people have sent technical s