I wrote the following short article several years ago. I've reproduced it here by request
Once of the curious features of the C language is its lack of an integrated string type. Most programming languages developed in the 1960/70s included a basic string type. Strings in C, however, are just a special case of array data and the only direct language support involves initialization of pointers and array values with string literals.
Fortunately, the C++ standard library introduces a standardized string type, std::string. Although its implementation is very well thought out, naïve use of std::string can fragment the heap, reduce performance and create unexpected bottlenecks. However, the same can be said for naïve use of strings in languages like Java and C#, where the String type is fundamental and most of the implementation details are hidden from the programmer. In C++ there are ways to mitigate potential performance problems. When I employ std::string, I prefer to use a memory model very similar to boost’s segregated storage; this reduces memory fragmentation and keeps all the string data within a sandbox. It’s fairly easy to do this once you know how to write a standard allocator. The strstream class in C++ is an efficient and elegant solution to complex string construction, and I prefer it over Java’s somewhat clumsy StringBuilder class. If you prefer the printf way of formatting strings, the boost library offers an excellent type-safe alternative that’s built to work efficiently with standard stream classes.
Systems programmers and developers of high performance applications typically use C style strings. There are a number of reasons for this, but chief amongst them are efficiency and interoperability. C strings are efficient precisely because they are simple -- they can be allocated on the stack or as part of a larger structure or in the free-store, and operations on them can be specifically tailored for maximum efficiency. Moreover, C strings are usually not optional when interacting with operating system APIs, drivers, low level libraries and legacy code.
Being able to program with C strings is a fairly fundamental skill for most Microsoft developers; indeed, a good number of the coding questions we ask during technical interviews involve some sort of C string manipulation. Being conversant with C string manipulation and the concomitant standard library functions is a point pride for many Microsoft programmers, especially with those who cut their teeth in C instead of C++. Indeed, these C string "fanboys" are sometimes critical of those developers who prefer std::string over character arrays and raw character buffers (I am not one of them).
I find that the worst part of programming with C strings is the string library itself. Take a very basic function like strcpy, for instance. In the early days the length of a symbol name was limited to just a few characters, so we can forgive the designers for selecting a somewhat less than human readable name. When there was only one way to copy a string, the name strcpy wasn’t a bad choice. However today there are dozens of different variations on the string copy function name available to Visual C++ programmers. Here are 24 of them from <string.h> and <mbstring.h>:
strcpy, wcscpy, _mbscpy, _tcscpy,
strcpy_s, wcscpy_s, _mbscpy_s, _tcscpy_s,
strncpy, wcsncpy, _mbsncpy, _tcsncpy,
_strncpy_l, _wcsncpy_l, _mbsncpy_l, _tcsncpy_l,
strncpy_s, wcsncpy_s, _mbsncpy_s, _tcsncpy_s,
_strncpy_s_l, _wcsncpy_s_l, _mbsncpy_s_l, _tcsncpy_s_l
There are versions for four different character sets: ASCII, Mbcs, Unicode and TCHAR. There are safe and unsafe versions, locale specific versions, and versions with additional semantics. Now multiply those semantic variations against the dozen or so basic string operations and you have hundreds of different names to try to remember!
The function names in <strsafe.h> are a little more regular, but there are still dozens of names to remember. There are 24 of them for copying a string:
StringCbCopy, StringCbCopyA, StringCbCopyW,
StringCbCopyEx, StringCbCopyExA, StringCbCopyExW,
StringCbCopyN, StringCbCopyNA, StringCbCopyNW,
StringCbCopyNEx, StringCbCopyNExA, StringCbCopyNExW,
StringCchCopy, StringCchCopyA, StringCchCopyW,
StringCchCopyEx, StringCchCopyExA, StringCchCopyExW,
StringCchCopyN, StringCchCopyNA, StringCchCopyNW,
StringCchCopyNEx, StringCchCopyNAEx, StringCchCopyNWEx
To complicate matters, Strsafe.h is incomplete; it has neither multi-byte character support nor support for locale specific functions [locale specific functions have been added since the time this article was authored]. Moreover, the strsafe naming style is only available for those functions needing to prevent buffer overruns. String operations such as comparison and collation, which are already safe, have no implementation in this library.
I think there’s really only one name I should have to remember for each basic string operation. In this case, that name would be “Copy”—overloaded for each different semantic variation of the operation, but with a uniform scheme for parameterization and return values. The compiler should do all the work of figuring out which variation to use. Since the name “Copy” is applicable to more than just strings, we should declare the name within a “Strings” namespace. Since some variations are unsafe, we should introduce a counterpart namespace “UnsafeStrings” to make it very explicit when choosing to use an unsafe version of a string function. We declare the functions in a namespace instead of a class so that the library can be extensible. This also makes factoring the implementation code into different files a little easier.
For an initial example, the basic safe Copy operations for each of three string types would be declared as follows:
namespace LibraryName
{
namespace Strings
{
errno_t Copy(char* destination, size_t destinationSize, const char* source);
errno_t Copy(unsigned char* destination, size_t destinationSize, const unsigned char* source);
errno_t Copy(wchar_t* destination, size_t destinationSize, const wchar_t* source);
}
}
These are the basic copy functions for ASCII, Mbcs and Unicode respectively. Each of these overloaded versions of Copy simply dispatches to its counterpart in the standard library. For example, the ASCII version is:
inline errno_t Strings::Copy(char* destination, size_t destinationSize, const char* source)
{
return ::strcpy_s(destination, destinationSize, source);
}
The counterpart unsafe Copy functions would be declared as follows:
namespace LibraryName
{
namespace UnsafeStrings
{
using Strings::Copy;
errno_t Copy(char* destination, const char* source);
errno_t Copy(unsigned char* destination, const unsigned char* source);
errno_t Copy(wchar_t* destination, const Unicode::Char* source);
}
}
Notice that the safe versions of Copy are composited into the UnsafeStrings namespace with a using declaration. This is done for convenience and makes both Strings or UnsafeStrings a complete name-set.
The Unsafe definitions of Copy will need additional parameter checking to ensure that the function semantics are uniform with the safe versions. In practice this doesn’t usually introduce much of a performance barrier since, depending on context, the compiler can often optimize away these additional parameter checks when the function gets expanded inline. The cost of the parameter checking is trivial compared to the cost of the copy.
inline errno_t UnsafeStrings::Copy(char* destination, const char* source)
{
if (!destination || !source)
return EINVAL;
::strcpy(destination, source);
return 0;
}
The different semantic variations take on a very regular form. For instance, to copy a limited number of characters we simply declare additional overloads as follows
namespace LibraryName
{
namespace Strings
{
errno_t Copy(char* destination, size_t destinationSize, const char* source, size_t maxCount);
errno_t Copy(unsigned char* destination, size_t destinationSize, const unsigned char* source, size_t maxCount);
errno_t Copy(wchar_t* destination, size_t destinationSize, const wchar_t* source, size_t maxCount);
}
namespace UnsafeStrings
{
using Strings::Copy;
errno_t Copy(char* destination, const char* source, size_t maxCount);
errno_t Copy(unsigned char* destination, const unsigned char* source, size_t maxCount);
errno_t Copy(wchar_t* destination, const wchar_t* source, size_t maxCount);
}
}
As with the earlier variations, the implementation of these simply dispatch to the correct counterpart function in the standard library. Again, the unsafe versions will need a little additional code to perform some parameter checking.
The locale specific variations of Copy are similarly implemented.
It’s convenient to add safe versions of Copy specifically for arrays.
namespace Strings
{
template <size_t TSize>
inline errno_t Copy(char (&destination)[destinationSize], const char *source)
{
return Copy(destination, destinationSize, source);
}
}
Other string operations are similarly easy to define. Consider, for instance, the 28 different name variations for functions comparing two strings:
strcmp, wcscmp, _mbscmp, _tcscmp,
_stricmp, _wcsicmp, _mbsicmp, _tcsicmp,
_stricmp_l, _wcsicmp_l, _mbsicmp_l, _tcsicmp_l,
strncmp, wcsncmp, _mbsncmp, _mbsncmp_l ,
_tcsnccmp, _tcsncmp, _tccmp,
_strnicmp, _wcsnicmp, _mbsnicmp,
_strnicmp_l, _wcsnicmp_l, _mbsnicmp_l,
_tcsncicmp, _tcsnicmp, _tcsncicmp_l
As with the copy functions, the compare functions have versions for four different character sets, ASCII, Mbcs, Unicode and TCHAR. There are locale specific versions, versions with case insensitive comparison semantics and some versions with different names but identical semantics.
As with Strings::Copy, the string comparison operation should have only one name, Compare. The declaration for the basic Compare functions are
namespace LibraryName
{
namespace Strings
{
int Compare(const char* string1, const char* string2);
int Compare(const unsigned char* string1, const unsigned char* string2);
int Compare(const wchar_t* string1, const wchar_t* string2);
}
namespace UnsafeStrings
{
using Strings::Compare;
// there are no unsafe specific versions of Compare
}
}
Each of these overloaded versions of Compare simply dispatches to its counterpart in the standard library. For example:
inline int Strings::Compare(const char* string1, const char* string2)
{
return ::strcmp(string1, string2);
}
The function strcmp and other standard string comparison functions have undefined behavior when passed bad parameters. This makes them unsuitable as predicate operations for sorting algorithms and ordered containers. Historically this was done for performance reasons since parameter checking was considered “expensive” due to the extra branch operations – this defense is somewhat dubious since comparing two strings is relatively much more expensive than the parameter checking. Accordingly, we redefine the Compare functions with an alternate semantic—one that is well-ordered for any two string arguments, NULL or not.
// function Compare(a,b)
// Compares two strings by lexicographically
// returns
// <0 when a < b
// 0 when a == b
// >0 when a > b
// except when either a or b are NULL, then
// <0 when a==NULL && b!=NULL
// 0 when a==NULL && b==NULL
// >0 when a!=NULL && b==NULL
inline int Strings::Compare(const char* a, const char* b)
{
if (!a)
return b ? -1 : 0;
if (!b)
return +1;
return ::strcmp(a, b);
}
The new added semantics are: two NULL strings are equal, and a NULL string is considered “less than” one that isn’t NULL. This means that if the Strings::Compare function is used in a predicate operator, the NULL strings will be sorted forward. The compiler can sometimes optimize away the parameter checking when the function is inlined.
Case insensitive comparison requires an additional tag type. Its introduced in the following code as enumCaseInsensitive, and the tag names CaseInsensitive and CASE_INSENSITIVE.
namespace LibraryName
{
namespace Strings
{
enum enumCaseInsensitive { CaseInsensitive, CASE_INSENSITIVE };
int Compare(const char* string1, const char* string2, enumCaseInsensitive);
int Compare(const unsigned char* string1, const unsigned char* string2, enumCaseInsensitive);
int Compare(const wchar_t* string1, const wchar_t* string2, enumCaseInsensitive);
}
namespace UnsafeStrings
{
using Strings::Compare;
}
}
As with the case sensitive version, this version of Compare simply dispatches to the appropriate counterpart in the standard library, adding the same NULL semantics as before. For example:
inline int Strings::Compare(const char a, const char b, enumIgnoreCase)
{
if (!a)
return b ? -1 : 0;
if (!b)
return +1;
return ::stricmp(a,b);
}
To use the case insensitive version of Compare, the calling code simply passes in the Strings::CaseInsensitive tag. I usually bring the identifiers “CaseInsensitive” or “CASE_INSENSITIVE” into the current namespace with a using directive.
using Strings::CaseInsensitive;
. . .
if ( Strings::Compare(name1, name2, CaseInsensitive) < 0 )
{
. . .
}
Conclusion
The completed Strings library contains the following functions: Append, Collate, Compare, CompareOrdinal, Convert, Copy, Find, IsEqual, IsLessThan, IsGreaterThan, Length, PrintF/VPrintF, PrintFLength, Replace, ScanF/VScanF, and Tokenize. In all, this Strings library has hundreds of functions but only sixteen function names to remember.
Incidentally, the similarity to the function naming convention in the C# String class is no coincidence.