Jumping into C++

Jumping into C++

Rate This
  • Comments 36

I was a professional developer for years, working initially with C in labs and an engineering firm before moving on to other languages and stints as an enterprise developer and consultant. Other languages and tool were more in demand for the projects I worked on so I gradually moved away from native development and lost touch with the C++ story. It has been years since I developed anything in C or tried using C++.

How hard would it be to write a familiar application in modern C++?

This blog post shares my experience returning to native programming using modern C++. No printf statements were harmed in this effort. A few of my colleagues might have been annoyed by my many questions, so shout out to James McNellis for tutoring and reviewing the code and blog post drafts and to Sumit Kumar and Niklas Gustafsson for deciphering warning messages.

Any errors in the code or this post are mine.

Ground Rules

I set out to write a simple application using modern C++. I imposed the following rules and requirements:

  • Use only modern C++. I wanted to use streams, iterators, templates and mapping how I did things in C, C# and other languages onto modern C++.
  • Use best practices. For example, "when in doubt, use vector." If a value was not meant to change, make it const. Don't use macros. Go with what makes sense and seems to have the most support.
  • Share all gotchas! I expected to run into simple problems reflecting my lack of C++ experience. Fortunately, I could wander the halls and ask my colleagues on the Visual C++ team for help/answers/advice and commiseration. Spoiler: one of my gotchas involved precompiled headers. *facepalm*
  • Keep the project simple. The application was not going to be a Windows Store application. I wanted to read and write some data, do a few loops, mess around with a collection or two, and use std::string, not build a "real" application or explore more advanced concepts. I do hope to build a more sophisticated "real" application in the future.
  • Use Visual Studio and Visual C++. Probably a no-brainer, but I did have some concerns about finding my way around VS2012. I had read a few complaints about the default editor color palette, but I had no complaints. I also found all the shortcuts I had learned years ago continue to work.

Given these ground rules, I chose a simple project, counting words across a number of files. It requires input and output, a way to map a count to a word, some string manipulation (if desired), and loops. It also works well as a command-line application so I could avoid UI issues. If written well, the code could always be ported to a fancier Windows application in the future.

Requirements

The application needs to count the words in a set of text files provided by the user. It must accept a list of files on the command line, process each file while accumulating word counts and skipping bad/missing files, and then print out the number of files processed and the list of words and counts. For this exercise, words are chunks of non-whitespace characters delimited by one or more whitespace characters—nothing too sophisticated.

For all requirements, I chose the simplest possible solution. For example, the program targets ASCII files and not deal with wide characters (though I did mess around with wstring and wifstream without running into problems). The program avoids handling certain binary files (test file are from the Gutenberg Project). Words include unwanted characters even after simple filtering, and so on.

Would you realistically implement these requirements with C++? You might if you could squeeze more performance out of file processing, might need to reuse the code on different platforms, or wanted more control. You might not if you had access to PowerShell or other scripting language and wanted a quick solution. As with all interesting choices, the answer is "it depends." If you wanted to use AppleSoft Basic on an old ][+, go for it (and share your code in the comments section below)!

The solution itself is short, but it did take a couple of iterations to whip into proper C++ shape. If you want the code, it is attached but should not be used in production code blah blah blah.

Interesting Bits

There were a few interesting bits -- interesting mostly because I was new to modern C++ and a bit rusty on using the C/C++ compiler. Some of the issues encountered were fixed using information online (the C++ community and online ecosystem is awesome); others I inflicted on my colleagues.

Processing Command-Line Arguments

Rather than looping through the command line arguments array argv, I went ahead and converted it to a vector<string>.

int main(int argc, char** argv)

{

// bail if no files are specified

if(0 == argc)

  return 0;

const vector<string> files(argv, argv + argc);

}

Iteration is straight-forward, especially if (unlike me) you remember to use const auto&.

for (const auto& file : files) ...

We have no intention of modifying the file string. Defaulting to const auto& when writing a loop seems like a safe best practice.

If run from the Windows command line, the first argument is usually the path/name of the executable. Rather than weed out this case when copying from argv, I did it during the file processing loop:

if(file.rfind(".exe") == (file.length() - 4))

  continue;

If the file name ends with ".exe", skip it. It could be more robust, but the basic mechanism is in there.

Reading a File

Reading a file is as easy as the different C++ tutorials claim: grab an appropriate stream, point it to the file and if it is not bad, read a word at a time until there are no more words to read and then close the file. For this project, the appropriate stream is ifstream though I did have it working with wifstream and wide-characters. If you decide to go that route, use wstring.

My first attempt explicitly checked for a bad file like so:

ifstream infile(file);

if(!infile.bad())

{

  string word;

  while(infile >> word) ...

Turns out ifstream::bad() is set by a previous i/o operation. If, as in this example, there is no previous operation, the file is not "bad" even if it is non-existent. I needed to use a different strategy, one that avoids the explicit "badness" check:

ifstream infile(file);

string word;

// if we can pull a word (file is good)

if(infile >> word)

{

  // process all of the contents

  do

  {

  } while(infile >> word);

}

This code "primes the pump" by test-reading a word from the file before processing the rest of the file. Per-file code (like counting the number of files actually processed) can be shoved after the test-word is successfully read.

If something goes horribly wrong while processing a file, we don't try to pick up the pieces. When the infile falls out of scope, the file is closed.

Tracking Word Counts

Words are tracked in a map using the word as key and a long integer to keep track of the number of occurrences. If a word is not in the array, it is automatically added, avoiding the need for extra code.

map<string, unsigned int> words;

string word;

...

words[word]++;  

This code is on the "simple" end of the map complexity continuum; implementations can get ugly quick. I used unsigned int because word counts will not be negative – there are no "anti-words" in this exercise.

I made one tweak to the program once it was working. The original version counted contiguous chunks of characters delimited by one or more spaces without worrying about whether a character was punctuation, so dog and 'dog' each counted as a unique word. It bugged me so I looked for a way to remove a set of characters from a string. What I found was:

word.erase(remove_if(word.begin(), word.end(), &::isremovable), word.end());

The inner remove_if removes characters the custom function isremovable says should be removed (ie, returns true), shifting all non-removable characters to the left. When done, remove_if returns an iterator pointing to the new end of the word. The outer word.erase removes characters from the new end to the actual end of the word.

This looked like mumbo jumbo until James explained it. It also helped stepping through example code that split the operations into separate lines. Once I got it, it seemed obvious, an "aha" moment that would help me dissect similar statements in the future. Hopefully!

If I wanted to go fully modern C++, I'd replace the isremovable with a lambda, but then there would be too much going on in that one statement for this first attempt J.

Printing Results

My first try at printing to console netted the following:

for (pair<string, long> c : words)

    cout << c.first << ":" << c.second << endl;

It worked but because the pair declaration is wrong (map keys are const), a temporary variable was created for each pair, making the program less efficient. I updated it using what I had learned working on the file processing code:

for (const auto& c : words)

    cout << c.first << ":" << c.second << endl;

If I had defined a new type for the word map, I would have been able to use another mechanism but like lambdas, it can wait.

Gotchas!

My "gotchas":

  • Selecting the wrong project template for the job. In my first go around, I chose a project type that included precompiled headers and handled Unicode. I had forgotten that includes need to go after the precompiled header include in the source file resulting in some funky errors whose cause was not immediate obvious to me (I figured my C++ was wrong!). With Unicode came TCHAR, adding complexities around printing and manipulating strings.For small projects, start with an empty C++ project and write everything from scratch. It is easy to extend later.
  • Forgetting to include the right library. When this happened, I was certain I had the correct includes and so assumed the errors being thrown at me were from bad code. This is part of the learning curve. Double-check includes! Online docs and frequent compiles helped.
  • Getting buried in complexity. Part of my time was spent reviewing C++ information related to my task, in particular the STL and templates. It did not take too long to go from beginner content to the dragon's den in an article, discussion thread or a few "related article" clicks. Some STL code cannot be unseen. Understand there is complexity, file it away, and refocus on the immediate goal.
  • Editor squigglies. On the first version of the project, I used "for each" when looping through files and word counts. The editor "squiggled" the container in each case yet the compiler had no complaints. Turns out "for each" is a Visual Studio extension; when I used "for", the squiggles went away. The code was technically correct, but there was a better way. Verify the veracity of the squiggle – could the statement be tweaked to get rid of it?
  • Assuming the requirements and implementation were "simple". Nothing beats a friendly code review from an expert. James reviewed the code and this post and identified quite a few fundamental "oops," "d'ohs," and "ughs". Getting a program to work is one thing; making sure it is correct (and I could explain why) was much harder. Don't avoid peer code reviews!

Most of these are pretty basic. Your gotchas may vary (YGMV).

Is C++ Hard?

Not if you start with a small, familiar project, tackle a few new concepts at a time, and use all the wonderful resources in the community (including the community itself).

I hope to tackle a more complex project next time. Stay tuned!

Attachment: WordCountInFiles.cpp
  • James: neat!

    For completeness:

    KJ // Kenny & James

    stacked-crooked.com/view

    > words 4300.txt > outKJ

    Time elapsed: 0.103112 seconds

    > words 4300.txt > outKJ

    Time elapsed: 0.103627 seconds

    > words 4300.txt > outKJ

    Time elapsed: 0.104398 seconds

    // For consistency, using Boost.Chrono's implementation of steady_clock as in the other measurements and included the I/O in the measurement (as in the previous cases).

    // Boost implementation doesn't suffer from the following issue: connect.microsoft.com/.../steady-clock-class-and-clock-function-are-non-conformant-with-c-11-and-c11-standards

    Pretty good! Does great even with the overhead of threading (given that we're processing only one file, for comparability with the other measurements).

    Hey, you guys wouldn't feel like doing the memory-mapped files w/ Boost (for portability) by any chance? ;-)

    Boost.Iostreams: www.boost.org/.../mapped_file.html

    Boost.Interprocess: www.boost.org/.../sharedmemorybetweenprocesses.html

    Based on the following, I'm guessing Boost.Iostreams is a good choice for this case:

    stackoverflow.com/.../is-there-a-difference-between-boost-iostream-mapped-file-and-boost-interprocess

  • MattPD> I'm guessing regex is the main workhorse contributing to the overall cost.

    Different regex implementations have different costs. Try swapping in Boost Regex and see what happens.

    > This made me curious... how?

    I speculated that the compiler might have to emit an instruction to copy an int, but this speculation was probably wrong. There is unlikely to be an observable performance difference either way (for ints).

    > My worries are somewhat along these lines:

    That analysis is incorrect. The mistake is in "However, if you use perfect forwarding, perfect forwarding via template argument deduction will effectively work like this: void SomeFuncFwd(std::string &&str) {SomeFunc(std::move(str));}". The outer function's signature will NOT become (string&&) when the original argument is a string literal. (I am leaving aside the question of what happens when the original argument is a string literal in braces; I am 95% certain that is a non-deduced context according to N3485 14.8.2.5 [temp.deduct.type]/5, see the example g({1,2,3}).) If the call is outer("A String"), the signature becomes outer(const char (&)[9]). inner(string) then constructs a temporary string in-place as usual, with no performance penalty.

    jalf> size_t is intended to describe the size of objects.

    The Core Language emits size_t when you say sizeof(object) or sizeof(type), but libraries use it for more than describing the byte size of objects/types. The STL uses size_t to describe the sizes of containers in elements (not bytes) because they live in memory. (Strictly speaking, the STL uses size_type typedefs, but these are always size_t except for very strange custom allocators.)

    > But as always with all the different typedefs of various integers, the difference is kind of academic.

    It ceases to be academic as soon as someone gives a big file to your 32-bit program.

  • STL>Different regex implementations have different costs. Try swapping in Boost Regex and see what happens.

    Done! :-)

    /* *** */

    Stephan, std::map, Boost.Regex: // 34544 unique words.

    > WordCountInFiles.Sb.exe 4300.txt > outSb

    Time elapsed: 0.246321 seconds

    > WordCountInFiles.Sb.exe 4300.txt > outSb

    Time elapsed: 0.244597 seconds

    > WordCountInFiles.Sb.exe 4300.txt > outSb

    Time elapsed: 0.248894 seconds

    /* *** */

    Stephan, std::unordered_map, Boost.Regex: // 34544 unique words.

    > WordCountInFiles.Sub.exe 4300.txt > outSub

    Time elapsed: 0.203676 seconds

    > WordCountInFiles.Sub.exe 4300.txt > outSub

    Time elapsed: 0.203893 seconds

    > WordCountInFiles.Sub.exe 4300.txt > outSub

    Time elapsed: 0.202584 seconds

    /* *** */

    Wow! What just happened? Boost.Regex seems over 3X faster than std::regex(!), this is getting closer to the initial/Eric's implementation.

    >That analysis is incorrect.

    So, just to make sure; in this case: stacked-crooked.com/view

    We have:

    template <typename C> void println_UniversalReference(C && c);

    template <typename C> void println_Value(C c);

    We call:

    println_Value(std::string{"abc"});

    println_UniversalReference(std::string{"abc"});

    I'm guessing:

    - deduced type for println_UniversalReference would be such that "C &&" becomes "std::string const &"

    - deduced type for println_Value would be such that "C" becomes "std::string"

    Since I'm not(?) doing any further copying, there shouldn't be any difference.

    Incidentally, were I to perform copying as in the "sorted" vs. "sorted2" examples in the following, would I observe any performance difference?

    cpp-next.com/.../want-speed-pass-by-value

    >non-deduced context

    Just making sure, what does it mean for a parameter to "not have std::initializer_list or reference to possibly cv-qualified std::initializer_list type"?

    (a) a parameter of type T without an initializer-list constructor?

    (b) a parameter of type not equal to std::initializer_list<T>?

    From the example in 14.8.2.1 I'm guessing it's (b).

  • MattPD> I'm guessing:

    You don't need to guess.

    C:\Temp>type meow.cpp

    #include <stdio.h>

    #include <string>

    template <typename T> struct Meow {

       static void Print() {

           puts("Unrecognized type T.");

       }

    };

    #define MEOW_EXPLICIT_SPECIALIZATION(X) \

       template <> struct Meow<X> { static void Print() { puts(#X); } };

    MEOW_EXPLICIT_SPECIALIZATION(std::string)

    MEOW_EXPLICIT_SPECIALIZATION(std::string&)

    MEOW_EXPLICIT_SPECIALIZATION(std::string&&)

    MEOW_EXPLICIT_SPECIALIZATION(const std::string)

    MEOW_EXPLICIT_SPECIALIZATION(const std::string&)

    MEOW_EXPLICIT_SPECIALIZATION(const std::string&&)

    template <typename C> void value(C) {

       printf("value C: ");

       Meow<C>::Print();

    }

    template <typename C> void univ_ref(C&&) {

       printf("univ_ref C: ");

       Meow<C>::Print();

       printf("univ_ref C&&: ");

       Meow<C&&>::Print();

    }

    int main() {

       std::string ml("modifiable lvalue");

       value(ml);

       univ_ref(ml);

       value(std::string("modifiable rvalue"));

       univ_ref(std::string("modifiable rvalue"));

    }

    C:\Temp>cl /EHsc /nologo /W4 /MTd meow.cpp && meow

    meow.cpp

    value C: std::string

    univ_ref C: std::string&

    univ_ref C&&: std::string&

    value C: std::string

    univ_ref C: std::string

    univ_ref C&&: std::string&&

    > println_UniversalReference(std::string{"abc"});

    > deduced type for println_UniversalReference would be such that "C &&" becomes "std::string const &"

    No, because the argument is a modifiable rvalue.

    > Incidentally, were I to perform copying as in the "sorted" vs. "sorted2" examples in the following

    It sounds like you're asking about a modification to your println example "along the lines" of the cpp-next example. I need to see concrete code in order to give concrete answers, because tiny details matter.

    > Just making sure, what does it mean for a parameter to "not have std::initializer_list or reference to possibly cv-qualified std::initializer_list type"?

    N3485 14.8.2.5 [temp.deduct.type]/5 says that when the function argument is stuff in braces, and the function parameter is NOT any of "cv std::initializer_list<XYZ>", "cv std::initializer_list<XYZ>&", or "cv std::initializer_list<XYZ>&&" (see note), then this is a "non-deduced context". Such a function argument/parameter pair won't be used to deduce any of the template parameters involved. (They'll have to be deduced elsewhere or explicitly specified, otherwise a candidate can't be stamped out for overload resolution.)

    Note: Where cv is nothing, const, volatile, or const volatile, and XYZ is something that involves a template parameter, like T or T * or pair<T, U> or whatever.

    The example "template<class T> void g(T); g({1,2,3});" is fairly clear; it is saying that T will not be deduced to be std::initializer_list<int>.

    Template argument deduction never considers constructors, conversions, or anything like that. It plays a type-matching game with very little "wiggle room" permitted.

  • STL> No, because the argument is a modifiable rvalue.

    &

    function call: univ_ref(std::string("modifiable rvalue"));

    producing output: univ_ref C: std::string

    This has been very instructive, thank you!

    STL> I need to see concrete code in order to give concrete answers, because tiny details matter.

    I've been trying to prod you to produce a generic answer accounting for all the possible corner cases, but I see you're not letting me get away with *that* much prodding ;D

    OK, I will think of a concrete/special case (given the above output I'm somewhat less worried).

    STL> Template argument deduction never considers constructors, conversions, or anything like that. It plays a type-matching game with very little "wiggle room" permitted.

    Got it, thanks!

    BTW, any notes on the Boost.Regex v. std::regex performance? For instance, is the (VS2012's) implementation accounting for some additional debugging checks, even when compiled as an optimized Release (can it be turned off)?

  • MattPD> is the (VS2012's) implementation accounting for some additional debugging checks, even when compiled as an optimized Release (can it be turned off)?

    We disable all checking in release mode.

Page 3 of 3 (36 items) 123