Jumping into C++

Jumping into C++

Rate This
  • Comments 36

I was a professional developer for years, working initially with C in labs and an engineering firm before moving on to other languages and stints as an enterprise developer and consultant. Other languages and tool were more in demand for the projects I worked on so I gradually moved away from native development and lost touch with the C++ story. It has been years since I developed anything in C or tried using C++.

How hard would it be to write a familiar application in modern C++?

This blog post shares my experience returning to native programming using modern C++. No printf statements were harmed in this effort. A few of my colleagues might have been annoyed by my many questions, so shout out to James McNellis for tutoring and reviewing the code and blog post drafts and to Sumit Kumar and Niklas Gustafsson for deciphering warning messages.

Any errors in the code or this post are mine.

Ground Rules

I set out to write a simple application using modern C++. I imposed the following rules and requirements:

  • Use only modern C++. I wanted to use streams, iterators, templates and mapping how I did things in C, C# and other languages onto modern C++.
  • Use best practices. For example, "when in doubt, use vector." If a value was not meant to change, make it const. Don't use macros. Go with what makes sense and seems to have the most support.
  • Share all gotchas! I expected to run into simple problems reflecting my lack of C++ experience. Fortunately, I could wander the halls and ask my colleagues on the Visual C++ team for help/answers/advice and commiseration. Spoiler: one of my gotchas involved precompiled headers. *facepalm*
  • Keep the project simple. The application was not going to be a Windows Store application. I wanted to read and write some data, do a few loops, mess around with a collection or two, and use std::string, not build a "real" application or explore more advanced concepts. I do hope to build a more sophisticated "real" application in the future.
  • Use Visual Studio and Visual C++. Probably a no-brainer, but I did have some concerns about finding my way around VS2012. I had read a few complaints about the default editor color palette, but I had no complaints. I also found all the shortcuts I had learned years ago continue to work.

Given these ground rules, I chose a simple project, counting words across a number of files. It requires input and output, a way to map a count to a word, some string manipulation (if desired), and loops. It also works well as a command-line application so I could avoid UI issues. If written well, the code could always be ported to a fancier Windows application in the future.

Requirements

The application needs to count the words in a set of text files provided by the user. It must accept a list of files on the command line, process each file while accumulating word counts and skipping bad/missing files, and then print out the number of files processed and the list of words and counts. For this exercise, words are chunks of non-whitespace characters delimited by one or more whitespace characters—nothing too sophisticated.

For all requirements, I chose the simplest possible solution. For example, the program targets ASCII files and not deal with wide characters (though I did mess around with wstring and wifstream without running into problems). The program avoids handling certain binary files (test file are from the Gutenberg Project). Words include unwanted characters even after simple filtering, and so on.

Would you realistically implement these requirements with C++? You might if you could squeeze more performance out of file processing, might need to reuse the code on different platforms, or wanted more control. You might not if you had access to PowerShell or other scripting language and wanted a quick solution. As with all interesting choices, the answer is "it depends." If you wanted to use AppleSoft Basic on an old ][+, go for it (and share your code in the comments section below)!

The solution itself is short, but it did take a couple of iterations to whip into proper C++ shape. If you want the code, it is attached but should not be used in production code blah blah blah.

Interesting Bits

There were a few interesting bits -- interesting mostly because I was new to modern C++ and a bit rusty on using the C/C++ compiler. Some of the issues encountered were fixed using information online (the C++ community and online ecosystem is awesome); others I inflicted on my colleagues.

Processing Command-Line Arguments

Rather than looping through the command line arguments array argv, I went ahead and converted it to a vector<string>.

int main(int argc, char** argv)

{

// bail if no files are specified

if(0 == argc)

  return 0;

const vector<string> files(argv, argv + argc);

}

Iteration is straight-forward, especially if (unlike me) you remember to use const auto&.

for (const auto& file : files) ...

We have no intention of modifying the file string. Defaulting to const auto& when writing a loop seems like a safe best practice.

If run from the Windows command line, the first argument is usually the path/name of the executable. Rather than weed out this case when copying from argv, I did it during the file processing loop:

if(file.rfind(".exe") == (file.length() - 4))

  continue;

If the file name ends with ".exe", skip it. It could be more robust, but the basic mechanism is in there.

Reading a File

Reading a file is as easy as the different C++ tutorials claim: grab an appropriate stream, point it to the file and if it is not bad, read a word at a time until there are no more words to read and then close the file. For this project, the appropriate stream is ifstream though I did have it working with wifstream and wide-characters. If you decide to go that route, use wstring.

My first attempt explicitly checked for a bad file like so:

ifstream infile(file);

if(!infile.bad())

{

  string word;

  while(infile >> word) ...

Turns out ifstream::bad() is set by a previous i/o operation. If, as in this example, there is no previous operation, the file is not "bad" even if it is non-existent. I needed to use a different strategy, one that avoids the explicit "badness" check:

ifstream infile(file);

string word;

// if we can pull a word (file is good)

if(infile >> word)

{

  // process all of the contents

  do

  {

  } while(infile >> word);

}

This code "primes the pump" by test-reading a word from the file before processing the rest of the file. Per-file code (like counting the number of files actually processed) can be shoved after the test-word is successfully read.

If something goes horribly wrong while processing a file, we don't try to pick up the pieces. When the infile falls out of scope, the file is closed.

Tracking Word Counts

Words are tracked in a map using the word as key and a long integer to keep track of the number of occurrences. If a word is not in the array, it is automatically added, avoiding the need for extra code.

map<string, unsigned int> words;

string word;

...

words[word]++;  

This code is on the "simple" end of the map complexity continuum; implementations can get ugly quick. I used unsigned int because word counts will not be negative – there are no "anti-words" in this exercise.

I made one tweak to the program once it was working. The original version counted contiguous chunks of characters delimited by one or more spaces without worrying about whether a character was punctuation, so dog and 'dog' each counted as a unique word. It bugged me so I looked for a way to remove a set of characters from a string. What I found was:

word.erase(remove_if(word.begin(), word.end(), &::isremovable), word.end());

The inner remove_if removes characters the custom function isremovable says should be removed (ie, returns true), shifting all non-removable characters to the left. When done, remove_if returns an iterator pointing to the new end of the word. The outer word.erase removes characters from the new end to the actual end of the word.

This looked like mumbo jumbo until James explained it. It also helped stepping through example code that split the operations into separate lines. Once I got it, it seemed obvious, an "aha" moment that would help me dissect similar statements in the future. Hopefully!

If I wanted to go fully modern C++, I'd replace the isremovable with a lambda, but then there would be too much going on in that one statement for this first attempt J.

Printing Results

My first try at printing to console netted the following:

for (pair<string, long> c : words)

    cout << c.first << ":" << c.second << endl;

It worked but because the pair declaration is wrong (map keys are const), a temporary variable was created for each pair, making the program less efficient. I updated it using what I had learned working on the file processing code:

for (const auto& c : words)

    cout << c.first << ":" << c.second << endl;

If I had defined a new type for the word map, I would have been able to use another mechanism but like lambdas, it can wait.

Gotchas!

My "gotchas":

  • Selecting the wrong project template for the job. In my first go around, I chose a project type that included precompiled headers and handled Unicode. I had forgotten that includes need to go after the precompiled header include in the source file resulting in some funky errors whose cause was not immediate obvious to me (I figured my C++ was wrong!). With Unicode came TCHAR, adding complexities around printing and manipulating strings.For small projects, start with an empty C++ project and write everything from scratch. It is easy to extend later.
  • Forgetting to include the right library. When this happened, I was certain I had the correct includes and so assumed the errors being thrown at me were from bad code. This is part of the learning curve. Double-check includes! Online docs and frequent compiles helped.
  • Getting buried in complexity. Part of my time was spent reviewing C++ information related to my task, in particular the STL and templates. It did not take too long to go from beginner content to the dragon's den in an article, discussion thread or a few "related article" clicks. Some STL code cannot be unseen. Understand there is complexity, file it away, and refocus on the immediate goal.
  • Editor squigglies. On the first version of the project, I used "for each" when looping through files and word counts. The editor "squiggled" the container in each case yet the compiler had no complaints. Turns out "for each" is a Visual Studio extension; when I used "for", the squiggles went away. The code was technically correct, but there was a better way. Verify the veracity of the squiggle – could the statement be tweaked to get rid of it?
  • Assuming the requirements and implementation were "simple". Nothing beats a friendly code review from an expert. James reviewed the code and this post and identified quite a few fundamental "oops," "d'ohs," and "ughs". Getting a program to work is one thing; making sure it is correct (and I could explain why) was much harder. Don't avoid peer code reviews!

Most of these are pretty basic. Your gotchas may vary (YGMV).

Is C++ Hard?

Not if you start with a small, familiar project, tackle a few new concepts at a time, and use all the wonderful resources in the community (including the community itself).

I hope to tackle a more complex project next time. Stay tuned!

Attachment: WordCountInFiles.cpp
  • A few additional comments:

    1) the normal "contract" for launching executables is to have argv[0] be the path to the program file. If somebody launches your program in a way such that this isn't true, they're breaking that contract, and are writing buggy code. Thus, you should bail on argc<2 and intialize by (argv+1, argv+argc), and you can then also drop the exe-checking.

    2) what you receive in argv are filenames, not files - this is perhaps a bit pedantic, but "file" and "filename" are semantically quite different :-)

    3) error opening the file sets failbit, not badbit, which is why your first attempt failed - so you could check for "infile.fail()"... or you can simply do "if(infile)" - or you could change your outer loop to "while( infile.good() )" - or you could have it as "while( std::getline(infile, line) )".

    4) STL's suggestion to use regex is nifty, and definitely easier than hand-parsing (even if simple word boundary parsing isn't too bad :-) ), but it almost feels like cheating! :-)

  • re. size_t, I'd say that no, you did the right thing. Technically, size_t is for describing the size of objects, so I'd say an int or unsigned int is more appropriate here. But both will work, and I'm sure you could come up with arguments in favor of size_t too.

    re. project type and precompiled headers, I strongly feel VS should remove them from the default project type. All projects start out small and simple, and for simplicity, give us a simple project. We know how to add the complex stuff anyway as and when we need it. I believe the only thing the pch-by-default policy achieves is to trip up newcomers, like it did with you. (while also being one more nonstandard thing MSVC users end up using in their code.)

  • "re. size_t, I'd say that no, you did the right thing. Technically, size_t is for describing the size of objects, "

    It is also used for describing the number of elements in an array, so if you're counting objects, then it makes sense.

    "I strongly feel VS should remove them from the default project type. All projects start out small and simple, and for simplicity, give us a simple project. "

    There is already an "empty project" type that I believe gives you what you want.  If you're doing an MFC project, or even a Windows project, you really want to use a precompiled header, as those two headers result in really long compile times otherwise.

  • The following code from your sample will always return true?

    argc never equals 0, as argv[0] contains the programms name.

    //bail if no files are specified

    if(0 == argc)

  • MattPD> MSDN moderation seems to hate my guts and wish me ill

    The filters are probably unhappy with anyone pasting a zillion links.

    Daniel Earwicker> (more accurately, it uses the free functions std::begin and std::end so it can also work with arrays).

    Actually, range-for directly recognizes arrays and classes with member .begin()/.end(). For everything else, it uses Argument-Dependent Lookup to find begin()/end() free functions. Interestingly, it never uses std::begin()/std::end(), which work with only arrays and classes with member .begin()/.end() - exactly the things that range-for directly recognizes. See N3485 6.5.4 [stmt.ranged] for the details.

    > we can forget that quirky pattern of using a default-constructed iterator to refer to the end position.

    Ranges will be wonderful, but note that this pattern is not new. istream_iterator has always followed it.

    > for (auto line: file_line_range(file))

    Note that this would be equivalent to "for (string line : RANGE)" and would therefore copy each line unnecessarily. Range-for should always be written as "for (auto& e : r)" or "for (const auto& e : r)" to access elements in-place unless you specifically want something else (which is rare).

    > you can pipe a range through a series of transformations using the | operator

    There is currently a problem with that technique. Range-for has been specified to keep temporary ranges alive for the duration of the entire loop, by using "auto&& __range = (expression);" internally. However, if you have a chain of temporary objects, where the last one depends on previous ones surviving (i.e. it does not copy elements but filters them in-place), then range-for cannot keep the previous temporaries alive (nothing in C++ is currently capable of that). The Committee is considering changing the specification of range-for to keep all temporaries in the expression alive for the duration of the entire loop, but that has not been voted into the Working Paper yet, nor have compilers implemented it.

    GregM> [size_t] is also used for describing the number of elements in an array, so if you're counting objects, then it makes sense.

    When counting things that don't live in memory simultaneously (e.g. things in a file), strongly consider using unsigned long long. In this case, a program compiled for x86 could easily be given an 8GB file containing 2^32 occurrences of the word "A" (and 2^32 LFs; a file with CRLFs would have to be 12GB). Actually hardening a program against enormous inputs requires more effort (e.g. enormous lines, enormous words, an enormous number of unique words), but not making the problem worse by adding arbitrary 32-bit limitations is a good start. (I speak from experience, having battled Large File Support bugs in the CRT and the STL - all of which we now believe to be fixed.)

    stefan> argc never equals 0, as argv[0] contains the programms name.

    3.6.1 [basic.start.main]/2 guarantees that "The value of argc shall be non-negative.", but it may be 0.

  • STL: A fair point ;-)

    I've been attempting to address the "is this best practice" inquiry in a fair manner, so I felt it was best to illustrate with existing practices with comments from Andrew and Scott :-)

    By the way, I have a question on range-based for loops & best practice, specifically, regarding the following:

    STL> Range-for should always be written as "for (auto& e : r)" or "for (const auto& e : r)" to access elements in-place unless you specifically want something else (which is rare).

    Wouldn't universal reference, "for (auto && e : r)", be a better choice in a generic context (e.g., templates)?

    One of the advantages is that it will bind to both, including temporaries and const objects:

    stacked-crooked.com/view

    Come to think of it, would using universal references for arguments as in here would be a good idea, too:

    stacked-crooked.com/view

    Some context:

    stackoverflow.com/.../what-is-the-advantage-of-using-universal-references-in-range-based-for-loops

    stackoverflow.com/.../proper-style-for-declaration-in-range-based-for

    Or, is there any way I can get in trouble if that's my default?

    The only thing that comes to mind is potential aliasing concerns for types like "int" or "double" (where pass-by-value, or for (auto e : r)" could be preferable, but I guess profile-first makes sense in this context (and I don't think either "auto&" or "auto const&" can be any better here):

    stackoverflow.com/.../is-a-reference-more-efficient-than-a-copy-for-small-types-in-a-range-based-for-l

    So, is using universal references *the* good default choice?

    // Example:

    #include <iostream>

    #include <numeric>

    #include <vector>

    template <typename C>

    void println(C const & c)

    {

    for (auto && v : c) std::cout << v << ' ';

    std::cout << '\n';

    }

    int main()

    {

    std::vector<int> c(5);

    iota(begin(c), end(c), 0);

    println(c);

    std::vector<int> const d{c};

    println(d);

    println(std::vector<int>{0, 1, 2, 3, 4});

    }

  • GregM> [size_t] is also used for describing the number of elements in an array, so if you're counting objects, then it makes sense.

    STL> When counting things that don't live in memory simultaneously (e.g. things in a file), strongly consider using unsigned long long.

    Thanks, that would be better than size_t when 32 bit is involved (I always forget that long long is actually 64 bit on 32 bit Windows), but size_t is still better than the "int or unsigned int" recommended by the person I responded to, since it will never be 64 bit, unlike size_t.

  • STL: One quick note, mostly on my secondary question[*] (the one on function-arguments, not the primary one on the range-based for loop). One issue I see is missing the optimization opportunity for copy elision (and unnecessarily moving instead) when universal references are used as arg-types -- but this is only a relative disadvantage compared to pass-by-value, not compared to pass-by-(const-)reference (which shouldn't do any better), right?

    // [*] I *think* this only applies to my secondary question, since the semantics of range-based for loops are defined in terms of universal references / perfect forwarding anyway; en.cppreference.com/.../range-for

    Is this correct?

  • Thanks to all for such great comments. I will be posting a follow-up to highlight comments/feedback including the regex-based solution. It is easier to appreciate a more elegant solution after grinding through a less elegant (but viscerally educational) version :)

  • MattPD> One of the advantages is that [auto&&] will bind to both, including temporaries and const objects:

    Given vector<int>, auto& becomes int&. Given const vector<int>, auto& becomes const int&. So that's not a problem.

    The advantage of auto&& is that it works with iterators that return proxy objects, like (notoriously) vector<bool>. auto& will refuse to bind to proxy objects, while const auto& will bind to them but won't allow you to assign through them.

    Because proxy objects are sufficiently rare, I hesitate to recommend auto&& to newbies for fear of making their heads explode. However, it always works (and range-for already uses it to bind to arbitrary expressions), so feel free to use it by default.

    > The only thing that comes to mind is potential aliasing concerns for types like "int" or "double" (where pass-by-value, or for (auto e : r)" could be preferable

    The iterator's op* is already returning T& or const T&. I haven't profiled it, but I don't believe that auto& will introduce a performance penalty, and auto might (even for int).

    GregM> I always forget that long long is actually 64 bit on 32 bit Windows

    It's required to be at least 64 bits everywhere. You can include <stdint.h> if you'd like the fixed-length typedefs.

    MattPD> One issue I see is missing the optimization opportunity for copy elision (and unnecessarily moving instead) when universal references are used as arg-types

    I need a full example to reason about. I don't immediately see how perfect forwarding could introduce a penalty. (However, saying return move(local); inhibits the NRVO.)

  • GregM> I always forget that long long is actually 64 bit on 32 bit Windows

    STL> It's required to be at least 64 bits everywhere. You can include <stdint.h> if you'd like the fixed-length typedefs.

    Thanks.  I didn't phrase that very well.  "I always forget that there is a portable integer type that is at least 64 bits on 32 bit Windows."  I think that the only time I've used it has been as the MS-specific ULONGLONG for interacting with CFile.  

  • Here's some not-so-serious perf timing ;-)

    E (like Eric): liveworkspace.org/.../47DiHj$0

    M (like Matt): liveworkspace.org/.../V4c0C$0

    S (like Stephan): liveworkspace.org/.../3CgOLK$0 // on-line compiler (linker) is complaining, but tested locally on VS2012 w/o problems

    Of course, there's that tiny implementation detail of our programs actually doing rather different things (with Eric's counting "words" like "#4300]" and mine (and Eric's) also counting empty words, while Stephan's actually counting... words, as in actually solving the specified problem) but, hey, we're faster! ;D

    // I guess one could call it performance-correctness trade-off ;-)

    // And mine shows an _enormous_ boost of roughly... wait for it... 0.017 seconds (!) relative to Eric's ;-)

    I've also tried unordered_map in Stephan's program, but this didn't help a lot, so I'm guessing regex is the main workhorse contributing to the overall cost.

    Test file: Ulysses by James Joyce // www.gutenberg.org/.../4300

    www.gutenberg.org/.../4300.txt

    /* *** */

    Eric: // 38825 word(s) in 1 file(s)

    > WordCountInFiles.E.exe 4300.txt > outE

    Time elapsed: 0.215093 seconds

    > WordCountInFiles.E.exe 4300.txt > outE

    Time elapsed: 0.212425 seconds

    > WordCountInFiles.E.exe 4300.txt > outE

    Time elapsed: 0.212831 seconds

    /* *** */

    Matt: // 34566 word(s) in 1 file(s)

    > WordCountInFiles.Mur.exe 4300.txt > outM

    Time elapsed: 0.198755 seconds

    > WordCountInFiles.Mur.exe 4300.txt > outM

    Time elapsed: 0.194646 seconds

    > WordCountInFiles.Mur.exe 4300.txt > outM

    Time elapsed: 0.195544 seconds

    /* *** */

    Stephan: // 34544 unique words.

    > WordCountInFiles.S.exe 4300.txt > outS

    Time elapsed: 0.817385 seconds

    > WordCountInFiles.S.exe 4300.txt > outS

    Time elapsed: 0.815651 seconds

    > WordCountInFiles.S.exe 4300.txt > outS

    Time elapsed: 0.81715 seconds

    /* *** */

    Stephan, unordered_map: // 34544 unique words.

    > WordCountInFiles.Su.exe 4300.txt > outSu

    Time elapsed: 0.75572 seconds

    > WordCountInFiles.Su.exe 4300.txt > outSu

    Time elapsed: 0.761518 seconds

    > WordCountInFiles.Su.exe 4300.txt > outSu

    Time elapsed: 0.763557 seconds

    /* *** */

  • STL: great, thanks!

    Universal references default for the range-based for loop it is, then! :-)

    // The added benefit is simplicity, i.e., not having to consider whether to use "for (auto& e : r)" or "for (const auto& e : r)" depending on the constness of "r".

    STL> "auto might (even for int)"

    This made me curious... how? Are we talking architecture with pointers smaller than ints here or is this something else?

    STL> how perfect forwarding could introduce a penalty

    My worries are somewhat along these lines: groups.google.com/.../97f044b8bc9a48d9

    In particular, with a focus on the cost of copy-in (copy elision is often talked about in the also-copy-out context, so perhaps that's already a signal I'm barking up the wrong tree here).

    Not sure this is anything to worry about in this context, can't find any performance difference in this (admittedly contrived) example:

    stacked-crooked.com/view

  • Kenny Kerr and I collaborated on an implementation optimized for a good balance between implementation simplicity and great performance.  We wrote an article about our implementation, which you can find on his blog:

    The Great C++ Wordament:  Meet Windows (kennykerr.ca/.../the-great-c-wordament-meet-windows)

  • > It is also used for describing the number of elements in an array, so if you're counting objects, then it makes sense.

    ... which is the size of the array, like I said. It is however *not* conventionally used for simple counters. (if it were, are there any situations where you'd use a plain int/unsigned int?) Heck, size_t is defined to be big enough to represent sizeof(the largest possible object), which is a completely arbitrary limit if you're counting the number of words in a file.

    (Note that I'm not considering the size of the datatype. If an int isn't big enough, use long long or int64_t or whatever else. I'm talking about the semantics. size_t is intended to describe the size of objects. int and similar are for representing, well, integral numbers)

    But as always with all the different typedefs of various integers, the difference is kind of academic. It makes no big difference in practice if you use a size_t instead of an unsigned int, (heck, you could use ptrdiff_t too, if you liked, or time_t). And the definition is loose enough that it's generally fairly subjective whether it's "correct" to use one instead of another. But personally, I'd be surprised to see size_t used for a simple counter.

    > There is already an "empty project" type that I believe gives you what you want

    Yes, there is, but it's not the default, and this trips up newcomers again and again and again... And again.

    Whereas someone creating a big complex MFC project hopefully knows how to enable precompiled headers *anyway*. The defaults should be toggled. By default, create a simple clean project with minimal "custom" settings. And them add an option for "big heavy-duty MFC project with all the 20-year-old project cruft, pregenerated files, readme.txt and a pch. Make it easy for newcomers who aren't familiar with all the project types. Those who are can easily pick the variant that suits them anyway.

Page 2 of 3 (36 items) 123