Jumping into C++

Jumping into C++

Rate This
  • Comments 36

I was a professional developer for years, working initially with C in labs and an engineering firm before moving on to other languages and stints as an enterprise developer and consultant. Other languages and tool were more in demand for the projects I worked on so I gradually moved away from native development and lost touch with the C++ story. It has been years since I developed anything in C or tried using C++.

How hard would it be to write a familiar application in modern C++?

This blog post shares my experience returning to native programming using modern C++. No printf statements were harmed in this effort. A few of my colleagues might have been annoyed by my many questions, so shout out to James McNellis for tutoring and reviewing the code and blog post drafts and to Sumit Kumar and Niklas Gustafsson for deciphering warning messages.

Any errors in the code or this post are mine.

Ground Rules

I set out to write a simple application using modern C++. I imposed the following rules and requirements:

  • Use only modern C++. I wanted to use streams, iterators, templates and mapping how I did things in C, C# and other languages onto modern C++.
  • Use best practices. For example, "when in doubt, use vector." If a value was not meant to change, make it const. Don't use macros. Go with what makes sense and seems to have the most support.
  • Share all gotchas! I expected to run into simple problems reflecting my lack of C++ experience. Fortunately, I could wander the halls and ask my colleagues on the Visual C++ team for help/answers/advice and commiseration. Spoiler: one of my gotchas involved precompiled headers. *facepalm*
  • Keep the project simple. The application was not going to be a Windows Store application. I wanted to read and write some data, do a few loops, mess around with a collection or two, and use std::string, not build a "real" application or explore more advanced concepts. I do hope to build a more sophisticated "real" application in the future.
  • Use Visual Studio and Visual C++. Probably a no-brainer, but I did have some concerns about finding my way around VS2012. I had read a few complaints about the default editor color palette, but I had no complaints. I also found all the shortcuts I had learned years ago continue to work.

Given these ground rules, I chose a simple project, counting words across a number of files. It requires input and output, a way to map a count to a word, some string manipulation (if desired), and loops. It also works well as a command-line application so I could avoid UI issues. If written well, the code could always be ported to a fancier Windows application in the future.

Requirements

The application needs to count the words in a set of text files provided by the user. It must accept a list of files on the command line, process each file while accumulating word counts and skipping bad/missing files, and then print out the number of files processed and the list of words and counts. For this exercise, words are chunks of non-whitespace characters delimited by one or more whitespace characters—nothing too sophisticated.

For all requirements, I chose the simplest possible solution. For example, the program targets ASCII files and not deal with wide characters (though I did mess around with wstring and wifstream without running into problems). The program avoids handling certain binary files (test file are from the Gutenberg Project). Words include unwanted characters even after simple filtering, and so on.

Would you realistically implement these requirements with C++? You might if you could squeeze more performance out of file processing, might need to reuse the code on different platforms, or wanted more control. You might not if you had access to PowerShell or other scripting language and wanted a quick solution. As with all interesting choices, the answer is "it depends." If you wanted to use AppleSoft Basic on an old ][+, go for it (and share your code in the comments section below)!

The solution itself is short, but it did take a couple of iterations to whip into proper C++ shape. If you want the code, it is attached but should not be used in production code blah blah blah.

Interesting Bits

There were a few interesting bits -- interesting mostly because I was new to modern C++ and a bit rusty on using the C/C++ compiler. Some of the issues encountered were fixed using information online (the C++ community and online ecosystem is awesome); others I inflicted on my colleagues.

Processing Command-Line Arguments

Rather than looping through the command line arguments array argv, I went ahead and converted it to a vector<string>.

int main(int argc, char** argv)

{

// bail if no files are specified

if(0 == argc)

  return 0;

const vector<string> files(argv, argv + argc);

}

Iteration is straight-forward, especially if (unlike me) you remember to use const auto&.

for (const auto& file : files) ...

We have no intention of modifying the file string. Defaulting to const auto& when writing a loop seems like a safe best practice.

If run from the Windows command line, the first argument is usually the path/name of the executable. Rather than weed out this case when copying from argv, I did it during the file processing loop:

if(file.rfind(".exe") == (file.length() - 4))

  continue;

If the file name ends with ".exe", skip it. It could be more robust, but the basic mechanism is in there.

Reading a File

Reading a file is as easy as the different C++ tutorials claim: grab an appropriate stream, point it to the file and if it is not bad, read a word at a time until there are no more words to read and then close the file. For this project, the appropriate stream is ifstream though I did have it working with wifstream and wide-characters. If you decide to go that route, use wstring.

My first attempt explicitly checked for a bad file like so:

ifstream infile(file);

if(!infile.bad())

{

  string word;

  while(infile >> word) ...

Turns out ifstream::bad() is set by a previous i/o operation. If, as in this example, there is no previous operation, the file is not "bad" even if it is non-existent. I needed to use a different strategy, one that avoids the explicit "badness" check:

ifstream infile(file);

string word;

// if we can pull a word (file is good)

if(infile >> word)

{

  // process all of the contents

  do

  {

  } while(infile >> word);

}

This code "primes the pump" by test-reading a word from the file before processing the rest of the file. Per-file code (like counting the number of files actually processed) can be shoved after the test-word is successfully read.

If something goes horribly wrong while processing a file, we don't try to pick up the pieces. When the infile falls out of scope, the file is closed.

Tracking Word Counts

Words are tracked in a map using the word as key and a long integer to keep track of the number of occurrences. If a word is not in the array, it is automatically added, avoiding the need for extra code.

map<string, unsigned int> words;

string word;

...

words[word]++;  

This code is on the "simple" end of the map complexity continuum; implementations can get ugly quick. I used unsigned int because word counts will not be negative – there are no "anti-words" in this exercise.

I made one tweak to the program once it was working. The original version counted contiguous chunks of characters delimited by one or more spaces without worrying about whether a character was punctuation, so dog and 'dog' each counted as a unique word. It bugged me so I looked for a way to remove a set of characters from a string. What I found was:

word.erase(remove_if(word.begin(), word.end(), &::isremovable), word.end());

The inner remove_if removes characters the custom function isremovable says should be removed (ie, returns true), shifting all non-removable characters to the left. When done, remove_if returns an iterator pointing to the new end of the word. The outer word.erase removes characters from the new end to the actual end of the word.

This looked like mumbo jumbo until James explained it. It also helped stepping through example code that split the operations into separate lines. Once I got it, it seemed obvious, an "aha" moment that would help me dissect similar statements in the future. Hopefully!

If I wanted to go fully modern C++, I'd replace the isremovable with a lambda, but then there would be too much going on in that one statement for this first attempt J.

Printing Results

My first try at printing to console netted the following:

for (pair<string, long> c : words)

    cout << c.first << ":" << c.second << endl;

It worked but because the pair declaration is wrong (map keys are const), a temporary variable was created for each pair, making the program less efficient. I updated it using what I had learned working on the file processing code:

for (const auto& c : words)

    cout << c.first << ":" << c.second << endl;

If I had defined a new type for the word map, I would have been able to use another mechanism but like lambdas, it can wait.

Gotchas!

My "gotchas":

  • Selecting the wrong project template for the job. In my first go around, I chose a project type that included precompiled headers and handled Unicode. I had forgotten that includes need to go after the precompiled header include in the source file resulting in some funky errors whose cause was not immediate obvious to me (I figured my C++ was wrong!). With Unicode came TCHAR, adding complexities around printing and manipulating strings.For small projects, start with an empty C++ project and write everything from scratch. It is easy to extend later.
  • Forgetting to include the right library. When this happened, I was certain I had the correct includes and so assumed the errors being thrown at me were from bad code. This is part of the learning curve. Double-check includes! Online docs and frequent compiles helped.
  • Getting buried in complexity. Part of my time was spent reviewing C++ information related to my task, in particular the STL and templates. It did not take too long to go from beginner content to the dragon's den in an article, discussion thread or a few "related article" clicks. Some STL code cannot be unseen. Understand there is complexity, file it away, and refocus on the immediate goal.
  • Editor squigglies. On the first version of the project, I used "for each" when looping through files and word counts. The editor "squiggled" the container in each case yet the compiler had no complaints. Turns out "for each" is a Visual Studio extension; when I used "for", the squiggles went away. The code was technically correct, but there was a better way. Verify the veracity of the squiggle – could the statement be tweaked to get rid of it?
  • Assuming the requirements and implementation were "simple". Nothing beats a friendly code review from an expert. James reviewed the code and this post and identified quite a few fundamental "oops," "d'ohs," and "ughs". Getting a program to work is one thing; making sure it is correct (and I could explain why) was much harder. Don't avoid peer code reviews!

Most of these are pretty basic. Your gotchas may vary (YGMV).

Is C++ Hard?

Not if you start with a small, familiar project, tackle a few new concepts at a time, and use all the wonderful resources in the community (including the community itself).

I hope to tackle a more complex project next time. Stay tuned!

Attachment: WordCountInFiles.cpp
  • Welcome back! :-)

    Here's an iteration:

    liveworkspace.org/.../42OOEw$0

    It illustrates a few simple ideas/extensions to consider in the future:

    - instead (or perhaps in the implementation) of "isremovable" use one of the character classification functions (header ctype):

    en.cppreference.com/.../byte

    For instance, std::ispunct /* en.cppreference.com/.../ispunct */ matches your current criteria pretty well (adding a few extra on its own ;]).

    Although you may also consider (negated) std::isalnum.

    // A gotcha here is type-inference error without static_cast<int(*)(int)> (try it out with Clang or GCC; MSVC sometimes like to be permissive).

    Instead of "unsigned int" (value type in the "words" variable) and "int" (variable "files_processed"), consider just using std::size_t (header cstddef) -- it expresses the intent (in both cases you're counting stuff) and is a bit more future proof (abstracts from having to worry on whether it should be long int, on some platforms, or long long int, on some other platforms) ;-)

    // I've also changed post-increments to pre-increments; while I don't believe there's any serious difference here, I just find it easier to read this way (if you pronounce "++" as "increment"; usually verb-noun order sounds a bit better).

    Now, if you're willing to consider Boost (which often makes C++ even easier :]), there's a few more things that may come in handy:

    - boost::algorithm::iends_with(filename, ".exe")

    www.boost.org/.../iends_with.html

    www.boost.org/.../usage.html

    - boost::algorithm::remove_erase_if

    www.boost.org/.../remove_erase_if.html

    HTH :-)

  • One more thought: instead of skipping the ".exe", we may want to consider skipping the first argument altogether // if you'll see the output of the program I've posted you'll know what inspired this thought ;-)

  • Hi MattPD, thanks for sharing your time and insights.

    The one big lesson I continue to learn is that there is always room for improvement in code, even for short and "simple" projects like this one.

    >> ...use one of the character classification functions (header ctype):

    An earlier version used ispunct but it removed legitimate hyphens (e,g, knick-knack). I almost left it that way but it bugged me so I put in a custom function. After I posted the article, I converted the function to a Lambda and let the compiler sort out the call :)

    >> size_t

    I'll give it a try. Is this a best practice?

    >> Boost

    Now I know why Boost has a huge following. Do I get points for doing it the "hard way"?

    -eric

  • This is great, Eric. Reading your post was like traveling back to my early days doing what you do now. You are not alone in this rediscovering trip. Best wishes, happy coding and make sure that even the most experienced C++ developer can learn something even from back-comers like us.

  • MattPD:

    >> ..we may want to consider skipping the first argument altogether

    :)

    The first version of the program did just that but as James McNellis pointed out, on Windows the program is not guaranteed to have the path\exe as the first argument, e.g. when invoked through CreateProcess. So I fixed it.

    I *almost* narrowed the requirements to say "command line only", but did the Right Thing.

  • Typo: "be sure", not "make sure" (I'm back to English as well :-P

  • Continuing the footnote on the following (apparently, there's a character limit on the comments, how Twitter-like ;]):

    "As for std::size_t as *the* good-for-counting-type: arguably, "in general" * (containers have their own size_type, nothing wrong with using it, although it may often be just a typedef to std::size_t), yes."

    * -- qualifications & caveats reserved due to enormous variety of possible C++ applications ;-) Here, the focus is on being future-proof & cross-platform, with an extra benefit of readable, self-documenting code (expressing the intent to count). There are, however, specific situations where we'd rather be platform-specific, sacrificing generality/readability, e.g., low-level communication protocols, network card drivers (where you know how many bits are representing, say, a packet size, and want to ensure it doesn't change), structures intended for serialization (where you explicitly *don't* want the benefits of size_t scaling to match your architecture word, but would rather have a fixed-width type so as to ensure compatibility) -- in these cases you'd consider fixed width integer types /* en.cppreference.com/.../integer */ (and, obviously, "int" and "unsigned int" would be out of the running, for the same reasons) -- although, IMHO, a typedef (clarifying the intent to use one of these as a count/size type) would still be a good idea.

  • There is a better way, Jumping into C#!

  • For "best practice", "words[word]++;" should really be "++words[word]". Although it doesn't really matter in this case since the map is storing ints, post-increment results in a copy of the object which can be expensive. You should always use pre-increment unless you actually need the previous value.

  • A couple of notes: isremovable() should return bool, and return 0; is unnecessary at the end of main() in C++.

    Here's my approach, word_freq.cpp at http://sdrv.ms/UCieo5  The core functionality is:

    const regex word_regex("\\w+");

    for (string line; getline(file, line); ) {

       for (sregex_token_iterator i(line.begin(), line.end(), word_regex), end; i != end; ++i) {

           ++word_freq[*i];

       }

    }

    The outer for-loop reads the file line-by-line. The inner for-loop tokenizes each line into words, where a word is defined as one or more alphanumeric-or-underscore characters (like identifiers in C++). This automatically ignores whitespace, punctuation, and so forth.

    There are significant differences between how WordCountInFiles.cpp and word_freq.cpp parse words. Given the input "cute ... fluffy,kittens" without quotes, WordCountInFiles.cpp sees "", "cute", and "fluffykittens". This is because it consumes whitespace-separated substrings, then eliminates punctuation. word_freq.cpp sees "cute", "fluffy", and "kittens".

  • @STL - Show-off :]     Seriously though - modern C++ can be quite expressive when you learn all the goodies that make up the language and the standard library (never mind throwing Boost into the mix).

  • Eric: MSDN moderation seems to hate my guts and wish me ill /* :-( */, so I guess I've just gotta post a link to the reply instead: pastehtml.com/.../cpjyl96zr.rtxt :-)

    The STL's idea to use regex -- pretty nifty! :-)

  • Precompiled headers specially designed to make our life easier with big projects. Use Unicode is good idea if you write for Windows. Second check for EXE is much easier, you should know argv[0] is always exe, so you have to bail out if argc < 2

  • If you wanted to print a specific error message when the file can't be opened, try:

       if(!infile.is_open())

    One way to make @Stephan T. Lavavej's version even neater is to use the "range" concept. In simple terms a range is an object with begin and end member functions that return iterators indicating the first and one-past-end positions in a sequence.

    So std::vector is already a range. The new for (e : r) syntax is a consumer of ranges (more accurately, it uses the free functions std::begin and std::end so it can also work with arrays).

    Ideally C++ would also have a sregex_token_range type, which would just be a thin wrapper around sregex_token_iterator.

    Unfortunately C++ doesn't have that type, but we can define our own in a handful of lines, and then we can forget that quirky pattern of using a default-constructed iterator to refer to the end position. The almost-standard directory_iterator, for looping through the contents of a file-system directory, also uses that pattern.

    And why not define a file_line_range type as well, which encapsulates opening a stream (throwing if it fails) and then iterating through the lines in it? Then iteration becomes totally consistent at both levels: just use the new for (e : r) syntax.

    const regex word_regex("\\w+");

    for (auto line: file_line_range(file))

      for (auto word: sregex_token_range(line, word_regex))

          ++word_freq[*i];

    If you search for boost range, you'll find out how to follow this idea to the point where you can pipe a range through a series of transformations using the | operator, e.g.:

    using namespace boost::adaptors;

    // loop through words backwards!

    for (auto word : words | reversed)

       ...

    Just like with LINQ-to-objects, the computation is "lazy", because it just composes a range that operates on each element as you iterate through the sequence.

  • Another vote for using tested libraries (like MattPD's suggestion of boost). What will your test do if file contains the string "abc"?

    if(file.rfind(".exe") == (file.length() - 4))

    if(string::npos == 3 - 4)

    if(-1 == -1)

Page 1 of 3 (36 items) 123