When localizing native code applications, it's not unusual to come across strings resembling this:
Couldn't find path "%s", error #%d.
This should be pretty straight forward. It's pretty obvious that there are two placeholders in this string: %s and %d. It's also quite clear what those placeholders will contain at runtime: a path and an error number.
The type of placeholders used tells us a lot. Just by looking at the string, we ought to know that the code is probably using printf (or similar), and therefore we know that we need to be very careful to get the format specifiers right.
Take a gander at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclib/html/_crt_format_specification_fields_.2d_.printf_and_wprintf_functions.asp. This page has an exhaustive explanation of what valid format specifiers look like. Did you notice that happy note at the bottom of the page? The one that says "Security Note - Ensure that format specification strings are not user-defined"? "User-defined" here also includes "localizer-tampered" - if the localizer makes a mistake, you have a problem.
For the string above, let's pretend* that the code is something like:
char path = "c:\\temp";char message = "Couldn't find path \"%s\", error #%d.";printf(message, path, 132);
Here, printf would do something along these lines: it walks through the message string, from start to finish, substituting any format specification with the rest of the arguments supplied, and then printing out the result. The function will substitute %s for the string in path, and substitute %d for the value 132. Worth noting here is that the argument "path" isn't actually a string, but the address where the text to be printed can be found. The output from this could would be:
Couldn't find path "c:\temp", error #132.
Very good. Now, let's pretend that the localizer needs to switch the word order around a little bit, and we end up with the equivalent of:
char path = "c:\\temp";char message = "Error #%d: couldn't find path \"%s\".";printf(message, path, 132);
What happens here? Well, printf again walks through the string, trying to substitute the format specifiers for the specified arguments. It expects first to find a digit, then the address to a string. Only, because the translation is messed up, the arguments don't actually match the specifiers. The first placeholder - %d - means that the argument "path" is interpreted as a number. This kinda works since it actually is a number, but it means that instead of the text, the memory address of the text will be printed. The second placeholder on the other hand - %s - means that printf will interpret the third argument as the address of some text. The third argument here is 132. When this is interpreted as an address, Windows will notice that the application is trying to poke around in some part of the memory where it doesn't belong, and promptly shut down the process. Que Dr. Watson.
This should make a few things clear:
The good news is that it is fairly straight forward to avoid or catch these bugs. Assuming that the source text is correct, it should be easy to create a check that would compare each source-translation pair and make sure that the same format specifiers occur in the same order.
That's it for now, next time I'll talk just a little bit more about how to author & translate messages with printf format specifiers.
* My examples hard-code all strings for clarity. I reality, if the path and error number were known at compile time, there wouldn't be any reason not to include them in the message.