Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Book Review: Find the Bug by Adam Barr

Book Review: Find the Bug by Adam Barr

  • Comments 13
Several months ago, I mentioned Adam Barr's book "Find the Bug".  I pre-ordered it from Amazon, but have only now finally finished reading it, so it took me somewhat longer to get this book review out :)

This is the first book review I've done since middle school, so please bear with the somewhat free-form style :)

 

First off, I liked the book, and I'd heartily recommend it to groups that are doing code reviews as a primer.

In "Find the Bug", Adam presents 50 vaguely real world examples of code bugs that are likely to be found only by debugging or careful code inspection.  He spreads the bugs across five different languages (C, Java, Python, Perl, and x86 assembly), providing ten different bugs in each.

The book is organized into five different sections, one for each language.  Each section contains a brief primer on the language, basically enough information for a reader unfamiliar with the language to be able to determine the bug even if they don't know the language (for instance, I don't know Python, but was able to find most of the bugs in the python section after reading the language overview).

For each bug, Adam follows a basic template.  He starts with a brief description of the purpose of the function (sorting entries, searching for values, playing LIFE, playing go fish, etc), the code for the function, a series of questions about the code, and a series of hints to aid in the process of determining the bug.  Finally, he explains the origin of the bug and presents the corrections needed for the bug.

Adam does a good job of presenting a fairly broad set of potential bugs, and given the diverse languages that he presents, there are opportunities for people to develop their code reviewing skills in a series of different languages.

Adam has a good writing style (as shown in his previous blook Proudly Serving my Corporate Masters), and it carries forward in this book - the examples are well written and read easily.

One thing I did not like about the book was the bug classification system.  Adam categorizes all the bugs in the book according to a taxonomy of bugs based roughly on the taxonomy that Donald Knuth used for TeX.  Personally, I found the taxonomy rather distracting.  Especially since the taxonomy of some of the bugs wasn't clear - for instance, bug 7.2 (assembly language, multiply two numbers using shifts) could be either F.location or A.logic depending on how you wanted to classify them. 

It's also clear that Adam's not an assembly language programmer.  I was cringing reading some of the assembly language sections (especially the multiply using shifts example) because the code looked "wrong" to my eyes.  Essentially, the root of the problem was that Adam didn't use some of the "standard" tricks that a seasoned assembly language programmer would use.  For instance he had a loop where he used a TEST ECX, ECX/JNZ <destination> - an assembly language programmer would use JECXZ <temporary label ahead of the jmp>/JMP <destination> instead (because JCXZ doesn't change the flags register).  He also zeroed registers using MOV EAX, 0 instead of XOR EAX, EAX (which results in slightly smaller code).  These are total nits however :).

Find the Bug's not a good general purpose textbook, but, as I said, it's a really good primer to help hone code reviewing techniques.  It's going to stay on the bookshelf.

I give it a thumbs up :)

 

Oh, and I've got to point out the Kanji Backspace problem (sorry Adam, I had to point it out).  This is problem 3.10 (C language, Kanji Backspace).  Adam's blogged about this one, but the upshot is that we had a disagreement about the Kanji Backspace algorithm as presented in his book - the algorithm does work, but the code as presented doesn't correctly identify Kanji lead byte characters.  The algorithm specified also also only works for DBCS character sets like Kanji - it doesn't work for general purpose multi-byte character sets.

 

Details:

Title: Find the Bug
Author: Adam Barr
ISBN: 0-321-22391-8

 

Full Disclosure: Adam's a friend of mine, we've known each other for at least 12 years now.

 

Edit: Removed nonexistant JECXNZ instruction, thanks Benny.

 

  • Perhaps he's an ASM programmer from the days of the 386? I used to do assembly programming, but I've only ever heard of JCXZ, not JCXNZ. Was it added to the instruction set of Pentiums, or is there some 'shortcut' for certain assemblers that just reverses the logic to use JCXZ instead?
  • @#$ You're right, there's no JCXNZ, for some reason I thought there was... Sigh...

    Correcting the article, thanks Benny.
  • Thank you for the review, Larry. You heard the man, go buy the book!

    I admit to not being an assembly-language programmer, but I'll defend the two examples you gave. I mention (at the bottom of p. 218/top of p. 219) the "xor eax,eax" way of zero-ing out a register, and explain that it is historical from the days when xor was faster than an immediate move. Since as far as I know this is no longer true (they are both 2 cycles on any >=386 processor), I mostly used mov eax,0 for clarity, although in a few cases I did xor eax,eax to get readers used to that. For the test ecx,ecx one, I don't remember what in particular I was thinking when I wrote the code, but I would assume I wanted to use the test opcode somewhere so I chose that place (I was also trying to use the various loop, rep, scas, mov, etc. operators at least once, which led to a bit of shoehorning).

    For Kanji backspace, you should note that I finally posted an errata section on the book's website:

    http://www.findthebug.com/errata.htm

    Larry I think you also pointed out errata #2 to me, but someone else did also, so I wanted to share the glory.

    Thanks.

    - adam
  • Cool - Actually I hadn't seen the errata page :) And I think I did point out #2, and you explained why it occured quite reasonably over lunch :)

    There was actually a HUGE discussion when the 386 first came out about how best to zero memory - whether xor eax, eax, or mov eax, 0 or sub eax, eax was the fastest. We eventually settled on xor eax, eax, I believe because the instruction was shorter (2 bytes, IIRC rather than 3 (but I'm not sure)), and that's what the compiler uses.

    As I mentioned, it's one of those x86 assembly language tricks - every assembly language person knows it. Your code (mov eax, 0) IS much cleaner :)
  • What is this Kanji Backspace problem ? Can't find a description of it.
  • The problem involves DBCS-encoded strings. In DBCS, a "character" can occupy one or two bytes (for the purposes of this problem, anyway). The 256 possible values for a byte are split into those that are "lead bytes" (that's lead as in "be in front", not the metal) and those that are "trailing bytes". The function IsDBCSLeadByte() will tell you if a given byte is in the lead byte range or not.

    A double-byte character always has a lead byte in the first byte and then any byte in the second byte. A single-byte character will consist of one trailing byte. We'll abbreviate the lead bytes as "H" and the trailing bytes as "l" (using those designations for historical reasons, since the lead bytes usually have the high bit on -- the mistake I made in the book was to claim that lead bytes ALWAYS had the high bit on and trailing bytes ALWAYS had it off).

    So a double-byte character is of the form HH or Hl and a single-byte character is of the form l. Going forward one character in a string is easy because you can check if the current byte is H or l, but going backwards is tricky because an l byte could be a single-byte character or the second byte of a double-byte character. So the "Kanji backspace" problem involves writing the code to do backspace.

    - adam
  • Two separate observations here:

    1. Actually there is a Japanese encoding system where every byte value in the range 0 to 127 is a single-byte character and every byte value in the range 128 to 255 is part of a non-single-byte character (including Kanji and other characters, and including triple-byte characters). Microsoft only supports that encoding in Internet Explorer, not elsewhere, not even in Hyperterm, not even in Windows Services for Unix. Though even in OSes that do support that encoding, you can't just look at a byte and figure out if it's the first or second (or third) byte of a character. Regardless of the encoding, you can't just look at a single byte and figure out if it's H or l, you still have to parse the string from an earlier known point.

    2. Regardless of the chosen encoding, I can't figure out what's special about backspace. In order to display a string or do quite a number of other operations on it, at some level the program already has to know which bytes are part of each character. Backspace just strips off the bytes representing the last character, and ought to be pretty simple, just using information that must be stored already anyway.
  • 1.
    If you are talking about UTF-8 then you are wrong.
    You can always find out if the byte is first or second. More details here http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

    2.
    The Kanji is speacial because you don't know how many bytes the last charater is. To find out you should start from beggining of the string. Well with long strings this could take quite some time. So to speed it up you use current position pointer to take the previous byte and try to find out is it first or second in the char.
    Try it and find the catch :)
  • 2/13/2005 1:40 PM Ivan
    > 1.
    > If you are talking about UTF-8 then you are
    > wrong.

    Big deal. I said what I was talking about. Compare your statement with the following hypothetical and equally true statement: if you are talking about the price of kiwi fruit in Greenland then you are wrong.

    > 2.
    > The Kanji is speacial because you don't know
    > how many bytes the last charater is.

    Bingo. Nor any other character. So suddenly you've figured out that I'm right, right?

    > So to speed it up you use current position
    > pointer to take the previous byte and try to
    > find out is it first or second in the char.

    Oops, now you're wrong. My previous posting already said what was wrong with it. Looking at the previous byte by itself, you can't figure out if it's the first or second (or third). You have to start with the previous position pointer, not the single byte previous to the current position pointer.
  • 1. UTF-8 fits in the description of the system you talk about, but doesn't have the defect you imply on it :)
    Maybe the one you have in mind is called Kiwi? ;)
    BTW you cannot have variable size charset without any way to encode the size. Obviously bytes been in rage 128-255 for multybyte char, is not enough condition to define the code set. You need some way to mark the beggining or the end of the char. e.g. If you have 6 bytes in 128-255 range, are they 2 chars or 3 chars?
    So pleace find the name of the charset you are talking about and it real definition.

    2. I had put a lot of efforts to don't SPOIL how you find out is the byte first or second. Obviously you do that by examining the previous bytes. You do that parsing the string backwards.
  • 1. EUC. I did say it was a Japanese encoding system not a universal (planetary or galactic or whatever) coding system, and I think it's obvious enough that I wasn't talking about either shift-JIS or raw JIS, so how many others are there?

    1. and 2. You can't find it by parsing backwards. (Well you can sometimes, but it's not something you should depend on.) You have to parse forwards from a known point, which would be the start of the string the first time that you parse the string, but which will normally be other information that you maintain about each character. If you're doing a text editor or command line input handler or just about anything else, you already have to maintain more information about each character than simply the byte offseet where it starts in memory.
  • 1. EUC seems to be 1 and 2 bytes per char. No 3.
    2. Yes you can. Actually this is the Kanji backspace problem.

    Well, I guess it's time to read Adam Barr's latest book.
    Have fun.
  • 2/15/2005 5:44 PM Ivan

    > 1. EUC seems to be 1 and 2 bytes per char.
    > No 3.

    True except for the 3-byte characters. See for example
    http://www.rikai.com/library/kanjitables/kanji_codes.euc.shtml

    > 2. Yes you can.

    True in some rare cases. In general you have to be prepared to back up as far as the beginning of the string.
Page 1 of 1 (13 items)