Today I'm going to discuss a few clever algorithms for dividing values by constant values known at compile-time, without the use of the division instruction.

In compiler optimization theory, the basic problem we attempt to solve is to create algorithms which take a specific program as input and produce a better program, for example a program which runs faster, uses less code space, or even consumes less power. An efficient standard library is an important step in creating efficient programs, but it does not provide the advantage of optimizations, which can dynamically adjust their behaviour to suit each input program. To demonstrate, today I'm going to discuss a very narrow but suprisingly interesting compiler optimization: efficiently dividing an (unsigned) integer by a fixed constant, such as 37.

In the bad old days, the division instruction on x86 machines was extremely slow, and the use of clever bit tricks was very common to speed it up, especially in high performance areas like graphics and simulation. Today we're more concerned with embedded systems, where, to save on chip cost, division is often implemented as a software library routine. Such a routine is necessary if neither argument is known until runtime, but it is extremely slow, easily taking hundreds of clock cycles. In many common applications, we know the divisor at compile-time, and we can exploit this knowledge to craft a specialized division algorithm that drastically outperforms the generic one.

Most programmers are aware that a value can be divided by 2n, truncating the remainder, by merely shifting it right by n bits. This works because the positional binary numeral system multiplies the kth digit from the right by 2k. This simple optimization is present in nearly every compiler; for this reason, overzealous programmers who convert their divisions by 4 into bit-shifts will usually discover that there is no improvement (at least if they measure the improvement, as they always should).

But what if we want to divide by a different number, say, 10? This would be quite handy to do efficiently, since division by 10 is often used in converting from binary to decimal or binary-coded decimal. Since 10 = 5×2, we really just have to shift right by 1 and then divide by 5. Dividing by 5 is the tricky part. There are several approaches; the following ideas are taken from the pre-released Chapter 10: Integer Division by Constants of the next edition of Hacker's Delight, an excellent book full of neat bit tricks like these.

First, supposing a multiplication instruction is available, let's think about how we can divide using multiplication. First notice that dividing x by y is the same as multiplying x by 1/y, and we know y at compile time, so we can find 1/y then. We use a fixed-point representation to perform the fractional arithmetic using integers. For example, in the decimal system, one can divide by 5 by simply multiplying by 2 and then moving the decimal point one place to the left (x/5 = 2x/10). In binary, however, 1/5 does not have a terminating decimal; instead we have:

1/5 = 0.001100110011...2

Now, suppose we have an instruction that multiplies two 32-bit values and produces a 64-bit result. Just as x/5 = 2x/10, we also see that:

x/5 = (858993459.2)x / 232.

We can approximate (858993459.2)x by (858993459.25)x, which is approximated in turn by 858993459*x + (x >> 2). We just do one multiply, one shift, and one 64-bit add, then take the high word (the high 32 bits). This is just a heuristic argument, but sure enough, if we exhaustively try this formula on every 32-bit value, it never fails. On many machines like the x86 this can be done with as little as 4 instructions.

But what about the machines not fortunate enough to have a 64-bit multiply result? Or what if you're writing in a high-level language like C where multiplication always has the same size operands and result? In this case the task is more difficult. The key is to notice that simulating an approximate multiply by 1/5 isn't so hard, because its bit pattern is very regular: 1/5 is about 0.00110011001100110011001100110011. Now, notice that:

0.00110011001100110011001100110011 = 0.0011001100110011 + (0.0011001100110011 >> 16)
0.0011001100110011 = 0.00110011 + (0.00110011 >> 8)
0.00110011 = 0.0011 + (0.0011 >> 4)
0.0011 = 1 >> 3 + 1 >> 4

Applying this logic in reverse, we devise the following algorithm for approximating x/5:

    q = (x >> 3) + (x >> 4);
    q = q + (q >> 4);
    q = q + (q >> 8);
    q = q + (q >> 16);

Unfortunately, this algorithm isn't correct off the bat like our last one, because we're discarding bits and accumulating rounding error all over the place. However, using multiplication, we can get an idea of how far off we are:

    r = x - 5*q;

By adding a carefully devined error adjustment based on this value, to be precise 11r/32, we come out with a correct algorithm that can be verified by exhaustive testing. See the PDF linked above if you're interested in the precise details of how we come up with this adjustment. Our complete algorithm reads:

    q = (x >> 3) + (x >> 4);
    q = q + (q >> 4);
    q = q + (q >> 8);
    q = q + (q >> 16);
    r = x - 5*q;
    return q + ((11*r) >> 5);

Of course, while being able to divide by 5 is great, ideally we'd want our compiler to come up with a clever algorithm for whatever constant we choose to divide things by. This is considerably more complicated, but there are compiler algorithms which will generate clever bit twiddling code like the above for any constant divisor and inline them right into the code performing the division. Especially for small divisors, the result is drastically more efficient than a call to a general software division function.

A much simpler version to understand in the general case is called exact division. Say you have a number that, for whatever reason, you already know is divisible by 22, and you want to divide it by 22. In this case, the solution is simple. We first shift right by 1; it remains to divide it by 11. Next, drawing from the theory of groups and modular arithmetic, we note for every odd number a less than 232, there is necessarily a multiplicative inverse b such that ab = 1 (mod 232). In other words, multiplying a by b and taking the low 32 bits, which the hardware does already, will result in the number 1. Consequently, x/a = bx/(ab) = bx (since ab=1). There are simple efficient algorithms to find this inverse, such as the extended Euclidean algorithm. Here are the inverses of several small odd integers mod 232:

3 2863311531 11 3123612579
5 3435973837 13 3303820997
7 3067833783 15 4008636143
9 954437177 17 4042322161

In our example, to divide, say, 3916 by 11, we'd just compute 3916*3123612579, which when truncated to the low 32 bits by the hardware is 356.

If you're interested in this kind of stuff, check out Hacker's Delight - it covers not only tricks for arithmetic but for other more complex operations such as counting the number of bits set in a word (the population function). I hope you guys found this interesting. As always, please feel free to leave any comments or e-mail me.