Welcome to MSDN Blogs Sign in | Join | Help
Contest Winner

So here is my answer to Monday’s contest:

unsigned __int32 factorial( unsigned __int32 n )

{

      static const unsigned __int32 values[] = {1,1,2,6,24,120,720,5040,40320,362880,3628800,39916800,479001600};

 

      if( n >= sizeof( values ) / sizeof( values[0] ) )

            return 0;

      return values[n];

}

Note how there’s no need for any loops or recursion. Why? Because 13! is greater than 2^32-1, so it can’t be calculated with unsigned 32-bit integers. And as the problem stated, if it can’t be calculated, return 0. Thus only the first 13 values (0 through 12) need to return a number that’s not 0. Pre-calculating this values and storing them in a look up table saves us a lot of time at run time.

Some things that indicate a good answer:

·         It’s complexity is O(1), so it executes in constant time.

·         The look up values are static so they aren’t initialized each time we call the function.

·         It works on any architecture.

Some things I saw in answers that made them less efficient:

·         Loops, recursion

·         The lookup table was not static, so it was pushed on the stack (or in some languages on the heap) each time the function would get called.

·         The look up table was signed. This means we need to convert to unsigned before we return which is unnecessary extra work.

·         The look up table was wrong (be sure to proof read!)

·         The code would only work on one architecture (such as x64 or x86, but not both and not anything else).

The last point wasn’t specified in the problem, so that’s ok if yours did, but it just keep that in mind when you do lots of bit work. Are you in big-endian or not? Are you on x64 or x86 or IA64 or what?

Which brings me to the winning entry submitted by ‘shaoxp’:

unsigned __int32 factorial( unsigned __int32 n )

{

    // return the factorial of n

    // if it can't be calculated return 0

    //

    //the max number of unsigned __int32 is 4294967295,which is between 12! and 13!

    //we use index 0 to store result for n >12

    //

    static unsigned __int32 fac[] = {0,1,1,2,6,24,120,720,5040,40320,362880,3628800,39916800,479001600};

 

    return fac[(((signed)n-13)>>31)&(n+1)];

    //by the way,if the function can be inline,i think, because there is no branch,the CPU can optimize the execution.

}

 

I like this one because it explicitly avoids the if statement. Why is this good (or bad)? CPUs need to wait for a branch instruction, such as an if statement, to fully execute before it knows which instruction to execute next. With no branches in the code, we never hit this case. However this solution is performing additional calculations than mine does. Also, modern day CPUs include branch prediction: the CPU will guess on which branch to take and begin executing that route. When the if statement finishes executing, it will see if its guess was right or wrong. If it was right, it’s already calculated what would have happened and continues on. If not, it wipes what it already had done and begins executing the other branch. So, depending on the program that uses this function would depend on how good the branch prediction is. Thus the extra calculations needed to avoid the branch may back fire as the CPU could avoid the branch all together. Either way, it’s a neat solution.

Thanks to everyone who submitted an answer! Stay tuned next Monday for your next chance to win the best contest on the Internet.

Posted: Friday, March 30, 2007 7:57 AM by Chris Becker

Comments

shaoxp said:

check about this. it executes faster about 10 times than my submitted solution.what i do is just translate the code to inline assembly!

static unsigned __int32 gfac[] = {0,1,1,2,6,24,120,721,5040,40320,362880,3628800,39916800,479001600};

unsigned __int32 __declspec(naked)  factorial( unsigned __int32 n )

{

   __asm{

       mov         eax,[esp+4]

       sub         eax,0Dh

       sar         eax,1Fh

       mov         ecx,[esp+4]

       add         ecx,1

       and         eax,ecx

       mov         eax,dword ptr gfac[eax*4]

       ret

   }

}

# March 30, 2007 3:34 AM

JediRy said:

It would be interesting to actually see how the two implementations compare in actual use. It's not clear to me that the "bit magic" version is actually better than the human-readable one, because I think the compiler should be able to reduce the conditional in the first version to a single subtraction plus a single jump instruction, vs. 4 arithmetic operations in the second version.

# April 6, 2007 8:01 PM

Chris Becker said:

Yes, but jumps can potentially take much longer. Where modern day CPUs can pipeline a hundred instructions at once, a jump instruction forces the CPU to wait until it knows which one to execute next at a loss of a hundred instructions. In that case 4 extra arithmetic instructions is much better.

However CPUs also have branch prediction where they will execute what they think the jump will be. If they're wrong they go back and do the other way, nothing is lost. So the question is, is the average amount of time it takes to execute a jump more or less than the arithmetic stuff? That depends a lot upon the program using the function.

I wouldn't say avoiding the jump here is 100% better, but it is a different kind of optimization.

# April 6, 2007 9:32 PM
Leave a Comment

(required) 

(required) 

(optional)

(required) 

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Page view tracker