From time to time, you see an algorithm or a technique that makes you marvel at the sheer beauty of it.

Barry Bond(who is our in-house emulation God, architect and chief troublemaker all rolled into one :-)) posted about the improvements that he and his team made to the emulator to speed it up by a whopping 40% from its first version to the second.

As a part of the team, I've been privileged enough to see check-in after check-in and to see the emulator team chip away at performance hotspots. For anyone interested in profiling or just in making their code faster, definitely go read that post.

There's a little innocuous section where Barry talks about how they poll for interrupts so much faster now.

The problem is simple - you have some stuff that can happen asynchronously. How do you poll(assume that polling is the only option for a variety of other reasons) and see whether anything has happened or not? What is the fastest way to poll? How frequently do you poll?

The typical implementation is what Barry and his team used in V1 of the emulator. The mainline code runs this code at a pre-determined granularity (at ARM instruction boundaries in the emulator's case) and the asynchronous event(the interrupt firing in the emulator case) is in charge of toggling fSomethingHappened (which is usually volatile since this is typically a multi-threaded scenario). I've listed the assembly I got after compiling this with full optimization. I've stopped the DoSomething function from getting inlined for the purpose of making this clearer.

    volatile bool fSomethingHappened=0;


int _tmain(int argc, _TCHAR* argv[])
{
	
	if(fSomethingHappened)
	{
		DoSomething();
	}
	
	return 0;
}


  mov         al,byte ptr [fSomethingHappened (40336Ch)] 
  test        al,al 
  je          wmain+0Eh (40174Eh) 
  call        DoSomething (401730h) 
  xor         eax,eax 

You have a compare and a jump. Now, jumps are expensive and generally a bad thing. How do we get rid of the jump? In fact, how do we poll..without using *any* instructions at all?

Let's say I gave you a guarantee - that the stuff you need to do inside DoSomething when something happens is going to take a lot of time - so you don't need to optimize for that case. You only need to optimize for the common case - that is nothing happening and fSomethingHappened being false.

In this case, we can do something really interesting - we can make the code in the asynchronous event firing responsible for *adding* the code to handle it as well.

How? See below

__declspec(naked) void DoSomething()
{
	__asm
	{
	 ret;
	}
}



int _tmain(int argc, _TCHAR* argv[])
{
	
	DoSomething();
	
	return 0;
}

DoSomething:
 ret

wmain:
 call        DoSomething (401730h) 
 xor         eax,eax 

This seems really weird at first glance - you have empty functions that are not doing anything. That's because the event patches the function with the code required to handle the interrupt

Here's the flow

  1. Thread 1 calls DoSomething. Returns instantly.
  2. Event fires on Thread 2 - so the event firing code patches DoSomething with a jump to the right code/the right code itself.
  3. Thread 1 merrily tries to poll again - calls DoSomething and handles the event. Life is good.

Check out how low cost the polling is in the normal case -when no event is being fired, we just do a call and a ret. But I hear some of you screaming - but you're still using 2 instructions.And why a call and a ret when you are not doing anything on the stack? You said you'll use no instructions! Well, you see, we actually are using 0 instructions.

It is due to the way processors speculate and look-ahead. All modern processors will look at the 'call' and see that it hits a 'ret' directly. They'll be smart enough to figure out that a call followed by a ret is effectively a nop - and therefore, turn the 2 instructions into a nop.

So in the normal case, the body of wmain for us is effectively this
wmain:
 nop

When the event fires and patches the DoSomething function, processors realize that they cannot speculate anymore - and so they do the right thing.

Awesome, isn't it? Polling without actually executing the code to poll. Just sheer beauty - something that totally blew me away when Barry explained how this works.

Notes:

  1. You shouldn't be worrying about optimizations like this unless you are writing an emulator :-)
  2. The code for actually patching the function with the right code is left as an exercise to the reader :-)
  3. This code doesn't deal with issues like memory fences, atomic patching,etc
  4. Barry explained this to me when we were walking the Strip at Las Vegas. I know what you are thinking :-)