Welcome to MSDN Blogs Sign in | Join | Help

A coworker on the CLR dropped by recently, having hit a brick wall about some problem he was having with some ASM code for x64 that was trying to tail call.  He has already grep’ed through the ABI docs, and had actually hit this blog prior to knocking on my door.  The problem?  How to emit a tail call through a register such that the OS unwinder doesn’t puke.

The quick answer is to use a REX prefix extension on the JMP instruction.  For those of you that know what you’re doing, you can probably stop here.  For more details, read on…

For folks that don’t know what a tail call is, you probably haven’t ever done any functional programming before.  The Intro to CS class I took my freshman year of college was taught by a guy who really liked programming languages, and forced us (against the will of many of the guys in the class) to write a whole bunch of Scheme.  But since none of us knew Scheme, we were learning that language as we went along, and he only taught us the functional subset of scheme.  The biggest problem folks coming from the procedural programming world hit is the lack of a for loop.  Since my Scheme is really rusty, I’ll continue this exposition writing in the “functional subset of C++” :-)

The easiest way to understand the issue is to try to solve a problem.  Let’s write Factorial without using any side-effects:

BigInt Factorial(BigInt v) {
    return (v <= 1) ? 1 : v * Factorial(v - 1);
}

That looks great, right?  Now go try running Factorial(15) and watch the stack explode (I’m not kidding – BigInt is a large object).  The reason the stack explodes is because Factorial doesn’t just return the value of a recursive function call.  It gets the value from the recursive function call, then it does something with it prior to the return, so the stack frame of the caller must be left alone.  You need to be able to eliminate the need to do something with the value of the function you’re recursing on, if you want to allow the stack frame to be tossed (actually, just reused):

BigInt Factorial_TailRecurse(BigInt v, BigInt current) {
    return (v <= 1) ? current : Factorial_TailRecurse(v - 1, v * current);
}

BigInt Factorial(BigInt v) {
    return Factorial_TailRecurse(v, 1);
}

Voila – we no longer need to do anything with the return result of Factorial_TailRecurse other than just return it.  What does that code wind up looking like, once it’s pumped through a relatively intelligent compiler back end?  Almost exactly like the more traditional C++ for loop, actually.

So, now you have a perfect understanding of what tail recursion is.  In my contrived example, however, there’s no need to do anything fancy to express that the end of your function is a tail-recursive call.  Imagine, however, that the function you’re calling from has a very large stack frame, and you’re calling doing something goofy, like writing a recursive parser.  Since you don’t want your parser to be limited to the depth of the stack, you’re making sure that you’re always writing tail-recursive code.  But what if you’re trying to call a function pointer recursively?  This is where you get stuck without REX JMP.

The OS unwinder can statically look at the Factorial_TailRecurse(v-1, v*current) function call, and recognized that the offset being jumped to is the head of this function, and therefore knows that it’s actually part of a function epilog.  If, however, instead of jmp [IP-42], it saw jmp [RAX], it wouldn’t know where this thing was actually headed.  In fact, it might be a switch statement.  So, to properly flag the tail call as being part of an epilog, you throw a REX prefix on first, and now the unwinder groks the epilog, and you’re done.

-Kev

My wife just asked "What's an 'after-sales engineer'?"  I parsed the terms, and came up with "Tech Support Dude".  As a teenager, I was a Foliage Architect (I mowed lawns), System Builder (computer assembly slave), Network Administrator (very poorly paid, local 'after sales engineer') and an Information Delivery Engineer (paper boy).

I just watched the highly lame “You wouldn’t steal a car…Downloading pirated movies is stealing” ‘trailer’. How’s this for irony: I watched this trailer because it was the SECOND thing to play on the DVD of Die Hard 4 that I just purchased. The FIRST thing to play on this DVD was a menu that asked me what language I would like: English, Mandarin, Russian, and about 8 other languages that I had never heard of, or couldn’t read. The DVD cost me 7RMB – about $1 – from the guy who has a 1m by 3m (about 3’ x 9’) booth selling DVD’s that look almost legit (minus the flat plastic packaging). But this is the just ironic frosting on a very annoying DRM cake. What’s the DRM cake I’m talking about? Let’s start from the beginning. {I’m writing this while watching the movie, so I’ll include a little commentary for your entertainment}

I’m living in Shanghai for a while. It’s very fun. Great city, interesting people, very kid tolerant, very cool place. Check my personal blog for details. {DH4 commentary: How come the guy who is “a Mac” is using a system that’s obviously not Mac OS X? Isn’t he too cool for that?}

We took a trip to Beijing last week. It was a blast. We stayed at the Grand Hyatt. It was a very nice place (I won’t mention that the reason we stayed there is because when my wife was telling me how much it cost, I thought she was talking in RMB, not US$). Being a fine upstanding establishment, the Hyatt purchases legal, licensed electronic hardware for their rooms. {DH4 commentary: So hackers don’t notice that there’s a big blob of C4 installed in their computer?} Our room had a Panasonic DVD player. Being the gadget freak that I am, we had brought along a batch of DVD’s and a laptop, thinking we could watch movies on the laptop (because hotels in the US charge you $8 to watch 1 of the 13 bad movies they offer on Pay-per-view. {DH4 commentary: Hackers all play with “dolls” – real men shoot guns}

Anyway, after a busy day of wandering around The Forbidden City, the mom & dad needed a break, so we decided to pop in a movie for the kids to watch. So we popped in our 100% legit copy of “Cars” (because my 2-year-old likes “Lighting DeQueen”). I saw 5 Chinese characters that, roughly translated said “The movie industry is far more interested in pointlessly trying to squeeze dollars from good customers than allowing those customers to use the product they purchased”. My movie was Region 1. {DH4 commentary: Where is the army of hackers required to simultaneously take down every major network in the country? I don’t see this group represented anywhere.} And I’m living in Region 6. So I have a few alternatives: I can buy a legitimate Region 6 copy of the movie, I can buy pirated DVD’s, or I can download the movie. I leave it as an exercise for the reader to decide which one I prefer. {DH4 commentary: Hackers are only slightly squeamish when it comes to killing innocent civilians.} Here are a few hints: I’ve been here 5 weeks, and I still don’t know where to find legal DVD’s. There are 4 sellers of pirated DVD’s within 3 minutes walking distance of my apartment. My network connection is MUCH slower than it is in the states. {DH4 commentary: I believe destroying a helicopter with a car is more plausible than being able to make all the Anthrax alarms go off in every federal building in the capital simultaneously}

Honestly, if I’m in law enforcement, I believe I’d prefer people just download their movies, because that doesn’t pump money into any pirate cartels – there’s no illegal manufacturing or illegal distribution channels required, just some schmuck with HandBrake or DVDShrink, and a network connection. If I’m in the movie industry, I want people to re-purchase their movies, I guess. But I’m not in either. {DH4 commentary: Kevin Smith stretches his talents as a thespian to play a Star Wars obsesses nerd. Honestly, his cameo has been the best part of the movie so far.} I’m a generally legit guy with a belief that while the customer may not always be right, if what I expect my customer to do causes them to spend both more money & more time, I probably won’t have that customer for very long. {DH4 commentary: If someone who has a backup generator and a wall of video screens isn’t able to prevent some schmuck from pwning his webcam, we’re all screwed…}

I’m currently looking at Hulu and Comedy Central right now:  Between the two of them, they have almost enough to let me quit shelling out $80/month to ComCast for the privilege of watching 6 channels (3 of which are in HD). Only 2 channels left (curse you, Noggin & PBS Kids). Of course, my dependency on those two channels will be decreased to zero with a few years, but I’d rather just give my $80/month directly to ABC, NBC, Comedy Central, Noggin, and PBS, than continue to give it to ComCast. I wonder if they’ll take donations. (Yes, I know PBS does J)

{D4 commentary: When someone as big at Bruce Willis throws someone as small as the Evil Asian Kung Fu Hacker Woman through a bookshelf, the small EAKFHW does NOT stand back up for a very long time. Of course, Bruce Willis has already fallen, been beaten down, and generally been pummeled, so perhaps I’m just struggling with my “suspension of disbelief”. }

Well, my whining about the state of DRM video is about done, but this movie deserves a few more comments:

Does no one realize that systems like the natural gas pipeline have a PHYSICAL failsafe to prevent “all the gas from being routed down one pipe”? When you “have a gun to your head” you don’t spend time writing a UI for your new “encryption algorithm”. And, having been involved in software for, holy crap, almost 15 years now, even if you buy the basic premise of the movie (“A Fire Sale”), the level of complexity involved in getting that many different systems all controlled by a single central, accessible UI (they can get to a camera in a Rutger’s elevator in a few seconds – sounds like an amazingly good UI/navigation/control system to me) would take an army of people many many years to pull off. And while it’s all being designed, the systems trying to be pwned are constantly changing.

But I did learn one valuable lesson from this movie: nerdish folk who can write code are so disconnected form humanity that they must be forced to see the consequence of their theoretical actions by heroic types before they’ll begin to feel empathy for their fellow humans.  It's true.  Until I saw The Net, I was honestly dedicating my life to destroying civilization for fun.

If I had stolen a car that had this movie sitting on the passenger seat, I'd return the car to the rightful owner, and turn myself into the authorities.

The spam bots have been causing me all sorts of trouble - I had over 1000 messages informing me that some spam has been deleted over the weekend.  I'm going to kill comments, and hope that in a few months, (after I get a chance to write up something of value), the spam bots will have stopped attempting to comment on my blog.  In the mean time, if you want to ask me a question, my email address is the title of my blog [freik] except my first initial comes _before_ my last name] at m1cr0s0ft.c0m, except everything is spelled properly, with i's and o's instead of 0's and 1's :-).  I always try to reply to reasonable questions [note:  asking me to troubleshoot software from a competitor and getting upset or insulting when I'm unable to help is _not_ reasonable], but I'll be operating from a very different timezone for a few months, so turn around may be different from what you would expect.

-Kev

<RETRACTION>

Crap.  After getting a couple of comments saying they tried the same thing, and it didn't work for them, I went back and tried it myself.  And it turns out it doesn't work.  I think I had tried it with some media that I had already transcoded to WMV without realizing it.  The reason it doesn't work is actually quite simple: The managed assemblies aren't pure IL - they're IJW images (lame-speak for "they contain some native code") so they're hard-tied to the platform they're shipped on.  I also tried copying the 32 bit version and running them, to no avail.  Ugh.  Perhaps this will inspire folks to port some of their codecs?  Heck, all I really want is FFDShow for x64, and I'd be happy...

UPDATE:  FFDSHOW is available for x64 :-)  Still waiting for CoreAVC for x64... Scratch that:  I really do prefer VC-1 and WMV.  Transcoding takes a long time (at high quality) but the results are great...

</RETRACTION> 

 Let me paint a scenario for you:

  • Because the DVD was just sitting there in your Vista Ultimate box, you installed x64 Vista instead of x86 Vista.
  • One of the only reasons you installed Vista to begin with is because the Vista Media Center product is much better than the old XP based Media Center.
  • You'll even tolerate the fact that the XBox Media Center Extender doesn't work with Vista (Options: buy an XBox 360, or have an XP MC running somewhere in the house [cuz XP MCE doesn't run on Virtual PC, even just to serve to an extender] - grrrr....)
  • You have a number of home movies taken before you paid attention to the fact that WMV is a great container & codec, so there's 3 years worth of home movies in DivX form.
  • You open up Vista Media Center and start watching recent home movies - they look great!
  • 2 weeks later, your wife opens up Vista Media Center, and calls you at work to tell you that all the home movies of your oldest daughter don't play!
  • You go home, launch each of the 70+ home movies from the command line (cuz you're a command line junkie) and they play just fine
  • You assume your wife must have had network problems or something, so you tell her they seem fine
  • 3 days later, your wife calls you at work, tells you that she CHECKED the network (while she is no longer an employee of MSFT, and has spent the past decade interacting with children more than lead developers (which have a large amount in common), she is astonishingly technically savvy because she is forced to endure your own techno-centricity, bless her soul), and that you're either smoking crack, lying, or magic, because she still can't watch any videos of your oldest daughter from the media center.

Here's the problem:  That cool 10 foot media center user interface?  C#.  UI code in general, because of the whole 'when am I actually done with this resource' aspect of UI code, seems to like garbage collection.  I don't know if they use WPF (which looks pretty impressive - I've been messing with it a little, lately), but it's 99% C# code.  C# code, by default, is processor agnostic - binaries produced by the C# compiler are flagged to run on ANY cpu.  You can take identical binaries and run them on AMD64, x86, and yes, even IA64.  And they work fine, in all 3 situations.  However, if this managed code tries to use any native binaries on the machine (say, some DirectShow filters, for example), it can only use native binaries for the CPU type it is running on.  Which means Media Center is running as a 64 bit app, and can only access the 64 bit codecs available on the machine.  No ffdshow, no XviD, no DivX, no, not even any IV32.  No problem, right?  That's what corflags is for!  "corflags.exe <assembly> /32bit+" - bzzzt - wrong answer.  Windows Media Center, as part of the OS installation, is covered by a whole host of things to prevent people from screwing it up.  In summary:  You can't flag the Media Center binaries as 32bit because they won't even load (curse you, code signing and system file protection!).

There is a way to make it work, but it has the unfortunate side effect of making ALL managed code run as 32 bit.  ldr64 is a utility that is included as part of the 64 bit CLR that does a single scary, system wide thing: it makes the managed loader think it's running on a 32 bit system, even when it's not.  That's right:  ldr64 setwow makes your entire computer think, for managed code, it's a 32 bit system.  And it works pretty well.  Don't do this on your SQL Server or anything, but my Media Center likes it much better when ffdshow is available to be used to show movies of my daughter back before she was losing molars, reading Harry Potter, having sleep overs, and generally being independent enough to make me feel far older than I actually am.

But wait, you say, weren't you smoking crack, lying, or perform some magic that made it work when you tried those media files from the command line?  Nope - I was using WIndows Media Player, which, even on Vista x64, is a 32 bit native application.  So it was loading 32 bit ffdshow and happily displaying my home movies to my heart's content.

One other minor note:  I do find it slightly disconcerting that there's a flag to say IL is 32 bit only, but nothing to indicate that it's 64 bit only.  There are plenty of things you can do in managed code (I've been told even verifiable managed code) that can force your code to be 64 bit only, or 32 bit only.  The fact that you can only tie it one direction seems slightly sad to me.  Oh well - unless you need the address space, the 64 bit managed platform isn't a really compelling platform, currently (okay, the extra registers are nice, but managed code uses pointers like they're char's, so the working set of most managed apps is tremendously higher when running as a 64 bit app, than as a 32 bit one)

-Kev

I've recently been trolling the web for any sort of language-comparision benchmarks, to see how the CLR's JIT stacks up to the competition.  Dr. Dobbs has what seemed to be a pretty reasonable micro-benchmark article.  It's not particularly insightful, but hey, it is hard to come up with stuff that makes sense to compare across multiple disparate languages.  I started looking for specific C# to Java comparisons and came upon this little gem.  I'll give you a few minutes to read it.

Now that you've read it, I would like to challenge you to understand how anyone would ever believe the question of "How long does it take to do nothing 1 billion times" in any language is worth comparing.  Seriously, ignore the thoughts regarding whether the comparison itself was fair or not.  The best part of that article is that the Java JIT they seem to be pushing doesn't get rid of the empty loops in a newer version.  Then look at their 'testing methodology' (or the lack thereof).  Not only are they not actually benchmarking anything that anyone with a brain in their head would care about, but they didn't even bother to figure out how use the tools that they claim to be comparing against.  Does anyone that actually uses C++* think that if they care about performance they're not going to enable the optimizer?  Every reasonable C++ compiler on the planet will completely eliminate the loops if only told to optimize.  And then the comparison against Assembly stuck in my craw.  Assembly is there because you KNOW WHAT YOU'RE DOING!!!!  If an assembly programmer writes a loop that iterates for a billion times, it's probably because they're trying to measure some particular pattern of code on the CPU they're running.

So (still snickering at the previous stuff), I guess I'm asking for input:  Does anyone have any non-micro-benchmark relatively objective managed code comparison links they'd recommend?  I'd really like to see how the CLR's JIT stacks up against some of the JVM's out there, for more than just little tiny toy code samples.

I'm also interested in 'end-to-end' timing, which includes JIT time.  Microbenchmarks with internal clocks don't tell you how long something actually ran for.  If a JIT takes 5 minutes to figure out it can remove 2 dead loops, your runtime isn't less than 1 second - it's less than 5 minutes and 1 second.

-Kev

* I found 4 different syntax errors in the C++ source code they provided...

 

I've recently been deluged by a pile of new electronic gadgets* so that I haven't had time to do much coding.  So I was looking around yesterday and found that there's (finally) a new version of Boost available.  I was actually underwhelmed by the lack of interesting new libraries that have been added (there have been a number of interesting libraries approved, but the only one of the group that have been approved that I was anxious to use trivially was Xpressive - I might rip out "for each" from my source code to try BOOST_FOREACH instead, too).  That aside, I immediately discovered that they removed support for BJam V1, and require BJam V2.  It took me another 2 hours of digging to figure out how to build the 64 bit libraries with the new build system.  Here's how you do it:

bjam toolset=msvc address-model=64 stage

And, voila - amd64 builds.  For the discerning multi-proc engineer (most of us are now, right?) try adding a -j%NUMBER_OF_PROCESSORS% to that command line to use every CPU core your machine has [it builds quite quickly on my Core 2 Quad, and my 2x2 Opteron [faster on my Opteron, actually - the C++ compiler eats data cache for breakfast - I think the extra pipe to memory really helps].

And now, to explain the * above:

Over the past two or three months, I've splurged a lot on myself and/or my wife:

  • Sony HD-SR1 camcorder - Good video quality - shockingly better than my old JVC GZ-27, and significantly better than any of the MiniDV camcorders I've owned. AVCHD is less than a great post-processing experience, right now, though
  • Apple iPod Shuffle - Fabulous form factor, super-crappy software - one of the devs on my team said that I must be using it in a way that Steve Jobs doesn't, because that's the only way Apple tests their products.  I would have waited to get a Creative Labs Stone for half the price, but it was a gift for my wife for her birthday, and the Stone wasn't yet available.  Seriously, why do I need a $#@@ driver to use a %#@# flash device?  To be fair with my complaints, the Zune does the same thing.  Lame lame lame lame.  Everyone else on the planet calls 'em USB storage devices, what is wrong with Apple and Microsoft?
  • Roku SoundBridge M500 - Great size, great audio quality, but it took forever for me to figure out that Windows Media Connect doesn't like to share audio files that are already on a network share.  Once I migrated my music onto my media center, it actually navigated my 150+ Gigs of audio remarkably well, and handles lossless WMA like a champ!
  • Bose Companion 2 speakers - Fabulous audio quality, nice looking desing - attached to the SoundBridge in our master bathroom - All my wife needs now is a hot plate and she could live in the master bathroom
  • Final not-so-electronic not-so-gadget:  Ellsworth Epiphany mountain bike, built up with most SRAM X.9 components.  This thing climbs better than my 1997 GT aluminum hard tail.  It's much bigger than my GT, but weighs about the same, and I have no idea what I'll do with those 27 gears.  I think I only ever used 19 of the 21 available on my GT.  I picked it up on Monday night at REI [wouldn't have bought it except for their nice little test track demonstrated that it really does have no weaknesses], I've ridden it to work 2 days [in the rain, but I really didn't care], and I'll be riding it this weekend (probably in the mud) Saturday and Sunday morning.  St. Edwards park, and perhaps I'll go visit the Tape Worm in Renton.

Disclaimer:  I've never actually written an unwind personality routine, so take what's here with a grain of salt.

A few days back, I spent 30 minutes defending the C++ runtime's exception handling personality routine to the guy that has the less than enviable job of supporting the .Net Framework's exception handling personality routine.  The eventual point to be learned is that the documentation for RtlUnwindEx is abysmal, and can cause the VC 8.0 C++ runtime to leak objects left and right.

The problem revolves around the C++ team's decision (which I fully support) to cause code compiled with /EHs or /EHsc flags to only invoke destructors when a C++ exception occurs.  This means that if you take a hardware exception, call RaiseException yourself, or call longjmp none of your C++ destructors between the point of exception and the handler will be invoked. I've talked about why this is a good idea, but I'll sum it up with this:  If that scenario seems bad to you, then you should be compiling your code with /EHa.

Anyway, back to the issue:  Say I'm writing a 'personality routine' which will be inovked by the OS when an exception occurs.  If you're writing the kind that requires that your handler initiates the second phase of unwind, you need to call RtlUnwindEx.  The documentation, which is arguably bad (it's little more that a function declaration), indicates that the EXCEPTION_RECORD structure to pass in is optional.  And that's true:  the functions on the stack will still be unwound.  But the personality routines for those functions might want to actually know that they're being invoked due to an exception, and not for something like a longjmp.  So, if you have an exception record that could be passed to the RtlUnwindEx function, you should.  If you don't, another personality routine might assume that there's not an exception to examine, and might decide to do something like NOT invoke the destructors for the object on the stack.

BTW - this bug exists only on x86, and only when you're crossing a managed code -> native code boundary, in particularly weird scenarios.  And I hope it's getting fixed :-)

-Kev

I had a brief e-mail exchange with one of the devs on the optimizer team about a checkin he put up for review.  He modified the compiler so that it only aligns the stack for functions that call other functions - that's the typical definition in compiler lingo of a 'leaf function'.  My first response was "don't do that - you may still have to align for reasons A, B, & C".  To which he responded with a quote from the ABI doc that explicitly says you only have to align the stack if you're calling a function.  So I started reading.  Turns out, he's right (and so is the doc), but there are some really nasty gotchas involved.  When we initially did this stuff, we just lived with the scenario where you might be wasting a bit more stack space...

The theme of the problems revolve around restrictions on encoding information in the unwind data. If you are saving XMM registers and have an unaligned stack pointer, you must use the SAVE_XMM128_FAR opcode because the SAVE_XMM128 opcode will only accept offsets in multiples of 16 bytes.  Or I guess you could use MOVUPS instead of MOVAPS to save & restore your XMM register, but I wouldn't recommend that on current hardware. Similarly, you have to use the ALLOC_LARGE descriptor for stack allocation of sizes that aren’t [n * 8 + 8]. ALLOC_LARGE actually has 2 variants - one that multiples by 8, and the other than doesn't.  You can use the latter to allocate random amounts of data from the stack.  But then if you want to use a frame pointer, you're going to be in a pretty weird spot, because it will have to be unaligned, as well. Unwind data dictates that you can only have a frame pointer that = RSP + 16 * [1-15].

I'll probably come back to this article & add some nice hyperlinks to ABI details in the doc itself, but it's been a while since I blogged anything useful, so I figured I'd just get this out there quickly.

I recently switched from the lead of the Code Gen & Tools team for Visual C++ to the lead for the Code Gen team for the .Net Framework.  I've not used managed code much, but for some classes of code, it sure does look easier to use [I'm working on a goofy GUI app for my personal use in C# that took me forever to do in C++].  There are lots of interesting problems to solve in this space, and a dramatically different set of customer constraints & potential customer scenarios.  I have to admit the security concerns for a JIT in the ASP.NET world are a little bit unnerving...

In far more interesting news, VS 2005 SP1 just released.  OK, it actually released a few days ago, but my power didn't come back on from our nasty storm Thursday night until Monday, so I was a bit preoccupied.  I highly recommend picking this up if you're using the AMD64 compiler.  We shipped an awful lot of compiler fixes, along with a few code quality improvements, in this release.  You'll notice that prologues and epilogues of complex functions look much different, and that the compiler throughput (the amount of time it takes to compile a function) is a bit better.  There are some severe bugs in the PGO scenario {Profile Guided Optimization} that anyone using this feature really needs.  Aside from that, I hear that the VS experience is just much more pleasant than VS 2005 RTM.  Go forth and Download!

Anyway, I'll try to get some more interesting blogs posts going in the future - work seems to have calmed down quite a bit, since Vista finally shipped.

I'll be hanging out with the cool kids at the Northwest C++ Users Group tomorrow night.  If you're in the area, and want to heckle me, swing by.  We'll be in building 40 at 6:30 PM.  My talk starts at 7:00 PM.  I'm talking about the actual runtime cost of exception handling for x64 and x86 on Windows.

Update:

Check out http://www.nwcpp.org/Meetings/2006/10.html for the slides, and a video of the talk, if you're interested.  You can see my thinning hair in back, my marvelous posture, and my always entertaining speaking style.

Imagine this very lame code:

int main() {}

void BugAsm() 
{
   __asm {
        MOV     [ESP+12],OFFSET main
   }
}

void(*BugAsmPt)()=&BugAsm; // this is just to make sure the function is not removed by /OPT:REF

Now imagine your significantly less lame code doing something similar.
Now imagine that the compiler crashes with a bizzare message about "x86\code.c" something or other.

Well, you can fix this problem by changing the assembly code to this:

mov DWORD PTR[esp+12], OFFSET main

Another unfixed bug, worked around :-|

-Kev

{Disclaimer - I started with a bunch of code from boost::array - it's a great implementation of the functionality it provides}

While my day job doesn't generally allow me to goof around with advanced C++ features, I still managed to find time to do so after the kids are in bed.  Mark & I were talking about  a new feature in C99: variable length arrays.  They're incredibly useful for certain scenarios, and will, by most compilers I imagine, just get turned into an _alloca call.  This led to a discussion of the value of variable sized arrays, and to categorize them.  Here's the final result:

Type 1: Fixed size arrays:  Easy to do in C, boost::array<T, N> does it for C++ with all the bells & whistles of an STL container [tr1 has something awfully similar]
Type 2: Variable length arrays:    Hard to do in C, std::vector<T> does it for C++
Type 3: Variable sized arrays at construction time:  Easy to do in C (malloc or _alloca always produce these), doesn't exist currently in either boost or STL

When we started comparing these things, we came to the conclusion that C99 style variable length arrays would be very useful.  Using __forceinline and _alloca, you can produce an implementation that works for any scenario that stays in a single function scope.  As soon as you start working across scopes, or try to implement something as a global, it falls to pieces.  It also fails miserably if the compiler doesn't actually inline forceinline functions [which is the case for -Od, and anything with -Ob0].  And I don't have a very reliable way of detecting this scenario :-(

Ignoring the weaknesses, let's assume that the programmer is very intelligent and all-knowing, and will never accidentally abuse these routines.  Here's an initial implementation (just the part that matters) that supports stack-based allocation of both Type 1 and Type 3 arrays:

template<typename T, std::size_t N = 0>
class array : public impl::_common_array<T, array<T,N> > {
protected:
    friend impl::_common_array<T, array<T,N> >;
    T elems[N];    // fixed-size array of elements of type T

public:
    array(std::size_t n = N) { BOOST_ASSERT(n == N); }
};
 
template<typename T>
class array<T, 0> : public impl::_common_array<T, array<T,0> > {
protected:
    friend impl::_common_array<T, array<T,0> >;
    T *elems;
public:
    __forceinline array(std::size_t s) : sz(s), elems((T*)_alloca(sizeof(T)*s)
    {
        for (size_t i = 0; i < s; i++)
            T *temp = new(&elems[i]) T;
    }
    ~array()
    {
        for (size_t i = 0; i < sz; i++)
            elems[i].~T();          
    }
};

But I don't like allowing a programmer to shoot themselves in the foot accidentally (on purpose is fine :-)).  First, I added some code so that we don't try stack-allocation for DEBUG code, since the inliner is rarely enabled for that scenario.  Next I promoted a warning to an error so that if the routine ever doesn't get inlined, you'll fail.  Finally, I declared a private operator new, so that you can't construct one of these things on the heap and really screw up the world:

template<class T>
class array<T, 0> : public impl::_common_array<T, array<T,0> > {
private:
    void *operator new(std::size_t sz);
protected:
    friend impl::_common_array<T, array<T,0> >;
    T *elems;
public:
#if defined(_DEBUG) || defined(DEBUG)
 
#pragma message("Using lame implementation")
 
    array(std::size_t sz) : sz(sz), elems(new T[sz])
    {}
    ~array() {delete[] elems;}
 
#elif !defined(NODEBUG) && !defined(NDEBUG) && !defined(_NODEBUG)
 
#pragma message("Be sure that your optimized build is NOT using -Ob0 - this completely disables inlining")
#pragma error("Please define DEBUG or _DEBUG for your debug build, and NODEBUG, NDEBUG or _NODEBUG for your optimizing build, otherwise, this code won't work!")
 
#else
 
#pragma message("Using good implementation")
#pragma warning(error:4714)
#pragma optimize("b1", on)
#pragma auto_inline(on)
 
    // If this thing doesn't get inlined, the code won't work - you'll need to use the previous blob of code instead
    __forceinline array(std::size_t sz) : sz(sz), elems((T*)_alloca(sizeof(T)*sz))
    {
        for (size_t i = 0; i < sz; i++)
        {
            T* temp = new(&elems[i]) T;
        }
    }
    ~array()
    {
        for (size_t i = 0; i < sz; i++)
            elems[i].~T();          
    }
};

But I'd really like to allow these things to be a bit more general purpose!  Fact is, a tricky optimizer could turn a malloc/free into an alloca, and a fixed-size alloca into a zero-cost part of the stack frame adjustment.  But getting the compiler to decide that a variable sized malloc should be turned into an alloca seems error-prone.  What if I want an array of 4Mb?
There goes my stack - poof - in one fell swoop.

So, now, I'm headed to some sort of policy based memory allocation where the programmer could indicate that stack allocation should be used.

It's still unsafe, but not automatically so.  The programmer is the one who should understand the implications of the malloc/free -> alloca conversion.  If there is no allocated memory, we'll just new & delete the memory manually.  So, instead of auto-allocating memory on the stack, we'll allow the programmer to allocate memory that they wish to keep track of manually, or, if no memory is provided, we'll allocate it from the heap, and free it when the object is destroyed.

namespace kbf
{
    namespace impl
    {
        template <typename T, class CHILD>
        class _common_array
        {
        protected:
            __forceinline CHILD *downcast() { return (CHILD *)this; }
            __forceinline const CHILD *downcast() const { return (const CHILD *)this; }
        public:
            // type definitions
            typedef T              value_type;
            typedef T*             iterator;
            typedef const T*       const_iterator;
            typedef T&             reference;
            typedef const T&       const_reference;
            typedef std::size_t    size_type;
            typedef std::ptrdiff_t difference_type;
            typedef std::reverse_iterator<iterator> reverse_iterator;
            typedef std::reverse_iterator<const_iterator> const_reverse_iterator;

            // iterator support
            iterator                begin()        { return downcast()->elems; }
            const_iterator          begin()  const { return downcast()->elems; }
            iterator                end()          { return downcast()->elems+downcast()->size(); }
            const_iterator          end()    const { return downcast()->elems+downcast()->size(); }
            reverse_iterator        rbegin()       { return reverse_iterator(downcast()->end()); }
            const_reverse_iterator  rbegin() const { return const_reverse_iterator(downcast()->end()); }
            reverse_iterator        rend()         { return reverse_iterator(begin()); }
            const_reverse_iterator  rend()   const { return const_reverse_iterator(begin()); }

            // at() with range check
            reference       at(size_type i)       { rangecheck(i); return downcast()->elems[i]; }
            const_reference at(size_type i) const { rangecheck(i); return downcast()->elems[i]; }

            // front() and back()
            reference       front()       { return c_array()[0]; }
            const_reference front() const { return data()[0]; }
            reference       back()        { return c_array()[downcast()->size()-1]; }
            const_reference back()  const { return data()[downcast()->size()-1]; }

            // direct access to data (read-only)
            const T* data() const { return downcast()->elems; }

            // use array as C array (direct read/write access to data)
            T* c_array() { return downcast()->elems; }

            // assign one value to all elements
            void assign (const T& value) { std::fill_n(begin(),downcast()->size(),value); }

            // check range (may be private because it is static)
            void rangecheck (size_type i) const {
                if (i >= downcast()->size()) {
                    throw std::range_error("array<>: index out of range");
                }
            }

            // operator[]
            reference operator[](size_type i)
            {
                BOOST_ASSERT( i < downcast()->size() && "out of range" );
                return c_array()[i];
            }

            const_reference operator[](size_type i) const
            {    
                BOOST_ASSERT( i < downcast()->size() && "out of range" );
                return data()[i];
            }
        };
    }

    template<typename T, std::size_t N = 0>
    class array : public impl::_common_array<T, array<T,N> > {
    protected:
        friend impl::_common_array<T, array<T,N> >;
        T elems[N];    // fixed-size array of elements of type T
    public:
        array(std::size_t n = N) { BOOST_ASSERT(n == N); }
        // size is constant
        static size_type size() { return N; }
        static bool empty() { return false; }
        static size_type max_size() { return N; }

        // swap (note: linear complexity)
        void swap (array<T,N>& y)
        {
            std::swap_ranges(begin(),end(),y.begin());
        }

        // assignment with type conversion
        template <typename T2>
        array<T,N>& operator= (const array<T2,N>& rhs)
        {
            std::copy(rhs.begin(),rhs.end(), begin());
            return *this;
        }

    };

    template<typename T>
    class array<T, 0> : public impl::_common_array<T, array<T,0> > {
        // TODO:  This needs to be noncopyable, probably, or deep-copyable...
    private:
        void *operator new(size_t);
    protected:
        friend impl::_common_array<T, array<T,0> >;

        std::size_t sz:sizeof(std::size_t) * 8 - 1;
        unsigned int releaseMem : 1;
        T *elems;

    public:
        array(std::size_t s, void *memory = NULL) : sz(s), releaseMem(!memory), elems((T*)memory)
        {
            if (releaseMem)
            {
                memory = malloc(s * sizeof(T));
                elems = (T*)memory;
            }
            for (size_t i = 0; i < s; i++)
                T *temp = new(&elems[i]) T;
        }
        array(std::size_t s, const T &copyVal, void *memory = NULL) : sz(s), releaseMem(!memory), elems((T*)memory)
        {
            if (releaseMem)
            {
                memory = malloc(s * sizeof(T));
                elems = (T*)memory;
            }
            for (size_t i = 0; i < s; i++)
                T *temp = new(&elems[i]) T(copyVal);
        }
        ~array()
        {
            for (size_t i = 0; i < sz; i++)
                elems[i].~T();          
            if (releaseMem)
                free(elems);
        }

        // size is constant
        size_type size() const { return sz; }
        bool empty() const { return sz != 0; }
        size_type max_size() const { return sz; }

        // swap (note: linear complexity)
        void swap (array<T,0>& y)
        {
            // error check
            std::swap_ranges(begin(),end(),y.begin());
        }

        // assignment with type conversion
        template <typename T2>
        array<T,0>& operator= (const array<T2,0>& rhs)
        {
            BOOST_ASSERT(rhs.size() == this->size());
            std::copy(rhs.begin(),rhs.end(), begin());
            return *this;
        }

    };

    // comparisons
    template<typename T, std::size_t N>
    bool operator== (const array<T,N>& x, const array<T,N>& y) {
        return std::equal(x.begin(), x.end(), y.begin());
    }
    template<typename T, std::size_t N>
    bool operator< (const array<T,N>& x, const array<T,N>& y) {
        return std::lexicographical_compare(x.begin(),x.end(),y.begin(),y.end());
    }
    template<typename T, std::size_t N>
    bool operator!= (const array<T,N>& x, const array<T,N>& y) {
        return !(x==y);
    }
    template<typename T, std::size_t N>
    bool operator> (const array<T,N>& x, const array<T,N>& y) {
        return y<x;
    }
    template<typename T, std::size_t N>
    bool operator<= (const array<T,N>& x, const array<T,N>& y) {
        return !(y<x);
    }
    template<typename T, std::size_t N>
    bool operator>= (const array<T,N>& x, const array<T,N>& y) {
        return !(x<y);
    }

    // global swap()
    template<typename T, std::size_t N>
    inline void swap (array<T,N>& x, array<T,N>& y) {
        x.swap(y);
    }

} /* namespace kbf */

This is where I am currently. 
This seems incredibly useful, and still very performant.
What I'd really like is to be able to eliminate vectors from most of my code, and use these things unless I really do need growable vectors...

I'm working on writing something up about how to use chained unwind info next, hopefully that won't be quite so long in coming as this post...

-Kev

If you're porting your application to x64, and you use much in the way of __asm in your x86 code, you're likely to start looking at ml64 - the 64 bit version of Masm.  The reason you're likely to do this is that the x64 compiler doesn't support __asm blocks in C code.  So you can either use the compiler intrinsic functions [and there are a lot of 'em, and they're all documented relatively well], or you have to use ml64.  For folks that aren't new to Masm, you're also likely to try to use some of the nifty little time-saver features of Masm to automatically generate prologue's and epilogues for your functions.  For ML64, DO NOT DO THIS!
Here's an example:
.DATA
.CODE
testfunc PROC uses rbx
xor rbx, rbx
ret
testfunc ENDP
END
There are 2 problems with the code that ML64 generates.  #1:  There is no .xdata of any sort.  No unwind directives are emitted, despite the fact that you're allocating stack, and saving a nonvolatile register.  #2:  The epilogue is invalid - it uses the 'leave' instruction, which is ineffecient, but also just plain illegal in an x64 function epilogue.  The stack unwind routines will not properly recognize the instruction sequence as an epilogue, so the debugger and the EH routines will all fail [the EH routines will just terminate your process if an exception occurs while testfunc is on the stack].

BTW - Here's the link to the customer bug.  It came in about 3 months too soon for serious consideration for VC8, so we punted it to the next version, but now it's pushed out until we get some more resources for investment in MASM.

My role for the past 6 months (and the next several) I'm charged with effectively "going down with the ship" WRT the current generation of code gen & tools for Visual C++.  This includes the linker, Masm, ml64, DIA, c2, pdb, and a few other miscellaneous components.  The entire ball of wax is to be replaced with components from a product with the much overused codename of Phoenix over the next few product cycles.  I have a skeleton crew [trust me on this one] to do the work, and I get to manage them.  As the lead of the bunch, I wind up making many less than pleasant calls of 'won't fix'ing a bugs that, in a previous product cycle, would have been fixed in a heart-beat.  I'm going to start blogging about each & every customer bug that I mark as "Won't Fix" simply to keep a public list of what not to do until the toolset is replaced.
I'm not sure how acceptable this is, but unless my management smacks me, I'm going to keep doing this for the next many months.  The next entry is the first of many...
More Posts Next page »
 
Page view tracker