Aargh! Part Eight, plus Boring Metablogging

Aargh! Part Eight, plus Boring Metablogging

  • Comments 9

Q: What's a pirate's second favourite mode of transportation?

A: A caaaaargh! Preferably a Jaguaaaaargh, but an early Oldsmobile Cutlass will do.

Q: Very amusing -- but what's a pirate's favourite mode of transportation?

A: A pirate ship, silly.

Gripe #10: Don't use _alloca

_alloca allocates memory off the stack instead of the heap. It's very convenient to use when you need a temporary buffer and don't want to worry about freeing it, but I try to avoid it whenever possible for two reasons:

(1) it wrecks the ability for the compiler to make optimizations based on how much stack a function uses, and

(2) when combined with the previous point, it leads to terrible, terrible crashes. If the user asks the machine for lots and lots of heap then the program churns away trying to allocate lots of memory before finally throwing an exception or returning null. The worst that can happen is not too bad. The worst that can happen if the stack is overallocated is that the stack guard page is hit for the second time and the process is terminated. (The exact behaviour of the win32 stack under low-stack conditions is interesting, and I will blog about it at some point; the script engines do some fancy footwork in this department to keep the stack from blowing.)

Here's the thing -- if you have an upper bound on the amount of stack you're going to use, you might as well just ask for it. If you don't have an upper bound -- if the upper bound is determined by something at runtime and it could be arbitrarily big -- then use the heap, because really big stack allocations are too risky.

For example, I'll often see code where _alloca is used to hold a temporary copy of a string when it is being converted from eight bit ASCII to UTF-16. Bad idea, particularly if the string can come from an untrusted internet script. The string could be a million bytes long, and blow the stack! Aaargh! It's drivin' me nuts!

The thing is, stack allocation is incredibly fast, and in many situations the common scenario is for the allocation to be small even if that's not guaranteed. Therefore I often use a "best of both worlds" strategy if I have data that is usually small but might be huge. First, I declare a stack buffer twice the size I expect to use. Then detect if more than that is required and allocate it off the heap if it is. (Of course, you have to remember to clean up the heap buffer if you do that.)

*********************************

You'll note how few new hit television series are called "Extreme Television Editing", or "New Issues In Satellite Uplink Futures Contract Law" -- though come to think of it, either of those would be more likely to catch my interest than the vast majority of the inaptly named "reality" television. Therefore I try to avoid talking about the blog in the blog -- I personally find blogs about blogging boring. I'll keep this short then.

A number of people have asked me questions like "can I quote your blog in my blog?", "can I post your article in my wiki?", "can I use this information in my MSDN Magazine article?" etc.

I'm putting this information on the web because I want it to reach the people who can use it, so by all means, disseminate the information widely. I ask only four things:

1) Please give me a heads-up before you do.
2) Please attribute the source, preferably with a link back to the original.
3) Please don't quote me out of context.
4) Please understand that by doing so, you remove my ability to correct mistakes in your copy.

UPDATE: I have posted a follow-up to these guidelines.

  • How about making a quickalloc/quickfree routine (or macro) such that there's a global buffer preallocated per thread and only one allocation is allowed perthread from it? If the size is bigger than the fixed buffer, it resorted to the heap, but otherwise it quickly returns you the pointer?

    Meta-reply: Thanks, I'll try to integrate your LCS code in the DP section of wikibooks. (It's not my own wiki and I have no connection to it, but I thought that would be a great place for such a fine article.)
  • Indeed, that's not a bad idea, though in a multi-threaded situation, getting the pointer to the Thread Local Storage might be slow enough to negate much of the perf advantage.

    Something that the script engines do in the compiler is they have a "no release allocator".

    For example: we know that the parser is going to need to build a parse tree. And we know that the parse tree is going to consist of many small allocations. And we know that the moment the codegen is done, ALL of those allocations can go away at once.

    What we do is we have a "no release allocator" that allocates some big blocks off the heap and then parcels them out.

    The per-allocation cost is much lower than calling the heap allocator. The wasted space per big block is tiny because there are no frees creating holes and there's very little "slop" left at the end, since all the tree nodes are small.

    And the free cost is incredibly cheaper than doing all the frees for all the nodes. We just free all the big blocks at once. No worries about memory leaks or bad refs, etc.
  • Also referred to as an "arena allocator".
  • Your favourite library, ATL, used the _alloca technique in ver. 3 for conversion between wide and ANSI strings without having to mess with buffers. There was no end to the amount of bugs where people used these conversion macros (yeah, they hid it behind an innocuous-looking macro) inside loops, and suddenly ran out of disk space.

    Fortunately, they fixed that in ver. 7 using exactly the technique you describe in the nicely wrapped conversion classes. I guess the security push had something to do with that...
  • > There was no end to the amount of bugs where people used these conversion macros (yeah, they hid it behind an innocuous-looking macro) inside loops, and suddenly ran out of disk space.

    Anybody who writes code that performs such conversions inside tight loops deserves what he gets. The new ATL macros won't crash, but will create hard to detect performance bottlenecks when used in this way. Is that better?

    > getting the pointer to the Thread Local Storage might be slow enough to negate much of the perf advantage

    Are the TLS functions so slow? One of my developers wrote code that used a global STL collection protected by a critical section. I had him change it per-thread collections accessed through TLS (the change made sense it the context of the code's function). Was this a good change in your opinion? Thanks.

    BTW, C++ makes it fairly straightforward to create custom allocators. In his book "Modern C++ Design", Andrei Alexandrescu provides a complete implementation of a very fast small object allocator (chapter 4).
  • In your example you're trading space -- one collection per thread instead of one collection -- for time. Whether it's a performance savings or not, you'd have to measure and see. It would depend a lot on how contented the lock was -- highly contended locks are not fast, obviously.

    I don't think the TLS functions are particularly slow. But the thing about stack allocations is that they often boil down to a single increment of a single register, which is the fastest operation there is. If you're doing a stack-based allocation for performance reasons, doing anything else in there is likely to slow it down by orders of magnitude.

    Of course, that assumes that the allocation is the bottleneck in the first place, which it often is not. On the script engines though, in ASP, small heap allocations ARE the bottleneck in some fairly common scenarios, so we did a lot of work to eliminate some of them.
  • > In your example you're trading space -- one collection per thread instead of one collection -- for time.

    Actually this is not exactly correct. Unless a collection contains very few items, its size is predominantly determined by the number of items it contains. Thus there should not be a significant difference between one collection of a 100 items and 10 collections of 10 items.

    > highly contended locks are not fast, obviously.

    The biggest advantage of using TLS over a lock in this context is that the threads become independent of each other. If the lock is highly contended, and if the lock owner takes its time, the time threads spend in a blocked state could become significant.

    But what if the lock is not highly contended and the amount of time spent inside the protected section is short (say equal to inserting/removing/searching for an element in a small/medium STL map)? So that in most cases the overhead is simply that of acquiring and releasing the critical section - how would that compare to TLS?

    Another potential factor is that using multiple collections means that each collection is smaller, and thus faster. However, as I've mentioned above, the collections in this particular case don't contain a lot of elements anyway so that difference should be negligible.

    > If you're doing a stack-based allocation for performance reasons, doing anything else in there is likely to slow it down by orders of magnitude.

    Actually we don't use stack allocations in this code. My question was triggered by your comments about Thread Local Storage.
  • I've never done a head-to-head of TLS vs uncontended critsec, so my answer would have to be "try it and see". I do know that the OS guys have carefully tuned uncontended critsecs to be as fast as possible, but I don't know how fast that is offhand.
  • http://www.insurance-top.com/company/">http://www.insurance-top.com/company/ car site insurance. [URL=http://www.insurance-top.com]home insurance[/URL]: car site insurance, The autos insurance company, compare car insurance. [url=http://www.insurance-top.com]cars insurance[/url] from website .
Page 1 of 1 (9 items)