<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Developing for Developers : Data structures</title><link>http://blogs.msdn.com/devdev/archive/tags/Data+structures/default.aspx</link><description>Tags: Data structures</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Cache-oblivious data structures</title><link>http://blogs.msdn.com/devdev/archive/2007/06/12/cache-oblivious-data-structures.aspx</link><pubDate>Tue, 12 Jun 2007 23:54:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:3256830</guid><dc:creator>dcoetzee</dc:creator><slash:comments>10</slash:comments><comments>http://blogs.msdn.com/devdev/comments/3256830.aspx</comments><wfw:commentRss>http://blogs.msdn.com/devdev/commentrss.aspx?PostID=3256830</wfw:commentRss><description>&lt;P&gt;In most data structure and algorithms classes, the model used for basic analysis is the traditional RAM model: we assume that we have a large, random-access array of memory, and count the number of simple reads/writes needed to perform the algorithm. For example, selection sort takes about n(n-1)/2 reads and 2n writes to sort an array of n numbers. This model was relatively accurate up through the 1980s, when CPUs were slow enough and RAM small enough that reading and writing the RAM directly didn't create a significant bottleneck. Nowadays, however, typical desktop CPUs possess deep cache hierarchies - at least a sizable&amp;nbsp;L1 and L2 cache - to prevent runtime from being dominated by RAM accesses. Data structures and algorithms that efficiently exploit the cache can perform dramatically more quickly than those that don't, and our analysis needs to take this into account.&lt;/P&gt;
&lt;P&gt;For example, suppose you have a list of bytes stored in both an array and a linked list. Provided the linked list nodes are allocated at random positions, iterating over the array will be dramatically faster, exhibiting optimal locality of reference, while iterating over the linked list will trigger a cache miss for every element. We can formalize this: assume a cache line - the size of the block read into the cache at one time - has size B bytes. Then traversing the linked list requires n cache misses, while traversing the array requires precisely ceiling(n/B) cache misses, a speedup of B times.&lt;/P&gt;
&lt;P&gt;It is possible to design data structures that specifically take advantage of the cache line size B. For example, &lt;A class="" href="http://blogs.msdn.com/devdev/archive/2005/08/22/454887.aspx" target=_blank mce_href="http://blogs.msdn.com/devdev/archive/2005/08/22/454887.aspx"&gt;in a previous post&lt;/A&gt; I described an unrolled linked list data structure where each node fills up a cache line, achieving similar cache performance to arrays while still allowing efficient insertions and deletions. Such a data structure is termed "cache-aware", because it &lt;EM&gt;explicitly&lt;/EM&gt; takes knowledge about the cache model into account in its construction.&lt;/P&gt;
&lt;P&gt;Unfortunately, this approach has a compelling disadvantage: one has to tune the data structure based on the cache line size B. This is a problem for many reasons: first of all, programs, even those compiled to&amp;nbsp;native&amp;nbsp;machine code,&amp;nbsp;are often run on many different processors with different cache line sizes. A Pentium 4 can run the same machine code as a Pentium II, but has a very different cache model. But even when running on a fixed machine, these simple cache-tuned data structures can't take advantage of multilevel caches, because each one has a different size and a different block size. One can attempt to build a data structure for a specific cache hierarchy, but this is even less robust against a change of machine.&lt;/P&gt;
&lt;P&gt;In 1999, in his masters thesis, Harold Prokop&amp;nbsp;came up with an interesting solution to this problem: the idea of a &lt;A class="" href="http://citeseer.ist.psu.edu/prokop99cacheobliviou.html" mce_href="http://citeseer.ist.psu.edu/prokop99cacheobliviou.html"&gt;&lt;EM&gt;cache-oblivious&lt;/EM&gt; algorithm&lt;/A&gt;. This does not mean that the algorithm does not take advantage of the cache; to the contrary, it does so quite effectively. What it means is that the algorithm does not need to know the cache line size; it works effectively for &lt;EM&gt;all&lt;/EM&gt; cache line sizes B simultaneously. This allows the algorithm to robustly exploit caches across many machines without machine-specific tuning. More importantly, it allows the algorithm to exploit multilevel caches effectively: because it works for all B, it applies between each two adjacent levels of cache, and so every level is well-utilized.&lt;/P&gt;
&lt;P&gt;In fact, the multilevel advantage extends beyond simply taking advantage of both the L1 and L2 caches - such algorithms perform well even when the storage hierarchy is expanded to encompass much slower media such as slow memory, Flash, compressed memory, disks, or even network resources, without knowledge of block sizes or access patterns. For example, cache-oblivious B-trees can be used to implement huge database indexes that simultaneously minimize disk reads and cache misses.&lt;/P&gt;
&lt;P&gt;Note that just because the algorithm doesn't depend on B doesn't mean the &lt;EM&gt;analysis&lt;/EM&gt; of the algorithm doesn't depend on B. Let's take an example: iterating over a simple&amp;nbsp;array. As noted earlier, this requires about n/B cache misses, which is optimal. But neither the array's structure nor the iteration algorithm explicitly take B into account. Consequently, it works for all B. In a multilevel cache, all data prefetched into every cache is used. This is the theoretical benefit of cache-oblivious algorithms: we can reason about the algorithm using a very simple two-level cache model, and it automatically generalizes to a complex, deep&amp;nbsp;cache hierarchy with no additional work.&lt;/P&gt;
&lt;P&gt;While cache-oblivious algorithms are clearly useful, at first it's not clear that there even exist any other than simple array iteration. Thankfully, extensive recent research has revealed cache-oblivious data structures and algorithms for a multitude of practical problems: &lt;A class="" href="http://supertech.csail.mit.edu/cacheObliviousBTree.html" mce_href="http://supertech.csail.mit.edu/cacheObliviousBTree.html"&gt;searching (binary trees)&lt;/A&gt;, &lt;A class="" href="http://portal.acm.org/citation.cfm?id=1227161.1227164" mce_href="http://portal.acm.org/citation.cfm?id=1227161.1227164"&gt;sorting&lt;/A&gt;, &lt;A class="" href="http://citeseer.ist.psu.edu/bender02localitypreserving.html" mce_href="http://citeseer.ist.psu.edu/bender02localitypreserving.html"&gt;associative arrays&lt;/A&gt;, &lt;A class="" href="http://citeseer.ist.psu.edu/prokop99cacheobliviou.html" mce_href="http://citeseer.ist.psu.edu/prokop99cacheobliviou.html"&gt;FFT computation&lt;/A&gt;, &lt;A class="" href="http://www5.in.tum.de/~bader/publikat/matmult.pdf" mce_href="http://www5.in.tum.de/~bader/publikat/matmult.pdf"&gt;matrix multiplication&lt;/A&gt; and &lt;A class="" href="http://www.springerlink.com/content/1cg4uvtb45mmyq3n/" mce_href="http://www.springerlink.com/content/1cg4uvtb45mmyq3n/"&gt;transposition&lt;/A&gt;, &lt;A class="" href="http://citeseer.ist.psu.edu/arge02cacheoblivious.html" mce_href="http://citeseer.ist.psu.edu/arge02cacheoblivious.html"&gt;priority queues&lt;/A&gt;, &lt;A class="" href="http://www.cs.utexas.edu/~vlr/papers/spaa04.ps" mce_href="http://www.cs.utexas.edu/~vlr/papers/spaa04.ps"&gt;shortest path&lt;/A&gt;, &lt;A class="" href="http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p05.pdf" mce_href="http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p05.pdf"&gt;query processing&lt;/A&gt;, &lt;A class="" href="http://portal.acm.org/citation.cfm?id=1137883" mce_href="http://portal.acm.org/citation.cfm?id=1137883"&gt;orthogonal range searching&lt;/A&gt;, and more emerging every year. Practical comparison studies have shown a significant performance gain on real processors compared to traditional algorithms, although carefully-tuned cache-aware algorithms still enjoy an advantage on the specific machines they are tuned for (see e.g. &lt;A class="" href="http://www.brics.dk/~gerth/pub6.html" mce_href="http://www.brics.dk/~gerth/pub6.html"&gt;[1][2]&lt;/A&gt;).&lt;/P&gt;
&lt;P&gt;Many of the cache-oblivious data structures and algorithms that have been published are relatively complex, but here I'll describe a simple one just to give you a feel for it. This discussion is based on parts of &lt;A class="" href="http://supertech.csail.mit.edu/papers/cobtree.pdf" mce_href="http://supertech.csail.mit.edu/papers/cobtree.pdf"&gt;this paper&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;Consider the following simple search problem: we have a static list of records, and we wish to find the record with a particular key. Traditionally, this problem is solved with either an array and binary search, or a binary search tree. Both of these approaches exhibit dismal cache behavior. Database applications rely on B-trees, which group several records into each block. This greatly improves performance, but requires knowledge of the block size, and does not work across all levels of the cache hierarchy.&lt;/P&gt;
&lt;P&gt;The key is the &lt;EM&gt;van Emde Boas&lt;/EM&gt; &lt;EM&gt;layout&lt;/EM&gt;, named after the van Emde Boas tree data structure conceived in 1977 by Peter van Emde Boas. Suppose for simplicity that the number of elements in a power of 2. We create a complete binary tree containing our records (where by "complete" we mean that all leaves are at the same level). This tree will have height &lt;EM&gt;h&lt;/EM&gt; = log&lt;SUB&gt;2&lt;/SUB&gt;&lt;EM&gt;n&lt;/EM&gt;.&lt;/P&gt;
&lt;P&gt;Take a look at the first &lt;EM&gt;h&lt;/EM&gt;/2 levels of the tree. This subtree has 2&lt;SUP&gt;&lt;EM&gt;h&lt;/EM&gt;/2&lt;/SUP&gt; leaves, each itself the root of a tree of height &lt;EM&gt;h&lt;/EM&gt;/2. In the van Emde Boas layout, we first lay out the subtree of height &lt;EM&gt;h&lt;/EM&gt;/2 rooted at the root, followed by the subtrees rooted at each leaf of this tree from left to right. Each subtree is recursively laid out using the van Emde Boas layout. An example diagram is shown in Figure 1 of the paper.&lt;/P&gt;
&lt;P&gt;The analysis proceeds like this: at each step of the recursion, the size of the subtrees&amp;nbsp;being laid out is the square root of the size of the subtrees laid out at the previous step. Consequently, as we recurse,&amp;nbsp;at some point we will be laying out subtrees between sqrt(B) and B in size (let's call it C). Each of these subtrees can be retrieved in a single memory transfer, and this layout partitions the tree into subtrees of this size.&lt;/P&gt;
&lt;P&gt;The search procedure works like this: we access the root, which pulls in the first log&lt;SUB&gt;2&lt;/SUB&gt;C levels of the tree. We have only cache hits until we access a leaf of the tree, which pulls in the next log&lt;SUB&gt;2&lt;/SUB&gt;C levels of the tree rooted at that leaf. This continues until we reach the bottom of the tree. Because only 1 out of every log&lt;SUB&gt;2&lt;/SUB&gt;C accesses causes a cache miss, the total number of misses is log&lt;SUB&gt;2&lt;/SUB&gt;n/log&lt;SUB&gt;2&lt;/SUB&gt;C = log&lt;SUB&gt;C&lt;/SUB&gt;n. Because C is at least sqrt(B), log&lt;SUB&gt;2&lt;/SUB&gt;C is at least (log&lt;SUB&gt;2&lt;/SUB&gt;B)/2, and the total number of misses is at most 2log&lt;SUB&gt;B&lt;/SUB&gt;n.&lt;/P&gt;
&lt;P&gt;We made one small assumption above: that each subtree is aligned with the cache block boundaries; if this is not true, each subtree can require two accesses, bringing the final maximum to 4log&lt;SUB&gt;B&lt;/SUB&gt;n. Probabilistic analysis shows that with a random starting point in memory and sufficiently large block size, the real constant is closer to 2. &lt;/P&gt;
&lt;P&gt;This is the same asymptotic performance as B-trees, and is faster than ordinary binary trees by a factor of about log&lt;SUB&gt;2&lt;/SUB&gt;n/2log&lt;SUB&gt;B&lt;/SUB&gt;n = (log&lt;SUB&gt;2&lt;/SUB&gt;B)/2. For example, for a disk block size of 2048 elements, this would be a speedup of about 6.5 times. The advantage is more than theoretical: to quote the paper, "preliminary experiments have shown, surprisingly, that CO [cache-oblivious] B-trees can outperform traditional B-trees, sometimes by factors of more than 2". &lt;/P&gt;
&lt;P&gt;But as practical as the research is in cache-oblivious algorithms, many applications and libraries have yet to take advantage of them. The data structures supplied with .NET, Java, Lisp, and so on are not cache-oblivious. We need to start putting this research into practice and reaping the benefits. I&amp;nbsp;hope this gives you some insight into cache-oblivious algorithms, and I hope you will consider taking advantage of it the next time you develop a product whose performance critically depends on memory access patterns.&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=3256830" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/devdev/archive/tags/Data+structures/default.aspx">Data structures</category></item><item><title>Succinct data structures</title><link>http://blogs.msdn.com/devdev/archive/2005/12/05/500171.aspx</link><pubDate>Mon, 05 Dec 2005 21:43:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:500171</guid><dc:creator>dcoetzee</dc:creator><slash:comments>4</slash:comments><comments>http://blogs.msdn.com/devdev/comments/500171.aspx</comments><wfw:commentRss>http://blogs.msdn.com/devdev/commentrss.aspx?PostID=500171</wfw:commentRss><description>&lt;P&gt;Sorry for the long hiatus, everyone. Today I'm going to talk about succinct data structures, which are informally data structures that use very close to the absolute minimum possible space. This material is largely based on lectures by MIT professor Erik Demaine, as transcribed by 6.897 students Wayland Ni and Gautam Jayaraman (&lt;A href="http://theory.csail.mit.edu/classes/6.897/spring03/scribe_notes/L12/lecture12.pdf"&gt;link&lt;/A&gt;). Also see additional notes from this year by students &lt;A href="http://theory.csail.mit.edu/classes/6.897/spring05/lec/lec21.pdf"&gt;Yogo Zhou&lt;/A&gt; and &lt;A href="http://theory.csail.mit.edu/classes/6.897/spring05/lec/lec22.pdf"&gt;Nick Harvey&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;One big problem with classical data structures that is&amp;nbsp;too often overlooked is how horribly inefficient their space usage is. For example, suppose you have a linked list of characters. Each node contains a character and a pointer; on a 32-bit platform, each pointer takes 4 bytes, so that's already 5n bytes for n elements. Additionally, it's unlikely with most standard allocators that the 3 bytes following the character will be used; these are normally reserved for padding for alignment. On top of this, unless some kind of memory pool is created for the nodes, most simple allocators will reserve 4 to 12 bytes of space &lt;EM&gt;per node&lt;/EM&gt; for allocation metadata. We end up using something like 16 times as much space as the actual data itself requires. Similar problems appear in many simple pointer-based data structures, including binary search trees, adjacency lists, red-black trees, and so on, but the problem isn't limited to these: dynamic arrays of the sort implemented by Java and .NET's ArrayList classes can also waste linear space on reserved space for growing the array which may never be used.&lt;/P&gt;
&lt;P&gt;A natural explanation to this state of affairs would be to say that all of this extra data is neccessary to facilitate fast operations, such as insertion, removal, and search, but in fact this isn't true: there are data structures which can perform many of the same operations using very close to the minimum possible space. What is the minimum possible space? For a list of data, it's simply the space taken by a static array, which is just the space taken by the data itself. But what about, say, a binary tree, or a set of integers?&lt;/P&gt;
&lt;P&gt;This is a little trickier, but the idea is to use combinatorics to count the number of possible values of size &lt;EM&gt;n&lt;/EM&gt;, and then take the logarithm of this. Because we need a distinct representation for each value (otherwise we couldn't tell which one we have), this will give us the average number of bits needed to represent an individual value. For example, suppose we have a binary tree with &lt;EM&gt;n&lt;/EM&gt; nodes; we are ignoring data for now and just assuming each node has a left and right child reference. How many bits do we need to store this tree? Well, the number of rooted, ordered, binary trees with &lt;EM&gt;n&lt;/EM&gt; nodes is described by &lt;EM&gt;C&lt;SUB&gt;n&lt;/SUB&gt;&lt;/EM&gt;, the &lt;EM&gt;n&lt;/EM&gt;th &lt;A href="http://mathworld.wolfram.com/CatalanNumber.html"&gt;Catalan number&lt;/A&gt;, which can be defined by this formula:&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;C&lt;SUB&gt;n&lt;/SUB&gt;&lt;/EM&gt; = (2&lt;EM&gt;n&lt;/EM&gt;)! / ((&lt;EM&gt;n&lt;/EM&gt; + 1)! &lt;EM&gt;n&lt;/EM&gt;!)&lt;/P&gt;
&lt;P&gt;Now, since &lt;A href="http://mathworld.wolfram.com/StirlingsApproximation.html"&gt;Stirling's approximation&lt;/A&gt; tells us that log(&lt;EM&gt;n&lt;/EM&gt;!) is &lt;EM&gt;n&lt;/EM&gt;lg &lt;EM&gt;n&lt;/EM&gt; plus some stuff asymptotically smaller than &lt;EM&gt;n &lt;/EM&gt;(written o(n)), where lg is the log base 2, we can see that the logarithm of &lt;EM&gt;C&lt;SUB&gt;n&lt;/SUB&gt;&lt;/EM&gt; is asymptotically approximated by:&lt;/P&gt;
&lt;P&gt;log &lt;EM&gt;C&lt;SUB&gt;n&lt;/SUB&gt;&lt;/EM&gt; ≈ (2&lt;EM&gt;n&lt;/EM&gt;)lg(2&lt;EM&gt;n&lt;/EM&gt;) −&amp;nbsp;(&lt;EM&gt;n&lt;/EM&gt; + 1)lg(&lt;EM&gt;n&lt;/EM&gt; + 1)&amp;nbsp;− &lt;EM&gt;n&lt;/EM&gt;lg(&lt;EM&gt;n&lt;/EM&gt;)&lt;BR&gt;log &lt;EM&gt;C&lt;SUB&gt;n&lt;/SUB&gt;&lt;/EM&gt; ≈ (2&lt;EM&gt;n&lt;/EM&gt;)(lg(2)&lt;EM&gt; + &lt;/EM&gt;lg(&lt;EM&gt;n&lt;/EM&gt;)) −&amp;nbsp;2&lt;EM&gt;n &lt;/EM&gt;lg(&lt;EM&gt;n&lt;/EM&gt;)&lt;BR&gt;log &lt;EM&gt;C&lt;SUB&gt;n&lt;/SUB&gt;&lt;/EM&gt; ≈ 2&lt;EM&gt;n&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;In other words, we have to use at least 2 bits per node - this is the best we can possibly do. Surprisingly, there is a very simple encoding that does so. We traverse the tree in preorder, outputting 1 for each node and outputting 0 whenever we hit a &lt;EM&gt;null&lt;/EM&gt;. Here's some C# code which implements the conversion back and forth between this compact bit string and a full in-memory tree structure, preserving both the tree's exact shape and its data:&lt;/P&gt;&lt;PRE&gt;        class BinaryTreeNode {
            public object data;
            public BinaryTreeNode left, right;
        }

        static void EncodeSuccinct(BinaryTreeNode node, ref BitArray bits, ref ArrayList data) {
            bits.Length++;
            bits[bits.Length - 1] = (node != null);
            if (node != null) {
                data.Add(node.data);
                EncodeSuccinct(node.left, ref bits, ref data);
                EncodeSuccinct(node.right, ref bits, ref data);
            }
        }

        static BinaryTreeNode DecodeSuccinct(BitArray bits, ArrayList data) {
            int bitsIndex = -1, dataIndex = -1;
            return DecodeSuccinctHelper(bits, ref bitsIndex, data, ref dataIndex);
        }

        static BinaryTreeNode DecodeSuccinctHelper(BitArray bits, ref int bitsIndex,
                                                   ArrayList data, ref int dataIndex) {
            BinaryTreeNode result;
            bitsIndex++;
            if (bits[bitsIndex])
            {
                result = new BinaryTreeNode();
                dataIndex++;
                result.data  = data[dataIndex];
                result.left  = DecodeSuccinctHelper(bits, ref bitsIndex, data, ref dataIndex);
                result.right = DecodeSuccinctHelper(bits, ref bitsIndex, data, ref dataIndex);
            } else {
                result = null;
            }
            return result;
        }
&lt;/PRE&gt;
&lt;P&gt;The original representation uses about 3 times as much space as the data, while the succinct representation uses only 2&lt;EM&gt;n&lt;/EM&gt; + 1 extra bits, an overhead of 6% on a 32-bit machine. There is some constant overhead, but because this is asymptotically less than &lt;EM&gt;n&lt;/EM&gt;,&amp;nbsp;the data structure is still considered succinct. A number of applications are evident: trees can be compactly stored and transmitted across networks in this form, embedded and portable devices can incorporate pre-built tree data structures in the application's image for fast startup, and so on.&lt;/P&gt;
&lt;P&gt;On the other hand, it would seem that you can't really &lt;EM&gt;do&lt;/EM&gt; much with trees in this form - in limited memory environments, you might not be able to expand the tree back out to its full form. An alternative encoding with the same space requirement is to search the tree with breadth-first instead of depth-first search. If we do this, the resulting bit string has some useful properties. Define the &lt;EM&gt;Rank&lt;/EM&gt; and &lt;EM&gt;Select&lt;/EM&gt; functions as:&lt;/P&gt;
&lt;P&gt;Rank(&lt;EM&gt;i&lt;/EM&gt;) = the number of 1 bits between indexes 0 and &lt;EM&gt;i&lt;/EM&gt; in the bit string&lt;BR&gt;Select(&lt;EM&gt;i&lt;/EM&gt;) = the index of the &lt;EM&gt;i&lt;/EM&gt;th 1 bit in the bit string&lt;/P&gt;
&lt;P&gt;The following can be proven: if we have the breadth-first search bit string for the tree, the children of the node represented by bit &lt;EM&gt;i&lt;/EM&gt; are always represented by bits 2(Rank(&lt;EM&gt;i&lt;/EM&gt;)) and 2(Rank(&lt;EM&gt;i&lt;/EM&gt;)) + 1. The parent of&amp;nbsp;index&amp;nbsp;&lt;EM&gt;i&lt;/EM&gt; is always located at index Select(&lt;EM&gt;i&lt;/EM&gt;/2). Finally, the data for the node at index &lt;EM&gt;i &lt;/EM&gt;is located at index Rank(&lt;EM&gt;i&lt;/EM&gt;) in the data array. Using these lemmas, along with an efficient implementation of Rank and Select, we can traverse the tree and perform all the usual operations on it in this compact form. I won't get into implementing Rank and Select here, but suffice it to say that by adding only O(&lt;EM&gt;n&lt;/EM&gt;/log &lt;EM&gt;n&lt;/EM&gt;) additional bits (which again is asymptotically less than &lt;EM&gt;n&lt;/EM&gt;) they can be implemented in constant time; if the hardware supports a &lt;EM&gt;find-first-zero&lt;/EM&gt; and/or &lt;EM&gt;population&lt;/EM&gt; (bit-count) operation, these can be quite efficient.&lt;/P&gt;
&lt;P&gt;This same idea can be applied to all sorts of data structures: for example, there are &lt;EM&gt;m choose n&lt;/EM&gt; (that is, &lt;EM&gt;m&lt;/EM&gt;!/(&lt;EM&gt;n&lt;/EM&gt;!(&lt;EM&gt;m&lt;/EM&gt;-&lt;EM&gt;n&lt;/EM&gt;)!)) different sets of integers between 1 and &lt;EM&gt;m&lt;/EM&gt;, and there are data structures which use bits in the logarithm of this quantity to store such a set. The number of partitions of a set of objects is defined by the &lt;A href="http://mathworld.wolfram.com/BellNumber.html"&gt;Bell numbers&lt;/A&gt;, and there are data structures using bits in the logarithm of these numbers. A variety of papers address succinct graph representations, and there are even schemes for representing graphs with specific properties that use no more bits than the logarithm of the number of graphs having those properties.&lt;/P&gt;
&lt;P&gt;One of the most interesting succinct data structures, created by Franceschini and Grossi in 2003 in their &lt;A href="http://www.springerlink.com/index/GQ8AMCEQ88BE4YW2.pdf"&gt;&lt;EM&gt;Optimal Worst-Case Operations for Implicit Cache-Oblivious Search Trees&lt;/EM&gt;&lt;/A&gt;, is a search tree which permits all the usual insert, delete, and search operations in O(log &lt;EM&gt;n&lt;/EM&gt;) time, but with only constant space overhead! It stores the data in an array, and nearly all the information it needs to search it is contained in the order of the elements alone (similar to the heap used in heapsort, but with different operations). On top of this, the structure is &lt;EM&gt;cache-oblivious&lt;/EM&gt;, meaning that it maximally utilizes every level of the memory hierarchy (within a constant factor) without any machine-specific tuning (I'll talk more about cache-oblivious data structures another time). I haven't read this paper yet, but for those who believe data structures are a "solved problem", this kind of result makes it clear that there are many wonders yet to behold.&lt;/P&gt;
&lt;P&gt;Finally, I wanted to come back to the problem with ArrayList. Although the simple array-growing scheme used by typical dynamic arrays can waste linear space (sometimes as much unused space is reserved as is used for data), there are alternative dynamic array data structures which remain efficient but reserve only O(sqrt(&lt;EM&gt;n&lt;/EM&gt;)) space, which is asymptotically optimal (there is a sqrt(&lt;EM&gt;n&lt;/EM&gt;) lower bound). These are described in Brodnik et al.'s &lt;EM&gt;&lt;A href="http://portal.acm.org/citation.cfm?id=673194"&gt;Resizeable Arrays in Optimal Time and Space&lt;/A&gt;.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Thanks for reading, everyone. Please ask any questions that you have, and don't hesitate to e-mail me if you want to.&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=500171" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/devdev/archive/tags/Data+structures/default.aspx">Data structures</category></item><item><title>Persistent data structures</title><link>http://blogs.msdn.com/devdev/archive/2005/11/08/490480.aspx</link><pubDate>Tue, 08 Nov 2005 22:41:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:490480</guid><dc:creator>dcoetzee</dc:creator><slash:comments>3</slash:comments><comments>http://blogs.msdn.com/devdev/comments/490480.aspx</comments><wfw:commentRss>http://blogs.msdn.com/devdev/commentrss.aspx?PostID=490480</wfw:commentRss><description>&lt;P&gt;When learning to program in a functional language such as Lisp, &lt;A href="http://caml.inria.fr/"&gt;Ocaml&lt;/A&gt;, or &lt;A href="http://www.haskell.org/"&gt;Haskell&lt;/A&gt;, one of the most difficult aspects of the paradigmic shift is that data in these languages is almost entirely immutable, or read-only. In purely functional programs, there is no assignment, and data structures cannot be modified. But how are we to do anything useful with a data structure we can't modify? The answer is that instead of modifying it, we create new data structures that incorporate the old structure as a part.&lt;/P&gt;
&lt;P&gt;The simplest example of this is the singly-linked list. Say you have a pointer to the head of a linked list A. I can take that list and add three new nodes to the beginning to get a new list B. Yet, due to the structure of the list,&amp;nbsp;the old pointer to the first node of the list A will still point to the same data structure as before; you wouldn't even notice I prepended the new elements. We have retained access to both the previous version and updated version of the list simultaneously. Conversely, if I take a pointer to the third node of A, I get a new list C consisting of A with the first two elements removed (called a &lt;EM&gt;tail &lt;/EM&gt;of the list), but existing pointers to A still see the whole list. A data structure with this special property, the ability to retain access to the old version during updates, is called a &lt;EM&gt;&lt;A href="http://en.wikipedia.org/wiki/Persistent_data_structure"&gt;persistent data structure&lt;/A&gt; &lt;/EM&gt;(not to be confused with a &lt;EM&gt;disk-based&lt;/EM&gt; data structure, which might also be inferred from the word "persistent"). Why might this be useful? There are several reasons:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;We can treat all lists as immutable, or read-only. As a result, a function can be sure that no one else is going to modify its list. It can safely pass it to untrusted functions, and annoying phenomena like aliasing and&amp;nbsp;race conditions cannot arise. 
&lt;LI&gt;Because the data is immutable, the compiler can make assumptions that allow it to produce faster code, and the runtime and garbage collector can also make simplifying assumptions. 
&lt;LI&gt;It allows us to efficiently "undo" a sequence of operations - just retain a pointer to the old version you want to return to later, and when you're done with the new version, throw it away (assuming you have garbage collection). This is particularly useful for rolling back a failed operation.&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;On the other hand, other operations on the singly-linked list, such as node removal or insertion in the middle, would overwrite existing pointers and destroy the old version. We can make any data structure (even arrays) persistent by simply making a complete copy of it and then modifying it to perform each operation, but this approach is obviously very expensive in time and space. An active area of programming language research is to find new, more clever data structures that can perform more operations in an efficient, persistent way.&lt;/P&gt;
&lt;P&gt;Following is a simple example, taken from Chris Okasaki's excellent and readable PhD thesis, &lt;A href="http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf"&gt;&lt;EM&gt;Purely Functional Data Structures&lt;/EM&gt;&lt;/A&gt;&amp;nbsp;(you can also purchase a &lt;A href="http://www.amazon.com/exec/obidos/tg/detail/-/0521663504/"&gt;book version&lt;/A&gt;). Say we wish to implement a simple queue. Our queue will support three main operations: add an element to the front, add an element to the back, and remove an element from the front. A singly linked list supports only two of these persistently, adding or removing an element from the front. A simple solution to this problem is to maintain two singly-linked lists, call them L and R. The queue's contents are simply the contents of L followed by the the contents of the reverse of R. For example, if L={1,2} and R={4,3}, then our queue contains the values {1,2,3,4} in that order. To add to the beginning or end of the queue, we simply add to the beginning of either L or R. To remove an element from the front, we remove an element from the front of L. However, it may so happen that L is empty. If this happens, we take R, reverse it, and make it the new L. In other words, we create a new queue {L,R} where L is the reverse of the old R&amp;nbsp;and R is the empty list.&lt;/P&gt;
&lt;P&gt;But isn't this operation, reversing a potentially long list, expensive? Not in an amortized sense. Each element added to R will only get moved to L one time ever. Therefore, a sequence of &lt;EM&gt;n&lt;/EM&gt; insertions and &lt;EM&gt;n&lt;/EM&gt; removals will still take only O(&lt;EM&gt;n&lt;/EM&gt;) time, which is an amortized time of O(1) per operation. We say that we can "charge" each of the original&amp;nbsp;&lt;EM&gt;n&lt;/EM&gt; insertions a cost of O(1) to "pay for" the O(&lt;EM&gt;n&lt;/EM&gt;) reversal cost.&lt;/P&gt;
&lt;P&gt;Binary search trees and even self-balancing trees like red-black trees can also be made fully persistent: when we add a new node N, we create a new version of N's parent,&amp;nbsp;call it P,&amp;nbsp;with N as a child. We must also create a new version of P's parent with P as a child, and so on up to the root. For a balanced tree, O(log n) space is required for such an insertion operation - more than the O(1) required for an in-place update, but much less than the O(n) required for a complete copy-and-update. These are very useful for building map data structures that need the advantages of persistence.&lt;/P&gt;
&lt;P&gt;One key observation to building more clever persistent data structures is that, really, we &lt;EM&gt;can&lt;/EM&gt; modify the old version, as long as it &lt;EM&gt;behaves&lt;/EM&gt; the same as it did before. For example, if we have a binary search tree, we can rebalance the tree without changing its contents; it only becomes more efficient to access. Objects holding existing pointers to the tree won't notice that it's been rebalanced as long as they access it only through its abstract interface and can't touch its internal representation. This can be exploited to maintain an efficient representation as more and more versions are created. On the other hand, this also sacrifices the advantage of avoiding race conditions; we must ensure the data structure is not accessed in an intermediate&amp;nbsp;state where it would not behave the same.&lt;/P&gt;
&lt;P&gt;Persistent data structures aren't strictly limited to functional languages. For example, the &lt;A href="http://www.moo.mud.org/"&gt;MOO&lt;/A&gt; and &lt;A href="http://cold.org/coldc/"&gt;ColdMUD&lt;/A&gt; virtual environment languages use immutable data structures for&amp;nbsp;built-in string, list, and map types, but have mutable objects. Nevertheless, implementations of persistent data structures are, today, largely limited to functional languages. But they're useful in every language - perhaps one day we will see them in standard libraries. Meanwhile, take a look at &lt;A href="http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf"&gt;Okasaki's thesis&lt;/A&gt;, and when thinking about your next design, give a thought to how persistent data structures could simplify your own code.&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=490480" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/devdev/archive/tags/Data+structures/default.aspx">Data structures</category></item><item><title>Unrolled linked lists</title><link>http://blogs.msdn.com/devdev/archive/2005/08/22/454887.aspx</link><pubDate>Tue, 23 Aug 2005 00:39:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:454887</guid><dc:creator>dcoetzee</dc:creator><slash:comments>9</slash:comments><comments>http://blogs.msdn.com/devdev/comments/454887.aspx</comments><wfw:commentRss>http://blogs.msdn.com/devdev/commentrss.aspx?PostID=454887</wfw:commentRss><description>&lt;P&gt;Today I'll be discussing &lt;A href="http://en.wikipedia.org/wiki/Unrolled_linked_list"&gt;unrolled linked lists&lt;/A&gt;, a simple variant of the linked list which has many of its desirable properties but exploits the cache to yield considerably better performance on modern PCs.&lt;/P&gt;
&lt;P&gt;In elementary data structures classes, students are often taught about arrays and linked lists and the algorithmic tradeoffs between the two. In an array, you can get the kth element quickly, but adding elements to the middle of the array is slow; linked lists are the opposite. However, arrays have a number of practical benefits that this simple treatment doesn't mention:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Arrays are compact;&amp;nbsp;an array&amp;nbsp;stores almost entirely data, while&amp;nbsp;linked lists (especially double-linked) can require many times as much space for bookkeeping (pointers, allocation metadata, alignment) as for actual data, particularly for small elements. 
&lt;LI&gt;Modern PCs have multi-level cache hierarchies that make traversing an array (visiting the elements in order) very fast. Cache hits are so fast that in cache-sensitive analysis they are considered "free"; we only count cache misses. If a cache line has size B, then the number of cache misses is about n/B.&amp;nbsp;A linked list, on the other hand,&amp;nbsp;requires a cache miss for every node access in the worst case. Even in the best case, when the nodes are allocated consecutively in order, because linked list nodes are larger, it can require several times more cache misses to traverse the list.&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;Let's make this concrete: using Gentoo Linux and gcc&amp;nbsp;on a Pentium 4 3.5GhZ machine, I constructed a linked list of 60 million integers and created an array of the same 60 million integers. I compiled with full optimization. Traversing the linked list required 0.48 seconds, while traversing the array required 0.04 seconds, 12 times faster. Moreover, when I introduced code to fragment the memory pool, the advantage increased dramatically to 50 times or more. The linked list also required 4 times as much memory - twice as much for next pointers, and twice as much again for allocation metadata.&lt;/P&gt;
&lt;P&gt;Don't throw out your linked lists just yet, though. If we build the 60 million element array above by inserting elements into the middle, it would take weeks, despite the cache advantage. In applications where insertions and deletions in the middle are frequent, arrays just don't work. Also, arrays have the distinct disadvantage that they are slow to grow, and in a fragmented memory pool, it may not be possible to allocate a large array at all. We need something with the cache advantages of an array but the quick insertions and incremental growth of a linked list.&lt;/P&gt;&lt;IMG src="http://moonflare.com/blogfiles/devdev/Unrolled_linked_list.png" align=right&gt; 
&lt;P&gt;Enter the unrolled linked list. In simple terms, an unrolled linked list is a linked list of small arrays, all the same size N. Each is small enough that inserting or deleting from it is quick, but large enough so that it fills a cache line. An iterator pointing into the list consists of both a pointer to a node and an index into that node.&lt;/P&gt;
&lt;P&gt;Let's consider insertion first. If there is space left in the array of the node in which we wish to insert the value, we simply insert it, requiring only O(N) time. If the array already contains N values, we create a new node, insert it&amp;nbsp;after&amp;nbsp;the current one, and move half the elements to that node, creating room for the new value. Again, total time is O(N). Deletion is similar; we simply remove the value from the array. If the number of elements in the array drops below N/2, we take elements from a neighboring array to fill it back up. If the neighboring array also has N/2 values, then we merge it with the neighboring array instead. Those familiar with &lt;A href="http://www.bluerwhite.org/btree/"&gt;B-Trees&lt;/A&gt; may note some similarities in these operations.&lt;/P&gt;&lt;IMG src="http://moonflare.com/blogfiles/devdev/Unrolled_linked_list_insert.png" align=left&gt; 
&lt;P&gt;Your first reaction may be that it seems wasteful to have every array being potentially half-empty. On average, though, each array will be about 70% full. Even in the worst case, this can be competitive with ordinary linked lists, which require at least one and as many as four or five words of overhead per element. By amortizing space overhead over several elements, an unrolled linked list can achieve very close to the compactness of an ordinary array. The overhead per element is proportional to 1/N, and in practice is at most a few bits per element. We can adjust N to trade off compactness and operation time. The space advantage becomes much greater when we consider lists of small values like characters or bits. Finally, we can set a higher low-water mark than 50% if necessary, at the cost of more frequent node splits and merges.&lt;BR&gt;&lt;/P&gt;
&lt;P&gt;Although space is critical in many applications, the primary advantage of unrolled linked lists is their cache behavior. If each node is roughly a multiple in size of the cache line size and all are full, and B is the cache line size, we can traverse each node optimally (N/B cache misses), and traverse the list with about (n/N+1)(N/B) cache misses, very close to the optimal n/B cache misses for realistic values of N.&amp;nbsp; In the average case we need about 40% more cache misses than traversing an array, and in the worst case we need about twice as many - which sounds a lot better than 14 to 50 times as many.&lt;/P&gt;
&lt;P&gt;In practice there's a bit of extra overhead for dealing with the added complexity. I tried out a simple unrolled linked list in our original experiment with a list of 60 million integers. I placed&amp;nbsp;124 integers in each node. With all nodes full, and a fragmented memory pool, it created the list in 0.64 seconds and traversed it in 0.10 seconds, about 2.5 times slower than the array (we may be observing some L2 cache effect here). With the nodes 50% full, it required about 0.15 seconds. Total space usage with all nodes full was 17% more than the array; with 50% usage it was about twice this, of course. Memory pooling could narrow the space gap.&lt;/P&gt;
&lt;P&gt;One small issue with unrolled linked lists is keeping track of the number of elements in each array. One easy and space-efficient way to deal with this, where applicable,&amp;nbsp;is to use some reserved "null" value to fill unused array slots. By aligning the nodes, sometimes the number of elements can be stored in the low bits of the pointers. Otherwise it adds a bit of overhead per node.&lt;/P&gt;
&lt;P&gt;I hope some of you found this illuminating. I hope to talk about more cache-sensitive data structures in the future, and talk about cache-oblivious data structures. Please ask any questions you have. Thanks for reading.&lt;/P&gt;
&lt;P&gt;Derrick Coetzee&lt;BR&gt;&lt;I&gt;&lt;FONT size=2&gt;This posting is provided "AS IS" with no warranties, and confers no rights.&lt;/FONT&gt;&lt;/I&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=454887" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/devdev/archive/tags/Data+structures/default.aspx">Data structures</category></item><item><title>Bloom filters</title><link>http://blogs.msdn.com/devdev/archive/2005/08/17/452827.aspx</link><pubDate>Wed, 17 Aug 2005 22:58:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:452827</guid><dc:creator>dcoetzee</dc:creator><slash:comments>11</slash:comments><comments>http://blogs.msdn.com/devdev/comments/452827.aspx</comments><wfw:commentRss>http://blogs.msdn.com/devdev/commentrss.aspx?PostID=452827</wfw:commentRss><description>&lt;P&gt;Imagine you're writing a simple spell checker in C. You've already collected a dictionary of 100,000 words, and you want to process the document a word at a time and find any words that aren't in the dictionary. You don't care about providing suggestions for wrong words. How would you go about this?&lt;/P&gt;
&lt;P&gt;Clearly, comparing each word in the document to every word in the dictionary won't do. You could use some kind of trie or binary search tree, but these have quite a bit of space overhead for pointers. The most obvious solution is to use a hash table containing all the words in the dictionary. Since hash tables with good hash functions and linear probing are fairly efficient up to about 80% utilization, you only need about 125,000 entries in the table. It seems hard to beat this.&lt;/P&gt;
&lt;P&gt;Now, imagine you're writing this spell checker in 1985. You have 640K of memory. There's no way the hash table above could fit; the words alone would take at least 500K. Are we sunk? Actually, we're not, and I'll explain why - but first let's review hash tables a bit.&lt;/P&gt;
&lt;P&gt;Recall how lookup works in a hash table. You use the hash function to transform the key into an index into the table. You go there, and if the table entry is empty, the key is not present. Otherwise, you must compare the key at that location to the desired key to make sure it's the right one, and not just another key that happened to map to the same location. This is too bad, because this comparison is expensive. For example, if I look up the misspelled "opulant", I may get the hash table entry containing "cheeseburger", by sheer coincidence. But really, what are the chances?&lt;/P&gt;
&lt;P&gt;It can be shown that if the hash function is good, the chance of such a collision is roughly&amp;nbsp;1 - e&lt;SUP&gt;-n/m&lt;/SUP&gt;, where n is the number of words in the dictionary and m is the size of the table. If we choose m=32n, this is about a 3% chance, which is pretty low. How can we use this? It allows us to remove the key comparison. If we look up a key and find that the table entry is in use, we simply claim that the word is spelled correctly. About 3% of misspelled words will be missed, but it won't catch all errors anyway - the user may be able to tolerate this.&lt;/P&gt;
&lt;P&gt;Unfortunately, trying to fit a hash table with 3,200,000 entries in 640K doesn't seem any easier. The key is to notice that we no longer need to have the keys in the table at all - we only care whether or not each entry is in use. We can store this information in a single bit, and 3,200,000 bits is 400K. We fit!&lt;/P&gt;
&lt;P&gt;This idea of exploiting the rarity of collisions to decrease time and space requirements, at the expense of occasionally being wrong,&amp;nbsp;is the same idea behind &lt;A href="http://en.wikipedia.org/wiki/Bloom_filter"&gt;Bloom filters&lt;/A&gt;, a data structure invented by Burton H. Bloom in 1970 (back when space &lt;EM&gt;really&lt;/EM&gt; mattered). In fact, what we have described above is already a simple Bloom filter, and is said to have been used by real spell-checkers in the bad old days. However, Bloom filters also generalize this idea in a way that shrinks space requirements even further.&lt;/P&gt;
&lt;P&gt;At NASA, one of the ways they avoid having (too many) horrible accidents is by having many independent backup systems. Even if some of the systems fail, the chance that they'll &lt;EM&gt;all&lt;/EM&gt; simultaneously fail is miniscule. We can use the same idea here: instead of mapping a key into a single table entry, use k independent hash functions to map it to k locations. Set all these locations' bits&amp;nbsp;to 1. Then, when we look up a word, we say that it's spelled correctly&amp;nbsp;only if all&amp;nbsp;k of its locations are set to 1. Here's some pseudocode:&lt;/P&gt;&lt;PRE&gt;function insert(key) {
    for each hash function f
        table[f(key)] = 1;
}

function contains(key) {
    for each hash function f
        if table[f(key)] = 0 then return false;
    return true;
}
&lt;/PRE&gt;
&lt;P&gt;If the word is not in the dictionary, the chance of &lt;CODE&gt;contains&lt;/CODE&gt; returning true is only about:&lt;/P&gt;
&lt;P&gt;(1 - e&lt;SUP&gt;-kn/m&lt;/SUP&gt;)&lt;SUP&gt;k&lt;/SUP&gt;&lt;/P&gt;
&lt;P&gt;Recall that n is the number of words in the dictionary, and m is the table size. For a given m and n, the optimal k is about (log 2)m/n.&amp;nbsp;If we set k=44, the optimal value for our table above, the chance of a misspelled word getting through shrinks to the ridiculously tiny 4 x 10^(-14), making it about as reliable as the hash table approach. If we set k=6 and m=8n, this gives about a 2.1% chance, even less than before, yet our table only requires a byte per word, or 100K. There are &lt;A href="http://www.cap-lore.com/code/BloomTheory.html"&gt;various other enhancements&lt;/A&gt; that can squish this down even further.&lt;/P&gt;
&lt;P&gt;Of course, nowadays we don't have this sort of memory constraint, and spellcheckers can afford to load their massive dictionaries into main memory. So what are Bloom filters still good for? The answer is that there are still many problems where the sets that we deal with are far too large to fit in any memory. For example, one way of detecting infinite loops in a function is to simulate it while keeping a set of all its previous states, including local variable values and the current location. If it ever reaches a state it's already been in, it's in a loop. Unfortunately, maintaining a set of all its states after every instruction could quickly fill any memory. Instead, we use Bloom filters to store the previous states; we may get a false positive for an infinite loop, but the chance is small, and multiple runs can exponentially shrink this chance. Ideas like this are used in some modern probabilistical model checking algorithms for verifying systems; for example, see &lt;A href="http://www.cc.gatech.edu/fac/Pete.Manolios/research/bloom-filters-verification.html"&gt;Bloom Filters in Probabilistic Verification&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;More generally, any time you want to remember a large number of chunks of data, particularly large chunks, Bloom filters are handy to have around. The error is not really an issue, since you can tweak the parameters to get the error as low as your application requires. They're particularly important on PDAs and embedded systems, which still have orders of magnitude less memory than desktops.&lt;/P&gt;
&lt;P&gt;Unfortunately, implementations are lacking. No language that I know of has Bloom filters in its standard library, and sample code is hard to find. When I get a bit of time I'll write up a little implementation in C++ and/or C#&amp;nbsp;for demonstration and reuse. Please ask if you have any questions, and I'll try to explain things further. Thanks for reading.&lt;/P&gt;
&lt;P&gt;Derrick Coetzee&lt;BR&gt;&lt;I&gt;&lt;FONT size=2&gt;This posting is provided "AS IS" with no warranties, and confers no rights.&lt;/FONT&gt;&lt;/I&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=452827" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/devdev/archive/tags/Data+structures/default.aspx">Data structures</category></item></channel></rss>