Fast Switching with LINQ

Fast Switching with LINQ

Rate This
  • Comments 33

Jomo Fisher—I’ve run into a performance problem several times in my career that I’ve never found a truly satisfying answer for until now. The problem can be distilled down to this:

Look up a value based on a string key. The set of strings is fixed and no unknown key values can occur.

Consider my personal rankings of the Seven Dwarves on a scale of 1-10:

                happy=9, sneezy=2, doc=7, sleepy=8, dopey=9, grumpy=2, bashful=6

I need a function that takes a name and returns a ranking. The default solution for this problem should be to use a Dictionary or HashTable. The implementations that ship with the .NET framework are extremely fast.

However, these container classes offer more features than I need for this problem. New keys can be added but my set is fixed. They can tell you whether a key is valid or not but I know a priori that the keys are limited to a fixed set. They can enumerate the set of keys or the set of values but I don’t need this capability.

Any time I see a data structure with a capability I’m not using it makes me wonder whether I can trade that capability for something I do need—in this case a speed boost.

If I was really desperate for a performance gain then it was possible to write a function like:

        static int Lookup(string s) {

            if (s.Length < 6) {

                if (s.Length < 5) return 7;

                else return 9;

            } else {

                if (s[0] < 's') {

                    if (s[0] < 'g') return 6;

                    else return 2;

                } else {

                    if (s[1] < 'n') return 8;

                    else return 2;

                }

            }

        }

Notice that we only have to look at the length and the first two characters of the string to make the decision. This function is fast but nearly impossible to maintain or debug. Also, I have to know the set of strings at compile time. All of these drawbacks make this approach significantly less useful.

Enter the LINQ Expression Compiler that will be available in the .NET 3.5. With the new expression support it’s possible to write an expression at runtime and then compile this expression into straight IL (which will be jitted into machine code). It turned out to be pretty easy to write a switch compiler that would generate an appropriate lookup expression on the fly. I wrote a class called SwitchCompiler to do the heavy lifting. Here’s how it’s called:

            var dict = new Dictionary<string, int> {

                    {"happy", 9}, {"sneezy", 2},

                    {"doc", 7}, {"sleepy", 8},

                    {"dopey", 9}, {"grumpy", 2},

                    {"bashful", 6}

                };

            var Lookup = SwitchCompiler.CreateSwitch(dict);

            var v = Lookup("doc");

            Debug.Assert(v == 7);

SwitchCompiler first builds up an expression that looks like this:

s.Length < 6 ? (s.Length < 5 ? 7 : 9) :

                (s[0] < 's' ? (s[0] < 'g' ? 6 : 2) : (s[1] < 'n' ? 8 : 2));

And then compiles this into a delegate returned into Lookup. (It should also be possible to generate a perfect minimal hash in the Compile method. I haven't tried this).

The actual implementation of SwitchCompiler is about 65 lines of code. You can see it in the attachment.

On my machine, looking up each value in a tight loop 1 million times takes 70ms while the corresponding loop using Dictionary takes 697ms. That’s close to a 900% performance improvement.

This posting is provided "AS IS" with no warranties, and confers no rights.

kick it on DotNetKicks.com
Attachment: SwitchCompiler.cs
Leave a Comment
  • Please add 4 and 5 and type the answer here:
  • Post
  • That's pretty impressive performance for a VM-based language.  The C++ equivelent, 1 million iterations, runs in about 55 ms, on a Pentium 4, 3.4 Ghz.  Here's the code:

    struct KeyValue {

     const char* key;

     int value;

     inline bool operator < (const KeyValue& that) const {

       return strcmp(this->key, that.key) < 0;

     }

    };

    static const KeyValue kvs[] = {

     {"bashful", 6},

     {"doc", 7},

     {"dopey", 9},

     {"grumpy", 2},

     {"happy", 9},

     {"sleepy", 8},

     {"sneezy", 2},

    }, * const kvs_end = kvs + (sizeof(kvs) / sizeof(0[kvs]));

    inline int Lookup(const char* s) {

     KeyValue key = { s };

     return std::lower_bound(kvs, kvs_end, key)->value;

    }

    int _tmain(int argc, _TCHAR* argv[])

    {

     LARGE_INTEGER begin, end, frequency;

     QueryPerformanceCounter(&begin);

     const KeyValue* k = kvs;

     int total = 0;  // Just so the compiler doesn't completely optimize out the for loop.

     for (int i = 0; i < 1000000; ++i) {

       total  += Lookup(k->key);

       if (++k == kvs_end)

         k = kvs;

     }

     QueryPerformanceCounter(&end);

     QueryPerformanceFrequency(&frequency);

     __int64 ms = (end.QuadPart - begin.QuadPart) * 1000 / frequency.QuadPart;

     cout << "Total:\t" << total << endl;

     cout << "Milliseconds:\t" << ms << endl;

    }

    One minor benefit of the C++ code is that there is no overhead of creating a lookup table before the first call to Lookup().

  • Hi Jeff,

    I tried your code on my machine and amazingly got exactly 55ms. That makes your orange look a little more like my apple.

    Your comment made me realize my implementation has one more feature that I could possibly throw away: there's an array bounds check on the string index operations. I think I could get rid of this by emitting 'unsafe' code. Unfortunately, I think I'd have to bypass the expression compiler to do this.

    I seem to be within spitting distance of optimized C++, that's pretty cool considering how amazing the VS C++ optimizer is.

    Jomo

  • Here's my test code in case anyone wants to reproduce my result:

    /*

    Use of included script samples are subject to the terms specified at http://www.microsoft.com/resources/sharedsource/licensingbasics/permissivelicense.mspx

    Written by Jomo Fisher

    */

    using System;

    using System.Collections.Generic;

    using System.Linq;

    using System.Text;

    using System.Reflection;

    using FastSwitch;

    namespace ConsoleApplication1 {

       class Program {

           static void Main(string[] args) {

               var dict = new Dictionary<string, int> {

                       {"happy", 9}, {"sneezy", 2},

                       {"doc", 7}, {"sleepy", 8},

                       {"dopey", 9}, {"grumpy", 2},

                       {"bashful", 6}

                   };

               const int count = 1000000;

               DateTime start = DateTime.Now;

               for (int i = 0; i < count; ++i) {

                   var v = dict["happy"];

                   v = dict["sneezy"];

                   v = dict["doc"];

                   v = dict["sleepy"];

                   v = dict["dopey"];

                   v = dict["grumpy"];

                   v = dict["bashful"];

               }

               TimeSpan elapsed = DateTime.Now - start;

               Console.WriteLine("Dictionary: {0} ms", elapsed.TotalMilliseconds);

               var compiled = SwitchCompiler.CreateSwitch(dict);

               start = DateTime.Now;

               for (int i = 0; i < count; ++i) {

                   var v = compiled("doc");

                   System.Diagnostics.Debug.Assert(v == 7);

                   v = compiled("happy");

                   System.Diagnostics.Debug.Assert(v == 9);

                   v = compiled("sneezy");

                   System.Diagnostics.Debug.Assert(v == 2);

                   v = compiled("sleepy");

                   System.Diagnostics.Debug.Assert(v == 8);

                   v = compiled("dopey");

                   System.Diagnostics.Debug.Assert(v == 9);

                   v = compiled("grumpy");

                   System.Diagnostics.Debug.Assert(v == 2);

                   v = compiled("bashful");

                   System.Diagnostics.Debug.Assert(v == 6);

               }

               elapsed = DateTime.Now - start;

               Console.WriteLine("Compiled: {0} ms", elapsed.TotalMilliseconds);

           }

       }

    }

  • You've been kicked (a good thing) - Trackback from DotNetKicks.com

  • Looks like you want a perfect hash or a minimal perfect hash? See e.g. http://burtleburtle.net/bob/hash/perfect.html

    I don't know about a c# implementation though.

  • The c++ code and c# code are timing 2 different things.

    the c++ code is cycling through each key for a total of 1,000,000 lookups.  the c# code is looking each key up 1,000,000 times for a total of 7,000,000 lookups.

  • Justin,

    You're right. I should have noticed the C++ code was doing 1/7th of the lookups as the C# code. When I adjust the C++ code, it dutifully reports 385ms which is pretty close to what the plain old .NET dictionary was doing.

    So I guess the LINQ algorithm is in the cat-bird seat, at least for a the time-being.

    I also noticed the C++ answer requires that the set of keys be sorted alphabetically before lookups can commence, so there *is* a little preparation time.

  • Tom,

    I had considered a minimal perfect hash (and I had even looked at the exact page you sent me). I still think the LINQ algorithm should perform better in many cases because that perfect hashing algorithm is O(n) on the number of bytes in the input key. The LINQ algorithm is O(log N) on the number of bytes in the key. In particular, I think the LINQ algorithm will outpace the other when the size of the keys is large and the length of the keys is non-uniform.

    Of course the big-O notation can hide a potentially large constant factor, but I don't really see why it would in this case here.

    Also, keep in mind that that particular minimal perfect hash implementation is throwing away a feature that I wanted to keep: It can't easily accomodate key sets that are available only at run time.

  • The C++ compiler optimized out the entire loop in my first version.  I added the total variable, and it still optimized out the whole loop.  I added the cout statement at the bottom and it finally executed the code.

    Any sign the C# compiler is optimizing out the whole loop in the C# code?

  • The C# compiler is definitely not removing the loop--for one, it has no way to determine whether the delegate has side-effects that need to be preserved.

    This can be seen by increasing the number of iterations--2M loops run in twice the time as 1M.

  • Wow, your switch compiler and the .Net VM is very impressive indeed.  My fastest run of the following C++ code was 88 ms.  Note also that I kind of cheated and used std::find instead of std::lower_bound, because when there are only 1 or 2 elements in the vector, std::find runs much faster.

    #include "stdafx.h"

    struct KeyValue {

     const char* key;

     int value;

    };

    struct CharValue {

     int char_;

     bool terminal_;

     union {

       int value_;

       size_t next_index_;

       vector<CharValue>* next_;

     };

     ptrdiff_t char_step_;

    };

    inline bool operator < (const CharValue& value, int c) {

     return value.char_ < c;

    }

    inline bool operator < (int c, const CharValue& value) {

     return c < value.char_;

    }

    inline bool operator == (const CharValue& value, int c) {

     return value.char_ == c;

    }

    static const KeyValue kvs[] = {

     {"bashful", 6},

     {"doc", 7},

     {"dopey", 9},

     {"grumpy", 2},

     {"happy", 9},

     {"sleepy", 8},

     {"sneezy", 2},

    }, * const kvs_end = kvs + (sizeof(kvs) / sizeof(0[kvs]));

    class KnownSetTrie {

    public:

     typedef vector<CharValue> CharMap;

     vector<CharMap> tree_;

     ptrdiff_t first_step_;

     template <typename Iter>

     KnownSetTrie(Iter begin, Iter end) {

       first_step_ = Compile(begin, end, 0) - 1;

       // Now that the tree_ is stable, convert indices to node pointers.

       for(vector<CharMap>::iterator i = tree_.begin(); i != tree_.end(); ++i) {

         for (CharMap::iterator j = i->begin(); j != i->end(); ++j)

           if (!j->terminal_)

             j->next_ = &tree_[j->next_index_];

       }

     }

     template <typename Iter>

     ptrdiff_t Compile(Iter begin, Iter end, ptrdiff_t char_index) {

       CharMap map;

       Iter start = begin, i = begin + 1;

       while (true) {

         CharValue value;

         value.char_ = start->key[char_index];

         if (i != end && i->key[char_index] == start->key[char_index]) {

           do {

             ++i;

           } while (i != end && i->key[char_index] == start->key[char_index]);

           value.terminal_ = false;

           value.char_step_ = Compile(start, i, char_index + 1);

           value.next_index_ = tree_.size() - 1;

         } else {

           value.terminal_ = true;

           value.value_ = start->value;

         }

         map.push_back(value);

         if (i == end)

           break;

         start = i++;

       }

       if (1 == map.size() && !map.front().terminal_) {

         // Pass through node:

         return map.front().char_step_ + 1;

       }

       tree_.push_back(map);

       return 1;

     }

     inline int Lookup(const char* s) const {

       const CharMap* node = &tree_.back();

       s += first_step_;

       while (true) {

         const CharMap::const_iterator value = find(node->begin(), node->end(), *s);

         if (value->terminal_)

           return value->value_;

         s += value->char_step_;

         node = value->next_;

       }

     }

    };

    int _tmain(int argc, _TCHAR* argv[])

    {

     KnownSetTrie trie(kvs, kvs_end);

     LARGE_INTEGER begin, end, frequency;

     QueryPerformanceCounter(&begin);

     int total = 0;  // Just so the compiler doesn't completely optimize out the for loop.

     for (int i = 0; i < 1000000; ++i) {

       for (const KeyValue* k = kvs; k != kvs_end; ++k)

         total  += trie.Lookup(k->key);

     }

     QueryPerformanceCounter(&end);

     QueryPerformanceFrequency(&frequency);

     __int64 ms = (end.QuadPart - begin.QuadPart) * 1000 / frequency.QuadPart;

     cout << "Total:\t" << total << endl;

     cout << "Milliseconds:\t" << ms << endl;

    }

  • It seems the LINQ guys have done a great job of optimization inside of the LINQ namespace (System.Linq)....

  • I got the C++ code down to 70 ms.  It took a __fastcall and some fun struct packing, though.

    I can't wait to use C# .Net!

    #include "stdafx.h"

    struct KeyValue {

     const char* key;

     int value;

    };

    struct CharValue {

     int char_ : 8;

     int terminal_ : 1;

     int char_step_ : 23;

     union {

       int value_;

       size_t next_index_;

       vector<CharValue>* next_;

     };

    };

    inline bool operator < (const CharValue& value, int c) {

     return value.char_ < c;

    }

    inline bool operator < (int c, const CharValue& value) {

     return c < value.char_;

    }

    inline bool operator == (const CharValue& value, int c) {

     return value.char_ == c;

    }

    static const KeyValue kvs[] = {

     {"bashful", 6},

     {"doc", 7},

     {"dopey", 9},

     {"grumpy", 2},

     {"happy", 9},

     {"sleepy", 8},

     {"sneezy", 2},

    }, * const kvs_end = kvs + (sizeof(kvs) / sizeof(0[kvs]));

    class KnownSetTrie {

    public:

     typedef vector<CharValue> CharMap;

     vector<CharMap> tree_;

     ptrdiff_t first_step_;

     CharMap* root_;

     template <typename Iter>

     KnownSetTrie(Iter begin, Iter end) {

       first_step_ = Compile(begin, end, 0) - 1;

       // Now that the tree_ is stable, convert indices to node pointers.

       for(vector<CharMap>::iterator i = tree_.begin(); i != tree_.end(); ++i) {

         for (CharMap::iterator j = i->begin(); j != i->end(); ++j)

           if (!j->terminal_)

             j->next_ = &tree_[j->next_index_];

         root_ = &tree_.back();

       }

     }

     template <typename Iter>

     ptrdiff_t Compile(Iter begin, Iter end, ptrdiff_t char_index) {

       CharMap map;

       Iter start = begin, i = begin + 1;

       while (true) {

         CharValue value;

         value.char_ = start->key[char_index];

         if (i != end && i->key[char_index] == start->key[char_index]) {

           do {

             ++i;

           } while (i != end && i->key[char_index] == start->key[char_index]);

           value.terminal_ = false;

           value.char_step_ = Compile(start, i, char_index + 1);

           value.next_index_ = tree_.size() - 1;

         } else {

           value.terminal_ = true;

           value.value_ = start->value;

         }

         map.push_back(value);

         if (i == end)

           break;

         start = i++;

       }

       if (1 == map.size() && !map.front().terminal_) {

         // Pass through node:

         return map.front().char_step_ + 1;

       }

       tree_.push_back(map);

       return 1;

     }

     int __fastcall Lookup(const char* s) const {

       const CharMap* node = root_;

       s += first_step_;

       while (true) {

         const CharMap::const_iterator value = find(node->begin(), node->end(), *s);

         if (value->terminal_)

           return value->value_;

         s += value->char_step_;

         node = value->next_;

       }

     }

    };

    int _tmain(int argc, _TCHAR* argv[])

    {

     KnownSetTrie trie(kvs, kvs_end);

     LARGE_INTEGER begin, end, frequency;

     QueryPerformanceCounter(&begin);

     int total = 0;  // Just so the compiler doesn't completely optimize out the for loop.

     for (int i = 0; i < 1000000; ++i) {

       for (const KeyValue* k = kvs; k != kvs_end; ++k)

         total += trie.Lookup(k->key);

     }

     QueryPerformanceCounter(&end);

     QueryPerformanceFrequency(&frequency);

     __int64 ms = (end.QuadPart - begin.QuadPart) * 1000 / frequency.QuadPart;

     cout << "Total:\t" << total << endl;

     cout << "Milliseconds:\t" << ms << endl;

    }

  • It seems the LINQ guys have done a great job of optimization inside of the LINQ namespace (System.Linq).

  • Hey Jeff,

    That's a really nice solution. I was wondering whether a trie could help here.

Page 1 of 3 (33 items) 123