In this blog post I am going to share with you a string matching sample and describe the inner workings of the algorithm. On my machine with NVIDIA GTX580 the C++ AMP algorithm that I present below achieves up to 35X speedup over the std::strstr algorithm from standard C++ library.

In the first section I will clarify the string search problem that this algorithm was designed to solve. In the second, I will briefly describe the baseline algorithm, which is the CPU version that I used to check the correctness of the results and to compare the performance against the C++ AMP version. Next, I will describe the approach taken for the C++ AMP algorithm. I will focus on the most interesting design challenges that needed to be solved, like finding the right algorithm or working with the “char” type in C++ AMP. Finally I will end with a link to zip file that contains the sample’s source code for you to play with.

#### The problem

The input to the algorithm is *M* number of strings called “patterns” and *N* number of strings called “records”. The length of the record is larger than the length of the pattern. Each pattern is searched within each record to find an exact match of the pattern within the record.

Note: Working with a set of patterns and a set of records rather than one pattern and one record makes this problem adequately sized for data parallel accelerators.

If the pattern is not found within the record, then the result for that string search is -1. Otherwise the result is the index of the first occurrence of the pattern within a record. The output of the algorithm is therefore defined as an array of *M***N *elements. First *M *elements in the output array represent the results for each pattern against first record. Next *M* elements contain the results for second record and so on.

For simplicity all the patterns have the same length. Similarly all the records have the same length. Finally each pattern and record in the input array ends with 4 null terminating characters (I will explain later in the post why).

Here is an example:

Patterns: kitty\0\0\0\0puppy\0\0\0\0

Records: kitty and puppy\0\0\0\0puppy, elephant\0\0\0\0

Output: 0, 10, -1, 0

First pattern “kitty” can be found in first record at index 0. Second pattern “puppy” can be found at index 10 within first record. Pattern “kitty” cannot be found in second record, which gives us result -1, then the first occurrence of “puppy” in second record is located at index 0.

#### CPU Solution with strstr

The std::strstr algorithm from standard C++ library was picked as performance baseline and for verifying correctness of the C++ AMP algorithm. The std::strstr has been carefully optimized (including acceleration with SSE instructions) and rigorously tested which makes it a solid baseline.

Since we are dealing with a set of patterns and set of records all we need to do is to introduce two nested loops. The outer one would loop for all records and the inner one would loop for all patterns. In the most inner loop we will check if the record from the outer loop matches the pattern from the inner loop. The pseudo code for that would be:

1: for (int r = 0; r < num_records; ++r)

2: for (int p = 0; p < num_patterns; ++p)

` 3: result[r][p] = strstr(&records[r], &patterns[p]);`

#### Solution with C++ AMP

##### Selecting the right algorithm

The C++ AMP solution is based on the Rabin-Karp algorithm. We compute the hash value for the pattern and then compute hashes for all possible positions within the record. If the hash of the pattern matches the hash of the record at some position, then we have potential match. We cannot be certain, because different strings can produce same hash value. That is why all potential matches have to be fully verified by character to character comparison.

There are two reasons of why Rabin-Karp is a good algorithm for the C++ AMP version:

1) String patterns converted to hash values (32bit integers) have smaller memory footprint, so if we have less than 4096 of them, then we can fit them to constant memory and benefit from fast on-chip constant caches. The occupancy will be improved too, as read-only data in constant memory can be accessed directly without occupying registers. As an added bonus we also save a tiny bit on data transfer, since we have less data to copy-in.

2) The computation in Rabin-Karp is compute bound rather than memory bound as in naïve string search. Let me explain. In the most inner loop of the naive string search algorithm, we read a character of a record and a character from a pattern then we compare them. Since records would be stored in global memory most of our time would be spent on read accesses. In Rabin-Karp on the other hand the most inner loop compares a hash value of a record against all hash values of patterns. Doing so the latency of reading a record from global memory is hidden by many comparison operations that happen after it.

##### Working with the “char” type in C++ AMP

Next important decision to make is to decide on how to transfer the records to the accelerator for processing. We have two basic approaches:

1) Use textures of “char” types, and then extract a character inside the kernel as integer value.

2) Use of array or array_view of “unsigned int” types, and then performing bit manipulation to extract individual characters inside the kernel. One of our earlier blog posts discussed the technique of supporting character types by emulating them with integers.

Both approaches would do well it terms of kernel’s computational performance. I have decided to go with latter approach as this will allows us to generate initial records for our string problem in the staging array and then easily adapt such staging array with array_view. By using a staging array for the records, we cut down the data transfer time as the copy operation can be done directly from the staging array to accelerator’s memory. As a consequence for taking this approach, I decided to use 4 null terminating characters after each record, which allows me to simplify the unpacking logic (no two records occupy same integer). For consistency I kept the same format for patterns, although it was not necessary since we compute the hash values for them and they are never copied to the accelerator’s memory.

##### Dealing with false-positives

Due to the nature of Rabin-Karp algorithm, we need an extra post processing stage at which we will verify all tentatively identified matches. Assuming a small percentage of matches, this step should be a snap; therefore, the algorithm makes this computation on the CPU.

Notice that as the number of tentative matches increase (most of the patterns tentatively match most of the records), the Rabin-Karp stage on the accelerator becomes impractical as all tentatively identified matches would need to be rechecked in post processing stage on the CPU. Carefully crafting rolling hash function that is used inside the Rabin-Karp algorithm can reduce the number of false-positives. If you however expect high percentage of positive matches, then the presented algorithm in this blog post would be ineffective.

Building on the assumption that there is low number of matches, we can slightly change the output of Rabin-Karp algorithm. Instead of getting the results for each pattern, we can just store if a record matches any pattern. Doing so, there is a smaller output array to be transferred back from the accelerator (the array with a single integer for every record). More importantly the most inner loop of Rabin-Karp algorithm would make a store to a register when a hash value of a pattern is equal to the hash value of the record. Otherwise we would be introducing an array in shared or global memory to hold the result for each pattern. The only downside of change is that our host side post processing stage would now have to do little more work as if a record matches any pattern then we would have to check that record against all the patterns. I have decided to make this change as in testing I got performance gains from it for my tested inputs.

##### Putting it all together

Let's repeat all steps of C++ AMP solution:

- The C++ AMP algorithm computes the hashes for all patterns and places them in constant memory of the accelerator.
- Records are copied as array_view of integers.
- Inside the kernel each thread is responsible for string matching a single record against all patterns.
- Bit manipulation is necessary when unpacking the record from the integer.
- The outer loop walks on every index inside the record and computes a rolling hash of successive substrings of the length of the pattern.
- In the innermost loop a hash for that index is checked against all patterns.
- The outer loop continues until all indices of the record are checked.
- The output of Rabin-Karp computation on the accelerator returns an array filled with 1 and 0 for each record. 1 signifies potential match for at least one pattern and 0 gives us a certainty that the record contains no possible matches to any pattern.
- For each potential match detected on the accelerator we run multicore std::strstr on the host side to double check identified match and to expand our results array with the match/no match for individual pattern.

Most critical steps that boost the performance of C++ AMP:

- Algorithmic efficiency of the Rabin-Karp approach and its compute bound nature.
- Use of constant memory for hash patterns.
- Large problem size that kept the accelerator busy (many records and many patterns).
- Use of staging array for the records.
- Data transfer of the records in packed form, rather than each ‘char’ represented as 4-byte integer.

#### Download the sample

I encourage you to download the Visual Studio project sample and study the algorithm that I described in this blog post. Please remember to use “Release” configuration when running a benchmark. Any feedback or comments are welcome either below or at the forum.