If you haven’t read previous posts in this series about auto-vectorization, you may want to begin at the beginning.
In the last post, we explained the rules for C++ loop headers that make your code amenable to auto-vectorization; we might say, make it auto-vectorizer-friendly. Prompted by an online question from NitPick (our astute reader, who seems to enjoy dual citizenship in the blog world, and in the real world), let’s tidy off that discussion with a further, fourth rule:
Consider the following, proto-typical loop:
This will vectorize. And likewise, if we change int to size_t, it will also vectorize. (In the latter case, you would get a C4018 warning that n < U is a signed/unsigned mismatch, unless you were careful to define U as an unsigned type)
[NitPick asks: What about other data types as loop counter? Answer: we have char, short, int, long and long long; together with their 5 unsigned cousins. Some of these do indeed inhibit auto-vectorization in this first release, related to the indexing operations a[n], b[n] and c[n] within the loop body. You can run the experiment quite easily, testing all 10 cases. You will find that any loop variable, which is unsigned, and narrower than the platform’s natural integer size, will inhibit vectorization. So, on x86, where the natural integer is 32 bits wide, unsigned char or unsigned short, don’t work. On x64, where the natural integer is 64 bits wide, unsigned char, unsigned short, unsigned long don’t work. (Recall that long is just 32 bits wide). But it’s simpler to use int or size_t and avoid the headache]
OK, if you follow these four simple rules for the loop header, the auto-vectorizer can start its next step of working out how to vectorize the loop body. So let’s now explore the patterns of loop body the auto-vectorizer recognizes.
Remember, a couple of posts ago, I left you with a puzzle? – a list of six, one-line loops, and asked you to work out which ones might vectorize and which not? In fact, the auto-vectorizer hits all but one of them. Let’s explain the examples, one-by-one.
This loop is very simple – in the literature of parallel processing, it’s called “embarrassingly parallel”; I’ve also come across the description “delightfully parallel”. Each iteration of the loop operates upon memory locations that are different from every other iteration. There is no overlap, or interference between iterations. For example, when n = 8, we have a = b + c. And there is no other value of n for which this loop accesses those same three locations.
[NitPick objects that the array called a could overlap array b in memory. This would cause iterations to interfere with each other. This problem – where the same memory locations can be accessed by different names in the source program – is called aliasing. It’s a bane in the life of every optimizing compiler. The auto-vectorizer includes runtime checks against such aliasing. We will explain those in a later post. For the moment, let’s assume that our examples do not include any aliasing. Thank you NitPick, for keeping us on the right path]
The auto-vectorizer will transform this loop into its equivalent (we are using the same ‘slicing’ pseudo-code introduced in previous posts):
This new loop will execute 250 iterations, calculating 4 results each time. On the first iteration it will calculate a[0:3] = b[0:3] + c[0:3]. On the next, a[4:7] = b[4:7] + c[4:7]. And so on.
The only change from Example 1 is that the loop’s upper bound is 999, rather than 1000. It’s not divisible by 4 (remember, we are assuming our array elements are floats or ints, where each SSE vector register can hold 4 at-a-time). The auto-vectorizer transforms this loop as before, but is smart enough to know it must add scalar additions for the last 3 iterations. In effect, it transforms the code as follows:
This example is interesting because it includes a call to a math library function (sin, in this case). The auto-vectorizer includes vectorized versions of all the math functions defined in “math.h”. The vector library calculates four sins on each call. We thereby vectorize this loop too.
It might appear that this loop would pose a problem for the auto-vectorizer. Just think for a moment: how would you make use of vector instructions to implement this loop? . . .
In fact, the auto-vectorizer defines a variable, let’s call it seq for the sake of argument, as a short, 4-element array in memory, holding the values [0,1,2,3]
[NitPick asks what this notation means – a list of numbers enclosed in square brackets. Answer: we just invented it, to simplify our explanations. It’s not standard C++. In this blog post, it denotes a list of numbers, stored in consecutive locations – either in memory, or as adjacent fields within a vector register. NitPick: but, doesn’t this mean that  is ambiguous – since it could denote a vector with a single element, or the operation of indexing into element number 3 of an array? Yes, but we promise never to use a vector that has less than 2 elements in order to avoid this flaw in our notation, honest]
The auto-vectorizer transforms the original loop to look like this:
Line 1 is where our variable seq is set up to hold the four values [0,1,2,3]. Line 2 sets up the SSE register called XMM0 to hold the vector [0,0,0,0]. In line 4, we set a[n:n+3] to hold the vector addition of seq with XMM0. In line 5, we vector-add [4,4,4,4] into XMM0 (there’s an SSE instruction for that!). So we see that the expression seq + XMM0 will take on successive values [0,1,2,3], [4,5,6,7], etc. Quite neat, right?
[NitPick again: why did you call this example “Iota”? Answer: see iota]
Suppose that we have populated the array a by setting a[n] = n throughout. So a = [0,1,2,3,4…999]. The loop above, run on a single core, with no vectorization, will therefore compute the result a = [1,2,3,4…500]. Does this vectorize safely? In other words, does the following pseudo-code calculate the correct answer:
Let’s work it out, on paper. On the first iteration, we calculate a[0:3] = a[1:4]. In pseudo-code, this step is performed as:
In other words, we copy the values of a[1:4] into XMM0, then write those 4 values into a[0:3]. Similarly for the next iteration. A little doodling with pencil and paper should convince you that this transformation is safe. The auto-vectorizer reaches the same conclusion and performs this transformation.
Note that this calculation involved dependence between two different iterations of the same loop. To explain what this means, consider the original loop, and its first two iterations:
Notice how a is read on the first iteration, and then updated on the second. So we have to perform the calculations in this order. We cannot switch their order like this:
Because this would result in making a = 2 and then also making a = 2. There is dependence between the original first two iterations (in fact, between each iteration and its neighbor): the second original iteration Writes the value of a after the previous original iteration has Read that same value. Hardware engineers call this a Write-After-Read, or WAR, dependence. Compiler writers call it an anti-dependence. In the same way that we cannot switch the order of these two steps and get the right answer, neither can we run them in parallel on different processors. But we can run them in parallel in a vector register. We will touch upon the subject of dependence more, in a future post. For the moment, just be grateful that the auto-vectorizer takes care of these worries on your behalf.
This example differs from the previous only in that the right hand side of the assignment indexes [n-1] rather than [n+1]. The consequences are profound: this one does not vectorize safely! I’ve brought it into the picture, in order to break the illusion that the auto-vectorizer can conquer any old loop it comes across in your code.
Let’s look at what the simple scalar loop does. Again we’ll assume that we have populated a = [0,1,2,3,4…999]. We get the sequence of assignments:
So the loop will set all the elements 0 thru 499 to the value of a. A strange thing to do perhaps, but that is what the original code specified. What would happen if we blindly vectorize this loop? – if we transform it to become:
On the first iteration, we calculate a[1:4] = a[0:3]. Recall that, in pseudo-assembly-code, this step is performed as:
So, starting with a = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14...999], we obtain, after this first vectorized iteration, a = [0,0,1,2,3,5,6,7,8,9,10,11,12,13,14...999]. Already our answer differs wildly from the correct answer, which, at this stage, should be a = [0,0,0,0,0,5,6,7,8,9,10,11,12,13,14...999]. The following diagram depicts what’s going on:
The first row depicts the original a – the ‘before’ picture. The 4-element vector [0,1,2,3] shown in bold-red, is used to stomp over the original entries for a[1:4], yielding the second row – the after picture. The answer is already wrong. And it just gets worse as we complete successive vector iterations. Here we see the next step in the evolving carnage:
Anyhow, you get the idea: if the auto-vectorizer were to blindly vectorize this loop, it would produce wrong answers. So it doesn’t! Auto-vectorization is safe – it will never vectorize a loop if that would produce wrong answers – answers that differ from the single-thread, scalar calculation implied by the original C++ code.
At this point, you might ask how the auto-vectorizer can examine a C++ loop, and tell whether it’s safe to vectorize. However, that would lead us into a topic called “loop dependence analysis”. Not sure whether anyone is interested. Please add a comment below if you’d like a post explaining what’s going on.
Meantime, let’s keep with practical concerns: in the next post, I’ll explain how you find out which loops the auto-vectorizer vectorized, and which not.