I assume you have read previous posts in this blog about auto-vectorization, starting with auto-vectorization-overview.
In this post, we will look at how the auto-vectorizer works.
Here is a tiny program that calculates the cube of each element in array b, setting the answer into a matching array called a. Of course, this program makes no real sense, because we have not initialized b. But for this exercise, we are interested in the code generated by the compiler, which does not depend on the values held in b.
On line 2, we create the two C-arrays, called a and b, each holding 1000 floats, each 32 bits wide. On line 5, in the main function, we cube b into a, element-by-element. What kind of machine code will the compiler generate for this loop?
In essence, prior to Visual Studio 2012, each iteration of the loop – in particular, iteration number n – carries out 7 operations:
[The astute reader – let’s call him NitPick – might ask why I said “in essence” in the last sentence. Answer: the compiler will optimize this loop, using constant-folding, 6-fold loop-unroll and strength reduction. It will also insert multi-byte NOPs to align branch targets. But this simplified version is near enough to help understand what’s going on. In a later blog, we might dig deeper into the truth]
That was pre-Visual Studio 2012. But in Visual Studio 2012, the compiler will generate very different code for this loop, making use of the chip’s SSE instruction set, and its XMM vector registers. (For a reminder on what this means, see an earlier post on What Is Vectorization.
We can denote the scalar operation (where the compiler, previously, would slavishly calculate one element of the result at a time) as: a = b3
We can denote the vector operation (where the compiler calculates four elements, in parallel, at a time) as: a[4:7] = b[4:7]3
Here, a[4:7] means the elements a thru a taken all together in one chunk. Note that a[4:7] is not C++ syntax; it’s a neat notation, called array slicing, supported in several other languages, including FORTRAN. If a and b each holds 1000 elements, we must carry out this operation – not 1000 times, as before – but just 250 times, since we are chomping thru the arrays, 4 elements at a time.
So (again, in essence), each iteration of the vectorized loop – in particular, iteration number n – carries out 7 operations, but each does 4 times as much work as before:
[NitPick asks why I called the registers reg0 and reg1 in the scalar case. Answer: I was being coy: the real-life registers used are the low 32-bits of vector registers XMM0 and XMM1. I thought this might be confusing, so held back with the truth. But note that the operations performed are as I described – just one arithmetic multiplication on one element of b at-a-time]
So much for the theory. How much faster does the code run, in practice? On my PC, this example loop runs 2.9x faster after auto-vectorization. Not as much as the theoretical maximum of 4x, but still a welcome boost in speed.
[NitPick again: why is the text sprinkled with italic items like: floats, a[n] and so on? Answer: they denote items that would normally appear in code, rather than in English prose. Books normally use a typewriter font for this purpose, as in a[n+1], but I find these hard to spot, and often ugly. So I am using italic instead]
That’s it for the moment. As always, please feel free to ask questions in the comments below.
Meantime, let me leave you with a puzzle:
Auto-vectorization performs detailed analyses in order to work out whether it’s safe to vectorize any given loop. By “safe” we mean that a vectorized version of the loop yields the same answers as a scalar version of that same loop. Not all loops can be vectorized. Here are a few examples. See if you can work out which ones vectorize or not. (Think whether one iteration of a loop would interfere with other iterations of that loop). In each case, imagine that a, b and c are declared as 1000-element arrays of float: