Matrix multiplication is common and the algorithm is easy to implementation. Here is one example:
Version 1:template<typename T>void SeqMatrixMult1(int size, T** m1, T** m2, T** result){ for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { result[i][j] = 0; for (int k = 0; k < size; k++) { result[i][j] += m1[i][k] * m2[k][j]; } } }}
This implementation is straight-forward and you can find it in text book and many online samples.
Version 2:template<typename T>void SeqMatrixMult2(int size, T** m1, T** m2, T** result){ for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { T c = 0; for (int k = 0; k < size; k++) { c += m1[i][k] * m2[k][j]; } result[i][j] = c; } }}
This version will use a temporary to store the intermediate result. So we can save a lot of unnecessary memory write. Notice that the optimizer can not help here because it doesn't know whether "result" is an alias of "m1" or "m2".
Version 3:template<typename T>void Transpose(int size, T** m){ for (int i = 0; i < size; i++) { for (int j = i + 1; j < size; j++) { std::swap(m[i][j], m[j][i]); } }}template<typename T>void SeqMatrixMult3(int size, T** m1, T** m2, T** result){ Transpose(size, m2); for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { T c = 0; for (int k = 0; k < size; k++) { c += m1[i][k] * m2[j][k]; } result[i][j] = c; } } Transpose(size, m2);}
This optimization is tricky. If you profile the function, you'll find a lot of data cache miss. We transpose the matrix so that both m1[i] and m2[i] can be accessed sequentially. This can greatly improve the memory read performance.
Version 4:template<typename T>void SeqMatrixMult4(int size, T** m1, T** m2, T** result);// assume size % 2 == 0// assume m1[i] and m2[i] are 16-byte aligned// require SSE3 (haddpd)template<>void SeqMatrixMult4(int size, double** m1, double** m2, double** result){ Transpose(size, m2); for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { __m128d c = _mm_setzero_pd(); for (int k = 0; k < size; k += 2) { c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(&m1[i][k]), _mm_load_pd(&m2[j][k]))); } c = _mm_hadd_pd(c, c); _mm_store_sd(&result[i][j], c); } } Transpose(size, m2);}// assume size % 4 == 0// assume m1[i] and m2[i] are 16-byte aligned// require SSE3 (haddps)template<>void SeqMatrixMult4(int size, float** m1, float** m2, float** result){ Transpose(size, m2); for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { __m128 c = _mm_setzero_ps(); for (int k = 0; k < size; k += 4) { c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(&m1[i][k]), _mm_load_ps(&m2[j][k]))); } c = _mm_hadd_ps(c, c); c = _mm_hadd_ps(c, c); _mm_store_ss(&result[i][j], c); } } Transpose(size, m2);}
For float types, we can use SIMD instruction set to parallel the data processing.
Parallel version using PPL (Parallel Patterns Library) and lambda in VC2010 CTP:template<typename T>void ParMatrixMult1(int size, T** m1, T** m2, T** result){ using namespace Concurrency; for (int i = 0; i < size; i++) { parallel_for(0, size, 1, [&](int j) { result[i][j] = 0; for (int k = 0; k < size; k++) { result[i][j] += m1[i][k] * m2[k][j]; } }); }}
Result
Here are the test results (what really matters is the relative time between different version):
Matrix size = 500 (Intel Core 2 Duo T7250, 2 cores, L2 cache 2MB)
Matrix size = 500 (Intel Xeon E5430, 4 cores, L2 cache 12MB)
From the results, you can find that: