What is the fastest way to transpose a matrix in C++?

This is a good question. There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates, e.g. in matrix multiplication and Gaussian smearing.

First let me list one of the functions I use for the transpose (EDIT: please see the end of my answer where I found a much faster solution)

void transpose(float *src, float *dst, const int N, const int M) {
    #pragma omp parallel for
    for(int n = 0; n<N*M; n++) {
        int i = n/N;
        int j = n%N;
        dst[n] = src[M*j + i];
    }
}

Now let’s see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.

for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*l+j];
        }
        C[K*i + j] = tmp;
    }
}

That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first

transpose(B);
for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*j+l];
        }
        C[K*i + j] = tmp;
    }
}
transpose(B);

Matrix multiplication is O(n^3) and the transpose is O(n^2), so taking the transpose should have a negligible effect on the computation time (for large n). In matrix multiplication loop tiling is even more effective than taking the transpose but that’s much more complicated.

I wish I knew a faster way to do the transpose (Edit: I found a faster solution, see the end of my answer). When Haswell/AVX2 comes out in a few weeks it will have a gather function. I don’t know if that will be helpful in this case but I could image gathering a column and writing out a row. Maybe it will make the transpose unnecessary.

For Gaussian smearing what you do is smear horizontally and then smear vertically. But smearing vertically has the cache problem so what you do is

Smear image horizontally
transpose output 
Smear output horizontally
transpose output

Here is a paper by Intel explaining that
http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

Lastly, what I actually do in matrix multiplication (and in Gaussian smearing) is not take exactly the transpose but take the transpose in widths of a certain vector size (e.g. 4 or 8 for SSE/AVX). Here is the function I use

void reorder_matrix(const float* A, float* B, const int N, const int M, const int vec_size) {
    #pragma omp parallel for
    for(int n=0; n<M*N; n++) {
        int k = vec_size*(n/N/vec_size);
        int i = (n/vec_size)%N;
        int j = n%vec_size;
        B[n] = A[M*i + k + j];
    }
}

EDIT:

I tried several function to find the fastest transpose for large matrices. In the end the fastest result is to use loop blocking with block_size=16 (Edit: I found a faster solution using SSE and loop blocking – see below). This code works for any NxM matrix (i.e. the matrix does not have to be square).

inline void transpose_scalar_block(float *A, float *B, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<block_size; i++) {
        for(int j=0; j<block_size; j++) {
            B[j*ldb + i] = A[i*lda +j];
        }
    }
}

inline void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size);
        }
    }
}

The values lda and ldb are the width of the matrix. These need to be multiples of the block size. To find the values and allocate the memory for e.g. a 3000×1001 matrix I do something like this

#define ROUND_UP(x, s) (((x)+((s)-1)) & -(s))
const int n = 3000;
const int m = 1001;
int lda = ROUND_UP(m, 16);
int ldb = ROUND_UP(n, 16);

float *A = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
float *B = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);

For 3000×1001 this returns ldb = 3008 and lda = 1008

Edit:

I found an even faster solution using SSE intrinsics:

inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
    __m128 row1 = _mm_load_ps(&A[0*lda]);
    __m128 row2 = _mm_load_ps(&A[1*lda]);
    __m128 row3 = _mm_load_ps(&A[2*lda]);
    __m128 row4 = _mm_load_ps(&A[3*lda]);
     _MM_TRANSPOSE4_PS(row1, row2, row3, row4);
     _mm_store_ps(&B[0*ldb], row1);
     _mm_store_ps(&B[1*ldb], row2);
     _mm_store_ps(&B[2*ldb], row3);
     _mm_store_ps(&B[3*ldb], row4);
}

inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            int max_i2 = i+block_size < n ? i + block_size : n;
            int max_j2 = j+block_size < m ? j + block_size : m;
            for(int i2=i; i2<max_i2; i2+=4) {
                for(int j2=j; j2<max_j2; j2+=4) {
                    transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
                }
            }
        }
    }
}

Leave a Comment