MPI hangs on MPI_Send for large messages

The details of the Cube class aren’t relevant here: consider a simpler version

#include <mpi.h>
#include <cstdlib>

using namespace std;

int main(int argc, char *argv[]) {
    int size, rank;
    const int root = 0;

    int datasize = atoi(argv[1]);

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank != root) {
        int nodeDest = rank + 1;
        if (nodeDest > size - 1) {
            nodeDest = 1;
        }
        int nodeFrom = rank - 1;
        if (nodeFrom < 1) {
            nodeFrom = size - 1;
        }

        MPI_Status status;
        int *data = new int[datasize];
        for (int i=0; i<datasize; i++)
            data[i] = rank;

        cout << "Before send" << endl;
        MPI_Send(&data, datasize, MPI_INT, nodeDest, 0, MPI_COMM_WORLD);
        cout << "After send" << endl;
        MPI_Recv(&data, datasize, MPI_INT, nodeFrom, 0, MPI_COMM_WORLD, &status);

        delete [] data;

    }

    MPI_Finalize();
    return 0;
}

where running gives

$ mpirun -np 4 ./send 1
Before send
After send
Before send
After send
Before send
After send
$ mpirun -np 4 ./send 65000
Before send
Before send
Before send

If in DDT you looked at the message queue window, you’d see everyone is sending, and no one is receiving, and you have a classic deadlock.

MPI_Send‘s semantics, wierdly, aren’t well defined, but it is allowed to block until “the receive has been posted”. MPI_Ssend is clearer in this regard; it will always block until the receive has been posted. Details about the different send modes can be seen here.

The reason it worked for smaller messages is an accident of the implementation; for “small enough” messages (for your case, it looks to be <64kB), your MPI_Send implementation uses an “eager send” protocol and doesn’t block on the receive; for larger messages, where it isn’t necessarily safe just to keep buffered copies of the message kicking around in memory, the Send waits for the matching receive (which it is always allowed to do anyway).

There’s a few things you could do to avoid this; all you have to do is make sure not everyone is calling a blocking MPI_Send at the same time. You could (say) have even processors send first, then receive, and odd processors receive first, and then send. You could use nonblocking communications (Isend/Irecv/Waitall). But the simplest solution in this case is to use MPI_Sendrecv, which is a blocking (Send + Recv), rather than a blocking send plus a blocking receive. The send and receive will execute concurrently, and the function will block until both are complete. So this works

#include <mpi.h>
#include <cstdlib>

using namespace std;

int main(int argc, char *argv[]) {
    int size, rank;
    const int root = 0;

    int datasize = atoi(argv[1]);

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank != root) {
        int nodeDest = rank + 1;
        if (nodeDest > size - 1) {
            nodeDest = 1;
        }
        int nodeFrom = rank - 1;
        if (nodeFrom < 1) {
            nodeFrom = size - 1;
        }

        MPI_Status status;
        int *outdata = new int[datasize];
        int *indata  = new int[datasize];
        for (int i=0; i<datasize; i++)
            outdata[i] = rank;

        cout << "Before sendrecv" << endl;
        MPI_Sendrecv(outdata, datasize, MPI_INT, nodeDest, 0,
                     indata, datasize, MPI_INT, nodeFrom, 0, MPI_COMM_WORLD, &status);
        cout << "After sendrecv" << endl;

        delete [] outdata;
        delete [] indata;
    }

    MPI_Finalize();
    return 0;
}

Running gives

$ mpirun -np 4 ./send 65000
Before sendrecv
Before sendrecv
Before sendrecv
After sendrecv
After sendrecv
After sendrecv

Leave a Comment