Tuesday, October 13, 2009

About MPI Send/Receive Deadlocks

In this post I will try to describe potential problems with synchronous Send/Receive communication primitives from MPI and how to resolve them. In particularly, I will talk about MPI.NET implementation, but lessons can be applied to any implementation.

Consider a simple MapReduce application. In this application main process sends N messages to M available processes and waits for results. There are no cyclic dependencies and reetrancy - just a flat model. So, the first intention is to use Send/Receive methods to organize message passing. To make things simple, assume there are only two processes (one main and one worker). So, the MPI program may look like the following:


using (var env = new Environment(ref args))
{
var comm = Communicator.world;

if (comm.Rank == 0)
{
for (var i = 0; i < taskCount; i++) comm.Send(new byte[msgLen], 1, 1);
for (var i = 0; i < taskCount; i++) comm.Receive(1, 1);
}
else if (comm.Rank == 1)
{
for (var i = 0; i < taskCount; i++)
{
comm.Receive(0, 1);
comm.Send(new byte[msgLen], 0, 1);
}
}
}


Surprisingly, with small messages this program will work fine. However, trying to set msgLen to, for example, 1 MB will make program hang. Actually, this is expected behavior, because Send and Receive functions are supposed to be synchronous: Send function should wait for corresponding Receive function in another process. The biggest question is why it behaves differently for different message sizes. Answer comes from MPI specification:

"The send call described in Section 3.2.1 uses the standard communication mode. In this mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer outgoing messages. In such a case, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. In this case, the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver."

So, even if Send function are considered synchronous, there are more details hidden inside it. Interprocess communication types are not divided to only synchronous vs asynchronous. There are more different modes, and MPI Send function cover two modes at once. It is left to implementation to decide which mode to use in every concrete case. MPI.NET specification of Send method says the following:

"The basic Send operation will block until this message data has been transferred from value. This might mean that the Send operation will return immediately, before the receiver has actually received the data. However, it is also possible that Send won't return until it matches a Receive<(Of <(T>)>)(Int32, Int32) operation. Thus, the dest parameter should not be equal to Rank, because a send-to-self operation might never complete."

So, in some cases Send method is partially asynchronous, and these are the cases of small messages. Very interesting question is a threshold, when Send becomes fully synchronous and whether this threshold is fixed or environment-specific? It is pretty hard to find out an algorithm, because MPI.NET uses native code. After some experiments, I found out that in my environment it is a bit less than 128000 bytes of payload.

Fixing this sample is very easy and straightforward - just replace Send with ImmediateSend and wait for asynchronous requests. However, it is important to remember not to rely on partially asynchronous behavior of Send primitive. Always test your MPI applications with different messages sizes and read documentation carefully.

No comments: