Tuesday, October 13, 2009

About MPI Send/Receive Deadlocks

In this post I will try to describe potential problems with synchronous Send/Receive communication primitives from MPI and how to resolve them. In particularly, I will talk about MPI.NET implementation, but lessons can be applied to any implementation.

Consider a simple MapReduce application. In this application main process sends N messages to M available processes and waits for results. There are no cyclic dependencies and reetrancy - just a flat model. So, the first intention is to use Send/Receive methods to organize message passing. To make things simple, assume there are only two processes (one main and one worker). So, the MPI program may look like the following:

using (var env = new Environment(ref args))
var comm = Communicator.world;

if (comm.Rank == 0)
for (var i = 0; i < taskCount; i++) comm.Send(new byte[msgLen], 1, 1);
for (var i = 0; i < taskCount; i++) comm.Receive(1, 1);
else if (comm.Rank == 1)
for (var i = 0; i < taskCount; i++)
comm.Receive(0, 1);
comm.Send(new byte[msgLen], 0, 1);

Surprisingly, with small messages this program will work fine. However, trying to set msgLen to, for example, 1 MB will make program hang. Actually, this is expected behavior, because Send and Receive functions are supposed to be synchronous: Send function should wait for corresponding Receive function in another process. The biggest question is why it behaves differently for different message sizes. Answer comes from MPI specification:

"The send call described in Section 3.2.1 uses the standard communication mode. In this mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer outgoing messages. In such a case, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. In this case, the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver."

So, even if Send function are considered synchronous, there are more details hidden inside it. Interprocess communication types are not divided to only synchronous vs asynchronous. There are more different modes, and MPI Send function cover two modes at once. It is left to implementation to decide which mode to use in every concrete case. MPI.NET specification of Send method says the following:

"The basic Send operation will block until this message data has been transferred from value. This might mean that the Send operation will return immediately, before the receiver has actually received the data. However, it is also possible that Send won't return until it matches a Receive<(Of <(T>)>)(Int32, Int32) operation. Thus, the dest parameter should not be equal to Rank, because a send-to-self operation might never complete."

So, in some cases Send method is partially asynchronous, and these are the cases of small messages. Very interesting question is a threshold, when Send becomes fully synchronous and whether this threshold is fixed or environment-specific? It is pretty hard to find out an algorithm, because MPI.NET uses native code. After some experiments, I found out that in my environment it is a bit less than 128000 bytes of payload.

Fixing this sample is very easy and straightforward - just replace Send with ImmediateSend and wait for asynchronous requests. However, it is important to remember not to rely on partially asynchronous behavior of Send primitive. Always test your MPI applications with different messages sizes and read documentation carefully.

Saturday, August 1, 2009

Distributed Excel or Excel on the Grid

It is a well known fact that Excel supports multithreading workbook recalculation from the version 2007. However, even at an earlier time, there was a possibility to execute a workbook in a distributed way. Distributed calculation of Excel workbooks is not well integrated with Office and is not a seamless process, but the possibility still exists. Since it is not integrated with Office, external applications and frameworks are required. Typically, these external applications are computational grids, like Microsoft HPC Server or Platform Symphony.

There are two approaches to execute Excel workbook on the grid in a distributed fashion. One approach is to use Excel as a UI for entering input data and displaying computation results. All computation logic is located in User Defined Functions (UDF) in external XLL or automation add-in libraries. Libraries are implemented in any appropriate language and platform. For XLL this can be C or C++, for automation add-ins this can be anything that supports COM, like C++ or .NET/C#/VB/F#. The approach with UDFs is pretty straitforward to implement. Developer has an API to a grid and she can easilly implement a logic of splitting big job into small tasks and submitting the tasks to a grid. To run a distributed computation, user of Excel worksheet just invokes the UDF with job input parameters. This approach is good for UDF developer, but very inconvinient for Excel worksheet user. The main disadvantage is narrow flexibility, lack of generic framework and inability to implement distributed computation in C/C++/.NET for Excel/VBA user.

Luckily, there is another approach, when grids support executing Excel worksheets themselves on the cluster. In this case, a workbook contains both UI and computation logic in cell formulas or VBA. Almost all solutions support distributing a workbook by data. For example, assume there is a MonteCarlo simulation that accepts a number of iterations or a long list of values as an input. Framework may split these values into an equal parts, copy Excel documents with partial inputs to a cluster, execute documents on compute nodes on a grid, and return results to a main document, which initiated computation. Advantage is obvious - computation logic can be implemented in Excel workbook itself with formulas or VBA macros. Unfortunately, there are no generic-purpose frameworks that fully implement this approach. Existing solutions are implemented for some well-determined use cases, like parametric sweep or simple MonteCarlo computations.

These two approaches are implemented by many computational and data grid vendors, which will be listed below. Most of the grids support only the first approach, because it simply doesn't require any special effort from grid vendors. They just create a couple of demos demonstrating distributed Excel with UDFs. Generally, this approach can be implemented on top of any grid, but only those with demos will be mentioned:

Microsoft HPC Server
There are several solutions from Microsoft to run distributed Excel on HPC. These implement both approaches discussed above.

The first approach with UDFs is represented by demos for sample models. There is no generic framework, because every user should implement her own UDF using preferred language. As it has been mentioned, custom UDFs have many advantages: effective parallelization, using API of HPC Server directly and rich programming environments (for example .NET instead Excel's VBA). Examples of this approach include AsianOptions sample Monte-Carlo model, ADAPTER demo (blog post, video, discussion), HPC and Excel Services, but with UDFs (only marketing information without any technical description. Maybe this is a description of another approach/framework).

The second approach, which allows computing Excel workbooks on the cluster, is represented by HPC Excel Services. To make this solution work, SharePoint Excel Services must be installed on every compute node on the cluster. Unfortunately, it seems it supports only parametric sweep-like models. This solution uses Excel add-in to specify job information, input data, and output data mappings. HPC Excel services support only lists as input values for parallelization and only lists for output values. Documentation is pretty extensive, sample models are described very well. Interesting feature of the solution is that it supports submitting models from SharePoint portal.

Platform Symphony
Platform supports both approaches as well - using UDFs, which implement actual parallelization logic, and computing actual Excel workbooks on compute nodes. There is no generic-purpose framework for this and everything should be customized for concrete Excel model. There is a very good demo describing this approach. Sample Excel workbook with description can be found here.

GigaSpaces XAP
GigaSpaces supports Excel models in a very interesting fashion. It supports running only UDFs, but since GigaSpaces is an in memory data grid more than HPC solution, the approach is oriented more on continuous computations rather than on batch jobs processing. Excel is considered like a UI for GigaSpaces grid. All interaction with the grid and all parallelization work is being done by UDFs. This approach resembles Excel Real-Time Server functionality. Demo called Excel that scales demonstrates this approach. Obviously, orientation on sessions and continuous computations doesn't limit users from implementing UDFs with batch jobs.

DataSynapse also supports Excel somehow, but there is no much information about it. Most likely, this is a demo of the first approach.

There is a high demand in running heavy Excel calculations and there is no surprise that many grid vendors support distributed Excel to a some extent. There are several approaches to run Excel workbooks on a grid, from using Excel just as UI and accessing a grid via API from UDFs to calculating actual workbooks with VBA code on compute nodes. Unfortunately, there is still no generic-purpose production ready framework for seamless parallelizing workbooks with VBA, but there is a hope that it will appear sooner or later.

Monday, March 23, 2009

Reflexil.NET - Ultimate Win in the Fight with Third-Party Libraries

It is a pretty usual situation when third-party libraries do not work not as expected. And it is also usual that the most annoying artifacts are very small and easy to fix. However, without access to the source code, it is not possible to fix them. Reflection can help in some situations, but not always. The common solution is to ask library developers to fix it. However, it can take a lot of time, and sometimes these "defects" are not actually defects and exist by design to save users from shooting their own legs. These restrictions may be good, but it does not make them less disappointing. After all, in some cases users know better and in these cases such restrictions become defects.

Recently, I have found such restriction in the SDK for Microsoft HPC Server. The SDK does not allow developer to specify custom credentials when connecting to the cluster. IScheduler.Connect method just uses Windows credentials of connecting process and throws "The server has rejected the client credentials." exception if anything goes wrong. This means that developer can not easily experiment with the cluster from remote development machine unless she is not in the same domain. The issue is described in this topic. Unfortunately, in my situation, the answer did not work. The opportunity to deploy applications on the cluster after each change and debug them remotely was not very promising. So I hoped to solve this issue somehow.

The first thing I have decided to investigate is how SDK connects to the cluster. With the help of .NET Reflector that was easy. I found out that SDK uses remoting with tcp channel. The only thing that separated me from connecting with custom credentials was adding them to channel sink properties:

Pic.1. Original assembly - connecting with process credentials.

It was really disappointing. So close from painless remote access.

Everybody knows about Reflector, but relatively small number of people know about a brilliant plug-in to Reflector called Reflexil. This great tool allows developers to modify assemblies in a very easy and straightforward way. So, a bit of luck and here it is:

Pic.2. Modified assembly - connecting with custom credentials.

A couple of lines of code and remote access with custom credentials worked perfectly. Of course, this version of the library should not be used in production, but at least a lot of pain in development stage went away.

P.S.: Good Reflexil tutorial with pictures can be found on CodeProject.
P.P.S.: It took me another 20 minutes to get rid of hardcoded password and to change IScheduler.Connect method to accept credentials. Reflexil is indeed very easy to use.

Friday, February 13, 2009

Distributed OSGi: Levels of Transparency

At the moment, RFC 119 contains 4 levels of transparency of Distributed OSGi for a developer. They range from completely transparent behavior, where no code should be changed to use remote OSGi container, to possibility of handling distribution software specific exceptions. All of them, if I understood them correctly, can be described in the following way:

Picture 1. Transparency levels 1-4

There is no doubt that in most cases these levels will satisfy all developer needs. However, since the distributed communication technology is hidden from the developer, such model may have a number of limitations. For example, it will be hard to make asynchronous method calls or to use communication technology-specific features.

So, it may be worth considering to support less transparent models. For example, when the client can obtain an remote endpoint or a reference to the proxy, created by distribution software:

Picture 2. Transparency level 5

This model can also be extended to the level, where the OSGi proxy on the client side is created by OSGi service developer, not framework. Current version of Riena may serve as an example of this case. Upd.: Newton, and, hence, Infiniflow also support writing a kind of custom proxies.

Moreover, fully non-transparent model can also be useful:

Picture 3. Transparency level 7

In this model, the service is exposed not through OSGi, but through the chosen distributed communication technology, like RMI. This service can also be registered in the OSGi service registry, but can't be used as an OSGi service. Every client, which wants to use it, can track this service, and when it comes online, can extract required data establish direct connection to the service using communication framework-specific proxy.

This model can suite either legacy applications, where components already exist and communicate with each other using some distributed communication framework. Also, the model can be useful when developer wants to use some communication framework-specific features and doesn't want to refuse OSGi convinience in the field of module management and service tracking.

P.S. It worth noting that similar ideas have already appeared in the comments on Riena project goals. However, I am not sure about exact levels of transparency mentioned there.

Friday, February 6, 2009

Traps on the Way from OSGi to Distributed OSGi

OSGi is a powerful technology for an application life cycle management. It deals with configuration and change management and helps to build quality service oriented systems. OSGi is pretty popular framework with broad applicability and many mature implementations. Eclipse IDE is maybe the most commonly used example of application build with OSGi.

This sounds very good, but current OSGi standard deals only applications running in the single JVM. It is obvious that the latest trends with scaling everything out diminish the value of this great technology to a large degree. So, there is no surprise that third party frameworks, built on top of OSGi, appeared on the horizon. Newton and R-OSGi are maybe the most well-known ones. From the OSGi side, latest drafts of the standard include specification of Distributed OSGi also known as RFC 119. As far as I know, there is no production-ready implementation of this RFC, but there were successful demos, and the RFC itself is very mature at the moment. All of these approaches to distribute OSGi are really interesting and worth discussing, but for now I am not going to dive deep into them.

The thing I am aware of at the moment is that OSGi is essentially just a framework for module (or component) management. When it deals with components in one JVM, everything is good. But moving to distribute environment will bring many new problems, like the need of communication medium between components. The desire (or need) to implement yet another RMI as a part of OSGi can be very strong. However, in my opinion, this should be avoided soever. OSGi is not a communication technology - there are lot of them on the market, covering different areas and created for different purposes. Implementing a new one, which will cover all required ways of communication, is hard. Distributed OSGi, on the opposite, should not cover this topic at all, dealing only with what OSGi deals the best - managing modules lifecycle, but in a distribute fashion. In the ideal situation , developers should have free choice of communication medium between modules, according to the requirements of the project.

This problem touches not only me, but the good news are that it seems creators of RFC 119 are on the right way. Other technologies for distributing OSGi are not so good from this point of view: R-OSGi introduces custom communication medium, and Newton makes use of Jini and RMI. It is not exactly bad for them, because they maybe can solve other problems better and they are ready to use right now, unlike RFC 119.

Actually, to correctly and objectively compare these technologies, additional research is required and it is not the goal of this post. If somebody is interested in the topic, Hal Hildebrand's blog can be used as a great source of an interesting information about RFC 119, OSGi in general, and in particularly, comparison of R-OSGi and Distributed OSGi.

Tuesday, February 3, 2009

Problems with Manifest in JAR

It is widely known that JAR files are built on the ZIP file format. So any jar file can be opened with zip archiver, and, to create jar file, zip tools can be used. The only major difference is that jar files may have optional META-INF directory with MANIFEST.MF and other files related with meta-information.

Everything looks very simple. However, today I have faced one problem with finding manifest file in the jar. I have created zip file from the directory with the simple structure:

After extension was changed to jar, this file was used by an external application, which tried to locate manifest in the file. This was a great surprise when the application was not able to find it. I tried to use jar util to prepare the package. When I pushed the result file to the application, it successfully found the manifest.

The situation was pretty strange - contents of both archives was identical and I knew that JAR files should not store any additional metainformation about manifests, but the experience showed the opposite. There were some doubts about possibly buggy logic in the client application, which read my jars. I looked through its source codes and found that it just uses standand JarInputStream from java.util.jar.

However, when I investigated this class, I noticed one thing that could be the cause of such behavior: logic responsible for finding manifest in the JAR assumes that META-INF directory should be located in the beginning of the archive. I investigated both my archives and it turned out that the one, created with zip tool placed this directory in the end. But the archive created with jar stored META-INF in the beginning, as it is expected.

So, if somebody still wants to create jar files with zip tools, do not forget to place META-INF in the beginning of the archive. Do not also forget to add two empty lines in the end of the manifest file, just in case.

Thursday, January 29, 2009

On Architecture in IT

Definition of architect and architecture:
Architect is a person, who has a vision of how the system should be built.
Architecture is a specification describing how the system should be built.

Architecture checklist (the things an architect should, at least partially, care about):
  • Domain
  • External interfaces, including UI
  • Integration with other systems
  • Security
  • Performance
  • Scalability
  • Configuration and change management
  • Monitoring
  • Technology zoo