Tuesday, October 13, 2009

About MPI Send/Receive Deadlocks

In this post I will try to describe potential problems with synchronous Send/Receive communication primitives from MPI and how to resolve them. In particularly, I will talk about MPI.NET implementation, but lessons can be applied to any implementation.

Consider a simple MapReduce application. In this application main process sends N messages to M available processes and waits for results. There are no cyclic dependencies and reetrancy - just a flat model. So, the first intention is to use Send/Receive methods to organize message passing. To make things simple, assume there are only two processes (one main and one worker). So, the MPI program may look like the following:


using (var env = new Environment(ref args))
{
var comm = Communicator.world;

if (comm.Rank == 0)
{
for (var i = 0; i < taskCount; i++) comm.Send(new byte[msgLen], 1, 1);
for (var i = 0; i < taskCount; i++) comm.Receive(1, 1);
}
else if (comm.Rank == 1)
{
for (var i = 0; i < taskCount; i++)
{
comm.Receive(0, 1);
comm.Send(new byte[msgLen], 0, 1);
}
}
}


Surprisingly, with small messages this program will work fine. However, trying to set msgLen to, for example, 1 MB will make program hang. Actually, this is expected behavior, because Send and Receive functions are supposed to be synchronous: Send function should wait for corresponding Receive function in another process. The biggest question is why it behaves differently for different message sizes. Answer comes from MPI specification:

"The send call described in Section 3.2.1 uses the standard communication mode. In this mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer outgoing messages. In such a case, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. In this case, the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver."

So, even if Send function are considered synchronous, there are more details hidden inside it. Interprocess communication types are not divided to only synchronous vs asynchronous. There are more different modes, and MPI Send function cover two modes at once. It is left to implementation to decide which mode to use in every concrete case. MPI.NET specification of Send method says the following:

"The basic Send operation will block until this message data has been transferred from value. This might mean that the Send operation will return immediately, before the receiver has actually received the data. However, it is also possible that Send won't return until it matches a Receive<(Of <(T>)>)(Int32, Int32) operation. Thus, the dest parameter should not be equal to Rank, because a send-to-self operation might never complete."

So, in some cases Send method is partially asynchronous, and these are the cases of small messages. Very interesting question is a threshold, when Send becomes fully synchronous and whether this threshold is fixed or environment-specific? It is pretty hard to find out an algorithm, because MPI.NET uses native code. After some experiments, I found out that in my environment it is a bit less than 128000 bytes of payload.

Fixing this sample is very easy and straightforward - just replace Send with ImmediateSend and wait for asynchronous requests. However, it is important to remember not to rely on partially asynchronous behavior of Send primitive. Always test your MPI applications with different messages sizes and read documentation carefully.

Saturday, August 1, 2009

Distributed Excel or Excel on the Grid

It is a well known fact that Excel supports multithreading workbook recalculation from the version 2007. However, even at an earlier time, there was a possibility to execute a workbook in a distributed way. Distributed calculation of Excel workbooks is not well integrated with Office and is not a seamless process, but the possibility still exists. Since it is not integrated with Office, external applications and frameworks are required. Typically, these external applications are computational grids, like Microsoft HPC Server or Platform Symphony.

There are two approaches to execute Excel workbook on the grid in a distributed fashion. One approach is to use Excel as a UI for entering input data and displaying computation results. All computation logic is located in User Defined Functions (UDF) in external XLL or automation add-in libraries. Libraries are implemented in any appropriate language and platform. For XLL this can be C or C++, for automation add-ins this can be anything that supports COM, like C++ or .NET/C#/VB/F#. The approach with UDFs is pretty straitforward to implement. Developer has an API to a grid and she can easilly implement a logic of splitting big job into small tasks and submitting the tasks to a grid. To run a distributed computation, user of Excel worksheet just invokes the UDF with job input parameters. This approach is good for UDF developer, but very inconvinient for Excel worksheet user. The main disadvantage is narrow flexibility, lack of generic framework and inability to implement distributed computation in C/C++/.NET for Excel/VBA user.

Luckily, there is another approach, when grids support executing Excel worksheets themselves on the cluster. In this case, a workbook contains both UI and computation logic in cell formulas or VBA. Almost all solutions support distributing a workbook by data. For example, assume there is a MonteCarlo simulation that accepts a number of iterations or a long list of values as an input. Framework may split these values into an equal parts, copy Excel documents with partial inputs to a cluster, execute documents on compute nodes on a grid, and return results to a main document, which initiated computation. Advantage is obvious - computation logic can be implemented in Excel workbook itself with formulas or VBA macros. Unfortunately, there are no generic-purpose frameworks that fully implement this approach. Existing solutions are implemented for some well-determined use cases, like parametric sweep or simple MonteCarlo computations.

These two approaches are implemented by many computational and data grid vendors, which will be listed below. Most of the grids support only the first approach, because it simply doesn't require any special effort from grid vendors. They just create a couple of demos demonstrating distributed Excel with UDFs. Generally, this approach can be implemented on top of any grid, but only those with demos will be mentioned:

Microsoft HPC Server
There are several solutions from Microsoft to run distributed Excel on HPC. These implement both approaches discussed above.

The first approach with UDFs is represented by demos for sample models. There is no generic framework, because every user should implement her own UDF using preferred language. As it has been mentioned, custom UDFs have many advantages: effective parallelization, using API of HPC Server directly and rich programming environments (for example .NET instead Excel's VBA). Examples of this approach include AsianOptions sample Monte-Carlo model, ADAPTER demo (blog post, video, discussion), HPC and Excel Services, but with UDFs (only marketing information without any technical description. Maybe this is a description of another approach/framework).

The second approach, which allows computing Excel workbooks on the cluster, is represented by HPC Excel Services. To make this solution work, SharePoint Excel Services must be installed on every compute node on the cluster. Unfortunately, it seems it supports only parametric sweep-like models. This solution uses Excel add-in to specify job information, input data, and output data mappings. HPC Excel services support only lists as input values for parallelization and only lists for output values. Documentation is pretty extensive, sample models are described very well. Interesting feature of the solution is that it supports submitting models from SharePoint portal.

Platform Symphony
Platform supports both approaches as well - using UDFs, which implement actual parallelization logic, and computing actual Excel workbooks on compute nodes. There is no generic-purpose framework for this and everything should be customized for concrete Excel model. There is a very good demo describing this approach. Sample Excel workbook with description can be found here.

GigaSpaces XAP
GigaSpaces supports Excel models in a very interesting fashion. It supports running only UDFs, but since GigaSpaces is an in memory data grid more than HPC solution, the approach is oriented more on continuous computations rather than on batch jobs processing. Excel is considered like a UI for GigaSpaces grid. All interaction with the grid and all parallelization work is being done by UDFs. This approach resembles Excel Real-Time Server functionality. Demo called Excel that scales demonstrates this approach. Obviously, orientation on sessions and continuous computations doesn't limit users from implementing UDFs with batch jobs.

DataSynapse
DataSynapse also supports Excel somehow, but there is no much information about it. Most likely, this is a demo of the first approach.

Conclusion
There is a high demand in running heavy Excel calculations and there is no surprise that many grid vendors support distributed Excel to a some extent. There are several approaches to run Excel workbooks on a grid, from using Excel just as UI and accessing a grid via API from UDFs to calculating actual workbooks with VBA code on compute nodes. Unfortunately, there is still no generic-purpose production ready framework for seamless parallelizing workbooks with VBA, but there is a hope that it will appear sooner or later.

Monday, March 23, 2009

Reflexil.NET - Ultimate Win in the Fight with Third-Party Libraries

It is a pretty usual situation when third-party libraries do not work not as expected. And it is also usual that the most annoying artifacts are very small and easy to fix. However, without access to the source code, it is not possible to fix them. Reflection can help in some situations, but not always. The common solution is to ask library developers to fix it. However, it can take a lot of time, and sometimes these "defects" are not actually defects and exist by design to save users from shooting their own legs. These restrictions may be good, but it does not make them less disappointing. After all, in some cases users know better and in these cases such restrictions become defects.

Recently, I have found such restriction in the SDK for Microsoft HPC Server. The SDK does not allow developer to specify custom credentials when connecting to the cluster. IScheduler.Connect method just uses Windows credentials of connecting process and throws "The server has rejected the client credentials." exception if anything goes wrong. This means that developer can not easily experiment with the cluster from remote development machine unless she is not in the same domain. The issue is described in this topic. Unfortunately, in my situation, the answer did not work. The opportunity to deploy applications on the cluster after each change and debug them remotely was not very promising. So I hoped to solve this issue somehow.

The first thing I have decided to investigate is how SDK connects to the cluster. With the help of .NET Reflector that was easy. I found out that SDK uses remoting with tcp channel. The only thing that separated me from connecting with custom credentials was adding them to channel sink properties:

Pic.1. Original assembly - connecting with process credentials.

It was really disappointing. So close from painless remote access.

Everybody knows about Reflector, but relatively small number of people know about a brilliant plug-in to Reflector called Reflexil. This great tool allows developers to modify assemblies in a very easy and straightforward way. So, a bit of luck and here it is:

Pic.2. Modified assembly - connecting with custom credentials.

A couple of lines of code and remote access with custom credentials worked perfectly. Of course, this version of the library should not be used in production, but at least a lot of pain in development stage went away.

P.S.: Good Reflexil tutorial with pictures can be found on CodeProject.
P.P.S.: It took me another 20 minutes to get rid of hardcoded password and to change IScheduler.Connect method to accept credentials. Reflexil is indeed very easy to use.

Friday, February 13, 2009

Distributed OSGi: Levels of Transparency

At the moment, RFC 119 contains 4 levels of transparency of Distributed OSGi for a developer. They range from completely transparent behavior, where no code should be changed to use remote OSGi container, to possibility of handling distribution software specific exceptions. All of them, if I understood them correctly, can be described in the following way:

Picture 1. Transparency levels 1-4

There is no doubt that in most cases these levels will satisfy all developer needs. However, since the distributed communication technology is hidden from the developer, such model may have a number of limitations. For example, it will be hard to make asynchronous method calls or to use communication technology-specific features.

So, it may be worth considering to support less transparent models. For example, when the client can obtain an remote endpoint or a reference to the proxy, created by distribution software:

Picture 2. Transparency level 5

This model can also be extended to the level, where the OSGi proxy on the client side is created by OSGi service developer, not framework. Current version of Riena may serve as an example of this case. Upd.: Newton, and, hence, Infiniflow also support writing a kind of custom proxies.

Moreover, fully non-transparent model can also be useful:

Picture 3. Transparency level 7

In this model, the service is exposed not through OSGi, but through the chosen distributed communication technology, like RMI. This service can also be registered in the OSGi service registry, but can't be used as an OSGi service. Every client, which wants to use it, can track this service, and when it comes online, can extract required data establish direct connection to the service using communication framework-specific proxy.

This model can suite either legacy applications, where components already exist and communicate with each other using some distributed communication framework. Also, the model can be useful when developer wants to use some communication framework-specific features and doesn't want to refuse OSGi convinience in the field of module management and service tracking.

P.S. It worth noting that similar ideas have already appeared in the comments on Riena project goals. However, I am not sure about exact levels of transparency mentioned there.

Friday, February 6, 2009

Traps on the Way from OSGi to Distributed OSGi

OSGi is a powerful technology for an application life cycle management. It deals with configuration and change management and helps to build quality service oriented systems. OSGi is pretty popular framework with broad applicability and many mature implementations. Eclipse IDE is maybe the most commonly used example of application build with OSGi.

This sounds very good, but current OSGi standard deals only applications running in the single JVM. It is obvious that the latest trends with scaling everything out diminish the value of this great technology to a large degree. So, there is no surprise that third party frameworks, built on top of OSGi, appeared on the horizon. Newton and R-OSGi are maybe the most well-known ones. From the OSGi side, latest drafts of the standard include specification of Distributed OSGi also known as RFC 119. As far as I know, there is no production-ready implementation of this RFC, but there were successful demos, and the RFC itself is very mature at the moment. All of these approaches to distribute OSGi are really interesting and worth discussing, but for now I am not going to dive deep into them.

The thing I am aware of at the moment is that OSGi is essentially just a framework for module (or component) management. When it deals with components in one JVM, everything is good. But moving to distribute environment will bring many new problems, like the need of communication medium between components. The desire (or need) to implement yet another RMI as a part of OSGi can be very strong. However, in my opinion, this should be avoided soever. OSGi is not a communication technology - there are lot of them on the market, covering different areas and created for different purposes. Implementing a new one, which will cover all required ways of communication, is hard. Distributed OSGi, on the opposite, should not cover this topic at all, dealing only with what OSGi deals the best - managing modules lifecycle, but in a distribute fashion. In the ideal situation , developers should have free choice of communication medium between modules, according to the requirements of the project.

This problem touches not only me, but the good news are that it seems creators of RFC 119 are on the right way. Other technologies for distributing OSGi are not so good from this point of view: R-OSGi introduces custom communication medium, and Newton makes use of Jini and RMI. It is not exactly bad for them, because they maybe can solve other problems better and they are ready to use right now, unlike RFC 119.

Actually, to correctly and objectively compare these technologies, additional research is required and it is not the goal of this post. If somebody is interested in the topic, Hal Hildebrand's blog can be used as a great source of an interesting information about RFC 119, OSGi in general, and in particularly, comparison of R-OSGi and Distributed OSGi.

Tuesday, February 3, 2009

Problems with Manifest in JAR

It is widely known that JAR files are built on the ZIP file format. So any jar file can be opened with zip archiver, and, to create jar file, zip tools can be used. The only major difference is that jar files may have optional META-INF directory with MANIFEST.MF and other files related with meta-information.

Everything looks very simple. However, today I have faced one problem with finding manifest file in the jar. I have created zip file from the directory with the simple structure:
-META-INF
--MANIFEST.MF
-A.class
-B.class

After extension was changed to jar, this file was used by an external application, which tried to locate manifest in the file. This was a great surprise when the application was not able to find it. I tried to use jar util to prepare the package. When I pushed the result file to the application, it successfully found the manifest.

The situation was pretty strange - contents of both archives was identical and I knew that JAR files should not store any additional metainformation about manifests, but the experience showed the opposite. There were some doubts about possibly buggy logic in the client application, which read my jars. I looked through its source codes and found that it just uses standand JarInputStream from java.util.jar.

However, when I investigated this class, I noticed one thing that could be the cause of such behavior: logic responsible for finding manifest in the JAR assumes that META-INF directory should be located in the beginning of the archive. I investigated both my archives and it turned out that the one, created with zip tool placed this directory in the end. But the archive created with jar stored META-INF in the beginning, as it is expected.

So, if somebody still wants to create jar files with zip tools, do not forget to place META-INF in the beginning of the archive. Do not also forget to add two empty lines in the end of the manifest file, just in case.

Thursday, January 29, 2009

On Architecture in IT

Definition of architect and architecture:
Architect is a person, who has a vision of how the system should be built.
Architecture is a specification describing how the system should be built.

Architecture checklist (the things an architect should, at least partially, care about):
  • Domain
  • External interfaces, including UI
  • Integration with other systems
  • Security
  • Performance
  • Scalability
  • Configuration and change management
  • Monitoring
  • Technology zoo

Wednesday, January 28, 2009

ICE - Object Grid

Grid computing was born to solve extreme problems using combined power of many servers. Later it came to enterprise in two major forms: computational grids and in memory data grids. First ones were aimed to solve heavy computational problems, anothers' goal was to provide fast and convenient access to large amounts of data by storing it in memory.

But it turned out, that people often needed to solve complex computation problems, which required fast access to a large amount of data. So, they needed to combine these technologies. This was pretty clear for grid providers, so many popular compute grids, like GridGain and DataSynapse, provided functionality to store data in memory in a distributed way. On the other side, many popular data grids, like Oracle Coherence and GigaSpaces, provided features to run parallel computations. However, they still played their own role better: compute grid had better functionality to run distributed computations than data grid, and vice versa. In some cases this problem was solved by maintaining two grid installations: compute grid, running parallel computations, used data grid, where the data for these computations were stored.

Anyway, one problem remained: computations may run on server, different from the server, where the data for this computation is stored. Each vendor tries to solve this problem by providing its own data-aware routing techniques. In case of using multiple grid tools, this required additional efforts.

Recently, one interesting framework, called ICE, have appeared on the horizon. It is a pretty general framework and in a couple of words, it looks like CORBA on steroids. Based on this tool, there is a grid solution, called IceGrid. This grid stores data and computations in the same place in the form of objects (remember, this is just like CORBA). From the documentation it looks like that load balancing, replication, and other important grid-related stuff is in place. This product also has a number of significant installations in a real heavy loaded and highly scalable environments. So, at least, it worth learning.

I do not think that object oriented approach is a killer feature for the world of large-scale systems. For many heavy tasks it is better and clearer to have logic and data separated. But maybe for some tasks, IceGrid's approach will be better.

Tuesday, January 27, 2009

JMX for .NET

JMX is a Java technology, which enables management and monitoring of Java applications. JMX is widely supported among software vendors. Many Java frameworks, including almost all application servers, provide access to the monitoring and management information via JMX, both as providers and cosumers. JMX consumer tools include JConsole, Hyperic HQ, Zenoss. But the main advantage of this technology is simplicity of using it to manage and monitor custom application.

Having this great technology in Java, it is not unusual to hear the question about which technology in .NET provides the same functionality. Generally, there is no exact clone .NET of JMX, however, there are some technologies, which can be used as an equivalent of JMX in .NET world. This technologies include Windows Management Instrumentation (WMI), Performance Counters, and .NET Profiling API. To be honest, first two technologies are related with Windows, but they can be used from .NET too, as it usually happens. Let's have a look at these tools:

.NET Profiling API
Unlike JMX, which covers both monitoring and management, .NET Profiling API deals only with monitoring. The API is pretty complex, but this complexity is repaid, because it allows developers to track every moment of application's life. Obviously, it is not a full equivalent of JMX in .NET, but it will cover use cases, where fine-grained and extensive monitoring is required.

Performance Counters
Performance Counters also deal only with monitoring application performance. Each counter is registered globally in Windows and can be used by applications, which fill the counter with performance information, and by applications, which track this information. Consumers of peformance counters include Performance Monitor (similar to JConsole to a some extent), Hyperic HQ with required plugins, etc. If developer's main goal is monitoring, Performance Counters can be freely used as an equivalent in .NET applications.

Windows Management Instrumentation
WMI is a Microsoft technology used for monitoring and management of devices and applications running on Windows. From the previous options, WMI resembles JMX the most. It looks a bit complex than JMX from the archtecture and .NET end-user points of view, but it also should provide more features. WMI was initially based on COM, so first implementation for .NET was pretty complex in terms of WMI providers' development, and the entire functionality was limited to monitoring. However, in WMI extensions for .NET 3.5 limitations were removed and writing WMI provider became easier. Like JMX, WMI is used by many monitoring tools, so, it can be treated as almost equal to JMX in Windows and .NET environment.

Custom Implementation
The fourth option can suite developers, who uses JMX in their applications, but doesn't use other JMX-enabled tools. It can happen when both JMX provider and consumer applications are home-grown and the power of the fact, that JMX is standard, is not used. In this case, custom objects exposed through WCF, Remoting, ASP.NET web services, or other communication means can be used.

So, if developer migrates from Java to .NET and searching for equivalent of JMX, he have a number of options. The exact choice will, as always, depend on the concrete use case.

Friday, January 23, 2009

Getting Started with .NET Profiling API

In a couple of words, .NET Profiling API allows writing a program, which can monitor other CLR program's execution. Tools like JetBrains dotTrace and the similar can be written using this API.

There is a lot of detailed information about .NET Profiling API in the Internet, so I will just try to list existing sources.
  1. .NET Internals: The Profiling API. Introduction to monitoring of .NET applications using performance counters and .NET 1.0 Profiling API. Contains theory and source code samples. Unfortunately, samples are written in Delphi and there is no project, which is ready-to-compile-and-work.
  2. Under the Hood: .NET Profiling API and the DNProfiler Tool. Little amount of theory with a couple of practical hints of profiler implementation with 1.0 version of API. Unfortunately, link to DNProfiler is broken.
  3. Inspect and Optimize Memory Usage with the .NET Profiler API. Acticle regarding memory monitoring with sample application and source codes, written with 1.0 version of API. Sample builds well, but doesn't work in my environment, maybe because I tried to build samples with .NET 2.0 and profiler was implemeted version 1.0 of profiling API.
  4. Creating a Custom .NET Profiler. A couple of hints with working profiler sample in compiled and source code forms. Profiler is written using .NET 2.0 version of API.
  5. No Code Can Hide from the Profiling API in the .NET Framework 2.0. Description of new features in 2.0 version of API and working piece of sample code.
  6. Profiler Stack Walking in the .NET Framework 2.0: Basics and Beyond. Piece of theory about Profiling API without working sample. Version 2.0 is covered.
  7. Rewrite MSIL Code on the Fly with the .NET Framework Profiling API, Modifying IL at Runtime, Modifying IL at Runtime (step II), Modifying IL at Runtime (step II+), Modifying IL at Runtime (step III). A set of articles on using .NET Profiling API for generating code at runtime. Covers version 1.0 of API.