Saturday, August 1, 2009

Distributed Excel or Excel on the Grid

It is a well known fact that Excel supports multithreading workbook recalculation from the version 2007. However, even at an earlier time, there was a possibility to execute a workbook in a distributed way. Distributed calculation of Excel workbooks is not well integrated with Office and is not a seamless process, but the possibility still exists. Since it is not integrated with Office, external applications and frameworks are required. Typically, these external applications are computational grids, like Microsoft HPC Server or Platform Symphony.

There are two approaches to execute Excel workbook on the grid in a distributed fashion. One approach is to use Excel as a UI for entering input data and displaying computation results. All computation logic is located in User Defined Functions (UDF) in external XLL or automation add-in libraries. Libraries are implemented in any appropriate language and platform. For XLL this can be C or C++, for automation add-ins this can be anything that supports COM, like C++ or .NET/C#/VB/F#. The approach with UDFs is pretty straitforward to implement. Developer has an API to a grid and she can easilly implement a logic of splitting big job into small tasks and submitting the tasks to a grid. To run a distributed computation, user of Excel worksheet just invokes the UDF with job input parameters. This approach is good for UDF developer, but very inconvinient for Excel worksheet user. The main disadvantage is narrow flexibility, lack of generic framework and inability to implement distributed computation in C/C++/.NET for Excel/VBA user.

Luckily, there is another approach, when grids support executing Excel worksheets themselves on the cluster. In this case, a workbook contains both UI and computation logic in cell formulas or VBA. Almost all solutions support distributing a workbook by data. For example, assume there is a MonteCarlo simulation that accepts a number of iterations or a long list of values as an input. Framework may split these values into an equal parts, copy Excel documents with partial inputs to a cluster, execute documents on compute nodes on a grid, and return results to a main document, which initiated computation. Advantage is obvious - computation logic can be implemented in Excel workbook itself with formulas or VBA macros. Unfortunately, there are no generic-purpose frameworks that fully implement this approach. Existing solutions are implemented for some well-determined use cases, like parametric sweep or simple MonteCarlo computations.

These two approaches are implemented by many computational and data grid vendors, which will be listed below. Most of the grids support only the first approach, because it simply doesn't require any special effort from grid vendors. They just create a couple of demos demonstrating distributed Excel with UDFs. Generally, this approach can be implemented on top of any grid, but only those with demos will be mentioned:

Microsoft HPC Server
There are several solutions from Microsoft to run distributed Excel on HPC. These implement both approaches discussed above.

The first approach with UDFs is represented by demos for sample models. There is no generic framework, because every user should implement her own UDF using preferred language. As it has been mentioned, custom UDFs have many advantages: effective parallelization, using API of HPC Server directly and rich programming environments (for example .NET instead Excel's VBA). Examples of this approach include AsianOptions sample Monte-Carlo model, ADAPTER demo (blog post, video, discussion), HPC and Excel Services, but with UDFs (only marketing information without any technical description. Maybe this is a description of another approach/framework).

The second approach, which allows computing Excel workbooks on the cluster, is represented by HPC Excel Services. To make this solution work, SharePoint Excel Services must be installed on every compute node on the cluster. Unfortunately, it seems it supports only parametric sweep-like models. This solution uses Excel add-in to specify job information, input data, and output data mappings. HPC Excel services support only lists as input values for parallelization and only lists for output values. Documentation is pretty extensive, sample models are described very well. Interesting feature of the solution is that it supports submitting models from SharePoint portal.

Platform Symphony
Platform supports both approaches as well - using UDFs, which implement actual parallelization logic, and computing actual Excel workbooks on compute nodes. There is no generic-purpose framework for this and everything should be customized for concrete Excel model. There is a very good demo describing this approach. Sample Excel workbook with description can be found here.

GigaSpaces XAP
GigaSpaces supports Excel models in a very interesting fashion. It supports running only UDFs, but since GigaSpaces is an in memory data grid more than HPC solution, the approach is oriented more on continuous computations rather than on batch jobs processing. Excel is considered like a UI for GigaSpaces grid. All interaction with the grid and all parallelization work is being done by UDFs. This approach resembles Excel Real-Time Server functionality. Demo called Excel that scales demonstrates this approach. Obviously, orientation on sessions and continuous computations doesn't limit users from implementing UDFs with batch jobs.

DataSynapse
DataSynapse also supports Excel somehow, but there is no much information about it. Most likely, this is a demo of the first approach.

Conclusion
There is a high demand in running heavy Excel calculations and there is no surprise that many grid vendors support distributed Excel to a some extent. There are several approaches to run Excel workbooks on a grid, from using Excel just as UI and accessing a grid via API from UDFs to calculating actual workbooks with VBA code on compute nodes. Unfortunately, there is still no generic-purpose production ready framework for seamless parallelizing workbooks with VBA, but there is a hope that it will appear sooner or later.

No comments: